GLBIO 2013 — A Survey of Protein Storage and Search Cost Reduction Techniques

This page will be updated with sources and a more significant writeup of my GLBIO 2013 submision.

If you see this page, give me a poke at [email protected] and I'll try to get this straightened out sooner rather than later.


To determine the biological functions of a newly sequenced protein, one often compares it against a database of existing proteins to find sequences which share substructures or motifs. Because the function of protein is highly correlated with its structure, finding similar sequences which have known biological function is a quick way to understand a new protein's likely functions.

Two concerns for comparing protein sequences are computation time and storage requirements. The most accurate algorithms to find similar protein strands use a dynamic programming approach, but this is computationally expensive. Modern methods attempt to reduce the time complexity by using heuristics or data transformations to decrease the amount of data examined. These approaches provide only approximate matches, but this is acceptable because often a group of similar protein sequences should be examined to determine the new sequence's properties.

In this research, a group of exact and approximate algorithms for selecting similar protein sequences from a database of proteins with known biological functions to the one under question is examined. Search speed relative to size of protein and size of database along with how the match sets overlap are factors examined. Modern compression algorithms are examined for their efficiency in reducing storage requirements and their (de)compression time. Algorithms and techniques examined include Smith-Waterman, BLAST, and prediction by partial matching. An application of a music thumbnailing algorithm to protein searching will be presented. The pros and cons of these algorithms will be presented.