Fast approximate string matching in Perl

(Measurement of Sequence Distance using Myers Algorithm)

This code measures the distance between a query and a text string, representing protein sequences, by finding the best alignment using the Myers 98 algorithm documented in A fast bit-vector algorithm for approximate string matching based on dynamic programming.

"The approximate string matching problem is to find all locations at which a query of length m_ matches a substring of a text of length _n with k_-or-fewer differences. Prior bit-vector algorithms, e.g. _agrep, run in O(nmk/w) time, where w_ is the word size of the machine, 32 bits. The Myers algorithm runs in _O(nm/w) time. It achieves its increased speed by "calculating a bit-vector representation of the relocatable dynamic programming matrix for the problem." The algorithm’s performance is therefore independent of k.

The Myers algorithm computes a matrix of bit-vectors for each query string; consisting of M vectors of length W, where M is the size of the string alphabet, and W is the word size for the machine. (Query strings must be truncated if they are larger than the machine word size, normally 32 bits.)  A bit is set for each position in the query string where the character matches the alphabet letter corresponding to the current vector (or matrix row.)

This permits very fast approximate matching over a large set of text (or database) strings for a given query string; the dynamic programming matrix is calculated a column at a time using the natural parallelism of bit-vector operations. This parallelism is achieved at the cost of assembling the bit-vector matrix for each query string. This program searches for 9 different protein sequences (truncated to length 32) in a text library of 742 much longer protein sequences. The sequences are read from FASTA files, and then kept in memory for speed. Larger libraries could be used, but memory overflow may occur; the presence of sufficient memory is assumed.

This implementation was done in Perl, and was a project for the Biological Sequence Analysis course at the UCSC Extension. Further details are given in a short report (PDF 225kb) and in the code: