Large String Matcher

Python

Dynamic Programming Algorithms

Divide and Conquer Algorithms

Summary

This project was created for my Analysis of Algorithms class at USC Viterbi School of Engineering.

The goal of this project was to implement in code a dynamic programming algorithm that rated the relative similarity between two strings. Use cases for an efficient algorithm would be for matching the similarity between strands of DNA.

The Problem

While a basic algorithm is easy to implement, it has its drawbacks. A dynamic programming algorithm uses memoization to record data. The amount of memory required for this algorithm is an array of size M*N where M and N are the lengths of the two input strings. While this is fine for small strings, DNA strands can be millions of strings long, requiring a lot of memory.

Basic Algorithm:
OPT(i, j) = min(OPT(i-1, j-1) + alpha(i, j), OPT(i-1, j) + delta, OPT(i, j-1) + delta)

The Solution

Creating an improved version of the algorithm requires a divide-and-conquer approach. The algorithm splits the first string in half then finds the optimal point to split the second string before finding the optimal alignment between the split sides. This is recursively repeated until one of the two sides has less than two characters. The optimal alignment can then be returned from the recursive base case, resulting in an optimal alignment for the whole string.

This approach reduces the memory requirement from M*N to 2*N, a significant improvement for very long strings. The tradeoff is the runtime is increased by a polynomial amount.

M+NTime in MS (Basic)Time in MS (Memory Efficient)Memory in KB (Basic)Memory in KB (Memory Efficient)
160.35619735717773440.39720535278320311030810436
641.6839504241943363.089904785156251046810500
1285.63788414001464810.6070041656494141050010608
25620.43390274047851639.0861034393310551098811216
38451.0747432708740296.169948577880861182010808
51286.45486831665039160.650968551635741203610568
768204.65803146362305372.50876426696781331210680
1024374.7079372406006670.60899734497071493210612
1280609.07173156738281069.20289993286131669610740
1536852.21409797668461694.16308403015141785610780
20481592.80514717102052824.4647979736332192410736
25602375.6480216979984341.2227630615232598010892
30723282.59396553039556043.5400009155273232411024
35844383.2199573516857586.0209465026863857611060
39686169.41809654235811138.3688449859624494011164