faqts : Computers : Programming : Languages : Python : Snippets

+ Search
Add Entry AlertManage Folder Edit Entry Add page to http://del.icio.us/
Did You Find This Entry Useful?

61 of 68 people (90%) answered Yes
Recently 8 of 10 people (80%) answered Yes


Distance algorithm for fuzzy string matching (Levenshtein distance)

Feb 25th, 2003 06:18
Gregoire Dooms, Nathan Wallace, Hans Nowak, Snippet 228, Magnus L. Hetland

Packages: miscellaneous
Explanation of the Levenshtein distance algorithm...
(Got a request for this, and thought I might as well post it...)
The algorithm:
def distance(a,b):
    c = {}
    n = len(a); m = len(b)
    for i in range(0,n+1):
        c[i,0] = i
    for j in range(0,m+1):
        c[0,j] = j
    for i in range(1,n+1):
        for j in range(1,m+1):
            x = c[i-1,j]+1
            y = c[i,j-1]+1
            if a[i-1] == b[j-1]:
                z = c[i-1,j-1]
                z = c[i-1,j-1]+1
            c[i,j] = min(x,y,z)
    return c[n,m]
It calculates the following: Given two strings, a and b, and three
operations, adding, subtracting and exchanging single characters, what
is the minimal number of steps needed to translate a into b?
The method is based on the following idea:
  We want to find the distance between a[:x] and b[:y]. To do this, we
  first calculate
  1) the distance between a[:x-1] and b[:y], adding the cost of a
     subtract-operation, used to get from a[:x] to a[:z-1];
  2) the distance between a[:x] and b[:y-1], adding the cost of an
     addition-operation, used to get from b[:y-1] to b[:y];
  3) the distance between a[:x-1] and b[:y-1], adding the cost of a
     *possible* exchange of the letter b[y] (with a[x]).
The cost of the subtraction and addition operations are 1, while the
exchange operation has a cost of 1 if a[x] and b[y] are different, and
0 otherwise.
After calculating these costs, we choose the least one of them (since
we want to use the best solution.)
Instead of doing this recursively (i.e. calculating ourselves "back"
from the final value), we build a cost-matrix c containing the optimal
costs, so we can reuse them when calculating the later values. The
costs c[i,0] (from string of length n to empty string) are all i, and
correspondingly all c[0,j] (from empty string to string of length j)
are j.
Finally, the cost of translating between the full strings a and b
(c[n,m]) is returned.
I guess that ought to cover it...