FreePascal Information Logo Friend of FreePascal Compiler Title
Articles with Feedback, FPC News Library, PDF Collection, Mail Lists, Books, Newsgroups, IRC Open online discussion areas Research and Tutorials Tools, Compilers and Utilities Blurbs about us, advertising, etc.
Welcome to the FoFPC Research Notes: Levenshtein

Data Searching using Levenshtein

by: G.E. Ozz Nixon Jr.
Published: October 2007
©opyright 2009 by Friends of FPC



     Data Searching is a very complex art in itself. Ignoring the long history of Internet "search engines" growing from Archie in 1992 to today's Google engine. Original techniques to provide users with "full text" search required a solution designed for common mistakes, one of which is called Levenshtein.

     Levenshtein distance is a metric for measuring the amount of difference between two sequences (i.e., the so called edit distance). The Levenshtein distance between two strings is given by the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character.

Code snippet from DXSock 6.0 WLM (Courtesy of Brain Patchwork DX, LLC)
function Levenshtein(const String1,String2: string): extended; // similarity engine
var
{$IFDEF VER100} // Delphi 3 can only do 4kbx4kb array - does not support dynamic arrays
   Matrix:Array[0..4095] of Array[0..4095] of Integer;
{$ELSE}
   Matrix:Array of Array of Integer;
{$ENDIF}
   maxl,i,j,n,m,distanz,cost:Integer;
   s_i,t_j:char;

begin
   Result:=0.0;
   If (String1='') or (String2='') then Exit;
{$IFDEF ASSEMBLY}
   asm // 1090108: By Ozz (FASTER THAN Length()!)
     MOV EAX, String1;       // Store Str Address
     MOV EAX, [EAX-$04]; // Move to "Size" Int32
     MOV N, EAX;    // Put into Result
     MOV EAX, String2;       // Store Str Address
     MOV EAX, [EAX-$04]; // Move to "Size" Int32
     MOV M, EAX;    // Put into Result
   End;
{$ELSE}
   N:=Length(String1);
   M:=Length(String2);
{$ENDIF}
{$IFNDEF VER100}
  setlength(Matrix,n+1,m+1);
{$ENDIF}
  if n=0 then distanz:=m
  else begin
    if m=0 then distanz:=n
    else begin
      for i:=0 to n do Matrix[i,0]:=i;
      for j:=0 to m do Matrix[0,j]:=j;
      for i:=1 to n do begin
        s_i:=String1[i];
        for j:=1 to m do begin
          t_j:=String2[j];
          if s_i=t_j then cost:=0
          else cost:=1;
          Matrix[i,j]:=Min(Min(Matrix[i-1,j]+1,Matrix[i,j-1]+1),Matrix[i-1,j-1]+cost);
        end;
      end;
      distanz:=Matrix[n,m];
    end;
  end;
  if m>n then maxl:=m
  else maxl:=n;
  result:=1-(distanz/maxl);
  if result<0 then result:=0
  else Result:=Result*100;
end;

     Using this algorithm, both "Robert" and "Rupert" return "66.66%" the same, while "Rubin" and "Robin" yields "80%". "Ashcroft" and "Ashcraft" yields "87.5%". However, "George" and "Jorge" yields "66.66%", and "Dave" and "David" yields "60%" the same.

     In this sample, we see another way of veiwing the differences between words. Things to pay close attention to is the result for Ashcroft and Ashcraft, even though close to us, this algorithm really didn't consider them very closer. Also, note that George and Jorge return the same result as Robert and Rupert. However, Dave and David are still out there. Please continue reading the other algorithms I use in my search engines.

G.E. Ozz Nixon Jr.
 Links and Products we find useful



ButtonGenerator.com
Valid XHTML 1.0 Transitional Internet Map
Programmer's Heaven
grat-i-fi-ca-tion - noun
the state of being gratified; great satisfaction.


"FoFPC is simply brilliant, a single site with exactly what I was looking for - you made this so easy for me!"

Doug Edwards
eXtreme Accounting
Locations of visitors to this page world map hits counter
Copyright 2009 by 3F, LLC. All rights reserved. Worldwide.
Your request was processed by server #3 in 0.001548 secs.

sponsor
This sponsor helps us with our documentation