FreePascal Information Logo Friend of FreePascal Compiler Title
Articles with Feedback, FPC News Library, PDF Collection, Mail Lists, Books, Newsgroups, IRC Open online discussion areas Research and Tutorials Tools, Compilers and Utilities Blurbs about us, advertising, etc.
Welcome to the FoFPC Research Notes: Similarity

Data Searching using Similarity

by: G.E. Ozz Nixon Jr.
Published: October 2007
©opyright 2009 by Friends of FPC



     Data Searching is a very complex art in itself. Ignoring the long history of Internet "search engines" growing from Archie in 1992 to today's Google engine. Original techniques to provide users with "full text" search required a solution designed for common mistakes, one of which is called Similarity.

     Similarity like Levenshtein distance is a metric for measuring the amount of difference between two sequences (i.e., the so called edit distance). This algorithm while much more complex than Levenshtein basically returns the same values.

Code snippet from DXSock 6.0 WLM (Courtesy of Brain Patchwork DX, LLC)
function Similarity(Const String1,String2: string): extended; // similarity engine
Var nMaxLaenge:Byte;
    Str1,str2:String;
    Max1,Max2:Integer;

  procedure analyse(Var st1,st2:String);
  var B:Byte;
  begin
    B:=1;
    While B<=Length(St1) Do
      If Pos(st1[B],st2)=0 Then Delete(st1,b,1)
      Else Inc(B);
  end;

  procedure LikeRec(Const S1,S2:String;Const nElem,nCurPos,nLaenge:Integer);
  Var
     Max1,Max2:Integer;

  begin
{$IFDEF ASSEMBLY}
    asm // 1090108: By Ozz (FASTER THAN Length()!)
      MOV EAX, S1;       // Store Str Address
      MOV EAX, [EAX-$04]; // Move to "Size" Int32
      MOV Max1, EAX;    // Put into Result
      MOV EAX, S2;       // Store Str Address
      MOV EAX, [EAX-$04]; // Move to "Size" Int32
      MOV Max2, EAX;    // Put into Result
    End;
{$ELSE}
    Max1:=Length(S1);
    Max2:=Length(S2);
{$ENDIF}
    if (nLaenge+Max1-nElem+1<=nMaxLaenge) OR (nLaenge+Max2-nCurPos+1<=nMaxLaenge)
      Then Exit;
    If (nCurPos>Max2) OR (nElem>Max1) Then Begin
      If nLaenge>nMaxLaenge Then nMaxLaenge:=nLaenge;
      Exit;
    End;
    If Str1[nElem]=Str2[nCurPos] Then
       LikeRec(Str1,Str2,nElem+1,nCurPos+1,nLaenge+1)
    Else Begin
       LikeRec(Str1,Str2,nElem+1,nCurPos,nLaenge);
       LikeRec(Str1,Str2,nElem,nCurPos+1,nLaenge);
    End;
  end;

begin
   Result:=0.0;
   If (String1='') or (String2='') then Exit;
{$IFDEF ASSEMBLY}
   asm // 1090108: By Ozz (FASTER THAN Length()!)
     MOV EAX, String1;       // Store Str Address
     MOV EAX, [EAX-$04]; // Move to "Size" Int32
     MOV Max1, EAX;    // Put into Result
     MOV EAX, String2;       // Store Str Address
     MOV EAX, [EAX-$04]; // Move to "Size" Int32
     MOV Max2, EAX;    // Put into Result
   End;
{$ELSE}
   Max1:=Length(String1);
   Max2:=Length(String2);
{$ENDIF}
   If Max1+Max2<50 Then Begin
      Str1:=String1;
      Str2:=String2;
      Analyse(Str1,Str2);
      Analyse(Str2,Str1);
      nMaxLaenge:=0;
      LikeRec(Str1,Str2,1,1,0);
      If Max1*Max2<>0 Then
         If Max1>Max2 Then result:=nMaxLaenge/Max1
         Else result:=nMaxLaenge/Max2
      Else result:=0;
   End
   Else Begin
      Result:=0;
   End;
   Result:=Result*100;
end;

     Using this algorithm, both "Robert" and "Rupert" return "66.66%" the same, while "Rubin" and "Robin" yields "80%". "Ashcroft" and "Ashcraft" yields "87.5%". However, "George" and "Jorge" yields "66.66%", and "Dave" and "David" yields "60%" the same.

     In this sample, we see another way of veiwing the differences between words. Oddly, when comparing the code and technique between similarity and Levenshtein, you may notice that Levenshtein code is very straight foward, easier to digest. However, out of over 100,000 name tests only a few times did this algorithm return results much different than Levenshtein. Please continue reading the other algorithms I use in my search engines.

G.E. Ozz Nixon Jr.
 Links and Products we find useful



ButtonGenerator.com
Valid XHTML 1.0 Transitional Internet Map
Programmer's Heaven
grat-i-fi-ca-tion - noun
the state of being gratified; great satisfaction.


"FoFPC is simply brilliant, a single site with exactly what I was looking for - you made this so easy for me!"

Doug Edwards
eXtreme Accounting
Locations of visitors to this page world map hits counter
Copyright 2009 by 3F, LLC. All rights reserved. Worldwide.
Your request was processed by server #3 in 0.002414 secs.

sponsor
Click to visit our paid sponsor