Data Searching using Similarity
by: G.E. Ozz Nixon Jr.
Published: October 2007
©opyright 2009 by Friends of FPC
Data Searching is a very complex art in itself. Ignoring the long history of
Internet "search engines" growing from Archie in 1992 to today's Google engine.
Original techniques to provide users with "full text" search required a solution
designed for common mistakes, one of which is called Similarity.
Similarity like Levenshtein distance is a metric for measuring the amount of difference between
two sequences (i.e., the so called edit distance). This algorithm while much more complex
than Levenshtein basically returns the same values.
Code snippet from DXSock 6.0 WLM (Courtesy of Brain Patchwork DX, LLC)
function Similarity(Const String1,String2: string): extended; // similarity engine
Var nMaxLaenge:Byte;
Str1,str2:String;
Max1,Max2:Integer;
procedure analyse(Var st1,st2:String);
var B:Byte;
begin
B:=1;
While B<=Length(St1) Do
If Pos(st1[B],st2)=0 Then Delete(st1,b,1)
Else Inc(B);
end;
procedure LikeRec(Const S1,S2:String;Const nElem,nCurPos,nLaenge:Integer);
Var
Max1,Max2:Integer;
begin
{$IFDEF ASSEMBLY}
asm // 1090108: By Ozz (FASTER THAN Length()!)
MOV EAX, S1; // Store Str Address
MOV EAX, [EAX-$04]; // Move to "Size" Int32
MOV Max1, EAX; // Put into Result
MOV EAX, S2; // Store Str Address
MOV EAX, [EAX-$04]; // Move to "Size" Int32
MOV Max2, EAX; // Put into Result
End;
{$ELSE}
Max1:=Length(S1);
Max2:=Length(S2);
{$ENDIF}
if (nLaenge+Max1-nElem+1<=nMaxLaenge) OR (nLaenge+Max2-nCurPos+1<=nMaxLaenge)
Then Exit;
If (nCurPos>Max2) OR (nElem>Max1) Then Begin
If nLaenge>nMaxLaenge Then nMaxLaenge:=nLaenge;
Exit;
End;
If Str1[nElem]=Str2[nCurPos] Then
LikeRec(Str1,Str2,nElem+1,nCurPos+1,nLaenge+1)
Else Begin
LikeRec(Str1,Str2,nElem+1,nCurPos,nLaenge);
LikeRec(Str1,Str2,nElem,nCurPos+1,nLaenge);
End;
end;
begin
Result:=0.0;
If (String1='') or (String2='') then Exit;
{$IFDEF ASSEMBLY}
asm // 1090108: By Ozz (FASTER THAN Length()!)
MOV EAX, String1; // Store Str Address
MOV EAX, [EAX-$04]; // Move to "Size" Int32
MOV Max1, EAX; // Put into Result
MOV EAX, String2; // Store Str Address
MOV EAX, [EAX-$04]; // Move to "Size" Int32
MOV Max2, EAX; // Put into Result
End;
{$ELSE}
Max1:=Length(String1);
Max2:=Length(String2);
{$ENDIF}
If Max1+Max2<50 Then Begin
Str1:=String1;
Str2:=String2;
Analyse(Str1,Str2);
Analyse(Str2,Str1);
nMaxLaenge:=0;
LikeRec(Str1,Str2,1,1,0);
If Max1*Max2<>0 Then
If Max1>Max2 Then result:=nMaxLaenge/Max1
Else result:=nMaxLaenge/Max2
Else result:=0;
End
Else Begin
Result:=0;
End;
Result:=Result*100;
end;
Using this algorithm, both "Robert" and "Rupert" return "66.66%" the same,
while "Rubin" and "Robin" yields "80%". "Ashcroft" and "Ashcraft" yields "87.5%".
However, "George" and "Jorge" yields "66.66%", and "Dave" and "David"
yields "60%" the same.
In this sample, we see another way of veiwing the differences between words.
Oddly, when comparing the code and technique between similarity and Levenshtein,
you may notice that Levenshtein code is very straight foward, easier to digest.
However, out of over 100,000 name tests only a few times did this algorithm
return results much different than Levenshtein.
Please continue reading the other algorithms I use in my search engines.
G.E. Ozz Nixon Jr.