Data Searching using Double-Metaphone
by: G.E. Ozz Nixon Jr.
Published: October 2007
©opyright 2009 by Friends of FPC
Data Searching is a very complex art in itself. Ignoring the long history of
Internet "search engines" growing from Archie in 1992 to today's Google engine.
Original techniques to provide users with "full text" search required a solution
designed for common mistakes, one of which is called Metaphone.
Double-Metaphone is a second generation phonetic algorithm, based upon an algorithm
published in 1990 for indexing words by pronunciation. The algorithm
produces variable length keys as its output, as opposed to Soundex's fixed-length
keys. Similar sounding words share the same keys. Double-Metaphone differs from
Metaphone by introducing primary and secondary codes, supporting pronunciation
in English, Slavic, Germanic, Celtic, Greek, French, Italian, Spanish and Chinese.
Code snippet from DXSock 6.0 WLM (Courtesy of Brain Patchwork DX, LLC)
function Metaphone2(S: string): string; // different design
Const MaxNum=22;
Tab:Array[1..MaxNum,1..8] Of String=(
{ Org,Ers,Nach,Vor,NotNach,NotVor,Anf,Ende}
('Ä','E','','','','','',''),
('AE','E','','','','','',''),
('AY','AI','','','','','',''),
('B','P','','','','','','1'),
('CHS','KS','','','','','',''),
('CK','K','','','','','',''),
('C','K','','','','KH','',''),
('D','T','','','','','','1'),
('EY','AI','','','','','',''),
('EI','AI','','','','','',''),
('G','K','','','N','','','1'),
('PF','F','','','','','1',''),
('PH','F','','','','','1',''),
('PN','N','','','','','1',''),
('QU','KW','','','','','',''),
('Q','K','','','','U','',''),
('TZ','TS','','','','','',''),
('V','F','','','','','',''),
('X','KS','','','','','',''),
('Y','J','','','','','1',''),
('Y','Ü','','','EA','','0',''),
('Z','TS','','','T','','','')
);
Var Anfang,Ende:Boolean;
Vor,Nach,Work,Output:String;
L,I,Num:Integer;
begin
Result:='';
If (S='') then Exit;
{$IFDEF ASSEMBLY}
asm // 1090108: By Ozz (FASTER THAN Length()!)
MOV EAX, S; // Store Str Address
MOV EAX, [EAX-$04]; // Move to "Size" Int32
MOV Num, EAX; // Put into Result
End;
{$ELSE}
Num:=Length(S);
$ENDIF}
if Num>0 then begin
S:=Uppercase(S);
I:=1;
While I1)Or(Tab[Num,3]=''))AND
((QuickPos(Nach,Tab[Num,4])>1)Or(Tab[Num,4]=''))AND
(QuickPos(Vor,Tab[Num,5])=0)AND
(QuickPos(Nach,Tab[Num,6])=0)AND
((Anfang And (Tab[Num,7]='1'))OR
(Not(Anfang)And(Tab[Num,7]='0'))OR(Tab[Num,7]=''))AND
((Ende And (Tab[Num,8]='1'))OR
(Not(Ende)And(Tab[Num,8]='0'))OR(Tab[Num,8]='')) Then Begin
Output:=Output+Tab[Num,2];
If Length(Tab[Num,1])>1 Then Inc(I,Length(Tab[Num,1])-1);
End
Else Output:=Output+Work[1];
End
Else OutPut:=OutPut+Work[1];
Inc(I,1);
Until I>L;
Result:=Output;
end;
end;
Designed to use Levenshtein, this algorithm returns "Robert" and "Rupert" as "66.66%" the same,
while "Rubin" and "Robin" yields "80%". "Ashcroft" and "Ashcraft" yields "87.5%".
However, "George" and "Jorge" yields "66.66%", and "Dave" and "David"
yields "60%" the same.
In these simple tests, Metaphone2 with Levenshtein yield the same as using plain
Levenshtein against the words. However, introduce words from Europe and this version
of the Metaphone helps you match words much better than Levenshtein by itself.
Read more on
other comparison techniques I used in my search engines.
G.E. Ozz Nixon Jr.