Data Searching using Metaphone
by: G.E. Ozz Nixon Jr.
Published: October 2007
©opyright 2009 by Friends of FPC
Data Searching is a very complex art in itself. Ignoring the long history of
Internet "search engines" growing from Archie in 1992 to today's Google engine.
Original techniques to provide users with "full text" search required a solution
designed for common mistakes, one of which is called Metaphone.
Metaphone is a phonetic algorithm, an algorithm published in 1990 for indexing words
by their English pronunciation. The algorithm produces variable length keys as its
output, as opposed to Soundex's fixed-length keys. Similar sounding words share the
same keys.
Code snippet from DXSock 6.0 WLM (Courtesy of Brain Patchwork DX, LLC)
function Metaphone(Const S: string): string; // single-metaphone
var
StrLen:Integer;
Cnt:Integer;
Str1:Char;
StrPrev:Char;
Str2:String;
Str:String;
begin
Result:='';
If (S='') then Exit;
Str:=Lowercase(S);
{$IFDEF ASSEMBLY}
asm // 1090108: By Ozz (FASTER THAN Length()!)
MOV EAX, S; // Store Str Address
MOV EAX, [EAX-$04]; // Move to "Size" Int32
MOV StrLen, EAX; // Put into Result
End;
{$ELSE}
StrLen:=Length(S);
{$ENDIF}
Cnt:=1;
Str2:=Copy(S,1,2);
// 4 pre-processing rules:
if QuickPos(Str2+',','ae,gn,kn,pn,wr,')>0 then begin // find silent first letters and remove
Delete(Str,1,1);
Dec(StrLen);
end
else if (Str2='wh') then begin // drop silent "H"
Delete(Str,2,1);
Dec(StrLen);
end;
Str1:=Str[1];
// x sounds like "s". change to "s"
if (Str1='x') then Str[1]:='s'
else if (Str1 in ['a','e','i','o','u']) then begin // drop leaving vowels
Delete(Str,1,1);
Dec(StrLen);
Result:=Str1;
end;
// MAIN:
While Cnt<=StrLen do begin
If (Cnt>1) then StrPrev:=Str1
Else StrPrev:=#32; // space
Str1:=Str[Cnt];
If (StrPrev<>Str1) then begin
case Str1 of
'f','j','l','m','n','r':Result:=Result+Str1;
'q':Result:=Result+'k';
'v':Result:=Result+'f';
'x':Result:=Result+'ks';
'z':Result:=Result+'s';
'b':begin
if (Cnt=StrLen) then begin
if StrPrev<>'m' then result:=result+'b';
end
else result:=result+'b';
end;
'c':begin
if (Copy(Str,Cnt,2)='ch') or
(Copy(Str,Cnt,3)='cia') then Result:=Result+'x'
else if (QuickPos(Str2+',','ci,ce,cy,')>0) and (StrPrev<>'s') then
Result:=Result+'s'
else Result:=Result+'k';
end;
'd':begin
if (QuickPos(Copy(Str,Cnt,3)+',','dge,dgy,dgi,')>0)
then Result:=Result+'j'
else Result:=Result+'t';
end;
'g':if cnt>1 then begin
if QuickPos(Copy(Str,cnt-1,3)+',','dge,dgy,dgi,dha,dhe,dhi,dho,dhu,')=0 then begin
if QuickPos(Copy(Str,cnt,2)+',','gi,ge,gy,')>0 then Result:=Result+'j'
else if (Copy(Str,cnt,2)<>'gn') or
((Copy(Str,cnt,2)<>'gh') and (cnt'c') then Result:=Result+'k';
'p':if (Copy(Str,cnt,2)='ph') then Result:=Result+'f'
else result:=result+Str1;
's':if (QuickPos(Copy(Str,cnt,3)+',','sia,sio,')>0) or
(Copy(Str,cnt,2)='sh') then Result:=Result+'x'
else Result:=Result+Str1;
't':if (QuickPos(Copy(Str,cnt,3)+',','tia,tio,')>0) then Result:=Result+'x'
else if Copy(Str,cnt,2)='th' then Result:=Result+'0' // zero
else if Copy(Str,cnt,3)='tch' then Result:=Result+Str1;
'w':if (QuickPos(Copy(Str,cnt,2)+',','wa,we,wi,wo,wu,')=0) then Result:=Result+Str1;
'y':if (QuickPos(Copy(Str,cnt,2)+',','ya,ye,yi,yo,yu,')=0) then Result:=Result+Str1;
end; // case
end; // if differnet character
Inc(Cnt);
end; // while
end;
Using this algorithm, both "Robert" and "Rupert" return the same string "rbr"
while "Rubin" and "Robin" yields "rbn". "Ashcroft" and "Ashcraft" yields "axkrf".
However, "George" and "Jorge" yields "rj" and "jrj" and "Dave" and "David"
yields "tf" and "tft".
This simple example shows an improvement for Ashcroft and Ashcraft over Soundex, however,
George and Jorge and Dave and David need an improved theory. Read more on
other comparison techniques I used in my search engines.
G.E. Ozz Nixon Jr.