FreePascal Information Logo Friend of FreePascal Compiler Title
Articles with Feedback, FPC News Library, PDF Collection, Mail Lists, Books, Newsgroups, IRC Open online discussion areas Research and Tutorials Tools, Compilers and Utilities Blurbs about us, advertising, etc.
Welcome to the FoFPC Research Notes: Metaphone

Data Searching using Metaphone

by: G.E. Ozz Nixon Jr.
Published: October 2007
©opyright 2009 by Friends of FPC



     Data Searching is a very complex art in itself. Ignoring the long history of Internet "search engines" growing from Archie in 1992 to today's Google engine. Original techniques to provide users with "full text" search required a solution designed for common mistakes, one of which is called Metaphone.

     Metaphone is a phonetic algorithm, an algorithm published in 1990 for indexing words by their English pronunciation. The algorithm produces variable length keys as its output, as opposed to Soundex's fixed-length keys. Similar sounding words share the same keys.

Code snippet from DXSock 6.0 WLM (Courtesy of Brain Patchwork DX, LLC)
function Metaphone(Const S: string): string; // single-metaphone
var
   StrLen:Integer;
   Cnt:Integer;
   Str1:Char;
   StrPrev:Char;
   Str2:String;
   Str:String;

begin
   Result:='';
   If (S='') then Exit;
   Str:=Lowercase(S);
{$IFDEF ASSEMBLY}
   asm // 1090108: By Ozz (FASTER THAN Length()!)
     MOV EAX, S;       // Store Str Address
     MOV EAX, [EAX-$04]; // Move to "Size" Int32
     MOV StrLen, EAX;    // Put into Result
   End;
{$ELSE}
   StrLen:=Length(S);
{$ENDIF} 
   Cnt:=1;
   Str2:=Copy(S,1,2);
   // 4 pre-processing rules:
   if QuickPos(Str2+',','ae,gn,kn,pn,wr,')>0 then begin // find silent first letters and remove
      Delete(Str,1,1);
      Dec(StrLen);
   end
   else if (Str2='wh') then begin // drop silent "H"
      Delete(Str,2,1);
      Dec(StrLen);
   end;
   Str1:=Str[1];
   // x sounds like "s". change to "s"
   if (Str1='x') then Str[1]:='s'
   else if (Str1 in ['a','e','i','o','u']) then begin // drop leaving vowels
      Delete(Str,1,1);
      Dec(StrLen);
      Result:=Str1;
   end;
   // MAIN:
   While Cnt<=StrLen do begin
      If (Cnt>1) then StrPrev:=Str1
      Else StrPrev:=#32; // space
      Str1:=Str[Cnt];
      If (StrPrev<>Str1) then begin
         case Str1 of
            'f','j','l','m','n','r':Result:=Result+Str1;
            'q':Result:=Result+'k';
            'v':Result:=Result+'f';
            'x':Result:=Result+'ks';
            'z':Result:=Result+'s';
            'b':begin
               if (Cnt=StrLen) then begin
                  if StrPrev<>'m' then result:=result+'b';
               end
               else result:=result+'b';
            end;
            'c':begin
               if (Copy(Str,Cnt,2)='ch') or
                  (Copy(Str,Cnt,3)='cia') then Result:=Result+'x'
               else if (QuickPos(Str2+',','ci,ce,cy,')>0) and (StrPrev<>'s') then
                  Result:=Result+'s'
                  else Result:=Result+'k';
            end;
            'd':begin
               if (QuickPos(Copy(Str,Cnt,3)+',','dge,dgy,dgi,')>0)
                  then Result:=Result+'j'
               else Result:=Result+'t';
            end;
            'g':if cnt>1 then begin
               if QuickPos(Copy(Str,cnt-1,3)+',','dge,dgy,dgi,dha,dhe,dhi,dho,dhu,')=0 then begin
                  if QuickPos(Copy(Str,cnt,2)+',','gi,ge,gy,')>0 then Result:=Result+'j'
                  else if (Copy(Str,cnt,2)<>'gn') or
                     ((Copy(Str,cnt,2)<>'gh') and (cnt'c') then Result:=Result+'k';
            'p':if (Copy(Str,cnt,2)='ph') then Result:=Result+'f'
               else result:=result+Str1;
            's':if (QuickPos(Copy(Str,cnt,3)+',','sia,sio,')>0) or
                  (Copy(Str,cnt,2)='sh') then Result:=Result+'x'
               else Result:=Result+Str1;
            't':if (QuickPos(Copy(Str,cnt,3)+',','tia,tio,')>0) then Result:=Result+'x'
               else if Copy(Str,cnt,2)='th' then Result:=Result+'0' // zero
               else if Copy(Str,cnt,3)='tch' then Result:=Result+Str1;
            'w':if (QuickPos(Copy(Str,cnt,2)+',','wa,we,wi,wo,wu,')=0) then Result:=Result+Str1;
            'y':if (QuickPos(Copy(Str,cnt,2)+',','ya,ye,yi,yo,yu,')=0) then Result:=Result+Str1;
         end; // case
      end; // if differnet character
      Inc(Cnt);
   end; // while
end;

     Using this algorithm, both "Robert" and "Rupert" return the same string "rbr" while "Rubin" and "Robin" yields "rbn". "Ashcroft" and "Ashcraft" yields "axkrf". However, "George" and "Jorge" yields "rj" and "jrj" and "Dave" and "David" yields "tf" and "tft".

     This simple example shows an improvement for Ashcroft and Ashcraft over Soundex, however, George and Jorge and Dave and David need an improved theory. Read more on other comparison techniques I used in my search engines.

G.E. Ozz Nixon Jr.
 Links and Products we find useful



ButtonGenerator.com
Valid XHTML 1.0 Transitional Internet Map
Programmer's Heaven
grat-i-fi-ca-tion - noun
the state of being gratified; great satisfaction.


"Wow! We are so pleased to see Friends of FreePascal Compiler ... with such a cool look and feel!"

"We like the fact you wrote all of the server scripts using FPC!"

Ian Wright
Codemasters
Locations of visitors to this page world map hits counter
Copyright 2009 by 3F, LLC. All rights reserved. Worldwide.
Your request was processed by server #3 in 0.003335 secs.

sponsor
This sponsor helps us with our documentation