
![]() | ![]() |
1.22. Soundex Matching
1.22.1. Problem
You have two English surnames and
want to know whether they sound somewhat similar, regardless of
spelling. This would let you offer users a "fuzzy search" of names in
a telephone book to catch "Smith" and "Smythe" and others within the
set, such as "Smite" and
"Smote".
1.22.2. Solution
Use
the standard Text::Soundex module:use Text::Soundex;
$CODE = soundex($STRING);
@CODES = soundex(@LIST);
Or use the CPAN module
Text::Metaphone:use Text::Metaphone;
$phoned_words = Metaphone('Schwern');
1.22.3. Discussion
The soundex algorithm hashes words (particularly English surnames)
into a small space using a simple model that approximates an English
speaker's pronunciation of the words. Roughly speaking, each word is
reduced to a four-character string. The first character is an
uppercase letter; the remaining three are digits. By comparing the
soundex values of two strings, we can guess whether they sound
similar.The following program prompts for a name and looks for similarly
sounding names from the password file. This same approach works on
any database with names, so you could key the database on the soundex
values if you wanted to. Such a key wouldn't be unique, of course.use Text::Soundex;
use User::pwent;
print "Lookup user: ";
chomp($user =<STDIN>);
exit unless defined $user;
$name_code = soundex($user);
while ($uent = getpwent( )) {
($firstname, $lastname) = $uent->gecos =~ /(\w+)[^,]*\b(\w+)/;
if ($name_code eq soundex($uent->name) ||
$name_code eq soundex($lastname) ||
$name_code eq soundex($firstname) )
{
printf "%s: %s %s\n", $uent->name, $firstname, $lastname;
}
}
The
Text::Metaphone module from CPAN addresses the same problem in a
different and better way. The soundex function
returns a letter and a three-digit code that maps just the beginning
of the input string, whereas Metaphone returns a
code as letters of variable length. For example:soundex metaphone
Christiansen C623 KRSXNSN
Kris Jenson K625 KRSJNSN
Kyrie Eleison K642 KRLSN
Curious Liaison C624 KRSLSN
To
get the most of Metaphone, you should also use the String::Approx
module from CPAN, described more fully in Recipe 6.13. It allows for there to be errors in the match
and still be successful. The edit distance is
the number of changes needed to go from one string to the next. This
matches a pair of strings with an edit distance of two:if (amatch("string1", [2], "string2") { }
There's also an
adist function that reports the edit distance. The
edit distance between "Kris Jenson" "Christiansen" is 6, but between
their Metaphone encodings is only 1. Likewise, the distance between
the other pair is 8 originally, but down to 1 again if you compare
Metaphone encodings.use Text::Metaphone qw(Metaphone);
use String::Approx qw(amatch);
if (amatch(Metaphone($s1), [1], Metaphone($s1)) {
print "Close enough!\n";
}
This would successfully match both of our example pairs.
1.22.4. See Also
The documentation for the standard Text::Soundex and User::pwent
modules; the Text::Metaphone and String::Approx modules from CPAN;
your system's passwd(5) manpage; Volume 3,
Chapter 6 of The Art of Computer Programming, by
Donald E. Knuth (Addison-Wesley)
![]() | ![]() | ![]() |
1.21. Constant Variables | ![]() | 1.23. Program: fixstyle |

Copyright © 2003 O'Reilly & Associates. All rights reserved.