The levenshtein function processes each byte of the input string individually. Then for multibyte encodings, such as UTF-8, it may give misleading results.
Example with a french accented word :
- levenshtein('notre', 'votre') = 1
- levenshtein('notre', 'nôtre') = 2 (huh ?!)
You can easily find a multibyte compliant PHP implementation of the levenshtein function but it will be of course much slower than the C implementation.
Another option is to convert the strings to a single-byte (lossless) encoding so that they can feed the fast core levenshtein function.
Here is the conversion function I used with a search engine storing UTF-8 strings, and a quick benchmark. I hope it will help.
<?php
function utf8_to_extended_ascii($str, &$map)
{
$matches = array();
if (!preg_match_all('/[\xC0-\xF7][\x80-\xBF]+/', $str, $matches))
return $str; foreach ($matches[0] as $mbc)
if (!isset($map[$mbc]))
$map[$mbc] = chr(128 + count($map));
return strtr($str, $map);
}
function levenshtein_utf8($s1, $s2)
{
$charMap = array();
$s1 = utf8_to_extended_ascii($s1, $charMap);
$s2 = utf8_to_extended_ascii($s2, $charMap);
return levenshtein($s1, $s2);
}
?>
Results (for about 6000 calls)
- reference time core C function (single-byte) : 30 ms
- utf8 to ext-ascii conversion + core function : 90 ms
- full php implementation : 3000 ms