Fuzzy lookup of names with a Dutch Metaphone implementation

Phonetic algorithms are heavily used to provide suggestions and corrections in spell checkers and also for searching and matching names. You could use such algorithms to populate autocomplete controls for the lookup of cities and streets, patients in a hospital, or reservations for a hotel stay. All you need to know is how a part of the name of the street, patient, or guest sound like, transform your search text with the algorithm, and lookup the encoded string in the dictionary. Metaphone is a popular algorithm family in this domain. Metaphone is used for encoding a word so that similar sounding words encode the same. That sounds as Soundex, doesn't it ? Well, that's true, but Metaphone it doesn't suffer from a fixed length output (Soundex just encode to a letter followed by three numbers), it's not specialized in names only, and it's not specialized in the English language only. For the record: because of these same reasons, Metaphone requires more computational power and storage capacity than Soundex.

Metaphone and double metaphone (the less English-centric version) are in the public domain and can be freely implemented. So I did some research into the English metaphone rules, and then jumped into the web of Dutch pronunciation rules. I downloaded one of the many available free implementations (in T-SQL), and adapted it towards the Dutch language. And believe me, if you thought that Dutch spelling was difficult, you should check our pronunciation rules Undecided. Fortunately Metaphone ignores all vowels (except the initial one) and diphthongs, and it doesn't specify how you should pronounce a 'g' or a 'r', so I think I came up with a pretty region-agnostic encoding (there are many-many-many Dutch dialects).

Examples

Here are a couple of examples:

DECLARE @Test TABLE(Name NVARCHAR(50))

 

INSERT @Test VALUES('vrede'), ('wrede'), ('luycks'), ('luiks')

INSERT @Test VALUES('Vandenbroecke'), ('Vandenberghe'), ('Verbrugghe'), ('Verschueren')

INSERT @Test VALUES('baobab'), ('china'), ('shampoo'), ('sjamaan'), ('tjokvol')

 

SELECT Name, U2UConsult.MetaphoneDutch(name) AS [Dutch Metaphone]

  FROM @Test

GO

And the result:

Here's a lookup in the AdventureWorks persons table for last names that contain a fragment that sounds like "kamer":

DECLARE @x NVARCHAR(150) = U2UConsult.MetaphoneDutch('kamer')

 

;WITH cte AS (

SELECT DISTINCT LastName, U2UConsult.MetaphoneDutch(LastName) AS [Dutch Metaphone]

  FROM Person.Person)

SELECT * FROM cte WHERE [Dutch Metaphone] LIKE ('%' + @x + '%')

GO

Here's the result:

Of course you should not launch this type of queries continuously against a production database. The encoded lookup value should be stored through a trigger, a persisted calculated column, or an indexed view.

Source Code

Here's the source code. The function is optimized for readability, not performance:

USE [AdventureWorks2008]

GO

 

CREATE Schema [U2UConsult]

GO

 

CREATE FUNCTION [U2UConsult].[MetaphoneDutch](@word AS VARCHAR(150))

RETURNS VARCHAR(150)

AS

BEGIN

    DECLARE @result VARCHAR(150)

    DECLARE @nextChar CHAR(2)

    DECLARE @currentChar CHAR(1)

    DECLARE @previousChar CHAR(1)

    DECLARE @length TINYINT

    DECLARE @count TINYINT

 

    SET @word = LTRIM(RTRIM(@word))

    SET @length = LEN(@word)

    SET @count = 1

    SET @result = ''

    SET @currentChar = SUBSTRING(@word, @count, 1)

 

    IF @currentChar IN ('a', 'e', 'i', 'o', 'u')

        SET @result = 'a'

 

    WHILE @count <= @length BEGIN

        SET @currentChar = SUBSTRING(@word, @count, 1);

        if (@currentChar = ' ') BEGIN

            SET @result = @result + ' ' + U2UConsult.MetaphoneDutch(SUBSTRING(@word, @count, @length))

            RETURN @result

        END

        IF @count <> 1

            SET @previousChar = SUBSTRING(@word, (@count-1), 1)

        ELSE

            SET @previousChar = ' '

        IF @previousChar <> @currentChar BEGIN

            SET @nextChar = SUBSTRING(@word, @count + 1, 1)

 

            -- b --

            IF @currentChar = 'b'

                IF @count = @length

                    SET @result = @result + 'p'

                ELSE

                    SET @result = @result + 'b'

 

            -- c --

            IF @currentChar = 'c'

                IF @nextChar = 'h'

                    IF @count = 1

                        SET @result = @result + 'x'

                    ELSE

                        SET @result = @result + 'g'

                ELSE IF @nextChar IN ('i', 'e', 'y') AND @previousChar <> 's'

                    SET @result = @result + 's'

                ELSE

                    SET @result = @result + 'k'

 

            -- d --

            IF @currentChar = 'd'

                IF SUBSTRING(@word,@count,3) IN ('dge', 'dgy', 'dgi')

                    SET @result = @result + 'j'

                ELSE

                    IF @count = @length

                        SET @result = @result + 't'

                    ELSE

                        SET @result = @result + 'd'

 

            -- f --

            IF @currentChar = 'f'

                SET @result = @result + 'f'

 

            -- g --

            IF @currentChar = 'g'

                SET @result = @result + 'g'

 

            -- h --

            IF @currentChar = 'h'

                IF @previousChar NOT IN ('c', 's', 'p', 't', 'g') AND @nextChar IN ('a', 'e', 'i', 'o', 'u', 'y')

                    SET @result = @result + 'h'

 

            -- j --

            IF @currentChar = 'j'

                IF @previousChar NOT IN ('i', 's')

                    IF @previousChar = 't'

                        SET @result = @result + 'x'

                    ELSE

                        SET @result = @result + 'j'

 

            -- k --

            IF @currentChar = 'k'

                IF @previousChar <> 'c'

                    SET @result = @result + 'k'

 

            -- l --

            IF @currentChar = 'l'

                SET @result = @result + 'l'

 

            -- m --

            IF @currentChar = 'm'

                SET @result = @result + 'm'

 

            -- n --

            IF @currentChar = 'n'

                SET @result = @result + 'n'

 

            -- p --

            IF @currentChar = 'p'

                IF @nextChar = 'h'

                    SET @result = @result + 'f'

                ELSE

                    SET @result = @result + 'p'

 

            -- q --

            IF @currentChar = 'q'

                SET @result = @result + 'k'

 

            -- r --

            IF @currentChar IN ('r')

                SET @result = @result + 'r'

 

            -- s --

            IF @currentChar = 's'

                IF @nextChar IN ('h', 'j')

                    SET @result = @result + 'x'

                ELSE

                    SET @result = @result + 's'

 

            -- t --

            IF @currentChar = 't'

                IF SUBSTRING(@word,@count,3) IN ('tia', 'tio', 'tie', 'tiu')

                    SET @result = @result + 's'

                ELSE

                    SET @result = @result + 't'

 

            -- v --

            IF @currentChar = 'v'

                SET @result = @result + 'v'

 

            -- w --

            IF @currentChar = 'w'

                IF @nextChar = 'r'

                    SET @result = @result + 'v'

                ELSE

                    SET @result = @result + 'w'

 

            -- x --

            IF @currentChar = 'x'

                SET @result = @result + 'ks'

 

            -- y --

            IF @currentChar = 'y'

                IF @nextChar IN ('a', 'e', 'i', 'o', 'u')

                    SET @result = @result + 'j'

 

            -- z --

            IF @currentChar = 'z'

                SET @result = @result + 'z'

        END

        SET @count = @count + 1

    END

 

    RETURN @result -- Mandatory

END

Enjoy!