Users often enter data approximately or inaccurately.. But sometimes, we need to search or match this inaccurate data anyway!
For example, users should match existing customer records rather than creating unwanted duplicates.
There are standard algorithms for measuring string-distances, but we’ll need a few extra steps to make this work efficiently against a database..
Measuring String Distance
Let’s start with a simple problem. Suppose the user searches for a customer, and we have some candidates to compare. Which one is closest?
Search: "Dan Wood" Matches: "Dan's Woodwork" (best) "Dance Studio" "Julia's Kitchen & Food" (worst)
The algorithm we need, is called a “string distance” or “edit distance” measurement. We can measure the distance between the user input & each candidate string, and order our matches so the closest are preferred.
There are several popular measures of string distance or match quality:
- Levenshtein distance: minimum number of inserts, deletes, or substitutions to transform between the strings
- Levenshtein-Demerau distance
- Longest Common Substring distance: Minimum number of symbols that have to be removed in both strings until resulting substrings are identical.
- q-gram distance: Sum of absolute differences between N-gram vectors of both strings.
- Jaccard distance: 1 minus the quotient of shared N-grams and all observed N-grams.
Normally, the Levenshtein (or Levenshtein-Demerau) distance or an N-gram type distance (q-gram or Jaccard) are preferred, and there are good implementations or pseudocode outlines for each of these on the net.
But, though these similarity measures are effective for sorting good matches, they’re expensive to run. We could run them against perhaps 50 or 100 possible matches, but not an entire table. So, what to do?
Database Query by N-Grams
To search efficiently against a database, we need to specify & restrict our query to only bring back a limited subset of data.
Where our look at string distance measures was useful in sorting matches by quality, we now need to filter so that only reasonable matches get returned at all.
None of these complex “string distance” measures can be run in SQL directly, but there is one building block we can use — the LIKE
operator.
The LIKE clause searches for a specific pattern or substring, which can effectively give us a single N-gram search. Provided the column in question is indexed, it’s efficient.
select * from CUSTOMER where NAME like '%dan%'
But this search uses just one trigram (an N-gram of 3 characters). What if the user mis-spelt Dan? We need a fuzzy search that can match elsewhere , even if one part is misspelt!
The solution is to sample & search for multiple N-grams.
We should make our code configurable, as to 1) how many N-grams are searched for, and 2) how long the N-grams are. But as a starting point, using three N-grams each of three letters, works well.
The simplest approach is to select the N-grams from evenly throughout the length of the search term. This code makes them lower-case, in expectation of case insensitivity:
int PARTIAL_COUNT = 3, PARTIAL_LENGTH = 3; public Set<String> selectNGrams (String term) { Set<String> partialSet = new TreeSet(); int availDistance = Math.max( term.length()-PARTIAL_LENGTH, 0); // for (int i = 0; i < PARTIAL_COUNT; i++) { int pos0 = (PARTIAL_COUNT > 1) ? availDistance * i / (PARTIAL_COUNT-1) : 0; int pos1 = Math.min( pos0+PARTIAL_LENGTH, term.length()); // String partial = term.substring( pos0, pos1); partial = partial.toLowerCase(); // partialSet.add( partial); } return partialSet; }
Having collected a set of N-grams, we can then query for a Customer which matches any of these.
select * from CUSTOMER where (NAME like ? or NAME like ? or NAME like ?)
Though prepared statements & bound SQL parameters are required for security, we can illustrate what our “Dan Wood” example would look like:
select * from CUSTOMER where (NAME like '%dan%' or NAME like '%n w%' or NAME like '%ood%')
Searching the database like this is efficient & brings back only customers which are at least vaguely plausible as matches.
This approach is not perfect — excessive misspelling can cause the database search to miss — but users can easily retype & search again. If failed matches are a frequent problem, it’s easy to increase the number of N-grams to search in configuration. This can be balanced against number of results & database load.
The idea here is not to attempt to be perfect in the database — or to be complex or costly there — but to perform a simple & efficient “rough” filtering step, which is still sufficiently fuzzy.
Once we have the results suitably pared down, we can bring these into Java & use our more-expensive string distance measurement to sort them.
This gives both an effective & efficient fuzziness to the database search, and a very high quality “best match” first ordering of results to the user.
Have you used similar approaches? Is this useful for you?
References:
– http://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/
– http://stackoverflow.com/questions/955110/similarity-string-comparison-in-java
Great!!!