Comparing strings with tolerance
I'm looking for a way to compare a string with an array of strings. Doing an exact 开发者_如何转开发search is quite easy of course, but I want my program to tolerate spelling mistakes, missing parts of the string and so on.
Is there some kind of framework which can perform such a search? I'm having something in mind that the search algorithm will return a few results order by the percentage of match or something like this.
You could use the Levenshtein Distance algorithm.
"The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character." - Wikipedia.com
This one is from dotnetperls.com:
using System;
/// <summary>
/// Contains approximate string matching
/// </summary>
static class LevenshteinDistance
{
/// <summary>
/// Compute the distance between two strings.
/// </summary>
public static int Compute(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// Step 1
if (n == 0)
{
return m;
}
if (m == 0)
{
return n;
}
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
}
class Program
{
static void Main()
{
Console.WriteLine(LevenshteinDistance.Compute("aunt", "ant"));
Console.WriteLine(LevenshteinDistance.Compute("Sam", "Samantha"));
Console.WriteLine(LevenshteinDistance.Compute("flomax", "volmax"));
}
}
You may in fact prefer to use the Damerau-Levenshtein distance algorithm, which also allows characters to be transposed, which is a common human error in data entry. You'll find a C# implementation of it here.
There is nothing in the .NET framework that will help you with this out-of-the-box.
The most common spelling mistakes are those where the letters are a decent phonetic representation of the word, but not the correct spelling of the word.
For example, it could be argued that the words sword
and sord
(yes, that's a word) have the same phonetic roots (they sound the same when you pronounce them).
That being said, there are a number of algorithms that you can use to translate words (even mispelled ones) into phonetic variants.
The first is Soundex. It's fairly simple to implement and there are a fair number of .NET implementations of this algorithm. It's rather simple, but it gives you real values you can compare to each other.
Another is Metaphone. While I can't find a native .NET implementation of Metaphone, the link provided has links to a number of other implementations which could be converted. The easiest to convert would probably be the Java implementation of the Metaphone algorithm.
It should be noted that the Metaphone algorithm has gone through revisions. There is Double Metaphone (which has a .NET implementation) and Metaphone 3. Metaphone 3 is a commercial application, but has a 98% accuracy rate compared to an 89% accuracy rate for the Double Metaphone algorithm when run against a database of common English words. Depending on your need, you might want to look for (in the case of Double Metaphone) or purchase (in the case of Metaphone 3) the source for the algorithm and convert or access it through the P/Invoke layer (there are C++ implementations abound).
Metaphone and Soundex differ in the sense that Soundex produces fixed length numeric keys, whereas Metaphone produces keys of different length, so the results will be different. In the end, both will do the same kind of comparison for you, you just have to find out which suits your needs the best, given your requirements and resources (and intolerance levels for the spelling mistakes, of course).
Here is an implementation of the LevenshteinDistance method that uses far less memory while producing the same results. This is a C# adaptation of the pseudo code found in this wikipedia article under the "Iterative with two matrix rows" heading.
public static int LevenshteinDistance(string source, string target)
{
// degenerate cases
if (source == target) return 0;
if (source.Length == 0) return target.Length;
if (target.Length == 0) return source.Length;
// create two work vectors of integer distances
int[] v0 = new int[target.Length + 1];
int[] v1 = new int[target.Length + 1];
// initialize v0 (the previous row of distances)
// this row is A[0][i]: edit distance for an empty s
// the distance is just the number of characters to delete from t
for (int i = 0; i < v0.Length; i++)
v0[i] = i;
for (int i = 0; i < source.Length; i++)
{
// calculate v1 (current row distances) from the previous row v0
// first element of v1 is A[i+1][0]
// edit distance is delete (i+1) chars from s to match empty t
v1[0] = i + 1;
// use formula to fill in the rest of the row
for (int j = 0; j < target.Length; j++)
{
var cost = (source[i] == target[j]) ? 0 : 1;
v1[j + 1] = Math.Min(v1[j] + 1, Math.Min(v0[j + 1] + 1, v0[j] + cost));
}
// copy v1 (current row) to v0 (previous row) for next iteration
for (int j = 0; j < v0.Length; j++)
v0[j] = v1[j];
}
return v1[target.Length];
}
Here is a function that will give you the percentage similarity.
/// <summary>
/// Calculate percentage similarity of two strings
/// <param name="source">Source String to Compare with</param>
/// <param name="target">Targeted String to Compare</param>
/// <returns>Return Similarity between two strings from 0 to 1.0</returns>
/// </summary>
public static double CalculateSimilarity(string source, string target)
{
if ((source == null) || (target == null)) return 0.0;
if ((source.Length == 0) || (target.Length == 0)) return 0.0;
if (source == target) return 1.0;
int stepsToSame = LevenshteinDistance(source, target);
return (1.0 - ((double)stepsToSame / (double)Math.Max(source.Length, target.Length)));
}
Your other option is to compare phonetically using Soundex or Metaphone. I just completed an article that presents C# code for both algorithms. You can view it at http://www.blackbeltcoder.com/Articles/algorithms/phonetic-string-comparison-with-soundex.
Here are two methods that calculate the Levenshtein Distance between strings.
The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.
Once you have the result, you'll need to define what value you want to use as a threshold for a match or not. Run the function on a bunch of sample data to get a good idea of how it works to help decide on your particular threshold.
/// <summary>
/// Calculates the Levenshtein distance between two strings--the number of changes that need to be made for the first string to become the second.
/// </summary>
/// <param name="first">The first string, used as a source.</param>
/// <param name="second">The second string, used as a target.</param>
/// <returns>The number of changes that need to be made to convert the first string to the second.</returns>
/// <remarks>
/// From http://www.merriampark.com/ldcsharp.htm
/// </remarks>
public static int LevenshteinDistance(string first, string second)
{
if (first == null)
{
throw new ArgumentNullException("first");
}
if (second == null)
{
throw new ArgumentNullException("second");
}
int n = first.Length;
int m = second.Length;
var d = new int[n + 1, m + 1]; // matrix
if (n == 0) return m;
if (m == 0) return n;
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
for (int i = 1; i <= n; i++)
{
for (int j = 1; j <= m; j++)
{
int cost = (second.Substring(j - 1, 1) == first.Substring(i - 1, 1) ? 0 : 1); // cost
d[i, j] = Math.Min(
Math.Min(
d[i - 1, j] + 1,
d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
return d[n, m];
}
精彩评论