Detecting string similarity using Levenshtein distance (is SLOW)
The query returns 21 million records
They way I loop through this table takes forever. What other solutions are there?
SqlDataReader rd = DbInfo.DataRdr(Conn,
"SELECT a.NAME AS ANAME, b.NAME AS BNAME, a.ID as AID, b.ID AS BUD " +
"FROM myTable a JOIN myTable b ON a.NUM = b.NUM AND a.ID <> b.ID");
while (rd.Read())
{
if (rd["ANAME"].ToString().LevenshteinDistance(rd["BNAME"].ToString()) <= 10)
{
Logging.Write(...);
}
}
public static int LevenshteinDistance(this string s, string t)
{
if (s == null)
throw new ArgumentNullException("s");
if (t == null)
throw new ArgumentNullException("t");
int n = s.Length;
int m = t.Length;
int[,] d = new int[n+开发者_开发百科1,m+1];
if (n == 0 || m == 0)
return Math.Max(m, n);
for (int i = 0; i <= n; i++)
{
d[i, 0] = i;
}
for (int i = 0; i < m; i++)
{
d[0, i] = i;
}
for (int i = 0; i < n; i++)
{
for (int j = 0; j < m; j++)
{
int cost = (t[j] == s[i]) ? 0 : 1;
d[i + 1, j + 1] = Math.Min(Math.Min(d[i, j + 1] + 1, d[i + 1, j] + 1), d[i, j] + cost);
}
}
return d[n, m];
}
You could start by using the following as your query instead, depending on how often the NUM
columns are actually equal:
SELECT a.NAME AS ANAME, b.NAME AS BNAME, other things
FROM myTable a
JOIN myTable b
ON a.NUM = b.NUM
AND a.id < b.id
Then your SQL query will give you pairs of rows with matching NUM
s that you can call LevenshteinDistance
on. Something like:
DataTable dt1 = new DataTable();
dt1.Load(DbInfo.DataRdr(Conn, "[the query I wrote above]"));
for (int i = 0; i < dt1.Rows.Count; i++)
{
if (dt1.Rows[i]["ANAME"].ToString().LevenshteinDistance(dt1.Rows[i]["BNAME"].ToString()) <= 10)
{
Logging.Write(...[modify the query so it returns the things you want to log]...);
}
}
You're comparing dt1.Rows[i]["Name"].ToString()
with dt1.Rows[j]["Name"].ToString()
even when i == j
.
Try looping 0 <= i < dt1.Rows.Count - 1
and i + 1 <= j < dt1.Rows.Count
.
Also, you're logging only if dt1.Rows[i]["NUM"].ToString() == dt1.Rows[i]["NUM"].ToString()
, which is probably a faster check. There's no point in doing Levenshtein if that's false.
EDIT: @John is right about the dt1.Rows[i]["NUM"].ToString() == dt1.Rows[i]["NUM"].ToString()
(both i
?).
Optimizations:
1) Take a look at your data. Maybe you can do some checks to sort out invalid pairs faster. If the lenght of Name
varies over more than 10 you can check if the difference between s.Lenght
and t.Length
is greater 10 and return a high distance right away (maybe int.MaxValue or just 100). There is no point in calculating the distance if it`s clearly out of scope anyway.
2) Look for small optimizations. Looping twice over 150k rows means 22.5 billion iterations. Small changes may have a great effect. You may try to cache to row objects and remove the ToString()
by usign the Equals()
method. I think this would be faster than accessing the i-th element of your datatable 150000 times.
for (int i = 0; i < dt1.Rows.Count; i++)
{
var outerRow = dt1.Rows[i];
for (int j = 0; i + 1 < dt1.Rows.Count; j++)
{
var innerRow = dt1.Rows[j];
if (Equals(outerRow["NUM"] == innerRow["NUM"]))
{
if (outerRow["Name"].ToString().LevenshteinDistance(innerRow.ToString()) <= 10)
{
Logging.Write(...);
}
}
}
3) Try to reduce/split the datasets. Execute a query to get all possible values of NUM
select distinct NUM from myTable
. Then for each NUM
in your result do your original query but using a where condition and select only the name: SELECT name from myTable where NUM = currentNum
.
This way you dont have to compare the NUM row and you don
t select odd data. Your code can be optimized to do just the levenshtein distance but using the optimizations stated in 1+2.
4) Try a different approach like fulltext search.
I just hat to solve a similar problem, finding matches in a 1.2 million rows table. I used lucene.net which provides me realtime results when searching over one or more properties of my rows.
They do levenshtein too, but Maybe its faster than your implementation ;) MSSQL Server supports fulltext search, too.
The biggest improvement you can make is to reduce the solution space being considered. Since you want a max distance of 10, any strings that differ by length more than 10 cannot possibly qualify:
SELECT a.NAME AS ANAME, b.NAME AS BNAME, a.ID as AID, b.ID AS BUD
FROM myTable a JOIN myTable b ON a.NUM = b.NUM AND a.ID < b.ID
WHERE length(a.NAME) - length(b.NAME) BETWEEN -10 AND 10;
Next, profile your code and see where are the hot spots. A good entry article: Find Application Bottlenecks with Visual Studio Profiler.
And have a look at Optimizing the Levenshtein Algorithm in C#.
Edit
Also Chris noticed that since levenshtein(a,b) == levenshtein(b,a)
you need only select on the join a.ID < b.ID, since the matrix is symmetrical. This will halve your problem right off the bat.
After doing the other optimizations mentioned in this thread, you can move the Levenshtein computation to the server and only SELECT the rows that matches your Edit distance criteria. I needed this functionality in a project, so I made a library out of it, here. The edit distance method used in the lib only requires n * 2 memory instead of n * m.
For instance, even when on the server, you only want to do the EditDistance computation when the difference in string length is < 10, so check for that first. Something like
SELECT a.NAME as NameA, b.NAME as NameB
FROM table a
JOIN table b ON a.NUM = b.NUM
WHERE a.Id < b.Id
AND length(a.NAME) - length(b.NAME) BETWEEN -10 AND 10 OR
EditDistance(a.Name, b.Name) < 10
精彩评论