Detecting string similarity using Levenshtein distance (is SLOW)

2023-03-18 20:54 问答作者：

The query returns 21 million records

They way I loop through this table takes forever. What other solutions are there?

SqlDataReader rd = DbInfo.DataRdr(Conn,
 "SELECT a.NAME AS ANAME, b.NAME AS BNAME, a.ID as AID, b.ID AS BUD " +
 "FROM myTable a JOIN myTable b ON a.NUM = b.NUM AND a.ID <> b.ID");

while (rd.Read())
{
   if (rd["ANAME"].ToString().LevenshteinDistance(rd["BNAME"].ToString()) <= 10)
   {

        Logging.Write(...);

   }
}



    public static int LevenshteinDistance(this string s, string t)
    {
        if (s == null)
            throw new ArgumentNullException("s");
        if (t == null)
            throw new ArgumentNullException("t");
        int n = s.Length;
        int m = t.Length;
        int[,] d = new int[n+开发者_开发百科1,m+1];

        if (n == 0 || m == 0)
            return Math.Max(m, n);

        for (int i = 0; i <= n; i++)
        {
            d[i, 0] = i;
        }
        for (int i = 0; i < m; i++)
        {
            d[0, i] = i;
        }

        for (int i = 0; i < n; i++)
        {
            for (int j = 0; j < m; j++)
            {
                int cost = (t[j] == s[i]) ? 0 : 1;

                d[i + 1, j + 1] = Math.Min(Math.Min(d[i, j + 1] + 1, d[i + 1, j] + 1), d[i, j] + cost);
            }
        }

        return d[n, m];
    }

You could start by using the following as your query instead, depending on how often the NUM columns are actually equal:

SELECT a.NAME AS ANAME, b.NAME AS BNAME, other things
FROM myTable a
JOIN myTable b
ON a.NUM = b.NUM
AND a.id < b.id

Then your SQL query will give you pairs of rows with matching NUMs that you can call LevenshteinDistance on. Something like:

DataTable dt1 = new DataTable();
dt1.Load(DbInfo.DataRdr(Conn, "[the query I wrote above]"));

for (int i = 0; i < dt1.Rows.Count; i++)
{
   if (dt1.Rows[i]["ANAME"].ToString().LevenshteinDistance(dt1.Rows[i]["BNAME"].ToString()) <= 10)
   {
     Logging.Write(...[modify the query so it returns the things you want to log]...);
   }
}

You're comparing dt1.Rows[i]["Name"].ToString() with dt1.Rows[j]["Name"].ToString() even when i == j.

Try looping 0 <= i < dt1.Rows.Count - 1 and i + 1 <= j < dt1.Rows.Count.

Also, you're logging only if dt1.Rows[i]["NUM"].ToString() == dt1.Rows[i]["NUM"].ToString(), which is probably a faster check. There's no point in doing Levenshtein if that's false.

EDIT: @John is right about the dt1.Rows[i]["NUM"].ToString() == dt1.Rows[i]["NUM"].ToString() (both i?).

Optimizations:

1) Take a look at your data. Maybe you can do some checks to sort out invalid pairs faster. If the lenght of Name varies over more than 10 you can check if the difference between s.Lenght and t.Length is greater 10 and return a high distance right away (maybe int.MaxValue or just 100). There is no point in calculating the distance if it`s clearly out of scope anyway.

2) Look for small optimizations. Looping twice over 150k rows means 22.5 billion iterations. Small changes may have a great effect. You may try to cache to row objects and remove the ToString() by usign the Equals() method. I think this would be faster than accessing the i-th element of your datatable 150000 times.

for (int i = 0; i < dt1.Rows.Count; i++)
{
   var outerRow = dt1.Rows[i];
   for (int j = 0; i + 1 < dt1.Rows.Count; j++)
   {
     var innerRow = dt1.Rows[j];
     if (Equals(outerRow["NUM"] == innerRow["NUM"]))
     {
        if (outerRow["Name"].ToString().LevenshteinDistance(innerRow.ToString()) <= 10)
        {
           Logging.Write(...);
        }
     }
  }

3) Try to reduce/split the datasets. Execute a query to get all possible values of NUM select distinct NUM from myTable. Then for each NUM in your result do your original query but using a where condition and select only the name: SELECT name from myTable where NUM = currentNum.

This way you dont have to compare the NUM row and you dont select odd data. Your code can be optimized to do just the levenshtein distance but using the optimizations stated in 1+2.

4) Try a different approach like fulltext search.

I just hat to solve a similar problem, finding matches in a 1.2 million rows table. I used lucene.net which provides me realtime results when searching over one or more properties of my rows.

They do levenshtein too, but Maybe its faster than your implementation ;) MSSQL Server supports fulltext search, too.

The biggest improvement you can make is to reduce the solution space being considered. Since you want a max distance of 10, any strings that differ by length more than 10 cannot possibly qualify:

SELECT a.NAME AS ANAME, b.NAME AS BNAME, a.ID as AID, b.ID AS BUD 
 FROM myTable a JOIN myTable b ON a.NUM = b.NUM AND a.ID < b.ID
 WHERE length(a.NAME) - length(b.NAME) BETWEEN -10 AND 10;

Next, profile your code and see where are the hot spots. A good entry article: Find Application Bottlenecks with Visual Studio Profiler.

And have a look at Optimizing the Levenshtein Algorithm in C#.

Edit

Also Chris noticed that since levenshtein(a,b) == levenshtein(b,a) you need only select on the join a.ID < b.ID, since the matrix is symmetrical. This will halve your problem right off the bat.

After doing the other optimizations mentioned in this thread, you can move the Levenshtein computation to the server and only SELECT the rows that matches your Edit distance criteria. I needed this functionality in a project, so I made a library out of it, here. The edit distance method used in the lib only requires n * 2 memory instead of n * m.

For instance, even when on the server, you only want to do the EditDistance computation when the difference in string length is < 10, so check for that first. Something like

SELECT a.NAME as NameA, b.NAME as NameB
FROM table a
JOIN table b ON a.NUM = b.NUM
WHERE a.Id < b.Id
AND length(a.NAME) - length(b.NAME) BETWEEN -10 AND 10 OR
    EditDistance(a.Name, b.Name) < 10

继续阅读：.net sql string-comparison

Detecting string similarity using Levenshtein distance (is SLOW)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？