High performance PHP simaliraty checking on large database

2023-03-09 09:59 问答作者：

I have 30,000 rows in a database that need to be similarity checked (using similar_text or another such function).

In order to do this it will require doing 30,000^2 checks for each columns.

I estimate I will be checking on average 4 columns.

This means I will have to do 3,600,000,000 checks.

What is the best (fastest, and most reliable) way to do this with PHP, bearing in mind request memory limits and time limits etc?

The server need to still actively serve webpages at the 开发者_如何学编程same time as doing this.

PS. The server we are using is an 8 core Xeon 32 GB ram.

Edit:

The size of each column is normally less that 50 characters.

I guess you just need FULL TEXT search.

If that not fits you, you have only one chance to solve this: cache the results. So you will not have to parse 3bil of records for each requests

Anyway here how you can do it:

   $result = array();       

   $sql = "SELECT * FROM TABLE";
   while( $row = ... ) {
      $result[] = $row;    //> Append the current record     
   }

Now results contains all the rows from your table.

At this point you said you want to similar_text() all columns with each other.
To do that and cache the results you need at least a table (as I said in the comment).

   //> Starting calculating the similarity
   foreach($result as $k=>$v) {
      foreach($result as $k2=>$v2) {

           //> At this point you have 2 rows, $v and $v2 containing your column

           $similarity = 0;

           $similartiy += levensthein($v['column1'],$v2['column1']);
           $similartiy += levensthein($v['column2'],$v2['column2']);               
           //> What ever comparison you need here between columns

           //> Now you can finally store the result by inserting in a table the $similarity
           "INSERT DELAYED INTO similarity (value) VALUES ('$similarity')"; 

      }                      
   }

2 Things you have to notice:

I used levensthein because it's much faster than similar_text (notice it's value it's the contrary of similar_text, because the greater the value levensthein returns the less the affinity between string)
I Used INSERT DELAYED to greatly lower the database cost

oy... similar_text() is O(n^3) !

do you really need a percentage similarity for each comparison or can you just do a quick compare of the first/middle/last X bytes of the strings to narrow the field?

if you're just looking for dups say... you can probably narrow down the number of comparisons you need to do, and that will be the most effective tack imho.

继续阅读：performance php

High performance PHP simaliraty checking on large database

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？