Comparing text blocks for similar content

2023-01-17 01:15 问答作者：

I have two text blocks, containing company names. Both contain names of hundreds of companies, the newer list has more companies. How do I remove the duplicate company names from the two lists , so that I am left with the new names only? Sample text block one:

Company Name One, Address line 1, line 2, phone, email
..
Random text 
..

Company Name Two 
Address, Phone,email
..
Random text 
..
Company Name 3 
Address, Phone,email

Sample text block two:

..Random Text..
M/s Company Name One Extra Random Text, Address line 1, line 2, phone, 
..random text...
M/s Company Name Two 
Address, Phone
...

The Company name, address etc are similar not same. The second block has the words M/s before all company names. I would like to do this in php, using regex perhaps.

I would like to out put company names which match, for eg. in the example given above I would like to output that Company names: Company name One, Company name two are common to both test blocks.

Update: thanks to @Wrikken, I have the text in two strings. I can explode the second block using the M/s, and get an array. How do I then check each item from this array to match the first text block which is one long string?

Although, I have since done the job manually, I would still like to know how two text blocks can be compared for similarity and hence the bounty.

Update: Output for @Joyce Babu code

..Random Text.. ..Random Text.. ..Random Text.. ..Random Text.. ..Random Text.. ..Random Text.. ..Random Text.. ..Random Text.. ..Random Text.. ..Random Text.. M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Addr开发者_开发知识库ess line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, ..random text... ..random text... ..random text... ..random text... ..random text... ..random text... ..random text... ..random text... ..random text... ..random text... M/s Company Name Two M/s Company Name Two M/s Company Name Two M/s Company Name Two M/s Company Name Two M/s Company Name Two M/s Company Name Two M/s Company Name Two M/s Company Name Two M/s Company Name Two Address, Phone Address, Phone Address, Phone Address, Phone Address, Phone Address, Phone Address, Phone Address, Phone Address, Phone Address, Phone ... ... ... ... ... ... ... ... ... ... ... ...

Output for @nikic

array(2) { [0]=>  string(17) "..Random Text.. " [4]=>  string(16) "Address, Phone " }

Output for @Joyce Babu second post

andom Text..andom Text..andom Text..andom Text..andom Text..andom Text..andom Text..andom Text..andom Text..andom Text..Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,andom text...andom text...andom text...andom text...andom text...andom text...andom text...andom text...andom text...andom text...Company Name TwoCompany Name TwoCompany Name TwoCompany Name TwoCompany Name TwoCompany Name TwoCompany Name TwoCompany Name TwoCompany Name TwoCompany Name Tworess, Phoneress, Phoneress, Phoneress, Phoneress, Phoneress, Phoneress, Phoneress, Phoneress, Phoneress, Phone

@Joyce Babu Final Code

<?php
set_time_limit(500);
$arOld = file('olddata.txt');
$arNew = file('newdata.txt');
$G=0;
    $c=0;

    foreach($arNew as $line){
    if(substr($line, 0, 4) == 'M/s '){
    $c++;   
    echo "<BR/>".$c.".)";
        $line = trim(substr($line, 4));
        foreach($arOld as $old){
            similar_text($line, $old, $percentage);
            if ($percentage > 80){
                continue;
            }
        }
        echo $line;
    }else{
    $G++;
    }
}
echo "<br/>".$G . " DID NOT MATCH";
?>

@ Output From Joyce Babu final code

1.)Company Name One Extra Random Text, Address line 1, line 2, phone,
2.)Company Name Two
4 DID NOT MATCH

Try this

set_time_limit(500)
$arOld = file('olddata.txt');
$arNew = file('newdata.txt');
foreach($arNew as $line){
    if(substr($line, 0, 3) === 'M/s '){
        $line = trim(substr($line, 3));
        foreach($arOld as $old){
            similar_text($line, $old, $percentage);
            if ($percentage > 80){
                continue;
            }
        }
        echo $line;
    }
}

Create an array of both (possibly using the file() function, depending on the format of the text, or possibly just an explode() on content), and use array_diff().

if you need to compare this lists only once, i'd suggest converting docs to txt and then you'll be able to compare using regex. otherwise you'll need to use third party software to access info in the docs... like here maybe Reading/Writing a MS Word file in PHP

$oldList = file('oldList.txt');
$newList = file('newList.txt');
$list = array_udiff($newList, $oldList, 'compare');

function compare($new, $old) {
    similar_text($old, substr($new, 3), $percent);
    return $percent >= 80 ? 0 : 1;
}

This is my basic idea. To find all texts similar by 80% and remove them from the $newList. You should adjust the percentage to satisfy your needs. The M/s is removed by substr($new, 3).

If there are no key fields for uniquely identifying the records, I think you will have to use something like similar_text or levenshtein.

$arOld = file('olddata.txt');
$arNew = file('newdata.txt');
foreach($arNew as $line){
   $line = trim(substr($line, 3));
   foreach($arOld as $old){
    similar_text($line, $old, $percentage);
    if ($percentage < 60){
        echo $line;
    }
   }
}

继续阅读：comparison diff php regex

Comparing text blocks for similar content

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？