开发者

PHP code to compare two large text files with ~300,000 entries and output differences

i've got two lists A and B, B = A + C - D. All elements are unique, no duplicates. How do i get the lists of:

(1) the new items added, C

(2) the old items removed, D

C and D aren't more than 10000 elements or so.

Edit

Crap, sorry guys - forgot the important det开发者_Go百科ail - these are both text files, not in memory elements.


I think the size of arrays is irrelevant unless you really want to focus on how performant this operation is going to be i.e., you are going for a specific number of executions per unit of time.

If you just need to do it to get it done, it seems pretty trivial to me using array_diff()

$a = array( 1, 2, 3, 4 );
$b = array( 1, 3, 5, 7 ); // 2 and 4 removed, 5 and 7 added

$c = array_diff( $b, $a ); // [5, 7]
$d = array_diff( $a, $b ); // [2, 4]


The most efficient way to do this will be to sort your lists first and access the elements of your array as few times as possible. An example:

<?php

sort($a, SORT_NUMERIC);
sort($b, SORT_NUMERIC);
$c = array();
$d = array();
while (($currA = array_pop($a)) !== null) {
        while (($currB = array_pop($b)) !== null) {
                if ($currB == $currA) {
                        // exists in both, skip value
                        continue 2;
                }
                if ($currA > $currB) {
                        // exists in A only, add to D, push B back on to stack
                        $d[] = $currA;
                        $b[] = $currB;
                        continue 2;
                }
                // exists in B only, add to C
                $c[] = $currB;
        }
        // exists in A only, for values of A < all of B
        $d[] = $currA;
}

This is going to perform orders of magnitude faster than 2 calls to array_diff even for lists that are only a few hundred elements long.


You said you already have two files A and B.

Here's the easiest, fastest solution assuming you're running on a Unix system.

system("comm -13 A B > C");
system("comm -23 A B > D");

//read C and D in PHP


function diffLists($listA,$listB) {

  $resultAdded = array();
  $resultRemoved = array();
  foreach($listB AS $item) {
    if (!in_array($item,$listA)) {
       $resultAdded[] = $item;
    }
  }
  foreach($listA AS $item) {
    if (!in_array($item,$listB)) {
      $resultRemoved[] = $item;
    }
  }
  return array($resultAdded,$resultRemoved);
}



$myListA = array('item1','item2','item3');
$myListB = array('item1','item3','item4');
print_r(diffLists($myListA,$myListB));

This should output an array with 2 elements. the first element is a list of items that are ADDED in list B and the second element is a list of items that were REMOVED in list B.


You might want try Levenshtein algorithm if you want to this more efficiently,

http://en.wikipedia.org/wiki/Levenshtein_distance


Searching for every value of A in B (and vice-versa) has O(n^2) complexity.

For large amounts of data, you're probably better to sort each of the lists O(n log n), then make a single pass through the sorted lists computing the added/removed elements. (Relatively easy to do, since you know there are no duplicates.)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜