PHP code to compare two large text files with ~300,000 entries and output differences
i've got two lists A and B, B = A + C - D. All elements are unique, no duplicates. How do i get the lists of:
(1) the new items added, C (2) the old items removed, D C and D aren't more than 10000 elements or so.Edit
Crap, sorry guys - forgot the important det开发者_Go百科ail - these are both text files, not in memory elements.
I think the size of arrays is irrelevant unless you really want to focus on how performant this operation is going to be i.e., you are going for a specific number of executions per unit of time.
If you just need to do it to get it done, it seems pretty trivial to me using array_diff()
$a = array( 1, 2, 3, 4 );
$b = array( 1, 3, 5, 7 ); // 2 and 4 removed, 5 and 7 added
$c = array_diff( $b, $a ); // [5, 7]
$d = array_diff( $a, $b ); // [2, 4]
The most efficient way to do this will be to sort your lists first and access the elements of your array as few times as possible. An example:
<?php
sort($a, SORT_NUMERIC);
sort($b, SORT_NUMERIC);
$c = array();
$d = array();
while (($currA = array_pop($a)) !== null) {
while (($currB = array_pop($b)) !== null) {
if ($currB == $currA) {
// exists in both, skip value
continue 2;
}
if ($currA > $currB) {
// exists in A only, add to D, push B back on to stack
$d[] = $currA;
$b[] = $currB;
continue 2;
}
// exists in B only, add to C
$c[] = $currB;
}
// exists in A only, for values of A < all of B
$d[] = $currA;
}
This is going to perform orders of magnitude faster than 2 calls to array_diff even for lists that are only a few hundred elements long.
You said you already have two files A and B.
Here's the easiest, fastest solution assuming you're running on a Unix system.
system("comm -13 A B > C");
system("comm -23 A B > D");
//read C and D in PHP
function diffLists($listA,$listB) {
$resultAdded = array();
$resultRemoved = array();
foreach($listB AS $item) {
if (!in_array($item,$listA)) {
$resultAdded[] = $item;
}
}
foreach($listA AS $item) {
if (!in_array($item,$listB)) {
$resultRemoved[] = $item;
}
}
return array($resultAdded,$resultRemoved);
}
$myListA = array('item1','item2','item3');
$myListB = array('item1','item3','item4');
print_r(diffLists($myListA,$myListB));
This should output an array with 2 elements. the first element is a list of items that are ADDED in list B and the second element is a list of items that were REMOVED in list B.
You might want try Levenshtein algorithm if you want to this more efficiently,
http://en.wikipedia.org/wiki/Levenshtein_distance
Searching for every value of A in B (and vice-versa) has O(n^2) complexity.
For large amounts of data, you're probably better to sort each of the lists O(n log n), then make a single pass through the sorted lists computing the added/removed elements. (Relatively easy to do, since you know there are no duplicates.)
精彩评论