stristr and speed

2023-01-15 00:54 问答作者：

I've got two files,file a around 5mb, and file b around 66 mb. I need to find out if there's any occu开发者_如何学Gornaces of the lines in file a, inside file b, and if so write them to file c.

This is the way I'm currently handling it:

ini_set("memory_limit","1000M");
set_time_limit(0);
$small_list=file("a.csv");
$big_list=file_get_contents("b.csv");
$new_list="c.csv";
$fh = fopen($new_list, 'a');
foreach($small_list as $one_line)
{
 if(stristr($big_list, $one_line) != FALSE) 
    {
    fwrite($fh, $one_line);
    echo "record found: " . $one_line ."<br>";
    }   
}

The issue is its been running(successfully) for over an hour and its maybe 3,000 lines into the 160,000 in the smaller file. Any ideas?

Build arrays with hashes as indices:

Read in file a.csv line by line and store in a_hash[md5($line)] = array($offset, $length) Read in file b.csv line by line and store in b_hash[md5($line)] = true

By using the hashes as indices you will automagically not wind up having duplicate entries.

Then for every hash that has an index in both a_hash and b_hash read in the contents of the file (using offset and length you stored in a_hash) to pull out the actual line text. If you're paranoid about hash collisions then store offset/length for b_hash as well and verify with stristr.

This will run a lot faster and use up far, far, FAR less memory.

If you want to reduce memory requirement further and don't mind checking duplicates then:

Read in file a.csv line by line and store in a_hash[md5($line)] = false
Read in file b.csv line by line, hash the line and check if exists in a_hash.
If a_hash[md5($line)] == false write to c.csv and set a_hash[md5($line)] = true

Some example code for the second suggestion:

$a_file = fopen('a.csv','r');
$b_file = fopen('b.csv','r');
$c_file = fopen('c.csv','w+');

if(!$a_file || !$b_file || !$c_file) {
    echo "Broken!<br>";
    exit;
}

$a_hash = array();

while(!feof($a_file)) {
    $a_hash[md5(fgets($a_file))] = false;
}
fclose($a_file);

while(!feof($b_file)) {
    $line = fgets($b_file);
    $hash = md5($line);
    if(isset($a_hash[$hash]) && !$a_hash[$hash]) {
        echo 'record found: ' . $line . '<br>';
        fwrite($c_file, $line);
        $a_hash[$hash] = true;
    }
}

fclose($b_file);
fclose($c_file);

Try sorting the files first (espacially the large one). Then you only need to check the first few characters of each line in b, and stop (go to the next line in a) when you're past that prefix. Then you can even make an index of where in the file each characters is the first (a starts on line 0, b starts on line 1337, c on line 13986 and so on).

Try using ob_flush() and flush() in loop.

foreach($small_list as $one_line)
{
 if(stristr($big_list, $one_line) != FALSE) 
    {
    fwrite($fh, $one_line);
    echo "record found: " . $one_line ."<br>";
    }  
       @ob_flush();
        @flush();
        @ob_end_flush(); 
}

继续阅读：performance php string

stristr and speed

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？