开发者

How to speed up processing a huge text file?

I have an 800mb text file with 18,990,870 lines in it (each line is a record) that I need to pick开发者_高级运维 out certain records, and if there is a match write them into a database.

It is taking an age to work through them, so I wondered if there was a way to do it any quicker?

My PHP is reading a line at a time as follows:

    $fp2 = fopen('download/pricing20100714/application_price','r');
if (!$fp2) {echo 'ERROR: Unable to open file.'; exit;}
while (!feof($fp2)) {
$line = stream_get_line($fp2,128,$eoldelimiter); //use 2048 if very long lines
if ($line[0] === '#') continue;  //Skip lines that start with # 
    $field = explode ($delimiter, $line);
list($export_date, $application_id, $retail_price, $currency_code, $storefront_id ) = explode($delimiter, $line);
if ($currency_code == 'USD' and $storefront_id == '143441'){
// does application_id exist? 
$application_id = mysql_real_escape_string($application_id); 
$query = "SELECT * FROM jos_mt_links WHERE link_id='$application_id';"; 
$res = mysql_query($query); 
if (mysql_num_rows($res) > 0 ) { 
 echo $application_id . "application id has price of " . $retail_price . "with currency of " . $currency_code. "\n";
} // end if exists in SQL  
} else 
{
// no, application_id doesn't exist 
}  // end check for currency and storefront
} // end while statement
fclose($fp2);


At a guess, the performance issue is because it issues a query for each application_id with USD and your storefront.

If space and IO aren't an issue, you might just blindly write all 19M records into a new staging DB table, add indices and then do the matching with a filter?


Don't try to invent the wheel, it's been done. Use a database to search through the file's content. You can looad that file into a staging table in your database and query your data using indexes for fast access if they add value. Most if not all databases have import/loading tools to get a file into the database relatively fast.


19M rows on DB will slow it down if DB was not designed properly. You can still use text files, if it is partitioned properly. Recreating multiple smaller files, based on certain parameters, storing in proper sorted way might work.

Anyway PHP is not the best language for file IO and processing, it is much slower than Java for this task, while plain old C would be one of the fastest for the job. PHP should be restricted to generated dynamic Web output, while core processing should be in Java/C. Ideally it should be Java/C service which generates output, and PHP using that feed to generate HTML output.


You are parsing the input line twice by doing two explodes in a row. I would start by removing the first line:

$field = explode ($delimiter, $line); 
list($export_date, ...., $storefront_id ) = explode($delimiter, $line);

Also, if you are only using the query to test for a match based on your condition, don't use SELECT * use something like this:

"SELECT 1 FROM jos_mt_links WHERE link_id='$application_id';"

You could also, as Brandon Horsley suggested, buffer a set of application_id values in an array and modify your select statement to use the IN clause thereby reducing the number of queries you are performing.


Have you tried profiling the code to see where it's spending most of its time? That should always be your first step when trying to diagnose performance problems.


Preprocess with sed and/or awk ?


Databases are built and designed to cope with large amounts of data, PHP isn't. You need to re-evaluate how you are storing the data.

I would dump all the records into a database, then delete the records you don't need. Once you have done that, you can copy those records wherever you want.


As others have mentioned, the expense is likely in your database query. It might be faster to load a batch of records from the file (instead of one at a time) and perform one query to check multiple records.

For example, load 1000 records that match the USD currency and storefront at a time into an array and execute a query like:

'select link_id from jos_mt_links where link_id in (' . implode(',', application_id_array) . ')'

This will return a list of those records that are in the database. Alternatively, you could change the sql to be not in to get a list of those records that are not in the database.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜