Import and process text file within MySQL

2023-03-05 11:10 问答作者：

I'm working on a research project that requires me to process large csv files (~2-5 GB) with 500,000+ records. These files contain information on government contracts (from USASpending.gov). So far, I've been using PHP or Python scripts to attack the files row-by-row, parse them, and then insert the information into the relevant tables. The parsing is moderately complex. For each record, the script checks to see if the entity named is already in the database (using a combination of string and regex matching); if it is not, it first adds the entity to a table of entities and then proceeds to parse the rest of the record and inserts the information into the appropriate tables. The list of entities is over 100,000.

Here are the basic functions (part of a class) that try to match each record with any existing entities:

private function _getOrg($data)
{
    // if name of organization is null, skip it
    if($data[44] == '') return null;

    // use each of the possible names to check if organization exists
    $names = array($data[44],$data[45],$data[46],$data[47]);

    // cycle through the names
    foreach($names as $name) {
        // check to see if there is actually an entry here
        if($name != '') {
            if(($org_id = $this->_parseOrg($name)) != null) {
                $this->update_org_meta($org_id,$data); // updates some information of existing entity based on record
                return $org_id;
            }
        }
    }

    return $this->_addOrg($data);
}

private function _parseOrg($name)
{
    // check to see if it matches any org names
    // db class function, performs simple "like" match
    $this->db->where('org_name',$name,'like');

    $result = $this->db->get('orgs');

    if(mysql_num_rows($result) == 1) {
        $row = mysql_fetch_object($result);
        return $row->org_id;
    }

    // check to see if matches any org aliases
    $this->db->where('org_alias_name',$name,'like');

    $result = $this->db->get('orgs_aliases');

    if(mysql_num_rows($result) == 1) {
        $row = mysql_fetch_object($result);
        return $row->org_id;
    }
    return null; // no matches, have to add new entity
 }

The _addOrg function inserts the new entity's information into the db, where hopefully it will match subsequent records.

Here's the problem: I can only get these scripts to parse about 10,000 records / hour, which, given the size, means a few solid days for each file. The way my db is structured requires a several different tables to be updated for each record because I'm compiling multiple external datasets. So, each record updates two tables, and each new entity updates three tables. I'm worried that this adds too much lag time between MySQL server and my script.

Here's my question:开发者_如何转开发 is there a way to import the text file into a temporary MySQL table and then use internal MySQL functions (or PHP/Python wrapper) to speed up the processing?

I'm running this on my Mac OS 10.6 with local MySQL server.

load the file into a temporary/staging table using load data infile and then use a stored procedure to process the data - shouldnt take more than 1-2 mins at the most to completely load and process the data.

you might also find some of my other answers of interest:

Optimal MySQL settings for queries that deliver large amounts of data?

MySQL and NoSQL: Help me to choose the right one

How to avoid "Using temporary" in many-to-many queries?

60 million entries, select entries from a certain month. How to optimize database?

Interesting presentation:

http://www.mysqlperformanceblog.com/2011/03/18/video-the-innodb-storage-engine-for-mysql/

example code (may be of use to you)

truncate table staging;

start transaction;

load data infile 'your_data.dat' 
into table staging
fields terminated by ',' optionally enclosed by '"'
lines terminated by '\n'
(
org_name
...
)
set
org_name = nullif(org_name,'');

commit;

drop procedure if exists process_staging_data;

delimiter #

create procedure process_staging_data()
begin

    insert ignore into organisations (org_name) select distinct org_name from staging;

    update...

    etc.. 

    -- or use a cursor if you have to ??

end#

delimiter ;

call  process_staging_data();

Hope this helps

It sounds like you'd benefit the most from tuning your SQL queries, which is probably where your script spends the most time. I don't know how the PHP MySQL client performs, but MySQLdb for Python is fairly fast. Doing naive benchmark tests I can easily sustain 10k/sec insert/select queries on one of my older quad-cores. Instead of doing one SELECT after another to test if the organization exists, using a REGEXP to check for them all at once might be more efficient (discussed here: MySQL LIKE IN()?). MySQLdb lets you use executemany() to do multiple inserts simultaneously, you could almost certainly leverage that to your advantage, perhaps your PHP client lets you do the same thing?

Another thing to consider, with Python you can use multiprocessing to and try parallelize as much as possible. PyMOTW has a good article about multiprocessing.

继续阅读：csv php python

Import and process text file within MySQL

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？