开发者

How to parse an extremely large xml file in either java or php and insert into a mysql DB [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.

Want to improve this question? Update the question so it focuses on one problem only by editing this post.

Closed 4 years ago.

Improve this question

I'm trying to parse a massive xml file into my MySQL database. the file is 4.7gb. I know, its insane.

Th开发者_如何学JAVAe data comes from here: http://www.discogs.com/data/ (the newest album xml is 700mb zipped and 4.7gb unzipped)

I can either use java or php to parse and update the database. I assume that java is the smarter idea.

I need to find a way to parse the xml without filling my 4gb of ram, and load it into the db.

What is the smartest way of doing this? I've heard of SAX parsers, am I thinking in the right direction?

For now, I don't care about downloading the images from those urls, I just want the data in my database. I have not yet designed the tables yet, but I'm more interested in the xml side right now.

I used php's fread() to open the file's first 1000 bites, so at least I can see what it looks like, here's a sample of the structure of the first album in the file:

<releases>
<release id="1" status="Accepted">
    <images>
        <image height="600" type="primary" uri="http://s.dsimg.com/image/R-1-1193812031.jpeg" uri150="http://s.dsimg.com/image/R-150-1-1193812031.jpeg" width="600" />
        <image height="600" type="secondary" uri="http://s.dsimg.com/image/R-1-1193812053.jpeg" uri150="http://s.dsimg.com/image/R-150-1-1193812053.jpeg" width="600" />
        <image height="600" type="secondary" uri="http://s.dsimg.com/image/R-1-1193812072.jpeg" uri150="http://s.dsimg.com/image/R-150-1-1193812072.jpeg" width="600" />
        <image height="600" type="secondary" uri="http://s.dsimg.com/image/R-1-1193812091.jpeg" uri150="http://s.dsimg.com/image/R-150-1-1193812091.jpeg" width="600" />
    </images>
    <artists>
        <artist>
            <name>Persuader, The</name>
        </artist>
    </artists>
    <title>Stockholm</title>
    <labels>
        <label catno="SK032" name="Svek" />
    </labels>
    <formats>
        <format name="Vinyl" qty="2">
            <descriptions>
                <description>12"</description>
            </descriptions>
        </format>
    </formats>
    <genres>
        <genre>Electronic</genre>
    </genres>
    <styles>
        <style>Deep House</style>
    </styles>
    <country>Sweden</country>
    <released>1999-03-00</released>
    <notes>Recorded at the Globe studio in Stockholm. The titles are the names of Stockholm's districts.</notes>
    <master_id>5427</master_id>
    <tracklist>
        <track>
            <position>A</position>
            <title>Östermalm</title>
            <duration>4:45</duration>
        </track>
        <track>
            <position>B1</position>
            <title>Vasastaden</title>
            <duration>6:11</duration>
        </track>
        <track>
            <position>B2</position>
            <title>Kungsholmen</title>
            <duration>2:49</duration>
        </track>
        <track>
            <position>C1</position>
            <title>Södermalm</title>
            <duration>5:38</duration>
        </track>
        <track>
            <position>C2</position>
            <title>Norrmalm</title>
            <duration>4:52</duration>
        </track>
        <track>
            <position>D</position>
            <title>Gamla Stan</title>
            <duration>5:16</duration>
        </track>
    </tracklist>
</release>

Thanks.


You obviously need a streaming API rather that a DOM, which would need to hold the whole document in memory. Java supports SAX and Stax. I've never used Stax myself, but heard it was easier to use than SAX, while still efficient.

Make sure to split the work into many transactions as well : the database won't be able to support as many insert statements in a single transaction.


I have faced some time ago with similar problem. Here is part of script witch imported around 28MB file, not reading whole data into memory. It should work perhaps :). It reads it by XML nodes, in memory stays only little part of XML. Script will need little modications to fit your needs.

$reader = new XMLReader();
$reader->open(<path_to_large_xml_file>);    
while ($reader->read()) {
    switch ($reader->nodeType) {
        case (XMLREADER::ELEMENT):
        if ($reader->localName == "Table") {

            $node = $reader->expand();
            $dom = new DomDocument();
            $n = $dom->importNode($node,true);
            $dom->appendChild($n);
            $sxe = simplexml_import_dom($n);

            $Data = array();
            $DataColumns = array();

            foreach ($columns as $key => $column)
            {

                if (in_array($key,$DateColumns))
                {
                    $DateArray = explode('/',substr(trim($sxe->$column),0,10));   
                    $ValueColumn = date('Y-m-d H:i:s',mktime(0,0,0,$DateArray[1],$DateArray[0],$DateArray[2]));
                    $Data[] = '\''.$ValueColumn.'\'';
                    $DataColumns[] = $key;

                    if ($SplitDateInsert == 'enabled')
                    {
                        $Data[] = '\''.$DateArray[2].'\'';
                        $Data[] = '\''.$DateArray[1].'\'';
                        $Data[] = '\''.$DateArray[0].'\'';

                        $DataColumns[] = $key.'_year';
                        $DataColumns[] = $key.'_month';
                        $DataColumns[] = $key.'_day';                            
                    }

                } else {
                    $ValueColumn = addslashes(trim($sxe->$column));
                    $Data[] = '\''.$ValueColumn.'\'';
                    $DataColumns[] = $key;
                }                   

            }               
                $SQL = "INSERT INTO {$tableName} (".implode(',',$DataColumns).") VALUES (".implode(',',$Data).")";                  
                $db->query($SQL);                       

        } // END IF table
    }
}


If I where to parse it using PHP I would do it in two steps:

  1. Split the file every several <release> nod and make that minifile a valid XML.
  2. Parse each resulting file separately

If speed is not of the essence, then PHP would actually be better, as it is easy to parse texts/XML in PHP.


Assuming that MySQL has similar capabilities like Oracle in that regard, why not let the DB handle the parsing? In oracle you could just register the XMLSchema, create a structured XMLType table (probably more useful than a clob) and then insert the file.

Never used it for anything quite that large, but I don't see why it shouldn't work and it's done in a few lines of code. You'd only need someone with MySQL experience to tell how the details work.


I would second suggestion to use Stax on Java. Or, even simpler, StaxMate which adds bit more convenience to access.

But what exactly do you need to do with XML? To just stream it in DB, there are ways to handle BLOBs as streams (although mySQL JDBC drivers are notorious for quirks) so that should be doable.


4.7 GB is not insane, merely mild. If you have 64-bit Java, extended VTD-XML should be the most efficient and easy to use option available.


You haven't said what processing you want to do with the XML. You could consider using Saxon's streaming-mode XSLT (it needs the Saxon-EE product, which costs money) - if the processing is essentially a "burst-mode streaming" exercise of processing each "release" element in turn, then it shouldn't be too difficult. Of course you can also use low-level SAX or StaX parsing, but that will almost certainly require developing and debugging rather more code.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜