开发者

Where can I find a large tabbed hierarchical data set for parser testing?

First, apologies as I realize this is only tangentially related to parser programming.

I've spend hours looking for a text file containing something like the following but with hundreds (hopefully thousands) of sub-entries. A complete biological classification file would be perfect. A massive version of the following would be great as my parser parses simple tabbed files:

TL,DR - I need a massive single-file hierarchical data set something like the following:

Kindoms
    Monera
    Protista
    Fungi
    Plants
    Animals
        Porifera
       开发者_运维百科     Sponges
        Coelenterates
            Hydra
            Coral
            Jellyfish
        Platyhelminthes
            Flatworms
            Flukes
        Nematodes
            Roundworms
            Tapeworms
        Chordates
            Urochordataes
            Cephalochordates
            Vertebrates
                Fish
                Amphibians
                Reptiles
                Birds
                Mammals

The best I've been able to find are tree-of-life images (from which I transcribed the sample data set above). A single file with a TON of real data would be awesome. It doesn't have to be a biological classification data set, but I would really like the data to reflect something in the real-world. (My parser feeds a menu - would be great if the remainder of my testing was with a data set that actually meant something!) Even if the file is not tabbed but the data was fairly easily regex'ed to a tabbed format... that would be great.

Any ideas? Thanks!


It is possible that the xml layout was changed since the last answer but the code submitted above is no longer accurate. The resulting dump is extraneous. Some of the nodes have aliases (denoted as 'othername') that are reported as distinct nodes themselves.

I used the script below to generate the correct dump.

<?php
$reader = new XMLReader();
$reader->open('http://tolweb.org/onlinecontributors/app?service=external&page=xml/TreeStructureService&node_id=1'); //15963 is the primates index
$set=-1;
while ($reader->read()) {
    switch ($reader->nodeType) {
        case (XMLREADER::ELEMENT):
        if ($reader->name == "OTHERNAMES"){
            $set=1;
        }
        if ($reader->name == "NODES"){
            $set=-1;
        }
        if ($reader->name == "NODE"){
            $set=-1;
        }
        if ($reader->name == "NAME" AND $set == -1){
            echo str_repeat("\t", $reader->depth - 2);  //repeat tabs for depth
            $node = $reader->expand();
            echo $node->textContent . "\n";
        }
            break;
    }
}
?>


This turned out to be such a pain in the ass. I finally tracked down a data feed from "The Tree of Life Web Project" at tolweb.org. I made the php script below to provide the basic functionality my post was looking for.

Change the node_id to have it print a tabbed representation of any of tolweb.org's data - just take the id from the page you're browsing on their site and change the node_id below.

Be aware though - their data feeds serve up large files, so definitely download the file to your own server (and change the "open" method below to point to the local file) if you're going to hit it more than once or twice.

More info on tolweb.org data feeds can be found here: http://tolweb.org/tree/home.pages/downloadtree.html

<?php
$reader = new XMLReader();
$reader->open('http://tolweb.org/onlinecontributors/app?service=external&page=xml/TreeStructureService&node_id=15963'); //15963 is the primates index
while ($reader->read()) {
    switch ($reader->nodeType) {
        case (XMLREADER::ELEMENT):
            if ($reader->name == "NAME"){
                echo str_repeat("\t", $reader->depth - 2);  //repeat tabs for depth
                $node = $reader->expand();
                echo $node->textContent . "\n";
            }
            break;
    }
}
?>
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜