Where can I find a large tabbed hierarchical data set for parser testing?

2023-04-10 18:36 问答作者：

First, apologies as I realize this is only tangentially related to parser programming.

I've spend hours looking for a text file containing something like the following but with hundreds (hopefully thousands) of sub-entries. A complete biological classification file would be perfect. A massive version of the following would be great as my parser parses simple tabbed files:

TL,DR - I need a massive single-file hierarchical data set something like the following:

Kindoms
    Monera
    Protista
    Fungi
    Plants
    Animals
        Porifera
       开发者_运维百科     Sponges
        Coelenterates
            Hydra
            Coral
            Jellyfish
        Platyhelminthes
            Flatworms
            Flukes
        Nematodes
            Roundworms
            Tapeworms
        Chordates
            Urochordataes
            Cephalochordates
            Vertebrates
                Fish
                Amphibians
                Reptiles
                Birds
                Mammals

The best I've been able to find are tree-of-life images (from which I transcribed the sample data set above). A single file with a TON of real data would be awesome. It doesn't have to be a biological classification data set, but I would really like the data to reflect something in the real-world. (My parser feeds a menu - would be great if the remainder of my testing was with a data set that actually meant something!) Even if the file is not tabbed but the data was fairly easily regex'ed to a tabbed format... that would be great.

Any ideas? Thanks!

It is possible that the xml layout was changed since the last answer but the code submitted above is no longer accurate. The resulting dump is extraneous. Some of the nodes have aliases (denoted as 'othername') that are reported as distinct nodes themselves.

I used the script below to generate the correct dump.

<?php
$reader = new XMLReader();
$reader->open('http://tolweb.org/onlinecontributors/app?service=external&page=xml/TreeStructureService&node_id=1'); //15963 is the primates index
$set=-1;
while ($reader->read()) {
    switch ($reader->nodeType) {
        case (XMLREADER::ELEMENT):
        if ($reader->name == "OTHERNAMES"){
            $set=1;
        }
        if ($reader->name == "NODES"){
            $set=-1;
        }
        if ($reader->name == "NODE"){
            $set=-1;
        }
        if ($reader->name == "NAME" AND $set == -1){
            echo str_repeat("\t", $reader->depth - 2);  //repeat tabs for depth
            $node = $reader->expand();
            echo $node->textContent . "\n";
        }
            break;
    }
}
?>

This turned out to be such a pain in the ass. I finally tracked down a data feed from "The Tree of Life Web Project" at tolweb.org. I made the php script below to provide the basic functionality my post was looking for.

Change the node_id to have it print a tabbed representation of any of tolweb.org's data - just take the id from the page you're browsing on their site and change the node_id below.

Be aware though - their data feeds serve up large files, so definitely download the file to your own server (and change the "open" method below to point to the local file) if you're going to hit it more than once or twice.

More info on tolweb.org data feeds can be found here: http://tolweb.org/tree/home.pages/downloadtree.html

<?php
$reader = new XMLReader();
$reader->open('http://tolweb.org/onlinecontributors/app?service=external&page=xml/TreeStructureService&node_id=15963'); //15963 is the primates index
while ($reader->read()) {
    switch ($reader->nodeType) {
        case (XMLREADER::ELEMENT):
            if ($reader->name == "NAME"){
                echo str_repeat("\t", $reader->depth - 2);  //repeat tabs for depth
                $node = $reader->expand();
                echo $node->textContent . "\n";
            }
            break;
    }
}
?>

继续阅读：parsing

Where can I find a large tabbed hierarchical data set for parser testing?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？