Where can I find a large tabbed hierarchical data set for parser testing?
First, apologies as I realize this is only tangentially related to parser programming.
I've spend hours looking for a text file containing something like the following but with hundreds (hopefully thousands) of sub-entries. A complete biological classification file would be perfect. A massive version of the following would be great as my parser parses simple tabbed files:
TL,DR - I need a massive single-file hierarchical data set something like the following:
Kindoms
Monera
Protista
Fungi
Plants
Animals
Porifera
开发者_运维百科 Sponges
Coelenterates
Hydra
Coral
Jellyfish
Platyhelminthes
Flatworms
Flukes
Nematodes
Roundworms
Tapeworms
Chordates
Urochordataes
Cephalochordates
Vertebrates
Fish
Amphibians
Reptiles
Birds
Mammals
The best I've been able to find are tree-of-life images (from which I transcribed the sample data set above). A single file with a TON of real data would be awesome. It doesn't have to be a biological classification data set, but I would really like the data to reflect something in the real-world. (My parser feeds a menu - would be great if the remainder of my testing was with a data set that actually meant something!) Even if the file is not tabbed but the data was fairly easily regex'ed to a tabbed format... that would be great.
Any ideas? Thanks!
It is possible that the xml layout was changed since the last answer but the code submitted above is no longer accurate. The resulting dump is extraneous. Some of the nodes have aliases (denoted as 'othername') that are reported as distinct nodes themselves.
I used the script below to generate the correct dump.
<?php
$reader = new XMLReader();
$reader->open('http://tolweb.org/onlinecontributors/app?service=external&page=xml/TreeStructureService&node_id=1'); //15963 is the primates index
$set=-1;
while ($reader->read()) {
switch ($reader->nodeType) {
case (XMLREADER::ELEMENT):
if ($reader->name == "OTHERNAMES"){
$set=1;
}
if ($reader->name == "NODES"){
$set=-1;
}
if ($reader->name == "NODE"){
$set=-1;
}
if ($reader->name == "NAME" AND $set == -1){
echo str_repeat("\t", $reader->depth - 2); //repeat tabs for depth
$node = $reader->expand();
echo $node->textContent . "\n";
}
break;
}
}
?>
This turned out to be such a pain in the ass. I finally tracked down a data feed from "The Tree of Life Web Project" at tolweb.org. I made the php script below to provide the basic functionality my post was looking for.
Change the node_id to have it print a tabbed representation of any of tolweb.org's data - just take the id from the page you're browsing on their site and change the node_id below.
Be aware though - their data feeds serve up large files, so definitely download the file to your own server (and change the "open" method below to point to the local file) if you're going to hit it more than once or twice.
More info on tolweb.org data feeds can be found here: http://tolweb.org/tree/home.pages/downloadtree.html
<?php
$reader = new XMLReader();
$reader->open('http://tolweb.org/onlinecontributors/app?service=external&page=xml/TreeStructureService&node_id=15963'); //15963 is the primates index
while ($reader->read()) {
switch ($reader->nodeType) {
case (XMLREADER::ELEMENT):
if ($reader->name == "NAME"){
echo str_repeat("\t", $reader->depth - 2); //repeat tabs for depth
$node = $reader->expand();
echo $node->textContent . "\n";
}
break;
}
}
?>
精彩评论