Trying to parse XML in Perl, but long data string gets cutoff
I've tried to parse an XML file with XML::Simple and XML::Twig with the same result. The other fields in the file works just fine.
The file in question can be retrieved here:
curl -s "http://apps.nlm.nih.gov/medlineplus/services/mpconnect_service.cfm?mainSearchCriteria.v.cs=2.16.840.1.113883.6.103&mainSearchCriteria.v.c=130"
Is this a problem with the parser or the file? The output was the same with both parsers. The HTML-tags in the string is stored in the XML
Input field (inside xml-tags named 'summary'):
<summary type="html"><p>Toxoplasmosis is a disease caused by the parasite <em>Toxoplasma gondii</em>. More than 60 million people in the U.S. have the parasite. Most of them don't get sick. But the parasite causes serious problems for some people. These include people with weak immune systems and babies whose mothers become infected for the first time during pregnancy. Problems can include damage to the brain, eyes and other organs.</p&开发者_高级运维;gt;
^I
<p>You can get toxoplasmosis from </p>
<ul>
<li>^IWaste from an infected cat</li>
<li>^IEating contaminated meat that is raw or not well cooked </li>
<li>^IUsing utensils or cutting boards after they've had contact with raw meat </li>
<li>^IDrinking infected water </li>
<li>^IReceiving an infected organ transplant or blood transfusion</li>
</ul>
<p>Most people with toxoplasmosis don't need treatment. There are drugs to treat it for pregnant women and people with weak immune systems. </p>

<p class="NLMattribution">Centers for Disease Control and Prevention</p></summary>
Output after XML-parsing:
<p>Toxoplasmosis is a disease caused by the parasite <em>Toxoplasma gondii</em>. More than 60 million people in the U.S. have the parasite. Most of them don't get sick. But the parasite causes serious problems for some people. These include people with weak im<p class="NLMattribution">Centers for Disease Control and Prevention</p>to treat it for pregnant women and people with weak immune systems. </p>her organs.</p>
Solution to the problem: The XML files contains a carriage return " " which causes problems for the parsers. After I downloaded the XML files I removed the carriage returns with the following line:
sed -i 's/
//g' *.xml
The parsers now works as expected.
Update: The carriage return does not affect the parser, only the output which appears truncated and mixed up. Removing it did however solve my problem.
I do get some weird results when parsing the curl as a pipe (using XML::Twig->new->parse( curl -s "http://..." |
): the content appears truncated, changes from call to call...
Things look better if I parse a file created from the curl result, or XML::Twig's native parseurl
method, then the result is constant, and what you want:
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig->new->parseurl( "http://apps.nlm.nih.gov/medlineplus/services/mpconnect_service.cfm?mainSearchCriteria.v.cs=2.16.840.1.113883.6.103&mainSearchCriteria.v.c=130" );
my $summary = $twig->first_elt( 'summary');
print $summary->text, "\n";
Honestly I have no idea why this happens. I'll try looking into it a little more, but I suspect there is nothing I can do: if the problem shows up in both XML::Simple and XML::Twig, then it's probably at a lower level of the stack, XML::Parser or expat and their interaction with curl.
精彩评论