Trying to parse XML in Perl, but long data string gets cutoff

2023-03-10 11:07 问答作者：

I've tried to parse an XML file with XML::Simple and XML::Twig with the same result. The other fields in the file works just fine.

The file in question can be retrieved here:

curl -s "http://apps.nlm.nih.gov/medlineplus/services/mpconnect_service.cfm?mainSearchCriteria.v.cs=2.16.840.1.113883.6.103&mainSearchCriteria.v.c=130"

Is this a problem with the parser or the file? The output was the same with both parsers. The HTML-tags in the string is stored in the XML

Input field (inside xml-tags named 'summary'):

<summary type="html">&lt;p&gt;Toxoplasmosis is a disease caused by the parasite &lt;em&gt;Toxoplasma gondii&lt;/em&gt;. More than 60 million people in the U.S. have the parasite.  Most of them don't get sick. But the parasite causes serious problems for some people. These include people with weak immune systems and babies whose mothers become infected for the first time during pregnancy. Problems can include damage to the brain, eyes and other organs.&lt;/p&开发者_高级运维;gt;&#xd;^I&#xd;&lt;p&gt;You can get toxoplasmosis from &lt;/p&gt;&#xd;&lt;ul&gt;&#xd;&lt;li&gt;^IWaste from an infected cat&lt;/li&gt;&#xd;&lt;li&gt;^IEating contaminated meat that is raw or not well cooked &lt;/li&gt;&#xd;&lt;li&gt;^IUsing utensils or cutting boards after they've had contact with raw meat &lt;/li&gt;&#xd;&lt;li&gt;^IDrinking infected water &lt;/li&gt;&#xd;&lt;li&gt;^IReceiving an infected organ transplant or blood transfusion&lt;/li&gt;&#xd;&lt;/ul&gt;&#xd;&lt;p&gt;Most people with toxoplasmosis don't need treatment. There are drugs to treat it for pregnant women and people with weak immune systems. &lt;/p&gt;&#xd;&#xd;&lt;p class="NLMattribution"&gt;Centers for Disease Control and Prevention&lt;/p&gt;</summary>

Output after XML-parsing:

<p>Toxoplasmosis is a disease caused by the parasite <em>Toxoplasma gondii</em>. More than 60 million people in the U.S. have the parasite.  Most of them don't get sick. But the parasite causes serious problems for some people. These include people with weak im<p class="NLMattribution">Centers for Disease Control and Prevention</p>to treat it for pregnant women and people with weak immune systems. </p>her organs.</p>

Solution to the problem: The XML files contains a carriage return " " which causes problems for the parsers. After I downloaded the XML files I removed the carriage returns with the following line:

sed -i 's/&#xd;//g' *.xml

The parsers now works as expected.

Update: The carriage return does not affect the parser, only the output which appears truncated and mixed up. Removing it did however solve my problem.

I do get some weird results when parsing the curl as a pipe (using XML::Twig->new->parse( curl -s "http://..." |): the content appears truncated, changes from call to call...

Things look better if I parse a file created from the curl result, or XML::Twig's native parseurl method, then the result is constant, and what you want:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

my $twig    = XML::Twig->new->parseurl( "http://apps.nlm.nih.gov/medlineplus/services/mpconnect_service.cfm?mainSearchCriteria.v.cs=2.16.840.1.113883.6.103&mainSearchCriteria.v.c=130" );
my $summary = $twig->first_elt( 'summary');

print $summary->text, "\n";

Honestly I have no idea why this happens. I'll try looking into it a little more, but I suspect there is nothing I can do: if the problem shows up in both XML::Simple and XML::Twig, then it's probably at a lower level of the stack, XML::Parser or expat and their interaction with curl.

继续阅读：parsing perl xml

Trying to parse XML in Perl, but long data string gets cutoff

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？