Read out node content with xpath from several xml files

2023-04-06 17:55 问答作者：

I have a question regarding reading out the node content with XPath from several XML files. I am fully aware that there are masses of resources on the internet on this issue and please, believe me, it really drives me crazy.

I want to read out information from files from the reuters rcv1 experimental corpus. all the files in this corpus share the same information. I post the structure here as an example:

<?xml version="1.0" encoding="iso-8859-1" ?>
<newsitem itemid="1000000" id="root" date="xxx" xml:lang="en">
<title>title title title</title>
<headline>headline headline headline</headline>
<byline>Jack Daniels</byline>
<dateline>Blabla</dateline>
<text>
<p> Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 </p>
<p> Paragraph 2 Paragraph 2 Paragraph 2 Paragraph 2 Paragraph 2 </p>
<p> Paragraph 3 Paragraph 3 Paragraph 3 Paragraph 3 Paragraph 3 </p>
<p> Paragraph 4 Paragraph 4 Paragraph 4 Paragraph 4 Paragraph 4 </p>
</text>
<copyright>(c) Reuters Limited 1996</copyright>
<metadata>
<codes class="bip:countries:1.0">
  <code code="MEX">
    <editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="1996-02-20"/>
  </code>
</codes>
<codes class="bip:topics:1.0">
  <code code="xxx">
    <editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="1996-08-20"/>
  </code>
  <code code="xxx">
    <editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="xxx"/>
  </code>
  <code code="xxx">
    <editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="xxx"/>
  </code>
  <code code="xxx">
    <editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="xxx"/>
  </code>
  <code code="xxx">
    <editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="xxx"/>
  </code>
</codes>
<dc element="dc.publisher" value="Reuters Holdings Plc"/>
<dc element="dc.date.published" value="xxx"/>
<dc element="dc.source" value="Reuters"/>
<dc element="dc.creator.location" value="xxx"/>
<dc element="dc.creator.location.country.name" value="xxx"/>
<dc element="dc.source" value="Reuters"/>
</metadata>
</newsitem>

The final goal of my task is to transfer these several thousand files into CSV. I am doing this with the software rapidminer by addressing the different node co开发者_Go百科ntents via their XPath address.

This is absolutely no problem for all points but one, the content of <text></text>.

With //newsitem/text/p/node() he always only delivers the first paragraph. What I am looking for, however, would be to extract all the plain text from all paragraphs. This means the CSV files should look approximately like this:

title, headline, date, text, location
titleblabla, headlineblabla, xxx, paragraph 1 paragraph 2 paragraph 3, anywhere
othertitleblabla, otherheadlineblabla, otherdatexxx, other paragraph 1 paragraph 2 paragraph 3, nowhere

Please, could somebody be so nice to help me achieve this by addressing it with XPath? I have also tried the whole thing with string matches but this takes ages and additionally I have to get rid of the XML tags.

Thank you very much,

alexandre (a desperate xpath/xml newbie)

It seems from your description that RapidMiner retrieves the string value of the node(-set) selected by a given XPath expression. By definition, the string value of a node-set is the string value of the first node in this node-set -- this matches exactly your description of the problem.

Solution:

Instead of:

//newsitem/text/p/node()

use:

/newsitem/text

The string value of the only elementselected in the provided document with the expression above (by definition) is the concatenation of all of its text-node descendents -- exactly what you want.

继续阅读：xml

Read out node content with xpath from several xml files

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？