开发者

Read out node content with xpath from several xml files

I have a question regarding reading out the node content with XPath from several XML files. I am fully aware that there are masses of resources on the internet on this issue and please, believe me, it really drives me crazy.

I want to read out information from files from the reuters rcv1 experimental corpus. all the files in this corpus share the same information. I post the structure here as an example:

<?xml version="1.0" encoding="iso-8859-1" ?>
<newsitem itemid="1000000" id="root" date="xxx" xml:lang="en">
<title>title title title</title>
<headline>headline headline headline</headline>
<byline>Jack Daniels</byline>
<dateline>Blabla</dateline>
<text>
<p> Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 </p>
<p> Paragraph 2 Paragraph 2 Paragraph 2 Paragraph 2 Paragraph 2 </p>
<p> Paragraph 3 Paragraph 3 Paragraph 3 Paragraph 3 Paragraph 3 </p>
<p> Paragraph 4 Paragraph 4 Paragraph 4 Paragraph 4 Paragraph 4 </p>
</text>
<copyright>(c) Reuters Limited 1996</copyright>
<metadata>
<codes class="bip:countries:1.0">
  <code code="MEX">
    <editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="1996-02-20"/>
  </code>
</codes>
<codes class="bip:topics:1.0">
  <code code="xxx">
    <editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="1996-08-20"/>
  </code>
  <code code="xxx">
    <editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="xxx"/>
  </code>
  <code code="xxx">
    <editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="xxx"/>
  </code>
  <code code="xxx">
    <editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="xxx"/>
  </code>
  <code code="xxx">
    <editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="xxx"/>
  </code>
</codes>
<dc element="dc.publisher" value="Reuters Holdings Plc"/>
<dc element="dc.date.published" value="xxx"/>
<dc element="dc.source" value="Reuters"/>
<dc element="dc.creator.location" value="xxx"/>
<dc element="dc.creator.location.country.name" value="xxx"/>
<dc element="dc.source" value="Reuters"/>
</metadata>
</newsitem>

The final goal of my task is to transfer these several thousand files into CSV. I am doing this with the software rapidminer by addressing the different node co开发者_Go百科ntents via their XPath address.

This is absolutely no problem for all points but one, the content of <text></text>.

With //newsitem/text/p/node() he always only delivers the first paragraph. What I am looking for, however, would be to extract all the plain text from all paragraphs. This means the CSV files should look approximately like this:

title, headline, date, text, location
titleblabla, headlineblabla, xxx, paragraph 1 paragraph 2 paragraph 3, anywhere
othertitleblabla, otherheadlineblabla, otherdatexxx, other paragraph 1 paragraph 2 paragraph 3, nowhere 

Please, could somebody be so nice to help me achieve this by addressing it with XPath? I have also tried the whole thing with string matches but this takes ages and additionally I have to get rid of the XML tags.

Thank you very much,

alexandre (a desperate xpath/xml newbie)


It seems from your description that RapidMiner retrieves the string value of the node(-set) selected by a given XPath expression. By definition, the string value of a node-set is the string value of the first node in this node-set -- this matches exactly your description of the problem.

Solution:

Instead of:

//newsitem/text/p/node() 

use:

/newsitem/text

The string value of the only elementselected in the provided document with the expression above (by definition) is the concatenation of all of its text-node descendents -- exactly what you want.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜