Read out node content with xpath from several xml files
I have a question regarding reading out the node content with XPath from several XML files. I am fully aware that there are masses of resources on the internet on this issue and please, believe me, it really drives me crazy.
I want to read out information from files from the reuters rcv1 experimental corpus. all the files in this corpus share the same information. I post the structure here as an example:
<?xml version="1.0" encoding="iso-8859-1" ?>
<newsitem itemid="1000000" id="root" date="xxx" xml:lang="en">
<title>title title title</title>
<headline>headline headline headline</headline>
<byline>Jack Daniels</byline>
<dateline>Blabla</dateline>
<text>
<p> Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 </p>
<p> Paragraph 2 Paragraph 2 Paragraph 2 Paragraph 2 Paragraph 2 </p>
<p> Paragraph 3 Paragraph 3 Paragraph 3 Paragraph 3 Paragraph 3 </p>
<p> Paragraph 4 Paragraph 4 Paragraph 4 Paragraph 4 Paragraph 4 </p>
</text>
<copyright>(c) Reuters Limited 1996</copyright>
<metadata>
<codes class="bip:countries:1.0">
<code code="MEX">
<editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="1996-02-20"/>
</code>
</codes>
<codes class="bip:topics:1.0">
<code code="xxx">
<editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="1996-08-20"/>
</code>
<code code="xxx">
<editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="xxx"/>
</code>
<code code="xxx">
<editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="xxx"/>
</code>
<code code="xxx">
<editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="xxx"/>
</code>
<code code="xxx">
<editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="xxx"/>
</code>
</codes>
<dc element="dc.publisher" value="Reuters Holdings Plc"/>
<dc element="dc.date.published" value="xxx"/>
<dc element="dc.source" value="Reuters"/>
<dc element="dc.creator.location" value="xxx"/>
<dc element="dc.creator.location.country.name" value="xxx"/>
<dc element="dc.source" value="Reuters"/>
</metadata>
</newsitem>
The final goal of my task is to transfer these several thousand files into CSV. I am doing this with the software rapidminer by addressing the different node co开发者_Go百科ntents via their XPath address.
This is absolutely no problem for all points but one, the content of<text></text>
.
With //newsitem/text/p/node()
he always only delivers the first paragraph. What I am looking for, however, would be to extract all the plain text from all paragraphs. This means the CSV files should look approximately like this:
title, headline, date, text, location
titleblabla, headlineblabla, xxx, paragraph 1 paragraph 2 paragraph 3, anywhere
othertitleblabla, otherheadlineblabla, otherdatexxx, other paragraph 1 paragraph 2 paragraph 3, nowhere
Please, could somebody be so nice to help me achieve this by addressing it with XPath? I have also tried the whole thing with string matches but this takes ages and additionally I have to get rid of the XML tags.
Thank you very much,
alexandre (a desperate xpath/xml newbie)
It seems from your description that RapidMiner retrieves the string value of the node(-set) selected by a given XPath expression. By definition, the string value of a node-set is the string value of the first node in this node-set -- this matches exactly your description of the problem.
Solution:
Instead of:
//newsitem/text/p/node()
use:
/newsitem/text
The string value of the only elementselected in the provided document with the expression above (by definition) is the concatenation of all of its text-node descendents -- exactly what you want.
精彩评论