XPath Node to String
How can I select the string contents of the following nodes:
<span class="url">
word
<b class=" ">test</b>
</span>
<span class="url">
word
<b class开发者_运维技巧=" ">test2</b>
more words
</span>
I have tried a few things
//span/text()
Doesn't get the bold tag
//span/string(.)
is invalid
string(//span)
only selects 1 node
I am using simple_xml in php and the only other option I think is to use //span which returns:
Array
(
[0] => SimpleXMLElement Object
(
[@attributes] => Array
(
[class] => url
)
[b] => test
)
[1] => SimpleXMLElement Object
(
[@attributes] => Array
(
[class] => url
)
[b] => test2
)
)
*note that it is also dropping the "more words" text from the second span.
So I guess I could then flatten the item in the array using php some how? Xpath is preferred, but any other ideas would help too.
$xml = '<foo>
<span class="url">
word
<b class=" ">test</b>
</span>
<span class="url">
word
<b class=" ">test2</b>
more words
</span>
</foo>';
$dom = new DOMDocument();
$dom->loadXML($xml); //or load an HTML document with loadHTML()
$x= new DOMXpath($dom);
foreach($x->query("//span[@class='url']") as $node) echo $node->textContent;
You dont even need an XPath for this:
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('span') as $span) {
if(in_array('url', explode(' ', $span->getAttribute('class')))) {
$span->nodeValue = $span->textContent;
}
}
echo $dom->saveHTML();
EDIT after comment below
If you just want to fetch the string, you can do echo $span->textContent;
instead of replacing the nodeValue. I understood you wanted to have one string for the span, instead of the nested structure. In this case, you should also consider if simply running strip_tags
on the span snippet wouldnt be the faster and easier alternative.
With PHP5.3 you can also register arbitrary PHP functions for use as callbacks in XPath queries. The following would fetch the content of all span elements and it's child nodes and return it as a single string.
$dom = new DOMDocument;
$dom->loadHTML($html);
$xp = new DOMXPath($dom);
$xp->registerNamespace("php", "http://php.net/xpath");
$xp->registerPHPFunctions();
echo $xp->evaluate('php:function("nodeTextJoin", //span)');
// Custom Callback function
function nodeTextJoin($nodes)
{
$text = '';
foreach($nodes as $node) {
$text .= $node->textContent;
}
return $text;
}
Using XMLReader:
$xmlr = new XMLReader;
$xmlr->xml($doc);
while ($xmlr->read()) {
if (($xmlr->nodeType == XmlReader::ELEMENT) && ($xmlr->name == 'span')) {
echo $xmlr->readString();
}
}
Output:
word
test
word
test2
more words
SimpleXML doesn't like mixing text nodes with other elements, that's why you're losing some content there. The DOM extension, however, handles that just fine. Luckily, DOM and SimpleXML are two faces of the same coin (libxml) so it's very easy to juggle them. For instance:
foreach ($yourSimpleXMLElement->xpath('//span') as $span)
{
// will not work as expected
echo $span;
// will work as expected
echo textContent($span);
}
function textContent(SimpleXMLElement $node)
{
return dom_import_simplexml($node)->textContent;
}
//span//text()
This may be the best you can do. You'll get multiple text nodes because the text is stored in separate nodes in the DOM. If you want a single string you'll have to just concatenate the text nodes yourself since I can't think of a way to get the built-in XPath functions to do it.
Using string()
or concat()
won't work because these functions expect string arguments. When you pass a node-set to a function expecting a string, the node-set is converted to a string by taking the text content of the first node in the node-set. The rest of the nodes are discarded.
How can I select the string contents of the following nodes:
First, I think your question is not clear.
You could select the descendant text nodes as John Kugelman has answer with
//span//text()
I recommend to use the absolute path (not starting with //
)
But with this you would need to process the text nodes finding from wich parent span
they are childs. So, it would be better to just select the span
elements (as example, //span
) and then process its string value.
With XPath 2.0 you could use:
string-join(//span, '.')
Result:
word test. word test2 more words
With XSLT 1.0, this input:
<div>
<span class="url">
word
<b class=" ">test</b>
</span>
<span class="url">
word
<b class=" ">test2</b>
more words
</span>
</div>
With this stylesheet:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="span[@class='url']">
<xsl:value-of select="concat(substring('.',1,position()-1),normalize-space(.))"/>
</xsl:template>
</xsl:stylesheet>
Output:
word test.word test2 more words
Along the lines of Alejandro's XSLT 1.0 "but any other ideas would help too" answer...
XML:
<?xml version="1.0" encoding="UTF-8"?>
<div>
<span class="url">
word
<b class=" ">test</b>
</span>
<span class="url">
word
<b class=" ">test2</b>
more words
</span>
</div>
XSL:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="span">
<xsl:value-of select="normalize-space(data(.))"/>
</xsl:template>
</xsl:stylesheet>
OUTPUT:
word test
word test2 more words
精彩评论