Pulling out a full node with child nodes using XPath

2022-12-08 17:07 问答作者：

I'm using XPath to select an section from an HTML 开发者_如何学Gopage. However when I use XPath to extract the node, it correctly selects only the text surrounding the HTML tags and not the HTML tags themselves.

Sample HTML

<body>
    <div>
      At first glance you may ask, &#8220;what <i>exactly</i>
      do you mean?&#8221; It means that we want to help <b>you</b> figure...
    </div>
</body>

I have the following XPath

/body/div

I get the following

At first glance you may ask, “what do you mean?” It means that we want to help figure...

I want

At first glance you may ask, “what exactly do you mean?” It means that we want to help you figure...

If you notice in the Sample HTML there is a  and  HTML tags in the content. The words within those tags are "lost" when I extract the content.

I'm using SimpleXML in PHP if that makes a difference.

Your XPath is fine, though you can remove the final /. as that's redundant:

/atom/content

All of the HTML is inside of a <![CDATA ]]> section so in the XML DOM you actually only have text there. The  and  tags will not be parsed as tags but will just show up as text. Using a CDATA section is exactly the same as if your XML were written like this:

<atom>
    <content>
      At first glance you may ask, &amp;#8220;what &lt;i&gt;exactly&lt;/i&gt;
      do you mean?&amp;#8221; It means that we want to help &lt;b&gt;you&lt;/b&gt; figure...
    </content>
</atom>

So, it's whatever you're doing with the <content> element afterwards that's dropping those tags. Are you later parsing the text as HTML, or running it through a filter, or something like that?

SimpleXML doesn't like text nodes so you'll have to use a custom solution instead.

You can use asXML() on each div element then remove the div tags, or you can convert the div elements to DOMNodes then loop over $div->childNodes and serialize each child. Note that your HTML entities will most likely be replaced by the actual characters if available.

Alternatively, you can take a look at the SimpleDOM project and use its innerHTML() method.

$html = 
'<body>
    <div>
      At first glance you may ask, &#8220;what <i>exactly</i>
      do you mean?&#8221; It means that we want to help <b>you</b> figure...
    </div>
</body>';

$body = simpledom_load_string($html);

foreach ($body->xpath('/body/div') as $div)
{
    var_dump($div->innerHTML());
}

I don't know if SimpleXML is different but to me it seems you need to make sure you're selecting all node types and not just text. In standard XPath you would do /body/div/node()

继续阅读：php simplexml xml

Pulling out a full node with child nodes using XPath

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？