开发者

Pulling out a full node with child nodes using XPath

I'm using XPath to select an section from an HTML 开发者_如何学Gopage. However when I use XPath to extract the node, it correctly selects only the text surrounding the HTML tags and not the HTML tags themselves.

Sample HTML

<body>
    <div>
      At first glance you may ask, &#8220;what <i>exactly</i>
      do you mean?&#8221; It means that we want to help <b>you</b> figure...
    </div>
</body>

I have the following XPath

/body/div

I get the following

At first glance you may ask, &#8220;what do you mean?&#8221; It means that we want to help figure...

I want

At first glance you may ask, &#8220;what <i>exactly</i> do you mean?&#8221; It means that we want to help <b>you</b> figure...

If you notice in the Sample HTML there is a <i/> and <b /> HTML tags in the content. The words within those tags are "lost" when I extract the content.

I'm using SimpleXML in PHP if that makes a difference.


Your XPath is fine, though you can remove the final /. as that's redundant:

/atom/content

All of the HTML is inside of a <![CDATA ]]> section so in the XML DOM you actually only have text there. The <i> and <b> tags will not be parsed as tags but will just show up as text. Using a CDATA section is exactly the same as if your XML were written like this:

<atom>
    <content>
      At first glance you may ask, &amp;#8220;what &lt;i&gt;exactly&lt;/i&gt;
      do you mean?&amp;#8221; It means that we want to help &lt;b&gt;you&lt;/b&gt; figure...
    </content>
</atom>

So, it's whatever you're doing with the <content> element afterwards that's dropping those tags. Are you later parsing the text as HTML, or running it through a filter, or something like that?


SimpleXML doesn't like text nodes so you'll have to use a custom solution instead.

You can use asXML() on each div element then remove the div tags, or you can convert the div elements to DOMNodes then loop over $div->childNodes and serialize each child. Note that your HTML entities will most likely be replaced by the actual characters if available.

Alternatively, you can take a look at the SimpleDOM project and use its innerHTML() method.

$html = 
'<body>
    <div>
      At first glance you may ask, &#8220;what <i>exactly</i>
      do you mean?&#8221; It means that we want to help <b>you</b> figure...
    </div>
</body>';

$body = simpledom_load_string($html);

foreach ($body->xpath('/body/div') as $div)
{
    var_dump($div->innerHTML());
}


I don't know if SimpleXML is different but to me it seems you need to make sure you're selecting all node types and not just text. In standard XPath you would do /body/div/node()

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜