开发者

A little help with this xPath?

I am getting some info from an RSS.

<?php
$dom = new DOMDocument;
libxml_use_internal_errors(TRUE);
$dom->load('http://www.myrss.com');
libxml_clear_errors();

$xPath = new DOMXPath($dom);
$links = $xPath->query('xxxxx');
foreach($links as $link) {
    printf("%s \n", $link->nodeValue);
}
?>

I have managed to get the TITLE, LINK and DESCRIPTION with //item/title and so on, howver I want to get the text content and image of description seperated.

As I can see through page source using firefox this is the code I see for image and the content. Both are in <description></description>

IMAGE

<div class="separator" style="clear: both; text-align: center;"><a href="LINK TO IMAGE" imageanchor="1" 
style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;">开发者_StackOverflow社区<img border="0" height="192" 
src="LINK TO IMAGE" width="320" /></a></div>

CONTENT TEXT

<span class="Apple-style-span" style="font-family: 'Trebuchet MS', sans-serif;"> CONTENT TEXT IS HERE </span>

What xPath should I use to get those data? Thank you


If it is what it looks like and the content is HTML-encoded, you can't do it in one step. You must retrieve every description text and parse into its own DOM (unless you want to resort to regex, which I would strongly discourage).

When in doubt, you can pass it through Tidy before. DOMDocument has loadHTML(), which is pretty resilient, but it is not guaranteed that it can load any HTML.

// beware, this is untested. it should give you an idea, though.

$dom = new DOMDocument;
libxml_use_internal_errors(TRUE);

$dom->load('http://www.myrss.com');
libxml_clear_errors();

$xPath = new DOMXPath($dom);
$items = $xPath->query('/rss/channel/item');

foreach($items as $item) {
    $descr = $xPath->query('./description', $item);
    // there should be at most one, but foreach gracefully
    // handles the case where there is no <description>
    foreach ($descr as $d) {
        $temp_dom = new DOMDocument();
        $temp_dom->loadHTML( $d->nodeValue );   // error handling/Tidy here!

        $temp_xpath = new DOMXPath($temp_dom);

        $img = $temp_xpath->query('//img');
        $txt = $temp_xpath->query('//span[@class="Apple-style-span"]');

        // now do something with $img and $txt
    }

}


Your code didn't format correctly so it would be hard for others to work on it.

However, the interactive tool here: http://www.bubasoft.net/ (XPath Builder) is very helpful when constructing XPath queries.


It looks like the content is encoded/escaped so you can't query it with Xpath as it isn't HTML/XML. Take at htmlentities and html_entity_decode

You should extract the content, convert it to HTML/XML en load it into a DOM Document separately. Then you can query it using Xpath.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜