PHP function to grab all links inside a <DIV> on remote site using scrape method

2023-01-22 17:48 问答作者：

Anyone has a PHP 开发者_开发百科function that can grab all links inside a specific DIV on a remote site? So usage might be:

$links = grab_links($url,$divname);

And return an array I can use. Grabbing links I can figure out but not sure how to make it only do it within a specific div.

Thanks! Scott

Check out PHP XPath. It will let you query a document for the contents of specific tags and so on. The example on the php site is pretty straightforward: http://php.net/manual/en/simplexmlelement.xpath.php

This following example will actually grab all of the URLs in any DIVs in a doc:

$xml = new SimpleXMLElement($docAsString);

$result = $xml->xpath('//div//a');

You can use this on well-formed HTML files, not just XML.

Good XPath reference: http://msdn.microsoft.com/en-us/library/ms256086.aspx

In the past I have use the PHP Simple DOM library with success:

http://simplehtmldom.sourceforge.net/

Samples:

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images 
foreach($html->find('img') as $element) 
       echo $element->src . '<br>';

// Find all links 
foreach($html->find('a') as $element) 
       echo $element->href . '<br>';

I found something that seems to do what I wanted.

http://www.earthinfo.org/xpaths-with-php-by-example/

<?php

$html = new DOMDocument();
@$html->loadHtmlFile('http://www.bbc.com');
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[@id='news_moreTopStories']//a/@href" );
foreach ($nodelist as $n){
echo $n->nodeValue."\n";
}

// for images

echo "<br><br>";
$html = new DOMDocument();
@$html->loadHtmlFile('http://www.bbc.com');
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[@id='promo_area']//img/@src" );
foreach ($nodelist as $n){
echo $n->nodeValue."\n";
}

?>

I also tried PHP DOM method and it seems faster...

http://w-shadow.com/blog/2009/10/20/how-to-extract-html-tags-and-their-attributes-with-php/

$html = file_get_contents('http://www.bbc.com');
//Create a new DOM document
$dom = new DOMDocument;

//Parse the HTML. The @ is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
@$dom->loadHTML($html);

//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementById('news_moreTopStories')->getElementsByTagName('a');

//Iterate over the extracted links and display their URLs
foreach ($links as $link){
    //Extract and show the "href" attribute. 
    echo $link->getAttribute('href'), '<br>';
}

继续阅读：curl php preg-match screen screen-scraping

PHP function to grab all links inside a <DIV> on remote site using scrape method

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？