html parsing with php DOMDocument
i'm trying to extract contents from a forum, I want to get all the topics links if the topic has more than one page, this is the topic format:
<td align="left">
<div class="topicos">
<a href="/_t_1593901" title="Welcome2">
<span class="titulo">
Hello World!
</span>
</a><br>
</div>
</td>
and this is the topic format if it has more then one page:
<td align="left">
<div class="topicos">
<a href="_t_1594517" title="Welcome">
<span class="titulo">
Hello World!
</span>
</a><br>
</div>
<span class="quickPaging">
[<img src="http://forum.imguol.com//forum/themes/jogos/images/clear.gif" class="master-sprite sprite-icon-minipost" alt="Ir à página" title="Ir à página">
Ir à página:
<a href="/_t_1594517?&page=1">1</a>,
<a href="/_t_1594517?&page=2">2</a>,
<a href="/_t_1594517?&page=3">3</a>,
<a href="/_t_1594517?&page=4开发者_高级运维">4</a>,
<a href="/_t_1594517?&page=5">5</a>
]</span>
</td>
I want to get the id(_t_1594517) of those topics with 5 or more pages, how can I do that ? This is what I were tyring, but I got lost and I didn't understand the DOMDocument documentation very well, I'm new to programming and PHP, help:
<?php
$html = new DOMDocument();
$url = "http://website.com/forum/?page=";
$page = "1";
while($page <= 10)
{
$html->loadHTML($url + $page);
foreach($html->getElementsByTagName('td') as $td)
{
if($td->hasAttributes())
{
if($td->getAttribute('align') == "left")
{
$div = $td->getElementsByTagName('div');
if($div->hasAttributes())
{
if($td->getAttribute('class') == "topicos")
{
$a = $td->getElementsByTagName('a');
{
if($a->hasAttributes())
{
/*$return['link'][] =*/ echo $a->getElementById('href')->tagName;
}
}
}
}
}
}
}
}
?>
I think xpath can help you:
If $with_links
had the HTML content with the 5 links then
$doc = new DOMDocument();
$doc->loadHTML($with_links);
$xpath = new DOMXPath($doc);
$quick_paging_links = $xpath->query('//span[@class="quickPaging"]/a[contains(@href,"_t_")]/@href');
if($quick_paging_links->length>4)
{
$first_href = $quick_paging_links->item(0)->value;
$id = substr($first_href, 1, strpos($first_href, '?')-1);
echo 'Topic with id '.$id.' has '.$quick_paging_links->length." links.\n";
}
will produce the output:
Topic with id _t_1594517 has 5 links.
精彩评论