strpos problem: getting value UBLIC returned
I am making a class to open a webpage and store the href values of all outbound links on the page. For some reason it works for the first 3 then goes wierd.开发者_如何学运维 Below is my code:
class Crawler {
var $url;
function construct($url) {
$this->url = 'http://'.$url;
$this->crawl();
}
function crawl() {
$str = file_get_contents($this->url);
$start = 0;
for($i=0; $i<10; $i++) {
$beg = strpos($str, '<a href="http://',$start)+16;
$end = strpos($str,'"',$beg);
$diff = $end - $beg;
$links[$i] = substr($str,$beg, $diff);
$start = $start + $beg;
}
print_r($links);
}
}
$crawler = new Crawler;
$crawler->construct('www.yahoo.com');
Ignore the for loop for the time being I know this will only return the first 10 and won't do the whole document. But if you run this code the first 3 work fine but then all the other values are UBLIC. Can anyone help? Thanks
Instead of:
$start = $start + $beg;
try:
$start = $beg;
That's likely why you are only seeing the first three matches.
Also, you need to insert a check that $beg
is not FALSE
:
for($i=0; $i<10; $i++) {
$beg = strpos($str, '<a href="http://',$start)+16;
if ($beg === FALSE)
break;
//...
Note, however, that you really should be using DOMDocument
to find all tags in a document with a given tag name (a
here). In particular, because this is HTML that might not be valid XHTML, you should consider using the loadHTML
method.
I think you have a problem in your logic:
you use $start to mark the place where to start looking for the href, but the resulting $beg
will still be an index into the complete string. So when you update $start
by adding $beg
you get to high values. You should try $start = $beg + 1
instead of $start = $start + $beg
精彩评论