Regular expression to match links containing "Google"
I want to use PHP regular expressions to match out all the links which contain the word google
. I've tried this:
$url = "http://www.google.com";
$html = file_get_contents($url);
preg_match_all('/<a.*(.*?)".*>(.*google.*?)<\/a>/i',$htm开发者_运维百科l,$links);
echo '<pre />';
print_r($links); // it should return 2 links 'About Google' & 'Go to Google English'
However it returns nothing. Why?
Better is to use XPath here:
$url="http://www.google.com";
$html=file_get_contents($url);
$doc = new DOMDocument;
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$query = "//a[contains(translate(text(), 'GOOGLE', 'google'), 'google')]";
// or just:
// $query = "//a[contains(text(),'Google')]";
$links = $xpath->query($query);
$links
will be a DOMNodeList
you can iterate.
You should use a dom parser, because using regex for html documents can be "painfully" error prone. Try something like this
//Disable displaying errors
libxml_use_internal_errors(TRUE);
$url="http://www.google.com";
$html=file_get_contents($url);
$doc = new DOMDocument();
$doc->loadHTML($html);
$n=0;
foreach ($doc->getElementsByTagName('a') as $a) {
//check if anchor contains the word 'google' and print it out
if ($a->hasAttribute('href') && strpos($a->getAttribute('href'),'google') ) {
echo "Anchor" . ++$n . ': '. $a->getAttribute('href') . '<br>';
}
}
精彩评论