开发者

php - How Do i extract bolded terms from a webpage and put them into an associative array?

I'm trying to grab all the bolded terms from a google results page and put them into an associative array, but the resu开发者_JAVA百科lts are eratic. It seems to only extract single word terms and sometimes (depending on the query) it grabs words that are not bolded. Does anyone know what I'm doing wrong? Thanks in advance.

$gurl = "http://www.google.com/search?q=marketingpro";
$data = file_get_contents($gurl);

// get bolded
preg_match_all('/<b>(\w+)<\/b>/', $data, $res, PREG_PATTERN_ORDER);
$H = $res[0];
foreach($H as $X){
$bold = strtolower($X);
$array[$bold] += 1;
}
print_r($array);


Try:

$doc = new DOMDocument();
@$doc->loadHTMLFile('http://www.google.com/search?q=marketingpro');
$xpath = new DOMXpath($doc);
$terms = array();
foreach ($xpath->query('//b') as $b)
{
  $terms[$b->nodeValue] = true;
}

var_dump(array_keys($terms));

For me, I get:

array(15) {
  [0]=>
  string(3) "Web"
  [1]=>
  string(13) "marketing pro"
  [2]=>
  string(12) "marketingpro"
  [3]=>
  string(3) "..."
  ... snip ...
  [14]=>
  string(9) "marketing"
}


/<b>(\w+)<\/b>/ will match only if there is one word inside, space and characters other than 0-9a-z and _ will be omitted in your result. I'll suggest looking for /<b>([^<]+)<\/b>/, or dom/xml parsers (but since google has invalid html, those can fail)


It extracts only single words, because that's what \w+ means. You could use a broader matching pattern like ([^<>]+) instead.

Or better yet, use QueryPath or phpQuery, which are easier on the eyes:

foreach (qp($html)->find("b") as $bold) {
    $bold = strtolower($bold->text());
    $array[$bold] += 1;
}


You may think about using a DOM parser. There's one here:

http://simplehtmldom.sourceforge.net/

Or, do something like this:

function getTextBetweenTags($string, $tagname)
{
  $pattern = "/<$tagname>(.*?)<\/$tagname>/";
  preg_match($pattern, $string, $matches);
  return $matches[1];
}

That will work as long as $tagname doesn't have any attributes, which "" tags shouldn't.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜