开发者

RegEx to get the keywords from HTML

I'm trying to obtain the keywords from an HTML page that I'm scraping with PHP.

So, if the keywords tag looks like this:

<meta name="Keywords" content="MacUpdate, Mac Software, Macintosh Software, Mac Games, Macintosh Games, Apple, Macintosh, Software, iphone, ipod, Games, Demos, Shareware, Freeware, MP3, audio, sound, macster, napster, macintel, universal binary">

I want to get this back:

Ma开发者_Python百科cUpdate, Mac Software, Macintosh Software, Mac Games, Macintosh Games, Apple, Macintosh, Software, iphone, ipod, Games, Demos, Shareware, Freeware, MP3, audio, sound, macster, napster, macintel, universal binary

I've constructed a regex, but it's not doing the trick.

(?i)^(<meta name=\"keywords\" content=\"(.*)\">)

Any ideas?


I would use a HTML/XML parser like DOMDocument and XPath to retrieve the nodes from the DOM:

$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$keywords = $xpath->query('//meta[translate(normalize-space(@name), "KEYWORDS", "keywords")="keywords"]/@content');
foreach ($keywords as $keyword) {
    echo $keyword->value;
}

The translate function seems to be necessary as PHP’s XPath implementation does not know the lower-case function.

Or you do the filtering with PHP:

$metas = $xpath->query('//meta');
foreach ($metas as $meta) {
    if ($meta->hasAttribute("name") && trim(strtolower($meta->getAttribute("name")))=='keywords' && $meta->hasAttribute("content")) {
        echo $meta->getAttribute("content")->value;
    }
}


Use the function get_meta_tags();

Tutorial


Stop using regex. It's slow, resource intensive, and not very nimble.

If you're programming in PHP check out http://simplehtmldom.sourceforge.net/ - SimpleDom is powerful enough to get you everything you need in a very simple object-oriented way.

    // Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images 
foreach($html->find('img') as $element) 
       echo $element->src . '<br>';

// Find all links 
foreach($html->find('a') as $element) 
       echo $element->href . '<br>';

Another example -

// Example
$html = str_get_html("<div>foo <b>bar</b></div>"); 
$e = $html->find("div", 0);

echo $e->tag; // Returns: " div"
echo $e->outertext; // Returns: " <div>foo <b>bar</b></div>"
echo $e->innertext; // Returns: " foo <b>bar</b>"
echo $e->plaintext; // Returns: " foo bar"


(.*) matches everything up to the LAST "(quote) in the document, obviously not what you want. Regex is greedy by default. You need to use

content=\"(.*?)\"

or

content=\"([^\"]*)\"


Stop trying to parse HTMl with regular expressions.

RegEx match open tags except XHTML self-contained tags


(?i)<meta\\s+name=\"keywords\"\\s+content=\"(.*?)\">

Would produce something like:

preg_match('~<meta\\s+name=\"keywords\"\\s+content=\"(.*?)\">~i', $html, &$matches);


This is a simple regex, that matches the first meta keywords tag. It only allows characters, numbers, legal URL characters, HTML entities and spaces to appear inside the content attribute.

$matches = array();
preg_match("/<meta name=\"Keywords\" content=\"([\w\d;,\.: %&#\/\\\\]*)\"/", $html, $matches);
echo $matches[1]; 
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜