Extract description in site with no meta tag description?
I need of a function in php that extract a description of a site url that don't have meta tag description any idea?
i have tried this function but don't work :
$content = file_get_contents($url);
function getExcerpt($content) {
$text = html_entity_decode($content);
$excerpt = array();
//match all tags
preg_match_all("|<[^>]+>(.*)]+>|", $text, $p, PREG_PATTERN_ORDER);
for ($x = 0; $x < sizeof($p[0]); $x++) {
if (p开发者_开发问答reg_match('< p >i', $p[0][$x])) {
$strip = strip_tags($p[0][$x]);
if (preg_match("/\./", $strip))
$excerpt[] = $strip;
}
if (isset($excerpt[0])){
preg_match("/([^.]+.)/", $strip,$matches);
return $matches[1];
}
}
return false;
}
$excerpt = getExcerpt($content);
Parsing HTML with RegEx is almost always a bad idea. Thankfully PHP has libraries that can do the work for you. The following code uses DOMDocument to extract either the meta description or if one does not exist, the first 1000 characters in the page.
<?php
function getExcerpt($html) {
$dom = new DOMDocument();
// Parse the inputted HTML into a DOM
$dom->loadHTML($html);
$metaTags = $dom->getElementsByTagName('meta');
// Check for a meta description and return it if it exists
foreach ($metaTags as $metaTag) {
if ($metaTag->getAttribute('name') === "description") {
return $metaTag->getAttribute('content');
}
}
// No meta description, extract an excerpt from the body
// Get the body node
$body = $dom->getElementsByTagName('body');
$body = $body->item(0);
// extract the contents
$bodyText = $body->textContent;
// collapse any line breaks
$bodyText = preg_replace('/\s*\n\s*/', "\n", $bodyText);
// collapse any more leftover spaces or tabs to single spaces
$bodyText = preg_replace('/[ ]+/', ' ', $bodyText);
// return the first 1000 chars
return trim(substr($bodyText, 0, 1000));
}
$html = file_get_contents('test.html');
echo nl2br(getExcerpt($html));
You'll probably want to add a little more logic to it, some DOM traversal to try to find the content, or just some snippet near the middle of the text. As it is, this code will probably grab a bunch of unwanted stuff like the top of the page navigation etc.
You should first check if there is meta description available, if yes then display that else search for the <p>
tags and display that data as description (you might want to put a limit on length of a paragraph, e.g. if length is less than 30, search for next paragraph). If there is no <p>
tag then simply display the title as description (that's how facebook and Digg works)
精彩评论