开发者

Getting the first paragraph from a URL which doesn't contain script tags and it has words number > 20

I want to grab from a URL the first paragraph which doesn't contain script tags and its words number is greater than 20. The right paragraph could be the 3rd one for example. Can you help me with that guys? I have made something like that until now.

 $start = strpos($url, '<p>');
 $end = strpos($url, '</p>', $start);
 $par1 = substr($url, $start, $end - $start + 4);   
 $count = str_word_count($par1);
 if ($count > 20) {     
     $par = html_entity_开发者_开发知识库decode(strip_tags($par1));
     echo $par;
 }

This code is not exactly right. It shows the first paragraph in the URL, only if it has number of words > 20.


Can't you just average the desired string length (for instance 20 words is for example about 80 characters, in which case you could use this in XPATH: //w[not(script) and string-length(normalize-space(.)) > 80]


This will iterate through all of the p tags with the code you posted above. I think that's what you're looking for:

//You may want to rename $url if it contains your paragraphs.
$contents = explode("<p>", $url);

foreach($contents as $p)
{
     $end = strpos($p, '</p>', 0);
     $p = substr($p, 0, $end); // this will remove everything not in a <p>
     $p = strip_tags($p);
     if(str_word_count($p) > 20)
     {
         echo html_entity_decode($p);
     }
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜