How to extract blocks of text from a HTML page?
I would like to extract blocks of texts with more than 100 words from a large HTML page using PHP. Whether the 开发者_C百科text is contained in <p>...</p>
doesn't matter. I only care about the number of words that makes a coherent text block so texts outside of HTML paragraphs should also be taken into consideration.
How can this be done?
I use phpQuery. Are you familiar with jQuery? they share the same syntax. You might be concerned about installing a new library, but trust me this library is well worth the extra over head
phpQuery
You can then access it like this:
foreach($doc->find('p') as $element){
$element = pq($element);
echo str_word_count($element->text());
}
Use the PHP Simple DOM Parser.
foreach($html->find('p') as $element){
echo str_word_count($element->src);
}
精彩评论