Get HTML output (cleaned text) with PHP
do you know if there is any functi开发者_开发问答on (PHP) which clean up some HTML code (got with cURL) and filter the visible text (the one the browser is going to show). Thanks
This is harder than you'd think. An obvious simple solution is to run strip_tags() over it, but that would simply remove tags and leave all text content intact, including embedded javascript and CSS, as well as all text inside elements that are normally hidden (e.g. by setting display: none
on them). You could try some regex magic to filter out the parts you're not interested in, but regular expressions on HTML are generally a bad idea for anything nontrivial. The ultimate solution is, I'm afraid, to use a proper HTML parser and then pull the actual text out of the resulting DOM tree - by the time you have that, you'll be pretty close to implementing a web browser.
Take a look at strip_tags():
http://us.php.net/manual/en/function.strip-tags.php
If you're literally just "cleaning up" the code, then a solution like TIDY could be your answer.
Some solutions like this will allow you to pull out plain text and might ease your pain.
However, "full on" parsing is a whole other story and you'd better bone up on your regex.
精彩评论