开发者

Get HTML output (cleaned text) with PHP

do you know if there is any functi开发者_开发问答on (PHP) which clean up some HTML code (got with cURL) and filter the visible text (the one the browser is going to show). Thanks


This is harder than you'd think. An obvious simple solution is to run strip_tags() over it, but that would simply remove tags and leave all text content intact, including embedded javascript and CSS, as well as all text inside elements that are normally hidden (e.g. by setting display: none on them). You could try some regex magic to filter out the parts you're not interested in, but regular expressions on HTML are generally a bad idea for anything nontrivial. The ultimate solution is, I'm afraid, to use a proper HTML parser and then pull the actual text out of the resulting DOM tree - by the time you have that, you'll be pretty close to implementing a web browser.


Take a look at strip_tags():

http://us.php.net/manual/en/function.strip-tags.php


If you're literally just "cleaning up" the code, then a solution like TIDY could be your answer.

Some solutions like this will allow you to pull out plain text and might ease your pain.

However, "full on" parsing is a whole other story and you'd better bone up on your regex.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜