Extracting data from a web page

2023-02-27 02:26 问答作者：

I am doing a school project which needs extracting data from web pages. To be precise I need a library or opensource program to extract human readable content from html/text data. Something like web browser rendered text content.

I know parsing html with regexs is worst method to extract text from it.

Extra info:

I need it for computing similarity开发者_运维技巧 between text documents.

Any help would be appreciated. Thanks

I would highly recommend this question's first answer in an effort to keep you away from parsing HTML with regular expressions. That answer does a far better job of illustrating why you shouldn't than I could, so I defer to that.

You will also find that you should look into XML parsers instead of trying to "parse by hand" via a regex (which you'll read in the referenced question and its answers).

If all you care is textual similarity, you could just write a regex to strip out all the HTML tags of the form </?(every|single|valid|tag)[^>]*> (perhaps first removing all <script>.*</script> tags), then mash all the content up in a very long paragraph. That wouldn't be a bad use of a regex at all; that's what they're there for.

I might recommend http://docs.python.org/library/xml.dom.minidom.html , but imho the interface can be very awkward. Also you don't need access to the hierarchical structure, just the text. Otherwise a parser would be better than a regex (which would otherwise be a terrible idea).

继续阅读：html-content-extraction html-parsing parsing text-extraction

Extracting data from a web page

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？