开发者

Stripping irrelevant parts of a web page

Is there a API or systematic way of stripping irrelevant parts of a web page while scraping it via Python? For instance, take this very page -- the only important part is the question and the answers, not the side bar column, header, etc开发者_运维知识库. One can guess things like that, but is there any smart way of doing it?


There's the approach from the Readability bookmarklet, with at least two Python implementations available:

  • decruft
  • python-readability


In general, no. In specific cases, if you know something about the structure of the site you are scraping, you can use a tool like Beautiful Soup to manipulate the DOM.


One approach is to compare the structure of multiple webpages that share the same template. In this case you would compare multiple SO questions. Then you can determine which content is static (useless) or dynamic (useful).

This field is known as wrapper induction. Unfortunately it is harder than it sounds!


This git hub project solves your problem, but it's in Java. May be worth a look: goose

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜