开发者

extracting useful data from arbitary html pages?

is there a library for ruby or php that is able to parse html pages and extract unique data by comparing it with other similar pages....开发者_如何转开发should use some sort of text mining to identify which texts are more likely noise and repetivie, while other texts are more unique and useful...


I'm a PHP guy, no idea about Ruby but I think that what you want is trivial to archive:

  • Use something like Simple HTML DOM to parse the pages.
  • For each page compare all the DOM elements.
  • Get the path of all elements that have different content, those will be your signal elements.
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜