开发者

Data Extraction?

I am looking for methods to extract various data from various websites. I know there are programs out there you can buy but bei开发者_StackOverflowng that I am trying to learn I want to do it myself. Does anyone have any suggestions on a general structure and if so, what language would you write it in. My first thought was java but I am more than willing and grateful to hear anyone else's opinion.


What kind of data are you trying to extract from websites? What websites? etc. A little more detail on your idea/project would be helpful

I recently had the need to look into and try a few html parsers to get some data I needed in a more consolidated format.

I tried JTidy (http://jtidy.sourceforge.net/) and looked into Web-Harvest (http://web-harvest.sourceforge.net/). JTidy wouldn't quite do what I wanted and Web-Harvest was overkill.

I ultimately settled on using Java + htmlparser (http://htmlparser.sourceforge.net/)

It took very little development time to get what I needed and htmlparser allows you to form 'filters' that search for specific things in the DOM.


look at hadoop (grids) and solr (crawlers and indexers ). They both support heavy processing and efficient indexing (for efficient searching) respectively.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜