开发者

Suggestion with best customizable crawlers and scrapers

I have a website which is pretty good but with very less information. So i felt like adding informtion like news regarding particular sector(for eg politics, hollywood etc). I believe crawlers are best approach to do so? Is my understanding correct, please suggest if you feel any other way to get information without using crawlers from various sources.

Secondly I am doing research from last 2 days and I cannot find a particular source which is capable of doing so. Now I want crawlers to find information, normalize and store in mysql database. Sounds pretty simple ha. But It isnt for me.

As this is very resource and time consuming . what all things should i take into consideration before choosing a crawler. Also I wish to customize it so any tool which is open source and good to be customized will be great.

Any source giving information and research about factors need to take into consideration while c开发者_如何转开发reating crawlers or educating about crawlers will be great. I prefer coding in java but i can code in any other language in case you feel that you have some language. I hope i have given enough information. Please dont hesitate if you need any more information to give suggestion.


You can use httrack to copy a target website. There is one firefox plugin as well named spiderzilla. But, they will just save the pages.

If you want to parse the data in the pages, then you can use simple_html_dom and store the information in mySQL.


Try the GNU Wget tool. You can add a lot of intelligence to the way it crawls and creates data dumps of web pages. It is open-source and customisable as well, and very fast too.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜