Multilevel web spider with regex match?
I need a web spider to find certain links with regex.
The spider would visit a list of websites, find links that match a regex pattern list, visit those matched links and repeat until the configured depth level.
I was about to code this on php but im 开发者_运维知识库not very good with threads on php and I need threads for this application.
So, what do you think is the best solution?
Maybe theres some existing app/code I could configure to create this spider.
There are several crawlers out there which you can use for free:
- Nutch
- Heritrix
- Wikipedia list of open-source crawlers
Nutch is probably the best and I would recommend that if you use it, you take advantage of their OPIC functionality instead of specifying the crawl depth yourself. OPIC allows the crawler to determine which site should be crawled next in an intelligent way, without the need of artificial depth limits.
精彩评论