Bot Web Quality
I am looking for a good open source bot to determine some quality, often required for google indexing.
For example
- find duplicate titles
- invalid links ( jspider do this, and I think a lot more will do this)
- exactly the same page, but different urls
- etc, where etc equals google q开发者_运维问答uality reqs.
Your requirements are very specific so it's very unlikely there is an open source product that does exactly what you want.
There are, however, many open source frameworks for building web crawlers. Which one you use depends on your language preference.
For example:
- For Python, try Scrapy
- For Java, try Arachnid
- For Ruby, try Anemone
- For Perl, try WWW::Spider
Generally, these frameworks will provide classes for crawling and scraping pages of a site based upon the rules you give, but then it's up to you to extract the data you need by hooking in your own code.
Google Webmaster Tools is a web-based service (rather than an on-demand bot), and it doesn't do everything you've asked for - but it does do some of it and a lot of things you haven't asked for, and - being from Google - it no doubt matches your odd "etc, where etc equals google quality reqs." better than anywhere else will.
精彩评论