How to approach Google groups discussions crawler
as an exercise in RSS I would like to be able to search through pretty much all Unix discussions on this group.
comp.unix.shell
I know enough Python and understand basic RSS, but I am stuck on ... how do I grab all messages between particular dates, or at least all messages between Nth recent and Mth recent?
High level descriptions, pseudo-code i开发者_如何转开发s welcome.
Thank you!
EDIT:
I would like to be able to go back more than 100 messages, but do not grabbing like parsing 10 messages at a time such as using this URL:
http://groups.google.com/group/comp.unix.shell/topics?hl=en&start=2000&sa=N
There must be a better way.
Crawling google groups violates the Google's Terms of Service, specifically the phrase:
use any robot, spider, site search/retrieval application, or other device to retrieve or index any portion of the Service or collect information about users for any unauthorized purpose
Are you sure you want to announce that you're doing that so openly? And are you blind to the consequences of your result?
For N recent, seems like you could pass parameter ?num=50
or something in the feed url
For example, 50 new messages from comp.unix.shell group
http://groups.google.com/group/comp.unix.shell/feed/atom_v1_0_msgs.xml?num=50
and then pick up a feedparser program like Universal Feed Parser
There is .update_parsed
option in feedparser, you could use that to check the msg within particular date range
>>> e.updated_parsed # parses all date formats
(2005, 11, 9, 11, 56, 34, 2, 313, 0)
As Randal mentioned, this violates Google's ToS -- however, as a hypothetical or for use on another site without these restrictions you could pretty easily rig something up with urllib and BeautifulSoup. Use urllib to open the page and then use BeautifulSoup to grab all the thread topics (and links if you want to crawl deeper). You can then programmatically find the link to the next page of results and then make another urllib to go to page 2 -- then repeat the process.
At this point you should have all the raw data, then it is just a matter of manipulating the data and implementing your searching functionality.
Have you thought about yahoos YQL? It's not too bad and can access a lot of APIs. http://developer.yahoo.com/yql/
I don't know if groups is suported but u can access rss feeds. Could be helpful.
精彩评论