开发者

How to download search results on google scholar using r?

I would like extract the first 100 results (say) of a Google Scholar search using R. Does anyone know how to do it?

To be precise, I just need the name of the paper, authors and citation count.

Ps Would this be leg开发者_StackOverflow社区al?


please consider the updated biobucket-post:

http://thebiobucket.blogspot.com/2011/11/r-function-google-scholar-webscraper.html


There are some Python and Perl scrapers out there that you might be able to adapt, linked at http://bmb-common.blogspot.com/2011/02/does-google-scholar-suck-or-am-i-just.html


I can't speak to the legalities of your task, but there are a few ways you can go about this. While I am not strong in XPath, it might be the best way. I believe that you can use the XML package to retrieve the page contents and use XPath to extract the data of the elements you need.

For instance, I use Chrome for a browser, and when I inspected the page with Developer Tools, there does appear to be a structure to the page, with the data "hidden" inside various tags that should you be able to exploit really easily using XPath.

Check out this link for an example of using XPath.

HTH and Good Luck


You can definitely retrieve the HTML content of the page using RCurl and parse them using RXML as suggested by Btibert3. The only issue you might face is that Google won't allow you to do queries in a "robotic" way. After something like 200 queries in Google in a short period of time, it won't return results anymore. Maybe that's different with Google Scholar, but I doubt so...


A solution was recently published here:

http://thebiobucket.blogspot.com/2011/11/visually-examine-google-scholar-search.html

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜