开发者

Scraping ASP.NET site with Ruby

I would like to scrape the search results of this ASP.NET site using Ruby and preferably just using Hpricot (I cannot open an instance of Firefox): http://www.ngosinfo.gov.pk/SearchResults.aspx?name=&foa=0

However, I am having trouble figuring out how to go through each page of results. Basically, I need simulate clicking on links like these:

<a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$Pager1$2',开发者_JAVA百科'')" class="blue_11" id="ctl00_ContentPlaceHolder1_Pager1">2</a>
<a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$Pager1$3','')" class="blue_11" id="ctl00_ContentPlaceHolder1_Pager1">3</a>

etc.

I tried using Net::HTTP to handle the post, but while that received the correct HTML, there were no search results (I'm probably not doing that correctly). In addition, the URL of the page does not contain any parameters indicating page, so it is not possible to force the results that way.

Any help would be greatly appreciated.


Using mechanize-1.0.0 the following works:

 agent = Mechanize.new
 page = agent.get('http://127.0.0.1/some.aspx')

 form = page.form("aspnetForm")
 form.add_field!('__EVENTARGUMENT', 'Page$2')
 form.add_field!('__EVENTTARGET', 'ctl00$ContentPlaceHolder1$gvwSomeList')
 page = agent.submit(form) # this gets page 2


Even better check out Mechanize. A good starting point on screen scraping is the railscasts.com episode on mechanize.


If you're just getting started, you might want to check out Nokogiri. It's more lightweight and better-documented than Hpricot (which appears to have been abandoned).

Edit: Jakub Hampl is correct - Mechanize is what you're looking for to interact with web sites. It works in cooperation with Nokogiri (which parses HTML and XML).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜