开发者

Finding the web address of a Javascript link

I am using RCurl in R to try and download data from a website, but I'm having trouble finding out what URL to use. He开发者_开发知识库re is the site:

http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX

See how in the upper right, above the displayed sheet, there's a link to download the data as a .csv file? I was wondering if there was a way to find a regular HTTP address for that .csv file, because RCurl can't handle the Javascript commands.


I will give you a quick and dirty way to get the data. Firstly you can use Fiddler2 http://www.fiddler2.com/fiddler2/ to inspect the POST that your browser sends. This results in the following POST:

POST http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX HTTP/1.1
Host: www.invescopowershares.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:13.0) Gecko/20100101 Firefox/13.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
DNT: 1
Connection: keep-alive
Referer: http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX
Content-Type: application/x-www-form-urlencoded
Content-Length: 70669

__EVENTTARGET=ctl00%24MainPageLeft%24MainPageContent%24ExportHoldings1%24LinkButton1&__EVENTARGUMENT=&__VIEWSTATE=%2FwEPDwUKLTE1OTcxNjYzNw9kFgJmD2QWBAIDD2QWBAIDD2QWCAIBDw9kFgQeC2........

So we can see that 3 parameters are being posted namely __EVENTTARGET, __EVENTVALIDATION and __VIEWSTATE.

The required form for the postForm call would be:

postForm(ftarget, "form name" = "aspnetForm", "method" = "POST", "action" = "holdings.aspx?ticker=PGX", "id" = "aspnetForm","__EVENTTARGET"=event.target,"__EVENTVALIDATION"=event.val,"__VIEWSTATE"=view.state)

Now comes the quick and dirty bit. I would just open a browser and get the relevant parameters that it recieves as follows:

library(rcom)
ie = comCreateObject('InternetExplorer.Application')
ie[["visible"]]=T # true for debugging
ie$Navigate2("http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX")
while(comGetProperty(ie,"busy")||comGetProperty(ie,"ReadyState")<4){
Sys.sleep(1)
print(comGetProperty(ie,"ReadyState"))
}
myDoc<-comGetProperty(ie,"Document")
myPW<-comGetProperty(myDoc,"parentWindow")
comInvoke(myPW,"execScript","var dumVar1=theForm.__EVENTVALIDATION.value;var dumVar2=theForm.__VIEWSTATE.value;","JavaScript")
event.val<-myPW[["dumVar1"]]
view.state<-myPW[["dumVar2"]]
event.target<-"ctl00$MainPageLeft$MainPageContent$ExportHoldings1$LinkButton1"
ie$Quit()
ftarget<-"http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX"
web.data<-postForm(ftarget, "form name" = "aspnetForm", "method" = "POST", "action" = "holdings.aspx?ticker=PGX", "id" = "aspnetForm","__EVENTTARGET"=event.target,"__EVENTVALIDATION"=event.val,"__VIEWSTATE"=view.state)
write(web.data[1],'temp.csv')
fin.data<-read.csv('temp.csv')


> fin.data[1,]
  ticker SecurityNum                      Name CouponRate maturitydate
1    PGX   949746879 WELLS FARGO & COMPANY PFD       0.08             
     rating  Shares PercentageOfFund PositionDate
1 BBB+/Baa3 2538656       0.04442112   06/11/2012

__EVENTVALIDATION, __VIEWSTATE maybe be always the same or they maybe session cookies. You could probably get them using RCurl but as I say this is the quick and dirty solution and we just take the ones that Internet Explorer is given. Things to note:

1). This requires windows with IE installed to use the rcom bits.

2). If you are running ie9 you may need to add invescopowershares.com to the Compatibility View Settings (as microsoft seems to have blocked event.val<-myPW[["dumVar1"]] type com calls)

EDIT (UPDATE)

Having looked thru the website in more detail __EVENTVALIDATION, __VIEWSTATE are being set as javascript variables on the initial page. We can just parse these in a quick and dirty fashion as follows without resorting to calling a browser.

dum<-getURL("http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX")
event.target<-"ctl00$MainPageLeft$MainPageContent$ExportHoldings1$LinkButton1"
event.val<-unlist(strsplit(dum,"__EVENTVALIDATION\" value=\""))[2]
event.val<-unlist(strsplit(event.val,"\" />\r\n\r\n<script"))[1]
view.state<-unlist(strsplit(dum,"id=\"__VIEWSTATE\" value=\""))[2]
view.state<-unlist(strsplit(view.state,"\" />\r\n\r\n\r\n<script"))[1]
ftarget<-"http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX"
web.data<-postForm(ftarget, "form name" = "aspnetForm", "method" = "POST", "action" = "holdings.aspx?ticker=PGX", "id" = "aspnetForm","__EVENTTARGET"=event.target,"__EVENTVALIDATION"=event.val,"__VIEWSTATE"=view.state)
write(web.data[1],'temp.csv')
fin.data<-read.csv('temp.csv')

The above should work cross platform.


Clicking on the Download link executes this piece of JavaScript:

__doPostBack('ctl00$MainPageLeft$MainPageContent$ExportHoldings1$LinkButton1','')

That __doPostBack function appears to simply fill in a couple of hidden form fields on that page then submit a POST request.

A quick googling shows that RCurl is capable of submitting a POST request. So, what you would need to do is look in the source of that page, find the form with name "aspnetForm", take all the fields from that form, and create your own POST request that submits the fields to the action URL (http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX).

Can't guarantee this will work, though. There appears to be a hidden form field named __VIEWSTATE that appears to encode some information, and I don't know how this factors in.


This is definitely the way to get the .csv file in RCurl, but I can't figure out what form fields I want to use in getForm to make it work. Should I use the fields from the doPostBack command that's attached to the "Download" link on the page, or should I use the fields from aspnetForm on the source page. Just for reference, the aspnetForm field we're interested in is:

" form name="aspnetForm" method="post" action="holdings.aspx?ticker=PGX" id="aspnetForm" style="margin:0px" "

... and the postForm request I just tried that didn't work was

postForm("http://www.invescopowershares.com/products/holdings.aspx?ticker=PGX", "form name" = "aspnetForm", "method" = "post", "action" = "holdings.aspx?ticker=PGX", "id" = "aspnetForm", "style" = "margin:0px")

Thanks for all of the help!


There is now a function in the qmao package that will do this for you. (It is based on code from a now deleted answer to this question.)

You can use the dlPowerShares function like this:

require("qmao")
Symbol <- "PGX"
dat <- qmao:::dlPowerShares(event.target = "ctl00$MainPageLeft$MainPageContent$ExportHoldings1$LinkButton1", 
                            action = paste0("holdings.aspx?ticker=", Symbol))
> head(dat)
  ticker SecurityNum                           Name CouponRate maturitydate    rating  Shares PercentageOfFund PositionDate
1    PGX   173080201         CITIGROUP CAPITAL XIII    0.07875   10/30/2040    BB/Ba2 2998647       0.04274939   08/31/2012
2    PGX   949746879      WELLS FARGO & COMPANY PFD    0.08000              BBB+/Baa3 2549992       0.03935854   08/31/2012
3    PGX   06739H362      BARCLAYS BK PLC              0.08125                A-/Baa3 2757635       0.03644835   08/31/2012
4    PGX   46625H621      JPMORGAN CHASE               0.08625              BBB+/Baa1 2416021       0.03310707   08/31/2012
5    PGX   060505765   BANK OF AMERICA CORP PFD 8.2    0.08200                 BB+/B1 2345508       0.03128002   08/31/2012
6    PGX   060505559 BANC OF AMERICA CORP PFD 8.625    0.08625                 BB+/B1 2259484       0.03001599   08/31/2012

In the code above, event.target is the first string inside of the javascript:__doPostBack() function that you will get when you right click the “Download” link and “Copy Link Address”.

action is the product-specific part of the action url.

Internally, the code follows Jeff's suggestion in his answer and searches the source of the page for the values of the fields for the "aspnetForm". It then uses those values in a call to postForm (from the RCurl package.)

In the qmao package, dlPowerShares is used by getHoldings.powershares. Also, getHoldings will call getHoldings.powershares if one of the Symbols passed to it is the symbol of a PowerShares ETF.


p.s. if qmao:::dlPowerShares is called with its defaults, it will download the PowerShares product list from http://www.invescopowershares.com/products/

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜