开发者

grab website content thats not in the sourcecode

I want to grab some financial data from sites like http://www.fxstreet.com/rates-charts/currency-rates/

up to now I'm using liburl to grab the sourcecode and some regexp search to get the data, which I afterwards store in a file.

Yet ther开发者_JS百科e is a little problem: On the page as I see it in the browser, the data is updated almost each second. When I open the source code however the data I'm looking for changes only every two minutes. So my program only gets the data with a much lower time-resolution than possible.

I have two questions:

(i) How is it possible that a source-code which remains static over two minutes produces a table that changes every second? What is the mechanism?

(ii) How do I get the data with second time-resolution, i.e. how do I read out such a changing table thats not shown in the sourcecode.

thanks in advance, David


You can use the network panel in FireBug to examine the HTTP requests being sent out (typically to fetch data) while the page is open. This particular page you've referenced appears to be sending POST requests to http://ttpush.fxstreet.com/http_push/, then receiving and parsing a JSON response.


try sending POST request to http://ttpush.fxstreet.com/http_push/connect, and see what you get

it will continuously load new data

EDIT:

you can use liburl or python, it doesn't really matter. Under HTTP, when you browse the web, you send GET or POST requests. Go to the website, open the Developer Tools (Chrome)/firebug(firefox plugin) and you will see that after all the data is loaded, there's a request that doesn't close - it stays open.

When you have a website and you want to fetch data continuously, you can do it in a few techniques:

  • make separate requests (using ajax) every few seconds - this will open a connection for each request, and if you want frequent data updates - it's wasteful
  • use long polling or server polling - make 1 request that fetches the data. it stays open, and flushes data to the socket (to your browser) whenever it needs. the TCP connection remains open. When the connection times out - you can reopen it. It's more effective than the above normally - but the connection remains open.
  • use XMPP or some other protocol (not HTTP) - used mainly on chats, like facebook/msn i think., probably google's and some others.

the website you posted uses the second method - when it detects a POST request to that page, it keeps the connection open and dumps data continuously. What you need to do is make a POST request to that page, you need to see which parameters (if any) are needed to be sent. It doesn't matter how you make the request, as long as you send the right parameters.

you need to read the response with a delimiter - probably every time they want to process data, they send \n or some other delimiter.

Hope this helps. If you see that you still can't get around this let me know and i'll get into more technical details

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜