grab website content thats not in the sourcecode

2023-03-30 18:07 问答作者：

I want to grab some financial data from sites like http://www.fxstreet.com/rates-charts/currency-rates/

up to now I'm using liburl to grab the sourcecode and some regexp search to get the data, which I afterwards store in a file.

Yet ther开发者_JS百科e is a little problem: On the page as I see it in the browser, the data is updated almost each second. When I open the source code however the data I'm looking for changes only every two minutes. So my program only gets the data with a much lower time-resolution than possible.

I have two questions:

(i) How is it possible that a source-code which remains static over two minutes produces a table that changes every second? What is the mechanism?

(ii) How do I get the data with second time-resolution, i.e. how do I read out such a changing table thats not shown in the sourcecode.

thanks in advance, David

You can use the network panel in FireBug to examine the HTTP requests being sent out (typically to fetch data) while the page is open. This particular page you've referenced appears to be sending POST requests to http://ttpush.fxstreet.com/http_push/, then receiving and parsing a JSON response.

try sending POST request to http://ttpush.fxstreet.com/http_push/connect, and see what you get

it will continuously load new data

EDIT:

you can use liburl or python, it doesn't really matter. Under HTTP, when you browse the web, you send GET or POST requests. Go to the website, open the Developer Tools (Chrome)/firebug(firefox plugin) and you will see that after all the data is loaded, there's a request that doesn't close - it stays open.

When you have a website and you want to fetch data continuously, you can do it in a few techniques:

make separate requests (using ajax) every few seconds - this will open a connection for each request, and if you want frequent data updates - it's wasteful
use long polling or server polling - make 1 request that fetches the data. it stays open, and flushes data to the socket (to your browser) whenever it needs. the TCP connection remains open. When the connection times out - you can reopen it. It's more effective than the above normally - but the connection remains open.
use XMPP or some other protocol (not HTTP) - used mainly on chats, like facebook/msn i think., probably google's and some others.

the website you posted uses the second method - when it detects a POST request to that page, it keeps the connection open and dumps data continuously. What you need to do is make a POST request to that page, you need to see which parameters (if any) are needed to be sent. It doesn't matter how you make the request, as long as you send the right parameters.

you need to read the response with a delimiter - probably every time they want to process data, they send \n or some other delimiter.

Hope this helps. If you see that you still can't get around this let me know and i'll get into more technical details

继续阅读：web-scraping

grab website content thats not in the sourcecode

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？