When is a webpage considered to be "loaded", in the presence of JS etc
Information: I have no knowledge of javascript. none.
I'm curious if there's any way to determine when 开发者_开发百科a webpage is completely loaded? Let's say I have a crawler, that uses webkit to render pages (and webkit's JS engine to parse any JS functions and finish processing the DOM etc), I'm curious if there's any way to know when a webpage is 'done' loading? What I consider to be done:
1) All scripts have finished executing. 2) No pending AJAX calls. 3) The DOM is completely processed and loaded based on currently available information.
For a more concrete hypothetical, from looking at the source of a few sites, I see that they load ads by using a script tag that injects stuff into the DOM, and issues AJAX calls to load and populate the ads. How can one determine when all this is done?
(replace the example by anything asynchronous, I guess. I just couldn't think of anything more universal than the above.)
By "detect", I mean, in any manner possible. For instance, injecting a bit of JS code into the page that writes something to the page to let me know stuff is done. Or for instance with QtWebkit, JS can call into C++(i believe), so a JS snippet could call a C++ function to let it know when the page was 'loaded'. Whatever works, in short.
The current 'naive' implementation I have just sits and waits for a few seconds after loading a page. It's stupid.
Please be as detailed as possible, and feel free to say 'read this first' if more background information is required prior to me understanding the answer.
Thank you very much!
It's in general impossible to say whether a page that contains asynchronous, script-driven content is truly done loading. Aside from the fundamental issue of the halting problem, it's possible for scripts or plugins to register for periodic timer events and continue modifying or adding to the page indefinitely.
The approach I've usually seen for determining when a page is done loading is when the entire DOM has been loaded, resources (images, stylesheets, scripts, etc.) referenced directly from that DOM have been loaded, and all script code has been read and executed through once. Text emitted via document.write()
is treated for this purpose as if it was directly included in the source HTML. If you're using QtWebKit, I believe this is the behavior you will see if you connect to the signal QWebPage::loadFinished(bool)
. (You can get the contained QWebPage
from a QWebFrame
using the accessor page()
.)
Deferred actions set up by the script code, whether by timers, events waiting for load of other resources to complete, or what have you, is not counted; media players and other plugins may complicate things further because each media type or even player may have a different standard of what constitutes "loaded".
A number of recent JavaScript libraries exploit this behavior to improve perceived page load times by loading an incomplete page containing just the first screen's worth of content plus some script, and not actually beginning to load images and content "below the fold" until after the first screenful or so is done loading and rendering. It's not very friendly to automated tools, crawlers or those who consider JavaScript a privilege to be earned by trusted sites, though.
精彩评论