getting a webpage source code without actually accessing a page
There are lots of web pages which simply run a script without 开发者_运维百科having any material on them. Is there anyway of seeing the page source without actually visiting the page because it just redirects you ?
Will using an html parser work to do this ? I'm using simpleHTMLdom to parse the page ?
In firefox you can use the view-source protocol to view only the sourcecode of a site without actually rendering it or executing JavaScripts on it.
Example: view-source:http://stackoverflow.com/q/5781021/298479 (copy it to your address bar)
Yes, simple parsing the HTML will get you the client-side (Javascript) code.
When these pages are accessed through a browser, the browser runs the code and redirects it but when you access it using a scraper or your own program, the code is not run and static script can be obtained.
Ofcourse you can't access the server side (php). That's impossible.
If you need a quick & dirty fix, you could disable JavaScript and Meta redirects (Internet Explorer can disable these in the Internet Options dialog. Firefox can use the NoScript add-in for same effect.)
This won't any server-side redirects, but will prevent client-side redirects and allow you to see the document's HTML source.
The only way to get the page HTML source is to send HTTP request to the web server and receive answer which is equal to visiting the page.
If you're on a *nix based operating system, try using curl from the terminal.
curl http://www.google.com
wget or lynx will also work well if you have access to a command line linux shell:
wget http://myurl lynx -dump http://myurl
If you are trying to HTML-Scrape the contents of a page that builds 90%+ of its content/view through executing JavaScript you are going to encounter issues unless you are rendering to a screen (hidden) and then scraping that. Otherwise you'll end up scraping a few script tags which does you little good.
e.g. If I try to scrape my Gmail inbox page, it is an empty HTML page with just a few scattered script tags (likely typical of almost all GWT based apps)
Does the page/site you are scraping have an API? If not, is it worth asking them if they have one in the works?
Typically these types of tools run along a fine line between "stealing" information and "sharing" information thus you may need to tread lightly.
精彩评论