getting a webpage source code without actually accessing a page

2023-02-28 09:23 问答作者：

There are lots of web pages which simply run a script without 开发者_运维百科having any material on them. Is there anyway of seeing the page source without actually visiting the page because it just redirects you ?

Will using an html parser work to do this ? I'm using simpleHTMLdom to parse the page ?

In firefox you can use the view-source protocol to view only the sourcecode of a site without actually rendering it or executing JavaScripts on it.

Example: view-source:http://stackoverflow.com/q/5781021/298479 (copy it to your address bar)

Yes, simple parsing the HTML will get you the client-side (Javascript) code.

When these pages are accessed through a browser, the browser runs the code and redirects it but when you access it using a scraper or your own program, the code is not run and static script can be obtained.

Ofcourse you can't access the server side (php). That's impossible.

If you need a quick & dirty fix, you could disable JavaScript and Meta redirects (Internet Explorer can disable these in the Internet Options dialog. Firefox can use the NoScript add-in for same effect.)

This won't any server-side redirects, but will prevent client-side redirects and allow you to see the document's HTML source.

The only way to get the page HTML source is to send HTTP request to the web server and receive answer which is equal to visiting the page.

If you're on a *nix based operating system, try using curl from the terminal.

curl http://www.google.com

wget or lynx will also work well if you have access to a command line linux shell:

wget http://myurl lynx -dump http://myurl

If you are trying to HTML-Scrape the contents of a page that builds 90%+ of its content/view through executing JavaScript you are going to encounter issues unless you are rendering to a screen (hidden) and then scraping that. Otherwise you'll end up scraping a few script tags which does you little good.

e.g. If I try to scrape my Gmail inbox page, it is an empty HTML page with just a few scattered script tags (likely typical of almost all GWT based apps)

Does the page/site you are scraping have an API? If not, is it worth asking them if they have one in the works?

Typically these types of tools run along a fine line between "stealing" information and "sharing" information thus you may need to tread lightly.

继续阅读：html-parsing javascript php

getting a webpage source code without actually accessing a page

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？