Save full webpage

2022-12-11 09:20 问答作者：

I've bumped into a problem while working at a project. I want to "crawl" certain websites of interest and save them as "full web page" including styles and images in order to build a mirror for them. It happened to me several times to bookmark a website in order to read it later and after few days the websit开发者_如何学Goe was down because it got hacked and the owner didn't have a backup of the database.

Of course, I can read the files with php very easily with fopen("http://website.com", "r") or fsockopen() but the main target is to save the full web pages so in case it goes down, it can still be available to others like a "programming time machine" :)

Is there a way to do this without read and save each and every link on the page?

Objective-C solutions are also welcome since I'm trying to figure out more of it also.

Thanks!

You actually need to parse the html and all css files that are referenced, which is NOT easy. However a fast way to do it is to use an external tool like wget. After installing wget you could run from the command line wget --no-parent --timestamping --convert-links --page-requisites --no-directories --no-host-directories -erobots=off http://example.com/mypage.html

This will download the mypage.html and all linked css files, images and those images linked inside css. After installing wget on your system you could use php's system() function to control programmatically wget.

NOTE: You need at least wget 1.12 to properly save images that are references through css files.

Is there a way to do this without read and save each and every link on the page?

Short answer: No.

Longer answer: if you want to save every page in a website, you're going to have to read every page in a website with something on some level.

It's probably worth looking into the Linux app wget, which may do something like what you want.

One word of warning - sites often have links out to other sites, which have links to other sites and so on. Make sure you put some kind of stop if different domain condition in your spider!

If you prefer an Objective-C solution, you could use the WebArchive class from Webkit.
It provides a public API that allows you to store whole web pages as .webarchive file. (Like Safari does when you save a webpage).

Some nice features of the webarchive format:

completely self-contained (incl. css, scripts, images)
QuickLook support
Easy to decompose

Whatever app is going to do the work (your code, or code that you find) is going to have to do exactly that: download a page, parse it for references to external resources and links to other pages, and then download all of that stuff. That's how the web works.

But rather than doing the heavy lifting yourself, why not check out curl and wget? They're standard on most Unix-like OSes, and do pretty much exactly what you want. For that matter, your browser probably does, too, at least on a single page basis (though it'd also be harder to schedule that).

I'm not sure if you need a programming solution to 'crawl websites' or personally need to save websites for offline viewing, but if its the latter, there's a great app for Windows — Teleport Pro and SiteCrawler for Mac.

You can use IDM (internet downloader management) for downloading full webpages, there's also HTTrack.

继续阅读：objective-c php

Save full webpage

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？