How do you use wget (with mk option) to mirror a site and its externally-linked images?

2023-02-15 11:27 问答作者：

I know of wget -mkp http://example.com to mirror a site and all of its internally-linked files.

But, I need to backup a site where all the images are stored on a separate domain. How could I download those images as well with wget, and update th开发者_JS百科e src tags accordingly?

Thank you!

A slightly modified version of @PatrickHorn's answer:

First cd into top directory containing downloaded files.

"first wget to find pages recursively, albeit only from that one domain"

wget --recursive --timestamping -l inf --no-remove-listing --page-requisites http://site.com

"second wget which spans hosts but does not retrieve pages recursively"

find site.com -name '*.htm*' -exec wget --no-clobber --span-hosts --timestamping --page-requisites http://{} \;

I've tried this, and it seems to have mostly worked - I get all the .htm(l) pages from just the site I'm after, then the external files. I haven't yet been able to change the links to be relative to the local copies of the external files.

wget with -r and -H is pretty dangerous since it can easily make its way to a large site (perhaps through an ad or search box) and span the whole Internet. The trick for downloading all dependencies for a single page is that you don't necessarily want recursion, but you do want to download page prerequisites as well as allow wget to span hosts, as in:

wget -H -N -kp http://<site>/<document>

However, with this command, now you don't get the recursive behavior.

So to combine the two, we can use the first wget to find pages recursively, albeit only from that one domain; and a second wget which spans hosts but does not retrieve pages recursively:

wget -mkp http://example.com
find example.com/ -name '*.html*' -exec wget -nc -HNkp http://{} \;

The -nc is the important point -- it tells wget to act like it downloaded it from the server, but use the local copy on your disk instead, which means that the references should have already been converted. Next, it will fetch all the resources; and finally it should clobber the original file (which needs the query string), and names the second correctly. Note that this one double-downloads the file so that it can fix the -- however, the place I am stuck is -k, converts relative URLs that it did not download, back to absolute URLs. So after the second step, all the links are now remote urls again.

Luckily, this problem should be a bit easier to solve by hand because all of the absolute links should start with "http://example.com/", so it might be possible to run a simple "sed" script to fix the link references.

What I would suggest, if you know the domains that you expect example.com to include is to use the -D option to specify only those domains you expect to download from, and nothing else. For example, from google.com, you include gstatic.com as well.

There's another person here with a similar question, but downloading remote images seems not to have been resolved.

The thread here suggests to just bite the bullet and do "-r -l 1 -H", but also use -A to restrict files which actually get saved to image or css types:

Assuming you know the separate domain where images are stored, the things are much simpler than you'd expect using a recent wget build (i.e. version >= 1.20). For example, let's suppose having images hosted at http://www.images.domain, try this:

wget -mkp -E -np -H -Dexample.com,images.domain http://example.com

In the above example I added to the starting -mkp more parameters some of which [-E (--adjust-extension) and -np (--no-parent)] just because I think they could be convenient to use plus the ones you definitely need for the purpose that are the following ones:

-H (--span-hosts) => enables spanning across hosts when doing recursive retrieving

-D<comma separated domain list> (--domain-list=<comma separated domain list>) => used to set the domains to be followed for file retrieving

That's it, have a look at wget manual for further reference

继续阅读：backup mirror wget

How do you use wget (with mk option) to mirror a site and its externally-linked images?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？