How do you use wget (with mk option) to mirror a site and its externally-linked images?
I know of wget -mkp http://example.com to mirror a site and all of its internally-linked files.
But, I need to backup a site where all the images are stored on a separate domain. How could I download those images as well with wget, and update th开发者_JS百科e src tags accordingly?
Thank you!
A slightly modified version of @PatrickHorn's answer:
First cd
into top directory containing downloaded files.
"first wget to find pages recursively, albeit only from that one domain"
wget --recursive --timestamping -l inf --no-remove-listing --page-requisites http://site.com
"second wget which spans hosts but does not retrieve pages recursively"
find site.com -name '*.htm*' -exec wget --no-clobber --span-hosts --timestamping --page-requisites http://{} \;
I've tried this, and it seems to have mostly worked - I get all the .htm(l) pages from just the site I'm after, then the external files. I haven't yet been able to change the links to be relative to the local copies of the external files.
wget with -r and -H is pretty dangerous since it can easily make its way to a large site (perhaps through an ad or search box) and span the whole Internet. The trick for downloading all dependencies for a single page is that you don't necessarily want recursion, but you do want to download page prerequisites as well as allow wget to span hosts, as in:
wget -H -N -kp http://<site>/<document>
However, with this command, now you don't get the recursive behavior.
So to combine the two, we can use the first wget to find pages recursively, albeit only from that one domain; and a second wget which spans hosts but does not retrieve pages recursively:
wget -mkp http://example.com
find example.com/ -name '*.html*' -exec wget -nc -HNkp http://{} \;
The -nc is the important point -- it tells wget to act like it downloaded it from the server, but use the local copy on your disk instead, which means that the references should have already been converted. Next, it will fetch all the resources; and finally it should clobber the original file (which needs the query string), and names the second correctly. Note that this one double-downloads the file so that it can fix the -- however, the place I am stuck is -k, converts relative URLs that it did not download, back to absolute URLs. So after the second step, all the links are now remote urls again.
Luckily, this problem should be a bit easier to solve by hand because all of the absolute links should start with "http://example.com/", so it might be possible to run a simple "sed" script to fix the link references.
What I would suggest, if you know the domains that you expect example.com to include is to use the -D option to specify only those domains you expect to download from, and nothing else. For example, from google.com, you include gstatic.com as well.
There's another person here with a similar question, but downloading remote images seems not to have been resolved.
The thread here suggests to just bite the bullet and do "-r -l 1 -H", but also use -A to restrict files which actually get saved to image or css types:
Assuming you know the separate domain where images are stored, the things are much simpler than you'd expect using a recent wget build (i.e. version >= 1.20). For example, let's suppose having images hosted at http://www.images.domain
, try this:
wget -mkp -E -np -H -Dexample.com,images.domain http://example.com
In the above example I added to the starting -mkp
more parameters some of which [-E (--adjust-extension)
and -np (--no-parent)
] just because I think they could be convenient to use plus the ones you definitely need for the purpose that are the following ones:
-H (--span-hosts) => enables spanning across hosts when doing recursive retrieving
-D<comma separated domain list> (--domain-list=<comma separated domain list>) => used to set the domains to be followed for file retrieving
That's it, have a look at wget manual for further reference
精彩评论