Use curl to parse XML, get an image's URL and download it

2023-01-09 18:47 问答作者：

I want to write a shell script to get an image from an rss feed. Right now I have:

curl http://foo.com/rss.xml | grep -E '<img src="http://www.foo.com/full/' | head -1 | sed -e 's/<img src="//' -e 's/" alt=""//' -e 's/width="400"//' -e 's/  height="400" \/>//' | sed 's/ //g'

This I use to grab the first occurence of an image URL in the file. Now I want to put this URL in a variable to use cU开发者_Go百科RL again to download the image. Any help appreciated! (Also you might give tipps on how to better remove everything from the line with the URL. This is the line:

 <img src="http://www.nichtlustig.de/comics/full/100802.jpg" alt="" width="400" height="400" />

There's probably some better regex to remove everything except the URL than my solution.) Thanks in advance!

Using a regexp to parse HTML/XML is a Bad Idea in general. Therefore I'd recommend that you use a proper parser.

If you don't object to using Perl, let Perl do the proper XML or HTML parsing for you using appropriate parser libraries:

HTML

curl http://BOGUS.com |& perl -e '{use HTML::TokeParser; 
    $parser = HTML::TokeParser->new(\*STDIN); 
    $img = $parser->get_tag('img') ; 
    print "$img->[1]->{src}\n"; 
}'

/content02/groups/intranetcommon/documents/image/blk_logo.gif

XML

curl http://BOGUS.com/whdata0.xml | perl -e '{use XML::Twig;
    $twig=XML::Twig->new(twig_handlers =>{img => sub { 
       print $_[1]->att("src")."\n"; exit 0;}}); 
    open(my $fh, "-");
    $twig->parse($fh);
}'

/content02/groups/intranetcommon/documents/image/blk_logo.gif

I used wget instead of curl, but its just the same

#!/bin/bash
url='http://www.nichtlustig.de/rss/nichtrss.rss'
wget -O- -q "$url" | awk 'BEGIN{ RS="</a>" }
/<img src=/{
  gsub(/.*<img src=\"/,"")
  gsub(/\".[^>]*>/,"")
  print
}'  |  xargs -i wget "{}"

Use a DOM parser and extract all img elements using getElementsByTagName. Then add them to a list/array, loop through and separately fetch them.

I would suggest using Python, but any language would have a DOM library.

#!/bin/sh
URL=$(curl http://foo.com/rss.xml | grep -E '<img src="http://www.foo.com/full/' | head -1 | sed -e 's/<img src="//' -e 's/" alt=""//' -e 's/width="400"//' -e 's/  height="400" \/>//' | sed 's/ //g')
curl -C - -O $URL

This totally does the job! Any idea on the regex?

Here's a quick Python solution:

from BeautifulSoup import BeautifulSoup
from os import sys

soup = BeautifulSoup(sys.stdin.read())
print soup.findAll('img')[0]['src']

Usage:

$ curl http://www.google.com/`curl http://www.google.com | python get_img_src.py`

This works like a charm and will not leave you trying to find the magical regex that will parse random HTML (Hint: there is no such expression, especially not if you have a greedy matcher like sed.)

继续阅读：curl download perl shell

Use curl to parse XML, get an image's URL and download it

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？