regular expression to extract data from html page
I want to extract all anchor tags from html pages. I am using this in Linux.
lynx --source http://www.imdb.com | egrep "<a[^>]*>"
but that is not working as expected, since result contains unwanted results
<a class="amazon-affiliate-site-name" href="http://www.fabric.com">F开发者_运维技巧abric</a><br>
I want just
<a href >...</a>
any good way ?
If you have a -P
option in your grep so that it accepts PCRE patterns, you should be able to use better regexes. Sometimes a minimal quantifier like *?
helps. Also, you’re getting the whole input line, not just the match itself; if you have a -o
option to grep, it will list only the part that matches.
egrep -Po '<a[^<>]*>'
If your grep doesn’t have those options, try
perl -00 -nle 'print $1 while /(<a[^<>]*>)/gi'
Which now crosses line boundaries.
To do a real parse of HTML requires regexes subtantially more more complex than you are apt to wish to enter on the command line. Here’s one example, and here’s another. Those may not convince you to try a non-regex approach, but they should at least show you how much harder it is in the general case than in specific ones.
This answer shows why all things are possible, but not all are expedient.
why can't you use options like --dump
?
lynx --dump --listonly http://www.imdb.com
Try grep -Eo
:
$ echo '<a class="amazon-affiliate-site-name" href="http://www.fabric.com">Fabric</a><br>' | grep -Eo '<a[^>]*>'
<a class="amazon-affiliate-site-name" href="http://www.fabric.com">
But please read the answer that MAK linked to.
Here's some examples of why you should not use regex to parse html.
To extract values of 'href'
attribute of anchor tags, run:
$ python -c'import sys, lxml.html as h
> root = h.parse(sys.argv[1]).getroot()
> root.make_links_absolute(base_url=sys.argv[1])
> print "\n".join(root.xpath("//a/@href"))' http://imdb.com | sort -u
Install lxml
module if needed: $ sudo apt-get install python-lxml
.
Output
http://askville.amazon.com http://idfilm.blogspot.com/2011/02/another-class.html http://imdb.com http://imdb.com/ http://imdb.com/a2z http://imdb.com/a2z/ http://imdb.com/advertising/ http://imdb.com/boards/ http://imdb.com/chart/ http://imdb.com/chart/top http://imdb.com/czone/ http://imdb.com/features/hdgallery http://imdb.com/features/oscars/2011/ http://imdb.com/features/sundance/2011/ http://imdb.com/features/video/ http://imdb.com/features/video/browse/ http://imdb.com/features/video/trailers/ http://imdb.com/features/video/tv/ http://imdb.com/features/yearinreview/2010/ http://imdb.com/genre http://imdb.com/help/ http://imdb.com/helpdesk/contact http://imdb.com/help/show_article?conditions http://imdb.com/help/show_article?rssavailable http://imdb.com/jobs http://imdb.com/lists http://imdb.com/media/index/rg2392693248 http://imdb.com/media/rm3467688448/rg2392693248 http://imdb.com/media/rm3484465664/rg2392693248 http://imdb.com/media/rm3719346688/rg2392693248 http://imdb.com/mymovies/list http://imdb.com/name/nm0000207/ http://imdb.com/name/nm0000234/ http://imdb.com/name/nm0000631/ http://imdb.com/name/nm0000982/ http://imdb.com/name/nm0001392/ http://imdb.com/name/nm0004716/ http://imdb.com/name/nm0531546/ http://imdb.com/name/nm0626362/ http://imdb.com/name/nm0742146/ http://imdb.com/name/nm0817980/ http://imdb.com/name/nm2059117/ http://imdb.com/news/ http://imdb.com/news/celebrity http://imdb.com/news/movie http://imdb.com/news/ni7650335/ http://imdb.com/news/ni7653135/ http://imdb.com/news/ni7654375/ http://imdb.com/news/ni7654598/ http://imdb.com/news/ni7654810/ http://imdb.com/news/ni7655320/ http://imdb.com/news/ni7656816/ http://imdb.com/news/ni7660987/ http://imdb.com/news/ni7662397/ http://imdb.com/news/ni7665028/ http://imdb.com/news/ni7668639/ http://imdb.com/news/ni7669396/ http://imdb.com/news/ni7676733/ http://imdb.com/news/ni7677253/ http://imdb.com/news/ni7677366/ http://imdb.com/news/ni7677639/ http://imdb.com/news/ni7677944/ http://imdb.com/news/ni7678014/ http://imdb.com/news/ni7678103/ http://imdb.com/news/ni7678225/ http://imdb.com/news/ns0000003/ http://imdb.com/news/ns0000018/ http://imdb.com/news/ns0000023/ http://imdb.com/news/ns0000031/ http://imdb.com/news/ns0000128/ http://imdb.com/news/ns0000136/ http://imdb.com/news/ns0000141/ http://imdb.com/news/ns0000195/ http://imdb.com/news/ns0000236/ http://imdb.com/news/ns0000344/ http://imdb.com/news/ns0000345/ http://imdb.com/news/ns0004913/ http://imdb.com/news/top http://imdb.com/news/tv http://imdb.com/nowplaying/ http://imdb.com/photo_galleries/new_photos/2010/ http://imdb.com/poll http://imdb.com/privacy http://imdb.com/register/login http://imdb.com/register/?why=footer http://imdb.com/register/?why=mymovies_footer http://imdb.com/register/?why=personalize http://imdb.com/rg/NAV_TWITTER/NAV_EXTRA/http://www.twitter.com/imdb http://imdb.com/ri/TRAILERS_HPPIRATESVID/TOP_BUCKET/102785/video/imdb/vi161323033/ http://imdb.com/search http://imdb.com/search/ http://imdb.com/search/name?birth_monthday=02-12 http://imdb.com/search/title?sort=num_votes,desc&title_type=feature&my_ratings=exclude http://imdb.com/sections/dvd/ http://imdb.com/sections/horror/ http://imdb.com/sections/indie/ http://imdb.com/sections/tv/ http://imdb.com/showtimes/ http://imdb.com/tiger_redirect?FT_LIC&licensing/ http://imdb.com/title/tt0078748/ http://imdb.com/title/tt0279600/ http://imdb.com/title/tt0377981/ http://imdb.com/title/tt0881320/ http://imdb.com/title/tt0990407/ http://imdb.com/title/tt1034389/ http://imdb.com/title/tt1265990/ http://imdb.com/title/tt1401152/ http://imdb.com/title/tt1411238/ http://imdb.com/title/tt1411238/trivia http://imdb.com/title/tt1446714/ http://imdb.com/title/tt1452628/ http://imdb.com/title/tt1464174/ http://imdb.com/title/tt1464540/ http://imdb.com/title/tt1477837/ http://imdb.com/title/tt1502404/ http://imdb.com/title/tt1504320/ http://imdb.com/title/tt1563069/ http://imdb.com/title/tt1564367/ http://imdb.com/title/tt1702443/ http://imdb.com/tvgrid/ http://m.imdb.com http://pro.imdb.com/r/IMDbTabNB/ http://resume.imdb.com http://resume.imdb.com/ https://secure.imdb.com/register/subscribe?c=a394d4442664f6f6475627 http://twitter.com/imdb http://wireless.amazon.com http://www.3news.co.nz/The-Hobbit-media-conference--full-video/tabid/312/articleID/198020/Default.aspx http://www.amazon.com/exec/obidos/redirect-home/internetmoviedat http://www.audible.com http://www.boxofficemojo.com http://www.dpreview.com http://www.endless.com http://www.fabric.com http://www.imdb.com/board/bd0000089/threads/ http://www.imdb.com/licensing/ http://www.imdb.com/media/rm1037220352/rg261921280 http://www.imdb.com/media/rm2695346688/tt1449283 http://www.imdb.com/media/rm3987585536/tt1092026 http://www.imdb.com/name/nm0000092/ http://www.imdb.com/photo_galleries/new_photos/2010/index http://www.imdb.com/search/title?sort=num_votes,desc&title_type=tv_series&my_ratings=exclude http://www.imdb.com/sections/indie/ http://www.imdb.com/title/tt0079470/ http://www.imdb.com/title/tt0079470/quotes?qt0471997 http://www.imdb.com/title/tt1542852/ http://www.imdb.com/title/tt1606392/ http://www.imdb.de http://www.imdb.es http://www.imdb.fr http://www.imdb.it http://www.imdb.pt http://www.movieline.com/2011/02/watch-jon-hamm-talk-butthole-surfers-paul-rudd-impersonate-jay-leno-at-book-reading-1.php http://www.movingimagesource.us/articles/un-tv-20110210 http://www.npr.org/blogs/monkeysee/2011/02/10/133629395/james-franco-recites-byron-to-the-worlds-luckiest-middle-school-journalist http://www.nytimes.com/2011/02/06/books/review/Brubach-t.html http://www.shopbop.com/welcome http://www.smallparts.com http://www.twinpeaks20.com/details/ http://www.twitter.com/imdb http://www.vanityfair.com/hollywood/features/2011/03/lauren-bacall-201103 http://www.warehousedeals.com http://www.withoutabox.com http://www.zappos.com
To extract values of 'href' attribute of anchor tags you may also use xmlstarlet after converting HTML to XHTML using HTML Tidy (Mac OS X version released on 25 March 2009):
curl -s www.imdb.com |
tidy -q -c -wrap 0 -numeric -asxml -utf8 --merge-divs yes --merge-spans yes 2>/dev/null |
xmlstarlet sel -N x="http://www.w3.org/1999/xhtml" -t -m "//x:a/@href" -v '.' -n |
grep '^[[:space:]]*http://' | sort -u | nl
On Mac OS X you may also use the command line tool linkscraper:
linkscraper http://www.imdb.com
see: http://codesnippets.joyent.com/posts/show/10772
精彩评论