how to grep part of the content from a string in bash
For example when filte开发者_运维技巧ring html file, if every line is in this kind of pattern:
<a href="xxxxxx" style="xxxx"><i>some text</i></a>
how can I get the content of href
, and how can I get the text between <i>
and </i>
?
cat file | cut -f2 -d\"
FYI: Just about every other HTML/regexp post on Stackoverflow explains why getting values from HTML using anything other than HTML parsing is a bad idea. You may want to read some of those. This one for example.
If href is always the second token separated by space in a,ine then u can try
grep "href" file | cut -d' ' -f2 | cut -d'=' -f2
Here's how to do it using xmlstarlet (optionally with tidy):
# extract content of href and <i>...</i>
echo '<a href="xxxxxx" style="xxxx"><i>some text</i></a>' |
xmlstarlet sel -T -t -m "//a" -v @href -n -v i -n
# using tidy & xmlstarlet
echo '<a href="xxxxxx" style="xxxx"><i>some text</i></a>' |
tidy -q -c -wrap 0 -numeric -asxml -utf8 --merge-divs yes --merge-spans yes 2>/dev/null |
xmlstarlet sel -N x="http://www.w3.org/1999/xhtml" -T -t -m "//x:a" -v @href -n -v . -n
精彩评论