Reading file in a pattern using awk
I have an input file in following manner
<td> Name1 </td>
<td> <span class="test"><a href="url1">Link </a></span&开发者_StackOverflow社区gt;</td>
<td> Name2 </td>
<td> <span class="test"><a href="url2">Link </a></span></td>
I want a awk script to read this file and output in following manner
url1 Name1
url2 Name2
Can anyone help me out in this trivial looking problem? Thanks.
Extracting one href per is relatively simple, so long as they conform to XHTML standards and there is only at most one on a line and you don't care about enclosing tags, but perl is easier:
$ perl -ne 'print "$1\n" if /href="([^"]+)"/'
If you care about enclosing tags or they are not standard conformant, you cannot use regular expressions to parse HTML. It is impossible.
added: oops, you do care about context, forget about regexps and use a real HTML parser
Here is an awk script that does the job
awk '
/a href=\".*\"/ { sub( /^.*a href=\"/,"" ); sub(/\".*/,""); print $0, name }
{ name = $2 }
'
this might work:
awk 'BEGIN
{i=1}{line[i++]=$0}
END
{
j=1;
while (j<i)
{print line[j+1] line[j]; j+=2}
}' yourfile|awk '{print substr($4,7,length($4)-6),$6}'
gawk '/^<td>/ {n = $2; getline; print gensub(/.*href="([^"]*).*/,"\\1",1), n}' infile
url1 Name1
url2 Name2
awk 'BEGIN{RS="></td>\n"; FS="> | </|\""}{print $7, $2}' infile
every 2 lines as a record.
精彩评论