开发者

Reading file in a pattern using awk

I have an input file in following manner

<td> Name1 </td>
<td> <span class="test"><a href="url1">Link </a></span&开发者_StackOverflow社区gt;</td>
<td> Name2 </td>
<td> <span class="test"><a href="url2">Link </a></span></td>

I want a awk script to read this file and output in following manner

url1 Name1
url2 Name2

Can anyone help me out in this trivial looking problem? Thanks.


Extracting one href per is relatively simple, so long as they conform to XHTML standards and there is only at most one on a line and you don't care about enclosing tags, but perl is easier:

$ perl -ne 'print "$1\n" if /href="([^"]+)"/'

If you care about enclosing tags or they are not standard conformant, you cannot use regular expressions to parse HTML. It is impossible.

added: oops, you do care about context, forget about regexps and use a real HTML parser


Here is an awk script that does the job

awk '
/a href=\".*\"/ { sub( /^.*a href=\"/,"" ); sub(/\".*/,"");  print $0, name }
                { name = $2 }
'


this might work:

awk 'BEGIN
     {i=1}{line[i++]=$0}
     END
     {
      j=1; 
      while (j<i) 
      {print line[j+1] line[j]; j+=2}
     }' yourfile|awk '{print substr($4,7,length($4)-6),$6}'


gawk '/^<td>/ {n = $2; getline; print gensub(/.*href="([^"]*).*/,"\\1",1), n}' infile

url1 Name1
url2 Name2


awk 'BEGIN{RS="></td>\n"; FS="> | </|\""}{print $7, $2}' infile

every 2 lines as a record.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜