How to use sed-awk-gawk to display a matched string

2023-02-13 11:37 问答作者：

I've got a file called 'res' that's 29374 characters of http data in a one-line string. Inside it, there are several http links, but I only want to be display those that end in '/idNNNNNNNNN' where N is a digit. In fact I'm only interested in the string 'idNNNNNNNNN'. I've tried with:

cat res | sed -n '0,/.*\开发者_StackOverflow社区(id[0-9]*\).*/s//\1/p'

but I get the whole file. Do you know a way to do it?

perl -n -E 'say $1 while m!/id(\d{9})!g' input-file

should work. That assumes exactly 9 digits; that's the {9} in the above. You can match 8 or 9 ({8,9}), 8 or more ({8,}), up to 9 ({0,9}), etc.

Example of this working:

$ echo -n 'junk jumk http://foo/id231313 junk lalala http://bar/id23123 asda' | perl -n -E 'say $1 while m!id(\d{0,9})!g'
231313
23123

That's with the 0 to 9 variant, of course.

If you're stuck with a pre-5.10 perl, use -e instead of -E and print "$1\n" instead of say $1.

How it works

First is the two command-line arguments to Perl. -n tells Perl to read input from standard input or files given on the command line, line by line, setting $_ to each line. $_ is perl's default target for a lot of things, including regular expression matches. -E merely tells Perl that the next argument is a Perl one-liner, using the new language features (vs. -e which does not use the 5.10 extensions).

So, looking at the one liner: say means to print out some value, followed by a newline. $1 is the first regular expression capture (captures are made by parentheses in regular expressions). while is a looping construct, which you're probably familiar with. m is the match operator, the ! after it is the regular expression delimiter (normally, you see / here, but since the pattern contains / it's easier to use something else, so you don't have to escape the / as \/). /id(\d{9}) is the regular expression to match. Keep in mind that the delimiter is !, so the / is not special, it just matches a literal /. The parentheses form a capture group, so $1 will be the number. The ! is the delimiter, followed by g which means to match as many times as possible (as opposed to once). This is what makes it pick up all the URLs in the line, not just the first. As long as there is a match, the m operator will return a true value, so the loop will continue (and run that say $1, printing out the match).

Two-sed solution

I think this is one way to do this with only sed. Much more complicated!

echo 'junk jumk http://foo/id231313 junk lalala http://bar/id23123 asda' | \
    sed 's!http://!\nhttp://!g' | \
    sed 's!^.*/id\([0-9]*\).*$!\1!'

cat res | perl -ne 'chomp; print "$1\n" if m/\/(id\d*)/'

The trouble is that sed and grep and awk work on lines, and you've only got one line. So, you probably need to split things up so you have more than one line -- then you can make the normal tools work.

tr ':' '\012' < res |
sed -n 's%.*/\(id[0-9][0-9]*\).*%\1%p'

This takes advantage of URLs containing colons and maps colons to newlines with tr, then uses sed to pick up anything up to a slash, followed by id and one or more digits, followed by anything, and prints out the id and digit string (only). Since these only occur in URLs, they will only appear one per line and relatively near the start of the line too.

Here's a solution using only one invocation of sed:

sed -n 's| |\n|g;/^http/{s|http://[^/]*/id\([0-9]*\)|\1|;P};D' inputfile

Explanation:

s| |\n|g; - Divide and conquer
/^http/{ - If pattern space begins with "http"
- s|http://[^/]*/id$[0-9]*$|\1|; - capture the id
- P - Print the string preceding the first newline
}; - end if
D - Delete the string preceding the first newline regardless of whether it contains "http"

Edit:

This version uses the same technique but is more selective.

sed -n 's|http://|\n&|g;/^\n*http/{s|\n*http://[^/]*/id\([0-9]*\)|\1\n|;P};D' inputfile

继续阅读：sed

How to use sed-awk-gawk to display a matched string

How it works

Two-sed solution

更多精彩内容

精彩评论

最新问答

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

哪里医院专治输卵管堵塞好？

外语基础薄弱的人出国自由行，带哪种翻译器比较好？？

输卵管积液手术价格？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？

How it works

Two-sed solution

更多精彩内容

精彩评论

最新问答

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

哪里医院专治输卵管堵塞好？

外语基础薄弱的人出国自由行，带哪种翻译器比较好？？

输卵管积液手术价格？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生 新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？