Using ? with sed
I just want to get the number of a file that may or may not be gzip'd. However, it appears that a开发者_运维技巧 regular expression in sed does not support a ?
. Here's what I tried:
echo 'file_1.gz'|sed -n 's/.*_\(.*\)\(\.gz\)?/\1/p'
and nothing was returned. Then I added a ?
to the string being analyzed:
echo 'file_1.gz?'|sed -n 's/.*_\(.*\)\(\.gz\)?/\1/p'
and got:
1
So, it looks like the ?
used in most regex's is not supported in sed, right? Well then, I would just like sed to give a 1
for file_1
and file_1.gz
. What's the best way to do that in a bash script if execution time is critical?
The equivalent to x?
is \(x\|\)
.
However, many versions of sed support an option to enable "extended regular expressions" which includes ?
. In GNU sed the flag is -r
. Note that this also changes unescaped parens to do grouping. eg:
echo 'file_1.gz'|sed -n -r 's/.*_(.*)(\.gz)?/\1/p'
Actually, there's another bug in your regex which is that the greedy .*
in the parens is going to swallow up the ".gz" if there is one. sed doesn't have a non-greedy equivalent to *
as far as I know, but you can use |
to work around this. |
in sed (and many other regex implementations) will use the leftmost match that works, so you can do something like this:
echo 'file_1.gz'|sed -r 's/(.*_(.*)\.gz)|(.*_(.*))/\2\4/'
This tries to match with .gz, and only tries without it if that doesn't work. Only one of group 2 or 4 will actually exist (since they are on opposite sides of the same |
) so we just concatenate them to get the value we want.
If you're looking for an answer to the specific example given in the question, or why it uses the ?
incorrectly (regardless of syntax), see the answer by Laurence Gonsalves.
If you're looking instead for the answer to the general question of why ?
doesn't exhibit its special meaning in sed as you might expect:
By default, sed uses the " POSIX basic regular expressions syntax", so the question mark must be escaped as \?
to apply its special meaning, otherwise it matches a literal question mark. As an alternative, you can use the -r
or --regexp-extended
option to use the "extended regular expression syntax", which reverses the meaning of escaped and non-escaped special characters, including ?
.
In the words of the GNU sed documentation (view by running 'info sed' on Linux):
The only difference between basic and extended regular expressions is in the behavior of a few characters: '?', '+', parentheses, and braces ('{}'). While basic regular expressions require these to be escaped if you want them to behave as special characters, when using extended regular expressions you must escape them if you want them to match a literal character.
and the option is explained:
-r
--regexp-extended
Use extended regular expressions rather than basic regular expressions. Extended regexps are those that `egrep' accepts; they can be clearer because they usually have less backslashes, but are a GNU extension and hence scripts that use them are not portable.
Update
Newer versions of GNU sed now say this:
-E
-r
--regexp-extended
Use extended regular expressions rather than basic regular expressions. Extended regexps are those that 'egrep' accepts; they can be clearer because they usually have fewer backslashes. Historically this was a GNU extension, but the '-E' extension has since been added to the POSIX standard (http://austingroupbugs.net/view.php?id=528), so use '-E' for portability. GNU sed has accepted '-E' as an undocumented option for years, and *BSD seds have accepted '-E' for years as well, but scripts that use '-E' might not port to other older systems.
So, if you need to preserve compatibility with ancient GNU sed, stick with -r
. But if you prefer better cross-platform portability on more modern systems (e.g. Linux+Mac support), go with -E
(but note that there are still some quirks and differences between GNU sed and BSD sed, so you'll have to make sure your scripts are portable in any case).
echo 'file_1.gz'|sed -n 's/.*_\(.*\)\?\(\.gz\)/\1/p'
Works. You have to put the return in the right spot, and you have to escape it.
A function that should return a number that follows the '_' in a filename, regardless of file extension:
realname () {
local n=${$1##*/}
local rn="${n%.*}"
sed 's/^.*\_//g' ${$rn:-$n}
}
You should use awk
which is superior to sed
when it comes to field grabbing/parsing:
$ awk -F'[._]' '{print $2}' <<<"file_1"
1
$ awk -F'[._]' '{print $2}' <<<"file_1.gz"
1
Alternatively you can just use Bash's parameter expansion like so:
var=file_1.gz;
temp=${var#*_};
file=${temp%.*}
echo $file
Note: works when var=file_1
as well
Part of the solution lies in escaping the question mark or using the -r
option.
sed 's/.*_\([^.]*\)\(\.\?[^.]\+\)\?$/\1/'
or
sed -r 's/.*_([^.]*)(\.?[^.]+)?$/\1/'
will work for:
file_1.gz
file_12.txt
file_123
resulting in:
1
12
123
I just realized that could do something very easy:
echo 'file_1.gz'|sed -n 's/.*_\([0-9]*\).*/\1/p'
Notice the [0-9]*
instead of a .*
. @Laurence Gonsalves's answer made me realize the greediness of my previous post.
精彩评论