Is it more efficient to grep twice or use a regular expression once?
I'm trying to parse a couple of 2gb+ files and want to grep on a couple of levels.
Say I want to fetch lines that contain "foo" and lines that also contain "bar".
I could do grep foo file.log | grep bar
, but my c开发者_开发问答oncern is that it will be expensive running it twice.
Would it be beneficial to use something like grep -E '(foo.*bar|bar.*foo)'
instead?
grep -E '(foo|bar)'
will find lines containing 'foo' OR 'bar'.
You want lines containing BOTH 'foo' AND 'bar'. Either of these commands will do:
sed '/foo/!d;/bar/!d' file.log
awk '/foo/ && /bar/' file.log
Both commands -- in theory -- should be much more efficient than your cat | grep | grep
construct because:
- Both
sed
andawk
perform their own file reading; no need for pipe overhead - The 'programs' I gave to
sed
andawk
above use Boolean short-circuiting to quickly skip lines not containing 'foo', thus testing only lines containing 'foo' to the /bar/ regex
However, I haven't tested them. YMMV :)
In theory, the fastest way should be:
grep -E '(foo.*bar|bar.*foo)' file.log
For several reasons: First, grep reads directly from the file, rather than adding the step of having cat read it and stuff it down a pipe for grep to read. Second, it uses only a single instance of grep, so each line of the file only has to be processed once. Third, grep -E
is generally faster than plain grep on large files (but slower on small files), although this will depend on your implementation of grep. Finally, grep (in all its variants) is optimized for string searching, while sed and awk are general-purpose tools that happen to be able to search (but aren't optimized for it).
These two operations are fundamentally different. This one:
cat file.log | grep foo | grep bar
looks for foo in file.log, then looks for bar in whatever the last grep output. Whereas cat file.log | grep -E '(foo|bar)'
looks for either foo or bar in file.log. The output should be very different. Use whatever behavior you need.
As for efficiency, they're not really comparable because they do different things. Both should be fast enough, though.
If you're doing this:
cat file.log | grep foo | grep bar
You're only printing lines that contain both foo
and bar
in any order. If this is your intention:
grep -e "foo.*bar" -e "bar.*foo" file.log
Will be more efficient since I only have to parse the output once.
Notice I don't need the cat
which is more efficient in itself. You rarely ever need cat
unless you are concatinating files (which is the purpose of the command). 99% of the time you can either add a file name to the end of the first command in a pipe, or if you have a command like tr
that doesn't allow you to use a file, you can always redirect the input like this:
tr `a-z` `A-Z` < $fileName
But, enough about useless cat
s. I have two at home.
You can pass multiple regular expressions to a single grep
which is usually a bit more efficient than piping multiple greps
. However, if you can eliminate regular expressions, you might find this the most efficient:
fgrep "foo" file.log | fgrep "bar"
Unlike grep
, fgrep
doesn't parse regular expressions which means it can parse lines much, much faster. Try this:
time fgrep "foo" file.log | fgrep "bar"
and
time grep -e "foo.*bar" -e "bar.*foo" file.log
And see which is faster.
精彩评论