Is it more efficient to grep twice or use a regular expression once?

2023-03-07 02:36 问答作者：

I'm trying to parse a couple of 2gb+ files and want to grep on a couple of levels.

Say I want to fetch lines that contain "foo" and lines that also contain "bar".

I could do grep foo file.log | grep bar, but my c开发者_开发问答oncern is that it will be expensive running it twice.

Would it be beneficial to use something like grep -E '(foo.*bar|bar.*foo)' instead?

grep -E '(foo|bar)' will find lines containing 'foo' OR 'bar'.

You want lines containing BOTH 'foo' AND 'bar'. Either of these commands will do:

sed '/foo/!d;/bar/!d' file.log

awk '/foo/ && /bar/' file.log

Both commands -- in theory -- should be much more efficient than your cat | grep | grep construct because:

Both sed and awk perform their own file reading; no need for pipe overhead
The 'programs' I gave to sed and awk above use Boolean short-circuiting to quickly skip lines not containing 'foo', thus testing only lines containing 'foo' to the /bar/ regex

However, I haven't tested them. YMMV :)

In theory, the fastest way should be:

grep -E '(foo.*bar|bar.*foo)' file.log

For several reasons: First, grep reads directly from the file, rather than adding the step of having cat read it and stuff it down a pipe for grep to read. Second, it uses only a single instance of grep, so each line of the file only has to be processed once. Third, grep -E is generally faster than plain grep on large files (but slower on small files), although this will depend on your implementation of grep. Finally, grep (in all its variants) is optimized for string searching, while sed and awk are general-purpose tools that happen to be able to search (but aren't optimized for it).

These two operations are fundamentally different. This one:

cat file.log | grep foo | grep bar

looks for foo in file.log, then looks for bar in whatever the last grep output. Whereas cat file.log | grep -E '(foo|bar)' looks for either foo or bar in file.log. The output should be very different. Use whatever behavior you need.

As for efficiency, they're not really comparable because they do different things. Both should be fast enough, though.

If you're doing this:

cat file.log | grep foo | grep bar

You're only printing lines that contain both foo and bar in any order. If this is your intention:

grep -e "foo.*bar" -e "bar.*foo" file.log

Will be more efficient since I only have to parse the output once.

Notice I don't need the cat which is more efficient in itself. You rarely ever need cat unless you are concatinating files (which is the purpose of the command). 99% of the time you can either add a file name to the end of the first command in a pipe, or if you have a command like tr that doesn't allow you to use a file, you can always redirect the input like this:

tr `a-z` `A-Z` < $fileName

But, enough about useless cats. I have two at home.

You can pass multiple regular expressions to a single grep which is usually a bit more efficient than piping multiple greps. However, if you can eliminate regular expressions, you might find this the most efficient:

fgrep "foo" file.log | fgrep "bar"

Unlike grep, fgrep doesn't parse regular expressions which means it can parse lines much, much faster. Try this:

time fgrep "foo" file.log | fgrep "bar"

and

time grep -e "foo.*bar" -e "bar.*foo" file.log

And see which is faster.

继续阅读：bash grep

Is it more efficient to grep twice or use a regular expression once?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？