Why is this grep filter slow?

2023-02-18 07:23 问答作者：

I want to get the first two letters in every word in the BSD dict word list, excluding those words that start with only one letter.

Without the one-letter exclusion it runs extremely fast:

time cat /usr/s开发者_开发知识库hare/dict/web2 | cut -c 1-2 | tr '[a-z]' '[A-Z]' | uniq -c > /dev/null

real    0m0.227s
user    0m0.375s
sys 0m0.021s

grepping on '..', however, is painfully slow:

time cat /usr/share/dict/web2 | cut -c 1-2 | grep '..' | tr '[a-z]' '[A-Z]' | uniq -c > /dev/null

real    1m16.319s
user    1m0.694s
sys 0m10.225s

What's going on here?

The problem is the UTF-8 Locale, easy workaround for 100x speedup

What's really slow on the Mac is the UTF-8 locale.

Replace grep .. with LC_ALL=C grep .. then your command will run over 100x faster.

This is probably true of Linux as well, except a given Linux distro is probably more likely to default to the C environment.

I don't know why it is so awful. But I know one quick way to speed it up is to invert your grep(1) expression with -v, and throw away all one-character lines:

$ time cat /usr/share/dict/words | cut -c 1-2 | grep -v '^.$' | tr '[a-z]' '[A-Z]' | uniq -c > /dev/null

real    0m0.086s
user    0m0.090s
sys  0m0.000s

This might run a little better and would also get rid of your cut needing another pipe.

cat /usr/share/dict/web2 | egrep -o '^.{2,}' | tr '[a-z]' '[A-Z]' | uniq -c > /dev/null

it might even be faster if you cut down on the use of excessive pipes and useless cat

$ awk '{ a[toupper(substr($0,1,2))]++ } END{for(i in a) print i,a[i] }' file

继续阅读：command-line grep regex

Why is this grep filter slow?

The problem is the UTF-8 Locale, easy workaround for 100x speedup

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

The problem is the UTF-8 Locale, easy workaround for 100x speedup

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？