Ruby string search: which is faster split or regex?

2023-04-06 20:30 问答作者：

This is a two part question. Given you have an array of strings that can be split at a character (e.g. email addresses at '@' or file names at '.') which is the the most performant way of finding the characters before the split character?

my_string.split(char开发者_JAVA百科)[0]

my_string[/regex/]

The second part of the question is what's how do you write a regex to get everything before the first instance of a character. The regex below finds certain characters before a '.' (because '.' is not in the pattern) but that was my hacky way to get to a solution.

my_string[/[A-Za-z0-9\_-]+/]

thanks!

The easiest way to answer the first part is to, as always, benchmark it with your real data. For example:

require 'benchmark'
Benchmark.bm do |x|
  x.report { 50000.times { a = 'a@b.c'.split('@')[0] } }
  x.report { 50000.times { a = 'a@b.c'[/[^@]+/] } }
end

says (on my setup):

      user     system      total        real
  0.130000   0.010000   0.140000 (  0.130946)
  0.090000   0.000000   0.090000 (  0.096260)

So the regex solution looks a little bit faster but the difference is barely noticeable even with 50 000 iterations. OTOH, the regex solution says exactly what you mean ("give me everything before the first @") whereas the split solution gets your desired result in a slightly roundabout way.

The split approach is probably slower because it has to scan the whole string to split it into pieces, then build an array of the pieces, and finally extract the first element of the array and throw the rest away; I don't know if the VM is clever enough to recognize that it doesn't need to build the array so that's just a bit of quick guess work.

As far as your second question is concerned, say what you mean:

my_string[/[^.]+/]

If you want everything before the first period then say "everything until a period" rather than "the first chunk that is made of these characters (which happen to not contain a period)".

partition will be quicker than split as it won't continue checking after the first match.

A regular slice with index will be quicker than a regexp slice.

The regexp slice also slows down considerably as the portion of the string before the match becomes larger. It becomes slower than the original split after ~ 10 characters and then becomes a whole lot worse from there on in. If you have a Regexp without a + or * match, I think it fairs a bit better.

require 'benchmark'
n=1000000

def bench n,email
  printf "\n%s %s times\n", email, n
  Benchmark.bm do |x|
      x.report('split    ') do n.times{ email.split('@')[0]  } end
      x.report('partition') do n.times{ email.partition('@').first  } end
      x.report('slice reg') do n.times{ email[/[^@]+/]  } end
      x.report('slice ind') do n.times{ email[0,email.index('@')]  } end
  end
end


bench n, 'a@be.pl'
bench n, 'some_name@regulardomain.com'
bench n, 'some_really_long_long_email_name@regulardomain.com'
bench n, 'some_name@rediculously-extra-long-silly-domain.com'
bench n, 'some_really_long_long_email_name@rediculously-extra-long-silly-domain.com'
bench n, 'a'*254 + '@' + 'b'*253    # rfc limits
bench n, 'a'*1000 + '@' + 'b'*1000  # for other string processing

Results 1.9.3p484:

a@be.pl 1000000 times
       user     system      total        real
split      0.405000   0.000000   0.405000 (  0.410023)
partition  0.375000   0.000000   0.375000 (  0.368021)
slice reg  0.359000   0.000000   0.359000 (  0.357020)
slice ind  0.312000   0.000000   0.312000 (  0.309018)

some_name@regulardomain.com 1000000 times
       user     system      total        real
split      0.421000   0.000000   0.421000 (  0.432025)
partition  0.374000   0.000000   0.374000 (  0.379021)
slice reg  0.421000   0.000000   0.421000 (  0.411024)
slice ind  0.312000   0.000000   0.312000 (  0.315018)

some_really_long_long_email_name@regulardomain.com 1000000 times
       user     system      total        real
split      0.593000   0.000000   0.593000 (  0.589034)
partition  0.531000   0.000000   0.531000 (  0.529030)
slice reg  0.764000   0.000000   0.764000 (  0.771044)
slice ind  0.484000   0.000000   0.484000 (  0.478027)

some_name@rediculously-extra-long-silly-domain.com 1000000 times
       user     system      total        real
split      0.483000   0.000000   0.483000 (  0.481028)
partition  0.390000   0.016000   0.406000 (  0.404023)
slice reg  0.406000   0.000000   0.406000 (  0.411024)
slice ind  0.312000   0.000000   0.312000 (  0.344020)

some_really_long_long_email_name@rediculously-extra-long-silly-domain.com 1000000 times
       user     system      total        real
split      0.639000   0.000000   0.639000 (  0.646037)
partition  0.609000   0.000000   0.609000 (  0.596034)
slice reg  0.764000   0.000000   0.764000 (  0.773044)
slice ind  0.499000   0.000000   0.499000 (  0.491028)

a<254>@b<253> 1000000 times
       user     system      total        real
split      0.952000   0.000000   0.952000 (  0.960055)
partition  0.733000   0.000000   0.733000 (  0.731042)
slice reg  3.432000   0.000000   3.432000 (  3.429196)
slice ind  0.624000   0.000000   0.624000 (  0.625036)

a<1000>@b<1000> 1000000 times
       user     system      total        real
split      1.888000   0.000000   1.888000 (  1.892108)
partition  1.170000   0.016000   1.186000 (  1.188068)
slice reg 12.885000   0.000000  12.885000 ( 12.914739)
slice ind  1.108000   0.000000   1.108000 (  1.097063)

2.1.3p242 holds about the same % differences but is about 10-30% quicker at everything, except for the regexp split where it slows down even more.

继续阅读：regex ruby

Ruby string search: which is faster split or regex?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？