开发者

Ruby reduce all whitespace to single spaces

I'm not sure how do this, as I'm pretty new to regular expressions, and can't seem to开发者_如何学Go find the proper method to accomplish this but say I have the following as a string (all tabs, and newlines included)

1/2 cup  




            onion           
             (chopped)

How can I remove all the whitespace and replace each instance with just a single space?


This is a case where regular expressions work well, because you want to treat the whole class of whitespace characters the same and replace runs of any combination of whitespace with a single space character. So if that string is stored in s, then you would do:

fixed_string = s.gsub(/\s+/, ' ')


Within Rails you can use String#squish, which is an active_support extensions.

require 'active_support'

s = <<-EOS
1/2 cup  

            onion
EOS

s.squish
# => 1/2 cup onion


You want the squeeze method:

str.squeeze([other_str]*) → new_str
Builds a set of characters from the other_str parameter(s) using the procedure described for String#count. Returns a new string where runs of the same character that occur in this set are replaced by a single character. If no arguments are given, all runs of identical characters are replaced by a single character.

   "yellow moon".squeeze                  #=> "yelow mon"
   "  now   is  the".squeeze(" ")         #=> " now is the"
   "putters shoot balls".squeeze("m-z")   #=> "puters shot balls"


The problem with the simplest solution gsub(/\s+/, ' ') is that it is very SLOW, as it replaces every space, even if it is single. But usually there is 1 space between words and we should fix only if there are 2 or more whitespaces in sequence.

Better solution is tr("\r\n\t", ' ').gsub(/ {2,}/, ' ') – first replace special whitespacing to ordinary spaces (tr works faster than gsub for replacing 1 char) and then squeeze spaces only if there are 2 or more consecutive spaces.

def method1(s) s.gsub!(/\s+/, ' '); s end
def method2(s) s.tr!("\r\n\t", ' '); s.gsub!(/ {2,}/, ' '); s end

Benchmark.bm do |x|
  n = 100_000
  x.report('method1') { n.times { method1("Lorem   ipsum\n\n dolor \t\t\tsit amet, consectetur\n \n\t\n adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.") } }
  x.report('method2') { n.times { method2("Lorem   ipsum\n\n dolor \t\t\tsit amet, consectetur\n \n\t\n adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.") } }
end;1

#          user     system      total        real
# method1  2.907425   0.024254   2.931679 (  3.406144)
# method2  0.644329   0.011254   0.655583 (  0.658699)    


The selected answer will not remove non-breaking space characters.

This should work in 1.9:

fixed_string = s.gsub(/(\s|\u00A0)+/, ' ')


If speed is a concern then your best bet is this.

.tr("\r\n\t", ' ').gsub(/ {2,}/, ' ')

This replaces whitespace characters with a space then replaces multiple spaces with a single space.

I saw the benchmark that Lev posted and compared variations of gsub .sqeeze .tr and .squish. I expanded his benchmark to try them out and while .squeeze is the fastest it does not answer the questions since it would only compress multiple tabs/new lines to a singe tab/new line.

# Replace multiple whitespace characters with a single space.
def method1(s) s.gsub!(/\s+/, ' '); s end # (in place)
def method2(s) s = s.gsub(/\s+/, ' '); s end

# Replace characters with a space then replace multiple spaces with a single space.
def method3(s) s.gsub!(/[\r\n\t]/, ' '); s.gsub!(/ {2,}/, ' '); s end # (in place)
def method4(s) s = s.gsub(/[\r\n\t]/, ' ').gsub(/ {2,}/, ' '); s end

# Replace characters with a space then replace multiple spaces with a single space.
def method5(s) s.tr!("\r\n\t", ' '); s.gsub!(/ {2,}/, ' '); s end # (in place)
def method6(s) s = s.tr("\r\n\t", ' ').gsub(/ {2,}/, ' '); s end

# Replace multiple whitespace characters with a single space.
def method7(s) s.squish!; s end # (in place)
def method8(s) s = s.squish; s end

# Combines multiple spaces into a single space
def method9(s) s.squeeze!(" "); s end # (in place)
def method10(s) s = s.squeeze(" "); s end

Benchmark.bm do |x|
  n = 100_000
  x.report('.gsub!      ') { n.times { method1("Lorem   ipsum\n\n dolor \t\t\tsit amet, consectetur\n \n\t\n adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.") } }
  x.report('.gsub       ') { n.times { method2("Lorem   ipsum\n\n dolor \t\t\tsit amet, consectetur\n \n\t\n adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.") } }
  x.report('.gsub!.gsub!') { n.times { method3("Lorem   ipsum\n\n dolor \t\t\tsit amet, consectetur\n \n\t\n adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.") } }
  x.report('.gsub .gsub ') { n.times { method4("Lorem   ipsum\n\n dolor \t\t\tsit amet, consectetur\n \n\t\n adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.") } }
  x.report('.tr!.gsub!  ') { n.times { method5("Lorem   ipsum\n\n dolor \t\t\tsit amet, consectetur\n \n\t\n adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.") } }
  x.report('.tr .gsub   ') { n.times { method6("Lorem   ipsum\n\n dolor \t\t\tsit amet, consectetur\n \n\t\n adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.") } }
  x.report('.squish     ') { n.times { method7("Lorem   ipsum\n\n dolor \t\t\tsit amet, consectetur\n \n\t\n adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.") } }
  x.report('.squish!    ') { n.times { method8("Lorem   ipsum\n\n dolor \t\t\tsit amet, consectetur\n \n\t\n adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.") } }
  x.report('.squeeze!   ') { n.times { method9("Lorem   ipsum\n\n dolor \t\t\tsit amet, consectetur\n \n\t\n adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.") } }
  x.report('.squeeze    ') { n.times { method10("Lorem   ipsum\n\n dolor \t\t\tsit amet, consectetur\n \n\t\n adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.") } }
end

Which gets these results

=>
#               user       system     total       real
# .gsub!        2.019544   0.030325   2.049869 (  2.059379)
# .gsub         1.968179   0.011204   1.979383 (  1.988050)
# .gsub!.gsub!  0.770042   0.014097   0.784139 (  0.787055)
# .gsub .gsub   0.728955   0.011577   0.740532 (  0.742887)
# .tr!.gsub!    0.487014   0.008260   0.495274 (  0.496820)
# .tr .gsub     0.487231   0.007769   0.495000 (  0.497164)
# .squish!      2.005224   0.011673   2.016897 (  2.025851)
# .squish       2.043497   0.013331   2.056828 (  2.066794)
# .squeeze!     0.117615   0.002004   0.119619 (  0.120140)
# .squeeze      0.196301   0.012094   0.208395 (  0.209267)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜