Ruby reduce all whitespace to single spaces
I'm not sure how do this, as I'm pretty new to regular expressions, and can't seem to开发者_如何学Go find the proper method to accomplish this but say I have the following as a string (all tabs, and newlines included)
1/2 cup
onion
(chopped)
How can I remove all the whitespace and replace each instance with just a single space?
This is a case where regular expressions work well, because you want to treat the whole class of whitespace characters the same and replace runs of any combination of whitespace with a single space character. So if that string is stored in s
, then you would do:
fixed_string = s.gsub(/\s+/, ' ')
Within Rails you can use String#squish
, which is an active_support
extensions.
require 'active_support'
s = <<-EOS
1/2 cup
onion
EOS
s.squish
# => 1/2 cup onion
You want the squeeze method:
str.squeeze([other_str]*) → new_str
Builds a set of characters from the other_str parameter(s) using the procedure described for String#count. Returns a new string where runs of the same character that occur in this set are replaced by a single character. If no arguments are given, all runs of identical characters are replaced by a single character.
"yellow moon".squeeze #=> "yelow mon"
" now is the".squeeze(" ") #=> " now is the"
"putters shoot balls".squeeze("m-z") #=> "puters shot balls"
The problem with the simplest solution gsub(/\s+/, ' ')
is that it is very SLOW, as it replaces every space, even if it is single. But usually there is 1 space between words and we should fix only if there are 2 or more whitespaces in sequence.
Better solution is tr("\r\n\t", ' ').gsub(/ {2,}/, ' ')
– first replace special whitespacing to ordinary spaces (tr
works faster than gsub
for replacing 1 char) and then squeeze spaces only if there are 2 or more consecutive spaces.
def method1(s) s.gsub!(/\s+/, ' '); s end
def method2(s) s.tr!("\r\n\t", ' '); s.gsub!(/ {2,}/, ' '); s end
Benchmark.bm do |x|
n = 100_000
x.report('method1') { n.times { method1("Lorem ipsum\n\n dolor \t\t\tsit amet, consectetur\n \n\t\n adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.") } }
x.report('method2') { n.times { method2("Lorem ipsum\n\n dolor \t\t\tsit amet, consectetur\n \n\t\n adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.") } }
end;1
# user system total real
# method1 2.907425 0.024254 2.931679 ( 3.406144)
# method2 0.644329 0.011254 0.655583 ( 0.658699)
The selected answer will not remove non-breaking space characters.
This should work in 1.9:
fixed_string = s.gsub(/(\s|\u00A0)+/, ' ')
If speed is a concern then your best bet is this.
.tr("\r\n\t", ' ').gsub(/ {2,}/, ' ')
This replaces whitespace characters with a space then replaces multiple spaces with a single space.
I saw the benchmark that Lev posted and compared variations of gsub .sqeeze .tr and .squish. I expanded his benchmark to try them out and while .squeeze is the fastest it does not answer the questions since it would only compress multiple tabs/new lines to a singe tab/new line.
# Replace multiple whitespace characters with a single space.
def method1(s) s.gsub!(/\s+/, ' '); s end # (in place)
def method2(s) s = s.gsub(/\s+/, ' '); s end
# Replace characters with a space then replace multiple spaces with a single space.
def method3(s) s.gsub!(/[\r\n\t]/, ' '); s.gsub!(/ {2,}/, ' '); s end # (in place)
def method4(s) s = s.gsub(/[\r\n\t]/, ' ').gsub(/ {2,}/, ' '); s end
# Replace characters with a space then replace multiple spaces with a single space.
def method5(s) s.tr!("\r\n\t", ' '); s.gsub!(/ {2,}/, ' '); s end # (in place)
def method6(s) s = s.tr("\r\n\t", ' ').gsub(/ {2,}/, ' '); s end
# Replace multiple whitespace characters with a single space.
def method7(s) s.squish!; s end # (in place)
def method8(s) s = s.squish; s end
# Combines multiple spaces into a single space
def method9(s) s.squeeze!(" "); s end # (in place)
def method10(s) s = s.squeeze(" "); s end
Benchmark.bm do |x|
n = 100_000
x.report('.gsub! ') { n.times { method1("Lorem ipsum\n\n dolor \t\t\tsit amet, consectetur\n \n\t\n adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.") } }
x.report('.gsub ') { n.times { method2("Lorem ipsum\n\n dolor \t\t\tsit amet, consectetur\n \n\t\n adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.") } }
x.report('.gsub!.gsub!') { n.times { method3("Lorem ipsum\n\n dolor \t\t\tsit amet, consectetur\n \n\t\n adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.") } }
x.report('.gsub .gsub ') { n.times { method4("Lorem ipsum\n\n dolor \t\t\tsit amet, consectetur\n \n\t\n adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.") } }
x.report('.tr!.gsub! ') { n.times { method5("Lorem ipsum\n\n dolor \t\t\tsit amet, consectetur\n \n\t\n adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.") } }
x.report('.tr .gsub ') { n.times { method6("Lorem ipsum\n\n dolor \t\t\tsit amet, consectetur\n \n\t\n adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.") } }
x.report('.squish ') { n.times { method7("Lorem ipsum\n\n dolor \t\t\tsit amet, consectetur\n \n\t\n adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.") } }
x.report('.squish! ') { n.times { method8("Lorem ipsum\n\n dolor \t\t\tsit amet, consectetur\n \n\t\n adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.") } }
x.report('.squeeze! ') { n.times { method9("Lorem ipsum\n\n dolor \t\t\tsit amet, consectetur\n \n\t\n adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.") } }
x.report('.squeeze ') { n.times { method10("Lorem ipsum\n\n dolor \t\t\tsit amet, consectetur\n \n\t\n adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.") } }
end
Which gets these results
=>
# user system total real
# .gsub! 2.019544 0.030325 2.049869 ( 2.059379)
# .gsub 1.968179 0.011204 1.979383 ( 1.988050)
# .gsub!.gsub! 0.770042 0.014097 0.784139 ( 0.787055)
# .gsub .gsub 0.728955 0.011577 0.740532 ( 0.742887)
# .tr!.gsub! 0.487014 0.008260 0.495274 ( 0.496820)
# .tr .gsub 0.487231 0.007769 0.495000 ( 0.497164)
# .squish! 2.005224 0.011673 2.016897 ( 2.025851)
# .squish 2.043497 0.013331 2.056828 ( 2.066794)
# .squeeze! 0.117615 0.002004 0.119619 ( 0.120140)
# .squeeze 0.196301 0.012094 0.208395 ( 0.209267)
精彩评论