Parse Lines of Single Words and Groups of Words Inside Quotes Using Regular Expressions in Ruby
I'm trying to figure out how to better parse lines of text that have values that look like this:
line1
'Line two' fudgy whale 'rolly polly'
fudgy 'line three' whale
fudgy whale 'line four'
'line five' 'fish heads'
line six
I wish to use a single regular 开发者_Go百科expression to display the desired output. I already know how to kludge it up to get the desired output but I want a single expression.
Desired output:
["line1"]
["Line two", "fudgy", "whale", "rolly polly"]
["fudgy", "line three", "whale"]
["fudgy", "whale", "line four"]
["line five", "fish heads"]
["line", "six"]
The line reading is already handled for me via Cucumber. Each line is read as one string value and I want to parse out single words and any number of words contained inside single quotes. I know less than nothing about regular expressions but I've hobbled together a regular expression using the regex "or" operator ("|") that got me close.
Taking that regex I first tried parsing each line using a string split:
text_line.split(/(\w+)|'(.*?)'/)
Which resulted in the following, less than acceptable, arrays:
["", "line1"]
["", "Line two", " ", "fudgy", " ", "whale", " ", "rolly polly"]
["", "fudgy", " ", "line three", " ", "whale"]
["", "fudgy", " ", "whale", " ", "line four"]
["", "line five", " ", "fish heads"]
["", "line", "", "six"]
I next tried using scan instead of a split and I saw this:
text_line.scan(/(\w+)|'(.*?)'/)
[["line1", nil]]
[[nil, "Line two"], ["fudgy", nil], ["whale", nil], [nil, "rolly polly"]]
[["fudgy", nil], [nil, "line three"], ["whale", nil]]
[["fudgy", nil,], ["whale", nil], [nil, "line four"]]
[[nil, "line five"], [nil, "fish heads"]]
[["line", nil], [nil, "six",]]
So I could see the regex "or" operator was producing a value for each possible "or" position which made sense. Knowing that I figured out I could use scan, flatten, and compact to clean it up giving me the desired output:
text_line.scan(/(\w+)|'(.*?)'/).flatten.compact
["line1"]
["Line two", "fudgy", "whale", "rolly polly"]
["fudgy", "line three", "whale"]
["fudgy", "whale", "line four"]
["line five", "fish heads"]
["line", "six"]
But using the scan, flatten, and compact looks incredibly ugly and it seems like I'm just monkey patching my own bad regular expression. I'm thinking instead of ham-handedly fixing the sloppy output from my poorly constructed regex I should just write a better regular expression.
So, is it possible to use a single regular expression to parse the above lines and get the desired output? I may be way off on the regex to begin with but I'm thinking if I could just somehow group the or's so they only return one value per group that would probably be what I'm looking for.
Please feel free to suggest alternate solutions but I'm looking for elegant solutions done the Ruby way since I'm trying to teach myself how to use the language.
Thanks in advance for your time.
edited to incorporate tininfi's better, more accurate regex
If you want to get array of arrays of different size, you may do it in two steps: .split
and .scan
.
In your case .scan
has ()
on two sides of |
, that's why you have trouble with nil
(Which supposed to be useful, but not it your case). So you have either use .flatten.compact
or add the 3rd step of .delete
.
text.split("\n").map{|i|p i.scan(/'([^']+)'|(\w+)/).flatten.compact}
text.split("\n").map{|i|p i.scan(/'[^']+'|\w+/).map{|i|i.delete "'"}}
You could simplify the regex to:
'(.*?)'|(\w+)
You still have to use the flatten and compact, but at least it is a bit nicer looking. Not that you specified the need, but this will allow for the string:
'quote one' 'quote two'
Below was rejected as less elegant than original solution.
You could try:
regex = %r((\w+)|(?:')([^"\r\n]*)(?:'))
text.split(regex).delete_if { |x| x.strip.empty? }
I have a feeling that you still don't like this, but this is the closest thing to "a single regular expression" I could come up with:
text_line.scan(/(?<=')(?:[^\s][^']*)(?=')|(?:\w+)/)
This breaks if the input text has a quoted word that starts with a space.
精彩评论