开发者

regex to break a string into "key" / "value" pairs when # of pairs is variable?

I'm using Ruby 1.9 and I'm wondering if there's a simple regex way to do this.

I have many strings that look like some variation of this:

str = "Allocation:  Random, Control:  Active Control, Endpoint Classification:  Safety Study, Intervention Model:  Parallel Assignment, Masking:  Double Blind (Subject, Caregiver, Investigator, Outcomes Assessor), Primary Purpose:  Treatment"

The idea is that I'd like to break this string into its functional components

  • Allocation: Random
  • Control: Active Control
  • Endpoint Classification: Safety Study
  • Intervention Model: Parallel Assignment
  • Masking: Double Blind (Subject, Caregiver, Investigator, Outcomes, Assessor)
  • Primary Purpose: Treatment

The "syntax" of the string is that there is a "key" which consists of one or more "words or other characters" (e.g. Intervention Model) followed by a colon (:). Each key has a corresponding "value" (e.g., Parallel Assignment) that immediately follows the colon (:)…The "value" consists of words, commas (whatever), bu开发者_开发技巧t the end of the "value" is signaled by a comma.

The # of key/value pairs is variable. I'm also assuming that colons (:) aren't allowed to be part of the "value" and that commas (,) aren't allowed to be part of the "key".

One would think that there is a "regexy" way to break this into its component pieces, but my attempt at making an appropriate matching regex only picks up the first key/value pair and I'm not sure how to capture the others. Any thoughts on how to capture the other matches?

 regex = /(([^,]+?): ([^:]+?,))+?/
=> /(([^,]+?): ([^:]+?,))+?/
irb(main):139:0> str = "Allocation:  Random, Control:  Active Control, Endpoint Classification:  Safety Study, Intervention Model:  Parallel Assignment, Masking:  Double Blind (Subject, Caregiver, Investigator, Outcomes Assessor), Primary Purpose:  Treatment"
=> "Allocation:  Random, Control:  Active Control, Endpoint Classification:  Safety Study, Intervention Model:  Parallel Assignment, Masking:  Double Blind (Subject, Caregiver, Investigator, Outcomes Assessor), Primary Purpose:  Treatment"
irb(main):140:0> str.match regex
=> #<MatchData "Allocation:  Random," 1:"Allocation:  Random," 2:"Allocation" 3:" Random,">
irb(main):141:0> $1
=> "Allocation:  Random,"
irb(main):142:0> $2
=> "Allocation"
irb(main):143:0> $3
=> " Random,"
irb(main):144:0> $4
=> nil


irb(main):003:0> pp Hash[ *str.split(/\s*([^,]+:)\s+/)[1..-1] ]
{"Allocation:"=>"Random,",
 "Control:"=>"Active Control,",
 "Endpoint Classification:"=>"Safety Study,",
 "Intervention Model:"=>"Parallel Assignment,",
 "Masking:"=>
  "Double Blind (Subject, Caregiver, Investigator, Outcomes Assessor),",
 "Primary Purpose:"=>"Treatment"}

The whitespace parts of the regex aren't needed, but help to slightly clean up the output. I leave it to you to do followup minor cleanup, such as removing the colons from the end of the keys or trailing commas from the values.


After some trial and error, I managed to get the following to work on your example string and regex:

str.split(/((?:[^,]+?): (?:[^:]+?,(?![^\(]+?\))))+?/).delete_if(&:empty?).map{|s| s.strip.chomp(',')}

I had to add a lookahead to ensure that the commas inside any parenthesis would be ignored, as well as muting some of the groups. The delete_if and map at the end are purely cosmetic.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜