开发者

Better way to parse "Description (tag)" to "Description, tag"

I have a text file with many 1000s of lines like this, which are category descriptions with the keyword enclosed in parentheses

Chemicals (chem) 
Electrical (elec) 

I need to convert these lines to comma separated values like so:

Chemicals, chem
Elec开发者_C百科trical, elec

What I am using is this:

lines = line.gsub!('(', ',').gsub!(')', '').split(',')

I would like to know if there is a better way to do this.

for posterity, this is the full code (based on the answers)

require 'rubygems'
require 'csv'

csvfile = CSV.open('output.csv', 'w')
File.open('c:/categories.txt') do |f|
  f.readlines.each do |line|
    (desc, cat) = line.split('(')
    desc.strip!
    cat.strip!
    csvfile << [desc, cat[0,cat.length-1]]
  end
end


Try something like this:

line.sub!(/ \((\w+)\)$/, ', \1')

The \1 will be replaced with the first match of the given regexp (in this case it will be always the category keyword). So it will basically change the (chem) with , chem.

Let's create an example using a text file:

lines = []
File.open('categories.txt', 'r') do |file|
  while line = file.gets 
    lines << line.sub(/ \((\w+)\)$/, ', \1')
  end
end

Based on the question updates I can propose this:

require 'csv'

csv_file = CSV.open('output.csv', 'w')

File.open('c:/categories.txt') do |f| 
  f.each_line {|c| csv_file << c.scan(/^(.+) \((\w+)\)$/)}
end

csv_file.close


Starting with Ruby 1.9, you can do it in one method call:

str = "Chemicals (chem)\n"
mapping = { ' (' => ', ',
            ')'  => ''}

str.gsub(/ \(|\)/, mapping)  #=> "Chemicals, chem\n"


In Ruby, a cleaner, more efficient, way to do it would be:

description, tag = line.split(' ', 2) # split(' ', 2) will return an 2 element array of
                                      # the all characters up to the first space and all characters after. We can then use
                                      # multi assignment syntax to assign each array element in a different local variable
tag = tag[1, (tag.length - 1) - 1] # extract the inside characters (not first or last) of the string
new_line = description << ", " << tag # rejoin the parts into a new string

This will be computationally faster (if you have a lot of rows) because it uses direct string operations instead of regular expressions.


No need to manipulate the string. Just grab the data and output it to the CSV file. Assuming that you have something like this in the data:

Chemicals (chem)

Electrical (elec)

Dyes & Intermediates (dyes)

This should work:

File.open('categories.txt', 'r') do |file|
  file.each_line do |line|
    csvfile << line.match(/^(.+)\s\((.+)\)$/) { |m| [m[1], m[2]] }
  end
end


Benchmarks relevant to discussion in @hundredwatt's answer:

require 'benchmark'

line = "Chemicals (chem)"

# @hundredwatt
puts Benchmark.measure {
  100000.times do
    description, tag = line.split(' ', 2)
    tag = tag[1, (tag.length - 1) - 1]
    new_line = description << ", " << tag
  end
} # => 0.18

# NeX
puts Benchmark.measure {
  100000.times do
    line.sub!(/ \((\w+)\)$/, ', \1')
  end
} # => 0.08

# steenslag
mapping = { ' (' => ', ',
  ')'  => ''}
puts Benchmark.measure {
  100000.times do
    line.gsub(/ \(|\)/, mapping)
  end
} # => 0.08


know nothing about ruby, but it is easy in php

 preg_match_all('~(.+)\((.+)\)~','Chemicals (chem)',$m);

$result = $m[1].','.$m[2];
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜