Better way to parse "Description (tag)" to "Description, tag"
I have a text file with many 1000s of lines like this, which are category descriptions with the keyword enclosed in parentheses
Chemicals (chem)
Electrical (elec)
I need to convert these lines to comma separated values like so:
Chemicals, chem
Elec开发者_C百科trical, elec
What I am using is this:
lines = line.gsub!('(', ',').gsub!(')', '').split(',')
I would like to know if there is a better way to do this.
for posterity, this is the full code (based on the answers)
require 'rubygems'
require 'csv'
csvfile = CSV.open('output.csv', 'w')
File.open('c:/categories.txt') do |f|
f.readlines.each do |line|
(desc, cat) = line.split('(')
desc.strip!
cat.strip!
csvfile << [desc, cat[0,cat.length-1]]
end
end
Try something like this:
line.sub!(/ \((\w+)\)$/, ', \1')
The \1
will be replaced with the first match of the given regexp (in this case it will be always the category keyword). So it will basically change the (chem)
with , chem
.
Let's create an example using a text file:
lines = []
File.open('categories.txt', 'r') do |file|
while line = file.gets
lines << line.sub(/ \((\w+)\)$/, ', \1')
end
end
Based on the question updates I can propose this:
require 'csv'
csv_file = CSV.open('output.csv', 'w')
File.open('c:/categories.txt') do |f|
f.each_line {|c| csv_file << c.scan(/^(.+) \((\w+)\)$/)}
end
csv_file.close
Starting with Ruby 1.9, you can do it in one method call:
str = "Chemicals (chem)\n"
mapping = { ' (' => ', ',
')' => ''}
str.gsub(/ \(|\)/, mapping) #=> "Chemicals, chem\n"
In Ruby, a cleaner, more efficient, way to do it would be:
description, tag = line.split(' ', 2) # split(' ', 2) will return an 2 element array of
# the all characters up to the first space and all characters after. We can then use
# multi assignment syntax to assign each array element in a different local variable
tag = tag[1, (tag.length - 1) - 1] # extract the inside characters (not first or last) of the string
new_line = description << ", " << tag # rejoin the parts into a new string
This will be computationally faster (if you have a lot of rows) because it uses direct string operations instead of regular expressions.
No need to manipulate the string. Just grab the data and output it to the CSV file. Assuming that you have something like this in the data:
Chemicals (chem)
Electrical (elec)
Dyes & Intermediates (dyes)
This should work:
File.open('categories.txt', 'r') do |file|
file.each_line do |line|
csvfile << line.match(/^(.+)\s\((.+)\)$/) { |m| [m[1], m[2]] }
end
end
Benchmarks relevant to discussion in @hundredwatt's answer:
require 'benchmark'
line = "Chemicals (chem)"
# @hundredwatt
puts Benchmark.measure {
100000.times do
description, tag = line.split(' ', 2)
tag = tag[1, (tag.length - 1) - 1]
new_line = description << ", " << tag
end
} # => 0.18
# NeX
puts Benchmark.measure {
100000.times do
line.sub!(/ \((\w+)\)$/, ', \1')
end
} # => 0.08
# steenslag
mapping = { ' (' => ', ',
')' => ''}
puts Benchmark.measure {
100000.times do
line.gsub(/ \(|\)/, mapping)
end
} # => 0.08
know nothing about ruby, but it is easy in php
preg_match_all('~(.+)\((.+)\)~','Chemicals (chem)',$m);
$result = $m[1].','.$m[2];
精彩评论