Problem with TXT file extraction in ruby
I have data file as in format of TXT , I like to parse the URL field from TXT file using the below ruby code
f = File.open(txt_file, "r")
f.each_line { |line|
rows = line.split(',')
rows[3].each do |url|
next if url=="URL"
开发者_JS百科 puts url
end
}
TXT contains:
name,option,price,URL
"x", "0,0,0,0,0,0", "123.40","http://domain.com/xym.jpg"
"x", "0,0,0,0,0,0", "111.34","http://domain.com/yum.jpg"
output:
0
Why does the output come from the option field "0,0,0,0,0,0"? How do I skip this and get the URL field?
Environment ruby 1.8.7 rails 2.3.8 gem 1.3.7
I'd check out a CSV parsing tool to make this easier:
require 'rubygems'
require 'faster_csv'
FasterCSV.foreach(txt_file, :quote_char => '"',
:col_sep =>',', :row_sep =>:auto) do |row|
puts row[3] if row[3] != "URL"
break
end
Also, I think you're misunderstanding how the split()
would work. If you run split()
against one row from your file, you're going to get back an array of columns
for that single row, not a multidimensional array as rows[3].each
would suggest.
EDIT: Before reading, I completely agree with the answer by Jeff Swensen, I'll leave my answer here regardless.
I'm not entirely sure what your inside loop is for (rows[3].each
) Because you can't convert a single line into a 'row' when you only have a single URL. You could split by the **
characters and return an Array of urls but then you still need to remove the extra double quotes, or you could use a Regular Expression, like so:
#!/usr/bin/env ruby
f = DATA
urls = f.readlines.map do |line|
line[/([^"]+)"\*\*/, 1]
end
urls.compact!
p urls
__END__
name ,option,price, **URL**
"x", "0,0,0,0,0,0", "123.40",**"http://domain.com/xym.jpg"**
"x", "0,0,0,0,0,0", "111.34",**"http://domain.com/yum.jpg"**
The call to compact
is needed because map
will insert nil
objects when you hit something that doesn't match that expression. For the String#[]
method, see here
The reason that "0" is the result is that your code is blindly splitting on the comma char when you seem to be expecting parsing CSV-style (where column values may contain delimiter chars if the entire column value is enclosed in quotes. I highly suggest using a csv parser. If you are using Ruby 1.9.2, then you will already have access to the FasterCSV library.
If you are sure that the fields you want are always surrounded by double quotations, you can use that as the basis for extracting rather than the comma.
File.open(txt_file) do |f|
f.each_line do |l|
cols = l.scan(/(?<!\\)"(.*?)(?<!\\)"/)
cols[3].tap{|url| puts url if url}
end
end
- In your code, the opened IO is not closed. This is a bad practice. It is better to use a block so that you do not forget to close it.
- The two
(?<!\\)"
in the regex match non-escaped double quotations. They use negative lookbehind. .*?
is a non-greedy match, which avoids a match from exceeding a non-escaped double quotation.tap
is to avoid repeating thecols[3]
operation twice inputs
andif
.
Edit again
If you use ruby 1.8.7, you can either
- update your regex engine to oniguruma by following easy steps here, http://oniguruma.rubyforge.org/
or
- replace the regex.
tap
cannot be used also. Use the following instead:
.
File.open(txt_file) do |f|
f.each_line do |l|
cols = l.scan(/(?:\A|[^\\])"(.*?[^\\]|)"/)
url = cols[3]
puts url if url
end
end
I would recomment using oniguruma. It is a new regex engine introduced since ruby 1.9, and is much powerful and faster than the one used in ruby 1.8. It can be installed easily on ruby 1.8.
The data is in CSV format, but if all you want to do is grab the last field in the string, then do just that:
text =<<EOT
name,option,price,URL
"x", "0,0,0,0,0,0", "123.40","http://domain.com/xym.jpg"
"x", "0,0,0,0,0,0", "111.34","http://domain.com/yum.jpg"
EOT
require 'pp'
text.lines.map{ |l| l.split(',').last }
If you want to clean up the double-quotes and trailing line-breaks:
text.lines.map{ |l| l.split(',').last.gsub('"', '').chomp }
# => ["URL", "http://domain.com/xym.jpg", "http://domain.com/yum.jpg"]
精彩评论