Problem with TXT file extraction in ruby

2023-03-01 13:15 问答作者：

I have data file as in format of TXT , I like to parse the URL field from TXT file using the below ruby code

f = File.open(txt_file, "r")
f.each_line { |line|
  rows = line.split(',')
  rows[3].each do |url|
    next if url=="URL"
开发者_JS百科    puts url
  end
}

TXT contains:

name,option,price,URL
"x", "0,0,0,0,0,0", "123.40","http://domain.com/xym.jpg"
"x", "0,0,0,0,0,0", "111.34","http://domain.com/yum.jpg"

output:

Why does the output come from the option field "0,0,0,0,0,0"? How do I skip this and get the URL field?

Environment ruby 1.8.7 rails 2.3.8 gem 1.3.7

I'd check out a CSV parsing tool to make this easier:

 require 'rubygems'
 require 'faster_csv'

 FasterCSV.foreach(txt_file, :quote_char => '"', 
        :col_sep =>',', :row_sep =>:auto) do |row|
   puts row[3] if row[3] != "URL"
   break
 end

Also, I think you're misunderstanding how the split() would work. If you run split() against one row from your file, you're going to get back an array of columns for that single row, not a multidimensional array as rows[3].each would suggest.

EDIT: Before reading, I completely agree with the answer by Jeff Swensen, I'll leave my answer here regardless.

I'm not entirely sure what your inside loop is for (rows[3].each) Because you can't convert a single line into a 'row' when you only have a single URL. You could split by the ** characters and return an Array of urls but then you still need to remove the extra double quotes, or you could use a Regular Expression, like so:

#!/usr/bin/env ruby

f = DATA
urls = f.readlines.map do |line|
  line[/([^"]+)"\*\*/, 1] 
end
urls.compact!

p urls

__END__
name ,option,price, **URL**
"x", "0,0,0,0,0,0", "123.40",**"http://domain.com/xym.jpg"**
"x", "0,0,0,0,0,0", "111.34",**"http://domain.com/yum.jpg"**

The call to compact is needed because map will insert nil objects when you hit something that doesn't match that expression. For the String#[] method, see here

The reason that "0" is the result is that your code is blindly splitting on the comma char when you seem to be expecting parsing CSV-style (where column values may contain delimiter chars if the entire column value is enclosed in quotes. I highly suggest using a csv parser. If you are using Ruby 1.9.2, then you will already have access to the FasterCSV library.

If you are sure that the fields you want are always surrounded by double quotations, you can use that as the basis for extracting rather than the comma.

File.open(txt_file) do |f|
  f.each_line do |l|
    cols = l.scan(/(?<!\\)"(.*?)(?<!\\)"/)
    cols[3].tap{|url| puts url if url}
  end
end

In your code, the opened IO is not closed. This is a bad practice. It is better to use a block so that you do not forget to close it.
The two (?<!\\)" in the regex match non-escaped double quotations. They use negative lookbehind.
.*? is a non-greedy match, which avoids a match from exceeding a non-escaped double quotation.
tap is to avoid repeating the cols[3] operation twice in puts and if.

Edit again

If you use ruby 1.8.7, you can either

update your regex engine to oniguruma by following easy steps here, http://oniguruma.rubyforge.org/

replace the regex. tap cannot be used also. Use the following instead:

File.open(txt_file) do |f|
  f.each_line do |l|
    cols = l.scan(/(?:\A|[^\\])"(.*?[^\\]|)"/)
    url = cols[3]
    puts url if url
  end
end

I would recomment using oniguruma. It is a new regex engine introduced since ruby 1.9, and is much powerful and faster than the one used in ruby 1.8. It can be installed easily on ruby 1.8.

The data is in CSV format, but if all you want to do is grab the last field in the string, then do just that:

text =<<EOT
name,option,price,URL
"x", "0,0,0,0,0,0", "123.40","http://domain.com/xym.jpg"
"x", "0,0,0,0,0,0", "111.34","http://domain.com/yum.jpg"
EOT

require 'pp'
text.lines.map{ |l| l.split(',').last }

If you want to clean up the double-quotes and trailing line-breaks:

text.lines.map{ |l| l.split(',').last.gsub('"', '').chomp }
# => ["URL", "http://domain.com/xym.jpg", "http://domain.com/yum.jpg"]

继续阅读：ruby ruby-on-rails

Problem with TXT file extraction in ruby

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？