How can I further process the line of data that causes the Ruby FasterCSV library to throw a MalformedCSVError?

2023-04-11 16:34 问答作者：

The incoming data file(s) contain malformed CSV data such as non-escaped quotes, as well as (valid) CSV data such as fields containing new lines. If a CSV format error is detected I would like to use an alternative routine on that data.

With the following sample code (abbreviated for simplicity)

FasterCSV.open( file ){|csv|
  row = true
  while row
    begin
      row = csv.shift
      break unless row
      # Do things with the good rows here...

    rescue FasterCSV::MalformedCSVError => e
      # Do things with the bad rows here...
      next
    end
  end
}开发者_运维知识库

The MalformedCSVError is caused in the csv.shift method. How can I access the data that caused the error from the rescue clause?

require 'csv' #CSV in ruby 1.9.2 is identical to FasterCSV

# File.open('test.txt','r').each do |line|
DATA.each do |line|
  begin
    CSV.parse(line) do |row|
      p row #handle row
    end
  rescue  CSV::MalformedCSVError => er
    puts er.message
    puts "This one: #{line}"
    # and continue
  end
end

# Output:

# Unclosed quoted field on line 1.
# This one: 1,"aaa
# Illegal quoting on line 1.
# This one: aaa",valid
# Unclosed quoted field on line 1.
# This one: 2,"bbb
# ["bbb", "invalid"]
# ["3", "ccc", "valid"]   

__END__
1,"aaa
aaa",valid
2,"bbb
bbb,invalid
3,ccc,valid

Just feed the file line by line to FasterCSV and rescue the error.

This is going to be really difficult. Some things that make FasterCSV, well, faster, make this particularly hard. Here's my best suggestion: FasterCSV can wrap an IO object. What you could do, then, is to make your own subclass of File (itself a subclass of IO) that "holds onto" the result of the last gets. Then when FasterCSV raises an exception you can ask your special File object for the last line. Something like this:

class MyFile < File
  attr_accessor :last_gets
  @last_gets = ''

  def gets(*args)
    line = super
    @last_gets << $/ << line
    line
  end
end

# then...

file  = MyFile.open(filename, 'r')
csv   = FasterCSV.new file

row = true
while row
  begin
    break unless row = csv.shift

    # do things with the good row here...

  rescue FasterCSV::MalformedCSVError => e
    bad_row = file.last_gets

    # do something with bad_row here...

    next
  ensure
    file.last_gets = '' # nuke the @last_gets "buffer"
  end
end

Kinda neat, right? BUT! there are caveats, of course:

I'm not sure how much of a performance hit you take when you add an extra step to every gets call. It might be an issue if you need to parse multi-million-line files in a timely fashion.
This ~~fails utterly~~ might or might not fail if your CSV file contains newline characters inside quoted fields. The reason for this is described in the source--basically, if a quoted value contains a newline then shift has to do additional gets calls to get the entire line. There could be a clever way around this limitation but it's not coming to me right now. If you're sure your file doesn't have any newline characters within quoted fields then this shouldn't be a worry for you, though.

Your other option would be to read the file using File.gets and pass each line in turn to FasterCSV#parse_line but I'm pretty sure in so doing you'd squander any performance advantage gained from using FasterCSV.

I used Jordan's file subclassing approach to fix the problem with my input data before CSV ever tries to parse it. In my case, I had a file that used \" to escape quotes, instead of the "" that CSV expects. Hence,

class MyFile < File
  def gets(*args)
    line = super
    if line != nil
      line.gsub!('\\"','""')  # fix the \" that would otherwise cause a parse error
    end
    line
  end
end

infile = MyFile.open(filename)
incsv = CSV.new(infile)

while row = infile.shift
  # process each row here
end

This allowed me to parse the non-standard CSV file. Ruby's CSV implementation is very strict and often has trouble with the many variants of the CSV format.

继续阅读：fastercsv ruby

How can I further process the line of data that causes the Ruby FasterCSV library to throw a MalformedCSVError?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？