开发者

CSV parsing for embedded double quotes

I've written a simple CSV file parser. But after looking at the wiki page on CSV formats I noticed some "extensions" to the basic format. Specifically embedded comma via double quotes. I've managed to parse开发者_StackOverflow中文版 those, however there is a second issue: embedded double quotes.

Example:

12345,"ABC, ""IJK"" XYZ" -> [1234] and [ABC, "IJK" XYZ]

I can't seem to find the correct way to distinguish between an enclosed double quote and none. So my question is what is the correct way/algorithm to parse CVS formats such as the one above?


The way I normally think about this is basically to look at the quoted value as a single, unquoted value or a sequence of double quoted values that form a value joined by quotes. That is,

  • to parse the next atom in the row:
    • read up to the first non whitespace character
    • if the current character is not a quote:
      • mark the current spot
      • read up to the next comma or newline
      • return the text between the mark and the character before the comma (strip spaces if appropriate)
    • if the current character is a quote:
      • create an empty string buffer
      • while the current character is not a quote
        • mark the current position +1 (skip the quote character)
        • read up to the next quote
        • if the buffer is not empty, append a quote to it
        • append to the buffer the text between the mark and the character before the current position (to strip both quotes)
        • advance one character (past the just read quote)
      • read up to the next comma or newline
      • return the buffer

essentially, split each double quoted segment of the quoted string and then catenate them together with quotes. thus: "ABC, ""IJK"" XYZ" becomes ABC, , IJK,  XYZ, which in turn becomes ABC, "IJK" XYZ


I would do this using a single character look-ahead, so if you're scanning the string and find a double quote, look at the next character to see if it is also a double quote. If it is, then the pair represents a single doublequote character in the output. If it's any other character, you're looking at the end of the quoted string (and hopefully that next character is a comma!). Be sure to account for the end-of-line condition when looking at the next character, too.


If you find a double-quote, then you should look for a double-quote in the end of the word/string. If you can't find, then there is an error. The same for a quote.

I suggest you try Flex/Bison in order to write a parser for the CSV file. Both tools will help you to generate a parser and then you can use the C files with the parser and call it from your C++ program. On Flex, you create a scanner that can find your tokens, like "word" or ""word"". On Bison, you define the syntax.


A double double-quote ("") is a literal double-quote, while a lone double-quote (") is used for enclosing text (including commas).

Here's a regex for a csv field, if that makes things easier:

([^",\n][^,\n]*)|"((?:[^"]|"")+)"

Group 1 will contain the field if it isn't in quotes, group 2 will contain the field if it is in quotes, minus the surrounding quotes. In that case, just replace all instances of "" with ".


I suggest reading: Stop Rolling Your Own CSV Parser and this CSV RFC. The first is really just someone who wants you to use their C# CSV parser, but still explains many issues.

Your parser should be examining a character at a time. I used a double bool strategy for my parser in D. Each quote toggles weather the string is quoted or not. When in a quoted Cell you flag when hit a quote, and turn off quoting. If the next character is a quote, quoting is turned on, a quote is added to the result and the flag is turned off. If the next character isn't a quote then the flag is turned off and so is quoting.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜