CSV parsing for embedded double quotes
I've written a simple CSV file parser. But after looking at the wiki page on CSV formats I noticed some "extensions" to the basic format. Specifically embedded comma via double quotes. I've managed to parse开发者_StackOverflow中文版 those, however there is a second issue: embedded double quotes.
Example:
12345,"ABC, ""IJK"" XYZ" -> [1234] and [ABC, "IJK" XYZ]
I can't seem to find the correct way to distinguish between an enclosed double quote and none. So my question is what is the correct way/algorithm to parse CVS formats such as the one above?
The way I normally think about this is basically to look at the quoted value as a single, unquoted value or a sequence of double quoted values that form a value joined by quotes. That is,
- to parse the next atom in the row:
- read up to the first non whitespace character
- if the current character is not a quote:
- mark the current spot
- read up to the next comma or newline
- return the text between the mark and the character before the comma (strip spaces if appropriate)
- if the current character is a quote:
- create an empty string buffer
- while the current character is not a quote
- mark the current position +1 (skip the quote character)
- read up to the next quote
- if the buffer is not empty, append a quote to it
- append to the buffer the text between the mark and the character before the current position (to strip both quotes)
- advance one character (past the just read quote)
- read up to the next comma or newline
- return the buffer
essentially, split each double quoted segment of the quoted string and then catenate them together with quotes. thus: "ABC, ""IJK"" XYZ"
becomes ABC,
, IJK
, XYZ
, which in turn becomes ABC, "IJK" XYZ
I would do this using a single character look-ahead, so if you're scanning the string and find a double quote, look at the next character to see if it is also a double quote. If it is, then the pair represents a single doublequote character in the output. If it's any other character, you're looking at the end of the quoted string (and hopefully that next character is a comma!). Be sure to account for the end-of-line condition when looking at the next character, too.
If you find a double-quote, then you should look for a double-quote in the end of the word/string. If you can't find, then there is an error. The same for a quote.
I suggest you try Flex/Bison in order to write a parser for the CSV file. Both tools will help you to generate a parser and then you can use the C files with the parser and call it from your C++ program. On Flex, you create a scanner that can find your tokens, like "word" or ""word"". On Bison, you define the syntax.
A double double-quote (""
) is a literal double-quote, while a lone double-quote ("
) is used for enclosing text (including commas).
Here's a regex for a csv field, if that makes things easier:
([^",\n][^,\n]*)|"((?:[^"]|"")+)"
Group 1 will contain the field if it isn't in quotes, group 2 will contain the field if it is in quotes, minus the surrounding quotes. In that case, just replace all instances of ""
with "
.
I suggest reading: Stop Rolling Your Own CSV Parser and this CSV RFC. The first is really just someone who wants you to use their C# CSV parser, but still explains many issues.
Your parser should be examining a character at a time. I used a double bool strategy for my parser in D. Each quote toggles weather the string is quoted or not. When in a quoted Cell you flag when hit a quote, and turn off quoting. If the next character is a quote, quoting is turned on, a quote is added to the result and the flag is turned off. If the next character isn't a quote then the flag is turned off and so is quoting.
精彩评论