CSV parsing for embedded double quotes

2023-01-31 19:22 问答作者：

I've written a simple CSV file parser. But after looking at the wiki page on CSV formats I noticed some "extensions" to the basic format. Specifically embedded comma via double quotes. I've managed to parse开发者_StackOverflow中文版 those, however there is a second issue: embedded double quotes.

Example:

12345,"ABC, ""IJK"" XYZ" -> [1234] and [ABC, "IJK" XYZ]

I can't seem to find the correct way to distinguish between an enclosed double quote and none. So my question is what is the correct way/algorithm to parse CVS formats such as the one above?

The way I normally think about this is basically to look at the quoted value as a single, unquoted value or a sequence of double quoted values that form a value joined by quotes. That is,

to parse the next atom in the row:
- read up to the first non whitespace character
- if the current character is not a quote:
  - mark the current spot
  - read up to the next comma or newline
  - return the text between the mark and the character before the comma (strip spaces if appropriate)
- if the current character is a quote:
  - create an empty string buffer
  - while the current character is not a quote
    - mark the current position +1 (skip the quote character)
    - read up to the next quote
    - if the buffer is not empty, append a quote to it
    - append to the buffer the text between the mark and the character before the current position (to strip both quotes)
    - advance one character (past the just read quote)
  - read up to the next comma or newline
  - return the buffer

essentially, split each double quoted segment of the quoted string and then catenate them together with quotes. thus: "ABC, ""IJK"" XYZ" becomes ABC, , IJK, XYZ, which in turn becomes ABC, "IJK" XYZ

I would do this using a single character look-ahead, so if you're scanning the string and find a double quote, look at the next character to see if it is also a double quote. If it is, then the pair represents a single doublequote character in the output. If it's any other character, you're looking at the end of the quoted string (and hopefully that next character is a comma!). Be sure to account for the end-of-line condition when looking at the next character, too.

If you find a double-quote, then you should look for a double-quote in the end of the word/string. If you can't find, then there is an error. The same for a quote.

I suggest you try Flex/Bison in order to write a parser for the CSV file. Both tools will help you to generate a parser and then you can use the C files with the parser and call it from your C++ program. On Flex, you create a scanner that can find your tokens, like "word" or ""word"". On Bison, you define the syntax.

A double double-quote ("") is a literal double-quote, while a lone double-quote (") is used for enclosing text (including commas).

Here's a regex for a csv field, if that makes things easier:

([^",\n][^,\n]*)|"((?:[^"]|"")+)"

Group 1 will contain the field if it isn't in quotes, group 2 will contain the field if it is in quotes, minus the surrounding quotes. In that case, just replace all instances of "" with ".

I suggest reading: Stop Rolling Your Own CSV Parser and this CSV RFC. The first is really just someone who wants you to use their C# CSV parser, but still explains many issues.

Your parser should be examining a character at a time. I used a double bool strategy for my parser in D. Each quote toggles weather the string is quoted or not. When in a quoted Cell you flag when hit a quote, and turn off quoting. If the next character is a quote, quoting is turned on, a quote is added to the result and the flag is turned off. If the next character isn't a quote then the flag is turned off and so is quoting.

继续阅读：algorithm csv parsing

CSV parsing for embedded double quotes

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？