开发者

Parsing commas and quotemarks in degenerate CSV files with Regular Expressions

I need to parse strings inputs where the columns are separated by columns and any field that contains a comma in the data is wrapped in quotes (commas separated, quoted text identifiers). For this project I need to remove the quotes and any commas that occur between pairs of quotes. Basically, I need to remove commas and quotes that are contained in fields while preserving the commas that are used to separate the fields. Here's a little code I put together that handles the simple scenario:

// Sample input 1: This works and covers 99% of the records that I need to parse.
string str1 = "an_email_address@somewhere.com,2010/03/27 12:2:02,,some_first_name,some_last_name,,\"This Address Works, Suite 200\",Some City,TN,09876-5432,9795551212x123,XYZ";
str1 = Regex.Replace(str1, "\"([^\"^,]*),([^\"^,]*)\"", "$1$2");
Console.WriteLine(str1);
// Outputs: an_email_address@somewhere.com,2010/03/27 12:2:02,,some_first_name,some_last_name,,This Address Works Suite 200,Some City,TN,09876-5432,9795551212x123,XYZ

Although this code works for most of my records, it doesn't work when a field contains more than one comma. What I would like to do is modify the code so that it remove each instance of a comma contained within the column no matter how many commas there are in the field. I don't want to hard code only handling 2 commas, or 3 commas, or 25 commas. The code should just remove all the commas in the field. Below is an example of what my code doesn't handle properly.

// Sample input 2: This doesn't work since there is more than 1 comma between the quotes.
string str2 = "an_email_address@somewhere.com,2010/03/27 12:2:02,,some_first_name,some_last_name,,\"i,l,k,e, c,o,m,m,a,s, i,n ,m,y, f,i,e,l,d\",Some City,TN,09876-5432,9795551212x123,XYZ";
str2 = Regex.Replace(str2, "\"([^\"^,]*),([^\"^,]*)\"", "$1$2");
Console.WriteLine(str2);
// Desired output: an_email_address@somewhere.com,2010/03/27 12:2:02,,some_first_na开发者_开发百科me,some_last_name,,i like commas in my field,Some City,TN,09876-5432,9795551212x123,XYZ

How can I accomplish this with regular expressions?


Matching quotes and regular expression don't go hand in hand, and you are probably better of using a CSV parser, as Michael Madsen suggested.
However, if you know the quotes only occur as you expect, you can do something like the following:

str2 = Regex.Replace(str2, "\"[^\"]*\"",
                     match => match.Value.Trim('\"').Replace(",", ""));


Here's a pure regex version:

str2 = Regex.Replace(str0, 
          @"""|,(?=(?>[^""]*""[^""]*(?:""[^""]*""[^""]*)*)$)", 
          String.Empty);

It matches any quotation mark, or a comma if it's followed by an odd number of quotation marks, and replaces it with nothing.

I would only go this route if I absolutely had to, for example if I were working with a framework that only let me specify the regex and the replacement string. Otherwise, I would either go with @Kobi's approach (because it's so much more readable) or use a dedicated CSV processor. They're not hard to find.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜