Parsing a large CSV file, dealing with commas and quotes
I need to load in a large CSV file (>1MB) and parse it. Generally this is quite easy to do by splitting first on linebreaks and then commas. The problem is 开发者_开发知识库though that some entries contain Strings that include their own commas. When this spreadsheet is converted to CSV, the lines containing commas are wrapped in quotes.
I've written a parser that first escapes all the commas in these strings, then splits it on linebreaks and then commas, and then unescapes the values again.
This is quite a slow process for such a long string, as I need to iterate through the whole string. Does anyone know a faster or more optimised method of dealing with this?
Have you had a look at csvlib yet? It is a parser library for ActionScript 3. It claims to be designed to properly handle quoted strings.
Hopefully, you are already enclosing your strings in quotes, especially the ones containing the commas. CSV parsers cannot distinguish a comma that is part of a string from a comma that separates two strings, unless the strings have quotes around them.
Good "This string, has a comma", "This string doesn't" Bad This string, has a comma, this string doesn't
Processing the file in a single pass will reduce the time. This can be achieved by using a simple state machine to handle the complexity of commas embedded in the values. Regards
- Add a reference to the
Microsoft.VisualBasic
(yes, it says VisualBasic but it works in C# just as well - remember that at the end it is all just IL) - Use the
Microsoft.VisualBasic.FileIO.TextFieldParser
class to parse the CSV file
Here is the sample code:
Dim parser As TextFieldParser = New TextFieldParser("C:\mar0112.csv")
parser.TextFieldType = FieldType.Delimited
parser.SetDelimiters(",")
While Not parser.EndOfData
'Processing row
Dim fields() As String = parser.ReadFields
For Each field As String In fields
'TODO: Process field
Next
End While
parser.Close()
精彩评论