开发者

Trouble Parsing Unicode CSV File

I'm trying to load a CSV into a DataTable using this:

class CSVReader
{
    public System.Data.DataTable GetDataTable(string strFileName)
    {
        System.Data.OleDb.OleDbConnection conn = new System.Data.OleDb.OleDbConnection
        (
        "Provider=Microsoft.Jet.OleDb.4.0; Data Source = " + System.IO.Path.GetDirectoryName(strFileName) + 
        "; Extended Properties = \"Text;HDR=YES;FMT=Delimited\""
        );
        conn.Open();
        string strQuery = "SELECT * FROM [" + System.IO.Path.GetFileName(strFileName) + "]";
        System.Data.OleDb.OleDbDataAdapter adapter = new System.Data.OleDb.OleDbDataAdapter(strQuery, conn);
        System.Data.DataSet ds = new System.Data.DataSet("CSV File");
        adapter.Fill(ds);
        return ds.Tables[0];
    }
}

It works fine one one CSV, but not another. Here is the snippet of the file that doesn't load properly:

Trouble Parsing Unicode CSV File

It simply loads "T" as the first column name, and everything else is blank/null. I tried manually looking at the first line with

Using (StreamReader x = new StreamReader(fileName) { string firstline x = x.ReadLine(); }

and the equivalent File.ReadAllLines and referenced the array's "0" entry ( file[0] ).

Both simply return "T" as the "first line" and anything beyond that is blank. Any ideas why it only sees the first character in the CSV and nothing else?

EDIT: First line looks like this:

TERM(s),OBJECTID,FILE,PATH,HIT COUNT

开发者_运维技巧

The second line looks like this:

"(test)","172911","16369318","Item001.E01/Partition 1/NONAME [NTFS]/[unallocated space]/13621367/16369318","4"

EDIT 2: I switched over the the lib somebody linked (CVSReader) and it seems to have taken a lot of the pain out. I tried encoding the file in UTF-8 with Notepad++ and it gets farther along, until it gets to:

LumenWorks.Framework.IO.Csv.MalformedCsvException was unhandled Message=The CSV appears to be corrupt near record '1373' field '3 at position '2601'. Current raw data : '32/System.ServiceModel/06d6eab93282d2b136a377bd50b7c5a9/System.ServiceModel.ni.dll","11" "(vc)","40656","Adobe AIR Application Installer.swf","Item001.E01/Partition 1/NONAME [NTFS]/[root]/Program Files/Common Files/Adobe AIR/Versions/1.0/Adobe AIR Application Installer.swf","11" "(vc)","503322","䄳䆷䞫䄦䠥","Item001.E01/Partition 1/NONAME [NTFS]/[root]/WINDOWS/Installer/520ae67.msp/䄳䆷䞫䄦䠥","11"

I'm guessing that it is taking issue with the foreign characters in UTF-8 encoding. If I leave the file the way it was, original encoding, it processes poorly/incorrectly. I don't want to make the user have to open the file and save it as ASCII/UTF-16 since it is ~90mb. I've been trying to google around, but most people say .NET can handle any encoding.

It seems the file is outputted as UCS-2 LE (which I think is UTF-16, right?). I"m confused why CVSReader/StreamReader are taking issue.

Diagnosed but not fully solved

When I pass a "characterset=Unicode" appended to the string on my OldeDB function it seems to work for USC-2LE/Unicode encoding. I would prefer to use the CSVReader custom lib, but it seems to use TextReader (which as far as I can tell can't handle Unicode).

http://www.codeproject.com/KB/database/CsvReader.aspx

THe following code will not work. It doesn't throw an error, but it seems to stall out even on its own thread:

Bad Code for USC2/Unicode:

                using (CsvReader csv = new CsvReader(
                           new StreamReader(kwfile, Encoding.Unicode), true))
            {
                csv.MissingFieldAction = MissingFieldAction.ReplaceByEmpty;
                keywordHits.Load(csv);
            }

Working, but not preferred solution:

        public System.Data.DataTable GetDataTable(string strFileName)
    {
        System.Data.OleDb.OleDbConnection conn = new System.Data.OleDb.OleDbConnection
        (
        "Provider=Microsoft.Jet.OleDb.4.0; Data Source = " + System.IO.Path.GetDirectoryName(strFileName) +
        "; Extended Properties = \"Text;characterset=Unicode;HDR=YES;FMT=Delimited\""
        );
        conn.Open();
        string strQuery = "SELECT * FROM [" + System.IO.Path.GetFileName(strFileName) + "]";
        System.Data.OleDb.OleDbDataAdapter adapter = new System.Data.OleDb.OleDbDataAdapter(strQuery, conn);
        System.Data.DataSet ds = new System.Data.DataSet("CSV File");
        adapter.Fill(ds);
        return ds.Tables[0];
    }

I've tried it with the optional third parameter in the CSVReader as well, and no difference. The program "works" by loading one CSV using the CSVReader class but has to use the OldeDB on the Unicode CSV. Obviously, using StreamReader with the Encoding.Unicode parameter works, but I'd have to re-invent the wheel in parsing out possibly malformed entries. Any thoughts? Or is this the best I can do without rewriting the CSVReader?


A far shot, but perhaps your file is encoded using UTF-16. The bytes in the file would like this:

0x54 0x00 0x45 0x00 ...

Reading these bytes using UTF-8 encoding (the default for StreamReader) will yield the following characters:

T <NUL> E <NUL> ...

Try to open your file using a binary editor. If the encoding is unexpected then open it in a text editor and save it using an encoding that will work for you (ASCII or UTF-8 are good candidates).


Is this snippet from the beginning? I found with the OleDb reader, it would base the schema on the first row, so that if subsequent rows had, for example, more columns, the additional columns would be ignored.

I stopped using it for that reason. If you want something lightweight, I have a fully-functional CsvReader in this answer..


When your file is UTF-16 encoded, try to specify the Encoding in the StreamReader and/or CVS reader code. The default encoding in .NET is UTF-8 which will result in the -Characters stated by Martin Liversage. The StreamReader has an overload new StreamReader(path, Encoding.UTF16), for the CVS classes I don't know.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜