Trouble Parsing Unicode CSV File

2023-02-18 15:57 问答作者：

I'm trying to load a CSV into a DataTable using this:

class CSVReader
{
    public System.Data.DataTable GetDataTable(string strFileName)
    {
        System.Data.OleDb.OleDbConnection conn = new System.Data.OleDb.OleDbConnection
        (
        "Provider=Microsoft.Jet.OleDb.4.0; Data Source = " + System.IO.Path.GetDirectoryName(strFileName) + 
        "; Extended Properties = \"Text;HDR=YES;FMT=Delimited\""
        );
        conn.Open();
        string strQuery = "SELECT * FROM [" + System.IO.Path.GetFileName(strFileName) + "]";
        System.Data.OleDb.OleDbDataAdapter adapter = new System.Data.OleDb.OleDbDataAdapter(strQuery, conn);
        System.Data.DataSet ds = new System.Data.DataSet("CSV File");
        adapter.Fill(ds);
        return ds.Tables[0];
    }
}

It works fine one one CSV, but not another. Here is the snippet of the file that doesn't load properly:

It simply loads "T" as the first column name, and everything else is blank/null. I tried manually looking at the first line with

Using (StreamReader x = new StreamReader(fileName) { string firstline x = x.ReadLine(); }

and the equivalent File.ReadAllLines and referenced the array's "0" entry ( file[0] ).

Both simply return "T" as the "first line" and anything beyond that is blank. Any ideas why it only sees the first character in the CSV and nothing else?

EDIT: First line looks like this:

TERM(s),OBJECTID,FILE,PATH,HIT COUNT

开发者_运维技巧

The second line looks like this:

"(test)","172911","16369318","Item001.E01/Partition 1/NONAME [NTFS]/[unallocated space]/13621367/16369318","4"

EDIT 2: I switched over the the lib somebody linked (CVSReader) and it seems to have taken a lot of the pain out. I tried encoding the file in UTF-8 with Notepad++ and it gets farther along, until it gets to:

LumenWorks.Framework.IO.Csv.MalformedCsvException was unhandled Message=The CSV appears to be corrupt near record '1373' field '3 at position '2601'. Current raw data : '32/System.ServiceModel/06d6eab93282d2b136a377bd50b7c5a9/System.ServiceModel.ni.dll","11" "(vc)","40656","Adobe AIR Application Installer.swf","Item001.E01/Partition 1/NONAME [NTFS]/[root]/Program Files/Common Files/Adobe AIR/Versions/1.0/Adobe AIR Application Installer.swf","11" "(vc)","503322","䄳䆷䞫䄦䠥","Item001.E01/Partition 1/NONAME [NTFS]/[root]/WINDOWS/Installer/520ae67.msp/䄳䆷䞫䄦䠥","11"

I'm guessing that it is taking issue with the foreign characters in UTF-8 encoding. If I leave the file the way it was, original encoding, it processes poorly/incorrectly. I don't want to make the user have to open the file and save it as ASCII/UTF-16 since it is ~90mb. I've been trying to google around, but most people say .NET can handle any encoding.

It seems the file is outputted as UCS-2 LE (which I think is UTF-16, right?). I"m confused why CVSReader/StreamReader are taking issue.

Diagnosed but not fully solved

When I pass a "characterset=Unicode" appended to the string on my OldeDB function it seems to work for USC-2LE/Unicode encoding. I would prefer to use the CSVReader custom lib, but it seems to use TextReader (which as far as I can tell can't handle Unicode).

http://www.codeproject.com/KB/database/CsvReader.aspx

THe following code will not work. It doesn't throw an error, but it seems to stall out even on its own thread:

Bad Code for USC2/Unicode:

                using (CsvReader csv = new CsvReader(
                           new StreamReader(kwfile, Encoding.Unicode), true))
            {
                csv.MissingFieldAction = MissingFieldAction.ReplaceByEmpty;
                keywordHits.Load(csv);
            }

Working, but not preferred solution:

        public System.Data.DataTable GetDataTable(string strFileName)
    {
        System.Data.OleDb.OleDbConnection conn = new System.Data.OleDb.OleDbConnection
        (
        "Provider=Microsoft.Jet.OleDb.4.0; Data Source = " + System.IO.Path.GetDirectoryName(strFileName) +
        "; Extended Properties = \"Text;characterset=Unicode;HDR=YES;FMT=Delimited\""
        );
        conn.Open();
        string strQuery = "SELECT * FROM [" + System.IO.Path.GetFileName(strFileName) + "]";
        System.Data.OleDb.OleDbDataAdapter adapter = new System.Data.OleDb.OleDbDataAdapter(strQuery, conn);
        System.Data.DataSet ds = new System.Data.DataSet("CSV File");
        adapter.Fill(ds);
        return ds.Tables[0];
    }

I've tried it with the optional third parameter in the CSVReader as well, and no difference. The program "works" by loading one CSV using the CSVReader class but has to use the OldeDB on the Unicode CSV. Obviously, using StreamReader with the Encoding.Unicode parameter works, but I'd have to re-invent the wheel in parsing out possibly malformed entries. Any thoughts? Or is this the best I can do without rewriting the CSVReader?

A far shot, but perhaps your file is encoded using UTF-16. The bytes in the file would like this:

0x54 0x00 0x45 0x00 ...

Reading these bytes using UTF-8 encoding (the default for StreamReader) will yield the following characters:

T <NUL> E <NUL> ...

Try to open your file using a binary editor. If the encoding is unexpected then open it in a text editor and save it using an encoding that will work for you (ASCII or UTF-8 are good candidates).

Is this snippet from the beginning? I found with the OleDb reader, it would base the schema on the first row, so that if subsequent rows had, for example, more columns, the additional columns would be ignored.

I stopped using it for that reason. If you want something lightweight, I have a fully-functional CsvReader in this answer..

When your file is UTF-16 encoded, try to specify the Encoding in the StreamReader and/or CVS reader code. The default encoding in .NET is UTF-8 which will result in the -Characters stated by Martin Liversage. The StreamReader has an overload new StreamReader(path, Encoding.UTF16), for the CVS classes I don't know.

继续阅读：.net

Trouble Parsing Unicode CSV File

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？