开发者

What is a way to Parse this File in C#, Where I have a CRLF Inside a Field

I'm trying to parse a file that looks like this:

|| Column Header A || Column Header B || Column Header C ||CRLF

| Data A | Data B | Data C |CRLF

| Data A | Data B | Data C |CRLF

("CRLF" represents a line break)

I had code to parse this fine:

I first parse the file into an array of lines:

 string[] lines = fileString.Split(Environment.NewLine.ToCharArray(), StringSplitOptions.RemoveEmptyEntries);

Then, I parse each row to an array of column data values,

First, I parse to get the header using:

  string Delimiter = "||";
  string[] columns = line.Split(new string[] { Delimiter }, StringSplitOptions.RemoveEmptyEntries);

Then parse the rest of the rows using

    string Delimiter = "|";
  string[] columns = line.Split(new string[] { Delimiter }, StringSplitOptions.RemoveEmptyEntries);

This worked perfectly until I found a record that had a CRLF inside of a field so my parsing broke up.

Can anyone think of a good way to parse th开发者_开发技巧is data below, and handles CRLF correctly? Here is an example:

|| Column Header A || Column Header B || Column Header C ||CRLF

| Data A | Data B | Data C |CRLF

| Data A | Data B CRLF Continued B | Data C |CRLF

The issue is that when I do the initial parsing to get the array of lines, I get 4 lines here instead of 3 (because the last line shows up as two entries in that array.)


What you have here is delimited text. String.Split() is a notoriously naive choice for parsing that kind of data. It's slow and prone to problems such as what you're experiencing now. A better solution is something like the Microsoft.VisualBasic.TextFieldParser class or the Fast CSV parser over on codeproject.


Not exactly elegant, but this brute-force solution is the first to come to mind. Split, and then combine if short:

var lines = content.Split(...);
string header[] = lines[0].Split(...);
int numberOfColumns = header.Length;

var parsedLines = new List<string[]>();
for (int i = 1; i < lines.Length; i++) {
   var line = lines[i];

   while ((fields = line.Split(...)).Length < numberOfColumns) {
     // combine with next, and increment i
     line += lines[++i];
   }

   parsedLines.Add(fields);
}


There's a simple fix in this case:

Grab one line. Does it end with a |? If not, add a CRLF and the next line to it. Repeat until it does end in |, then parse it.


This is a classic example of Bad Data, or rather bad choice of delimiters. Before writing a parser, you must be 100% sure about the data your code would expect.

In this case you encountered a CRLF in your data, how would you(or your code) know that its not actually a delimiter?

I'd say use a better delimiter if you have the choice.

EDIT: You need to have an understanding with the sender on the delimiter, and then it is the sender's responsibility to ensure the data qualtity.

Looking at your sample data, '|CRLF' seems to be a good delimiter instead of 'CRLF'. But how do you(the parser) make sure that this delimiter does not occur in the actual data? You cannot. What you can do is to validate the quality of data against the pattern agreed with the sender (ex. no of columns in a record etc). And if the validation fails, report the error back to sender and ask for re-transmit.

A better approach would be for the sender to give you a header with the details of the data (i.e no of records, no of columns etc.)

As a parser, your control over the data is limited. This problem NEEDS support from the sender.


Just and idea based on what you've shown in the question:

Remove all the CRLF that don't appear right after | or || letting the last one there (to mark the line break). Doing this I think your current code will still work the way you want.

Something like this:

string wrongLine = "| Data A | Data B \r\n Continued B | Data C |\r\n";

string rightLine = wrongLine.Replace(" " + Environment.NewLine, string.Empty);

It'll give you this output (maintaining the last CRLF):

"| Data A | Data B Continued B | Data C |\r\n"


You should consider a CSV parsing library.

However, you could do something like (more proof of concept than best case) this if you are really against that path and can guarantee your column headers are free of miscellaneous CRLFs

string Delimiter = "||"; 

string[] columns = fileString.Substring(0, fileString.IndexOf(Environment.NewLine))
   .Split(new string[] { Delimiter }, StringSplitOptions.RemoveEmptyEntries); 

string[] cells = fileString.Substring(fileString.IndexOf(Environment.NewLine))
   .Split(new string[] { Delimiter }, StringSplitOptions.RemoveEmptyEntries); 

List<string> rows = new List<string>();
StringBuilder row = new StringBuilder();
int colIndex = 0;
int breakIndex = columns.Length;
char[] trimChars = new char[] { '\r','\n',' ' };

foreach(string c in cells)
{
   if (cellIndex == breakIndex)
   {
       rows.Add(row.ToString().Trim(trimChars));
       cellIndex = 0;
       row = new StringBuilder();
   }
   row.Append(c).Append(" ");
   cellIndex ++;
}
rows.Add(row.ToString().Trim(trimChars));
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜