开发者

regex to parse tab delimited file

I'm trying to get this regex to 开发者_JAVA技巧work for capturing fields on a tab delimited line. this seems to work for all cases except when the line starts with two tabs:

^\t|"(?<field>[^"]+|\t(?=\t))"|(?<field>[^\t]+|\t(?=\t))|\t$

for example, where \t represents a tab:

\t \t 123 \t abc \t 345 \t efg

captures only 5 fields omitting one of the first "blanks" (tabs)


Regular expressions are probably not the best tool for this job. I suggest you use the TextFieldParser class, which is intended for parsing files with delimited or fixed-length fields. The fact it resides in the Microsoft.VisualBasic assembly is a little annoying if you're coding in C#, but it doesn't prevent you from using it...


Agree that Regex is not the right tool for the job here.

I was in the middle of cleaning this code up when Thomas posted that link to a nice little gem in the framework. I've used this method for parsing delimited text that may contain quoted strings and escape characters. It's probably not the most optimized in the world but it's pretty readable in my opinion and it gets the job done.

/// <summary>
/// Breaks a string into tokens using a delimeter and specified text qualifier and escape sequence.
/// </summary>
/// <param name="line">The string to tokenize.</param>
/// <param name="delimeter">The delimeter between tokens, such as a comma.</param>
/// <param name="textQualifier">The text qualifier which enables the delimeter to be embedded in a single token.</param>
/// <param name="escapeSequence">The escape sequence which enables the text qualifier to be embedded in a token.</param>
/// <returns>A collection of string tokens.</returns>
public static IEnumerable<string> Tokenize( string line, char delimeter, char textQualifier = '\"', char escapeSequence = '\\' )
{

    var inString = false;
    var escapeNext = false;
    var token = new StringBuilder();

    for (int i = 0 ; i < line.Length ; i++) {

        // If the last character was an escape sequence, then it doesn't matter what
        // this character is (field terminator, text qualifier, etc) because it needs
        // to appear as a part of the field value.

        if (escapeNext) {
            escapeNext = false;
            token.Append(line[i]);
            continue;
        }

        if (line[i] == escapeSequence) {
            escapeNext = true;
            continue;
        }

        if (line[i] == textQualifier) {
            inString = !inString;
            continue;
        }

        // hit the end of the current token?
        if (line[i] == delimeter && !inString) {

            yield return token.ToString();

            // clear the string builder (instead of allocating a new one)
            token.Remove(0, token.Length);

            continue;

        }

        token.Append(line[i]);

    }

    yield return token.ToString( );

}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜