C# - Splitting on a pipe with an escaped pipe in the data?
I've got a pipe delimited file that I would like to split (I'm using C#). For example:
This|is|a|test
However, some of the data can contain a pipe in it. If it does, it will be escaped with a backslash:
This|is|a|pip\|ed|test (this is a pip|ed test)
I'm wondering if there is a regexp or some other method to split this apart on just the "pure" pipes (that is, pipes that have no backslash in front of them). My current method is to replace the escaped pipes with a custom bit of text, split on pipes, and then replace my custom text with a pipe. Not very elegant and I can't help but think there开发者_开发技巧's a better way. Thanks for any help.
Just use String.IndexOf()
to find the next pipe. If the previous character is not a backslash, then use String.Substring()
to extract the word. Alternatively, you could use String.IndexOfAny()
to find the next occurrence of either the pipe or backslash.
I do a lot of parsing like this, and this is really pretty straight forward. Taking my approach, if done correctly will also tend to run faster as well.
EDIT
In fact, maybe something like this. It would be interesting to see how this compares performance-wise to a RegEx solution.
public List<string> ParseWords(string s)
{
List<string> words = new List<string>();
int pos = 0;
while (pos < s.Length)
{
// Get word start
int start = pos;
// Get word end
pos = s.IndexOf('|', pos);
while (pos > 0 && s[pos - 1] == '\\')
{
pos++;
pos = s.IndexOf('|', pos);
}
// Adjust for pipe not found
if (pos < 0)
pos = s.Length;
// Extract this word
words.Add(s.Substring(start, pos - start));
// Skip over pipe
if (pos < s.Length)
pos++;
}
return words;
}
This oughta do it:
string test = @"This|is|a|pip\|ed|test (this is a pip|ed test)";
string[] parts = Regex.Split(test, @"(?<!(?<!\\)*\\)\|");
The regular expression basically says: split on pipes that aren't preceded by an escape character. I shouldn't take any credit for this though, I just hijacked the regular expression from this post and simplified it.
EDIT
In terms of performance, compared to the manual parsing method provided in this thread, I found that this Regex implementation is 3 to 5 times slower than Jonathon Wood's implementation using the longer test string provided by the OP.
With that said, if you don't instantiate or add the words to a List<string>
and return void instead, Jon's method comes in at about 5 times faster than the Regex.Split()
method (0.01ms vs. 0.002ms) for purely splitting up the string. If you add back the overhead of managing and returning a List<string>
, it was about 3.6 times faster (0.01ms vs. 0.00275ms), averaged over a few sets of a million iterations. I did not use the static Regex.Split() for this test, I instead created a new Regex instance with the expression above outside of my test loop and then called its Split method.
UPDATE
Using the static Regex.Split() function is actually a lot faster than reusing an instance of the expression. With this implementation, the use of regex is only about 1.6 times slower than Jon's implementation (0.0043ms vs. 0.00275ms)
The results were the same using the extended regular expression from the post I linked to.
I came across a similar scenario, For me the count of number of pipes were fixed(not pipes with "\|") . This is how i have handled.
string sPipeSplit = "This|is|a|pip\\|ed|test (this is a pip|ed test)";
string sTempString = sPipeSplit.Replace("\\|", "¬"); //replace \| with non printable character
string[] sSplitString = sTempString.Split('|');
//string sFirstString = sSplitString[0].Replace("¬", "\\|"); //If you have fixed number of fields and you are copying to other field use replace while copying to other field.
/* Or you could use a loop to replace everything at once
foreach (string si in sSplitString)
{
si.Replace("¬", "\\|");
}
*/
Here is another solution.
One of the most beautiful thing about programming, is the several ways of giving a solution to the same problem:
string text = @"This|is|a|pip\|ed|test"; //The original text
string parsed = ""; //Where you will store the parsed string
bool flag = false;
foreach (var x in text.Split('|')) {
bool endsWithArroba = x.EndsWith(@"\");
parsed += flag ? "|" + x + " " : endsWithArroba ? x.Substring(0, x.Length-1) : x + " ";
flag = endsWithArroba;
}
Cory's solution is pretty good. But, i fyou prefer not to work with Regex, then you can simply do something searching for "\|" and replacing it with some other character, then doing your split, then replace it again with the "\|".
Another option is is to do the split, then examine all the strings and if the last character is a \, then join it with the next string.
Of course, all this ignores what happens if you need an escaped backslash before a pipe.. like "\\|".
Overall, I lean towards regex though.
Frankly, I prefer to use FileHelpers because, even though this isn't comma delimeted, it's basically the same thing. And they have a great story about why you shouldn't write this stuff yourself.
You can do this with a regex. Once you decide to use a backslash as your escape character, you have two escape cases to account for:
- Escaping a pipe:
\|
- Escaping a backslash that you want interpreted literally.
Both of these can be done in the same regex. Escaped backslashes will always be two \
characters together. Consecutive, escaped backslashes will always be even numbers of \
characters. If you find an odd-numbered sequence of \
before a pipe, it means you have several escaped backslashes, followed by an escaped pipe. So you want to use something like this:
/^(?:((?:[^|\\]|(?:\\{2})|\\\|)+)(?:\||$))*/
Confusing, perhaps, but it should work. Explanation:
^ #The start of a line
(?:...
[^|\\] #A character other than | or \ OR
(?:\\{2})* #An even number of \ characters OR
\\\| #A literal \ followed by a literal |
...)+ #Repeat the preceding at least once
(?:$|\|) #Either a literal | or the end of a line
精彩评论