Text Parsing - My Parser Skipping commands
I'm trying to parse text-formatting. I want to mark inline code, much like SO does, with backticks (`
). The rule is supposed to be that if you want to use a backtick inside of an inline code element, You should use double backticks around the inline code.
like this:
`` mark inlin开发者_开发知识库e code with backticks ( ` ) ``
My parser seems to skip over the double backticks completely for some reason. Heres the code for the function that does the inline code parsing:
private string ParseInlineCode(string input)
{
for (int i = 0; i < input.Length; i++)
{
if (input[i] == '`' && input[i - 1] != '\\')
{
if (input[i + 1] == '`')
{
string str = ReadToCharacter('`', i + 2, input);
while (input[i + str.Length + 2] != '`')
{
str += ReadToCharacter('`', i + str.Length + 3, input);
}
string tbr = "``" + str + "``";
str = str.Replace("&", "&");
str = str.Replace("<", "<");
str = str.Replace(">", ">");
input = input.Replace(tbr, "<code>" + str + "</code>");
i += str.Length + 13;
}
else
{
string str = ReadToCharacter('`', i + 1, input);
input = input.Replace("`" + str + "`", "<code>" + str + "</code>");
i += str.Length + 13;
}
}
}
return input;
}
If I use single backticks around something, it wraps it in the <code>
tags correctly.
In the while
-loop
while (input[i + str.Length + 2] != '`')
{
str += ReadToCharacter('`', i + str.Length + 3, input);
}
you look at the wrong index - i + str.Length + 2
instead of i + str.Length + 3
- and in turn you have to add the backtick in the body. It should probably be
while (input[i + str.Length + 3] != '`')
{
str += '`' + ReadToCharacter('`', i + str.Length + 3, input);
}
But there are some more bugs in your code. The following line will cause an IndexOutOfRangeException
if the first character of the input is a backtick.
if (input[i] == '`' && input[i - 1] != '\\')
And the following line will cause an IndexOutOfRangeException
if the input contains an odd number of separated backticks and the last character of the input is a backtick.
if (input[i + 1] == '`')
You should probably refector your code into smaller methods and not handle to many cases inside a single method - that is very prone to bugs. If you have not jet written unit tests for the code I strongly suggest to do so. And because parsers are not really easy to test because of all kinds of invalid inputs you have to be prepared for you may have a look at PEX - a tool that automatically generates test cases for your code by analyzing all branching points and trying to take every possible code path.
I quickly started PEX and run it against the code - it found the IndexOutOfRangeException
I thought of and some more. And of course PEX found the obvious NullReferenceExceptions
if the input is a null reference. Here are the inputs that PEX found to cause exceptions.
case1 = "`"
case2 = "\0`"
case3 = "\0``"
case4 = "\0`\0````````````\u0001``````````````\0\0\0\0\0\0\0\0\0\0\0````"
case5 = "\0`\0````````````\u0001``````````````\0\0\0\0\0\0\0\0\0\0\0```\0````````````\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0`"
case6 = "\0`\0````````````\u0001``````````````\0\0\0\0\0\0\0\0\0\0\0```\0````````````\0\0\0\0\0\0\0\0\0\0``<\0\0`````````````````````````````````````````````````````````````````````````````````````\0\0\0\0\0\0\0\0\0\0``<\0\0```````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````\0\0\0\0\0\0\0\0\0`\0```````````````"
My "fix" of your code changed the inputs that cause exceptions (and maybe also introduced new bugs). PEX caught the following in the modified code.
case7 = "\0```"
case8 = "\0`\0````````````\u0001``````````````\0\0\0\0\0\0\0\0\0\0\0```\0`\0"
case9 = "\0`\0````````````\u0001``````````````\0\0\0\0\0\0\0\0\0\0\0```\0````````````\0\0\0\0\0\0\0\0\0\0``<\0\0`````````````````````````````````````````````````````````````````````````````````````\0\0\0\0\0\0\0\0\0\0``\0`\0`\0``"
All three inputs did not cause exceptions in the original code while case 4 and 6 no longer cause exceptions in the modified code.
Here is a little snippet tested in LinqPad to get you started
void Main()
{
string test = "here is some code `public void Method( )` but ``this is not code``";
Regex r = new Regex( @"(`[^`]+`)" );
MatchCollection matches = r.Matches( test );
foreach( Match match in matches )
{
Console.Out.WriteLine( match.Value );
if( test[match.Index - 1] == '`' )
Console.Out.WriteLine( "NOT CODE" );
else
Console.Out.WriteLine( "CODE" );
}
}
Output:
`public void Method( )`
CODE
`this is not code`
NOT CODE
精彩评论