Text Parsing - My Parser Skipping commands

2023-01-01 13:58 问答作者：

I'm trying to parse text-formatting. I want to mark inline code, much like SO does, with backticks (`). The rule is supposed to be that if you want to use a backtick inside of an inline code element, You should use double backticks around the inline code.

like this:

`` mark inlin开发者_开发知识库e code with backticks ( ` ) ``

My parser seems to skip over the double backticks completely for some reason. Heres the code for the function that does the inline code parsing:

    private string ParseInlineCode(string input)
    {
        for (int i = 0; i < input.Length; i++)
        {
            if (input[i] == '`' && input[i - 1] != '\\')
            {
                if (input[i + 1] == '`')
                {
                    string str = ReadToCharacter('`', i + 2, input);
                    while (input[i + str.Length + 2] != '`')
                    {
                        str += ReadToCharacter('`', i + str.Length + 3, input);
                    }
                    string tbr = "``" + str + "``";
                    str = str.Replace("&", "&amp;");
                    str = str.Replace("<", "&lt;");
                    str = str.Replace(">", "&gt;");
                    input = input.Replace(tbr, "<code>" + str + "</code>");
                    i += str.Length + 13;
                }
                else
                {
                    string str = ReadToCharacter('`', i + 1, input);
                    input = input.Replace("`" + str + "`", "<code>" + str + "</code>");
                    i += str.Length + 13;
                }
            }
        }
        return input;
    }

If I use single backticks around something, it wraps it in the <code> tags correctly.

In the while-loop

while (input[i + str.Length + 2] != '`')
{
    str += ReadToCharacter('`', i + str.Length + 3, input);
}

you look at the wrong index - i + str.Length + 2 instead of i + str.Length + 3 - and in turn you have to add the backtick in the body. It should probably be

while (input[i + str.Length + 3] != '`')
{
    str += '`' + ReadToCharacter('`', i + str.Length + 3, input);
}

But there are some more bugs in your code. The following line will cause an IndexOutOfRangeException if the first character of the input is a backtick.

 if (input[i] == '`' && input[i - 1] != '\\')

And the following line will cause an IndexOutOfRangeException if the input contains an odd number of separated backticks and the last character of the input is a backtick.

if (input[i + 1] == '`')

You should probably refector your code into smaller methods and not handle to many cases inside a single method - that is very prone to bugs. If you have not jet written unit tests for the code I strongly suggest to do so. And because parsers are not really easy to test because of all kinds of invalid inputs you have to be prepared for you may have a look at PEX - a tool that automatically generates test cases for your code by analyzing all branching points and trying to take every possible code path.

I quickly started PEX and run it against the code - it found the IndexOutOfRangeException I thought of and some more. And of course PEX found the obvious NullReferenceExceptions if the input is a null reference. Here are the inputs that PEX found to cause exceptions.

case1 = "`"

case2 = "\0`"

case3 = "\0``"

case4 = "\0`\0````````````\u0001``````````````\0\0\0\0\0\0\0\0\0\0\0````"

case5 = "\0`\0````````````\u0001``````````````\0\0\0\0\0\0\0\0\0\0\0```\0````````````\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0`"

case6 = "\0`\0````````````\u0001``````````````\0\0\0\0\0\0\0\0\0\0\0```\0````````````\0\0\0\0\0\0\0\0\0\0``<\0\0`````````````````````````````````````````````````````````````````````````````````````\0\0\0\0\0\0\0\0\0\0``<\0\0```````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````\0\0\0\0\0\0\0\0\0`\0```````````````"

My "fix" of your code changed the inputs that cause exceptions (and maybe also introduced new bugs). PEX caught the following in the modified code.

case7 = "\0```"

case8 = "\0`\0````````````\u0001``````````````\0\0\0\0\0\0\0\0\0\0\0```\0`\0"

case9 = "\0`\0````````````\u0001``````````````\0\0\0\0\0\0\0\0\0\0\0```\0````````````\0\0\0\0\0\0\0\0\0\0``<\0\0`````````````````````````````````````````````````````````````````````````````````````\0\0\0\0\0\0\0\0\0\0``\0`\0`\0``"

All three inputs did not cause exceptions in the original code while case 4 and 6 no longer cause exceptions in the modified code.

Here is a little snippet tested in LinqPad to get you started

void Main()
{
    string test = "here is some code `public void Method( )` but ``this is not code``";
    Regex r = new Regex( @"(`[^`]+`)" );

    MatchCollection matches = r.Matches( test );

    foreach( Match match in matches )
    {
        Console.Out.WriteLine( match.Value );
        if( test[match.Index - 1] == '`' )
            Console.Out.WriteLine( "NOT CODE" );
            else
        Console.Out.WriteLine( "CODE" );
    }
}

Output:

`public void Method( )`
CODE
`this is not code`
NOT CODE

继续阅读：text-parsing

Text Parsing - My Parser Skipping commands

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？