开发者

pascal-like string literal regular expression

I'm trying to match pascal string literal input to the following pattern: @"^'([^']|(''))*'$", but that's not working. What is wrong with the pattern?

public void Run()
{             
    using(StreamReader reader = new StreamReader(String.Empty))
    {
        var LineNumber = 0;
        var LineContent = String.Empty;

        while(null != (LineContent = reader.ReadLine()))
        {
            LineNumber++;

            String[] InputWords = new Regex(@"\(\*(?:\w|\d)*\*\)").Replace(LineContent.TrimStart(' '), @" ").Split(' ');

            foreach(String word in InputWords)
            {
                Scanner.Scan(word);
            }

        }
    }
}

I search input string for any pascal-comment entry, replace it with whitespace, then I split input into substrings to match them to the following:

private void Initialize()
{
    MatchingTable = new Dictionary<TokenUnit.TokenType, Regex>();

    MatchingTable[TokenUnit.TokenType.Identifier] = new Regex
    (
        @"^[_a-zA-Z]\w*$",
        RegexOptions.Compiled | RegexOptions.Singleline
    );
    MatchingTable[TokenUnit.TokenType.NumberLiteral] = new Regex
    (
        @"(?:^\d+$)|(?:^\d+\.\d*$)|(?:^\d*\.\d+$)",
         RegexOptions.Compiled | RegexOptions.Singleline
    );
}
// ... Here it all comes together
public TokenUnit Scan(String input)
{                         
    foreach(KeyValuePair<TokenUnit.TokenType, Regex> node in this.MatchingTable)
    {
        if(node.Value.IsMatch(input))
        {
            return new TokenUnit
            {
                Type = node.Key 开发者_如何转开发                       
            };
        }
    }
    return new TokenUnit
    {
        Type = TokenUnit.TokenType.Unsupported
    };
}


The pattern appears to be correct, although it could be simplified:

^'(?:[^']+|'')*'$

Explanation:

^      # Match start of string
'      # Match the opening quote
(?:    # Match either...
 [^']+ # one or more characters except the quote character
 |     # or
 ''    # two quote characters (= escaped quote)
)*     # any number of times
'      # Then match the closing quote
$      # Match end of string

This regex will fail if the input you're checking it against contains anything besides a Pascal string (say, surrounding whitespace).

So if you want to use the regex to find Pascal strings within a larger text corpus, then you need to remove the ^ and $ anchors.

And if you want to allow double quotes, too, then you need to augment the regex:

^(?:'(?:[^']+|'')*'|"(?:[^"]+|"")*")$

In C#:

foundMatch = Regex.IsMatch(subjectString, "^(?:'(?:[^']+|'')*'|\"(?:[^\"]+|\"\")*\")$");

This regex will match strings like

'This matches.'
'This too, even though it ''contains quotes''.'
"Mixed quotes aren't a problem."
''

It won't match strings like

'The quotes aren't balanced or escaped.'
There is something 'before or after' the quotes.
    "Even whitespace is a problem."
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜