C# Regex Expression Issue
I am trying to parse the following line:
"\#" TEST #comment hello world
In my input, the #comment always comes at the end of the line. Ther开发者_开发问答e may or may not be a comment, but if there is, its always in the end of the line.
I used the following Regex to parse it:
(\#.+)?
I have the RegexOption.RightToLeft
on. I expected it to pull #comment hello world
. But instead it is pulling "#" TEST #comment hello world"
Why is my Regex expression not pulling the right thing and what is the valid Regex expression I need to make it pull correctly?
The important question is: How do you see the difference between the # at the end of the line and the # that starts the comment? Let's assume for simplicity that the last # starts a comment.
In that case, what you want to match is
- one #
- an arbitrary sequence of text not containing #
- until the end of the line
So let's put that into a regex: #[^#]*$
. You don't need RightToLeft for it. As far as I know, you also don't need to escape #
in C# regular expressions.
Of course, if you provide information on how to see the difference between a "valid" # and a "comment-starting" #, a more elegant solution could be found that allows for # within comments.
I think you'll find too many edge cases when trying to pull this off with regular expressions. Dealing with the quotes is what really complicates things, not to mention escape characters.
A procedural solution is not complicated, and will be faster and easier to modify as needs dictate. Note that I don't know what the escape characters should be in your example, but you could certainly add that to the algorithm...
string CodeSnippet = Resource1.CodeSnippet;
StringBuilder CleanCodeSnippet = new StringBuilder();
bool InsideQuotes = false;
bool InsideComment = false;
Console.WriteLine("BEFORE");
Console.WriteLine(CodeSnippet);
Console.WriteLine("");
for (int i = 0; i < CodeSnippet.Length; i++)
{
switch(CodeSnippet[i])
{
case '"' :
if (!InsideComment) InsideQuotes = !InsideQuotes;
break;
case '#' :
if (!InsideQuotes) InsideComment = true;
break;
case '\n' :
InsideComment = false;
break;
}
if (!InsideComment)
{
CleanCodeSnippet.Append(CodeSnippet[i]);
}
}
Console.WriteLine("AFTER");
Console.WriteLine(CleanCodeSnippet.ToString());
Console.WriteLine("");
This example strips the comments away from the CodeSnippet
. I assumed that's what you were after.
Here's the output:
BEFORE
"\#" TEST #comment hello world
"ab" TEST #comment hello world
"ab" TEST #comment "hello world
"ab" + "ca" + TEST #comment
"\#" TEST
"ab" TEST
AFTER
"\#" TEST
"ab" TEST
"ab" TEST
"ab" + "ca" + TEST
"\#" TEST
"ab" TEST
As I said, you'll probably need to add escape characters to the algorithm. But this is a good starting point.
The +
operator tries to match as many times as it can. To match as few times as possible, use its lazy equivalent, +?
:
(#.+?)
Of course, this would give trouble with comments that contain #
:
"\#" TEST #comment #hello #world
Use " #.+". I left the \ out of my test because # is not a recognized escape sequence. I left out the (, ) and ? because they where not needed.
Regex regex = new Regex(" #.+");
Console.WriteLine(regex.Match("#\" TEST #comment hello world"));
For the test string you've given, this regex pulls the comment correctly (with right to left option): /((?: #).+)$/
Disclaimer:
- Also pulls the whitespace just before the '#', so you may need to do a trim.
- Comment cannot contain the sequence ' #' in them
This will match "#" and everything after it, witch is the expected behavior :)
var reg = new Regex("#(.)*")
Hope this helps
Right, I've tested this one and it seems to do the necessary.
\#.+(\#.+)$
Specifically, it skips past the first #, then captures everything from the second # to the end of the line, returning
#comment hello world
精彩评论