开发者

Regex syntax hightlighting question follow up

Hey all, this is a follow up to this question: Regex syntax highlighting question

I’m not sure what the procedure is when a question is spawned off an answer to a previous question, so if this is the wrong way to go about it let me know.

Basically, I was unclear in my previous question. I have been messing around in http://www.rubular.com/ trying to get my RegEx to work, but to no avail. The problem lies in the messages I'm trying to parse, which are pretty irregular and have lots of messy nested quotes. Here is an example message, which is pretty much worst case scenario:

2011/03/04 10:27:17   [STUFF] subject=STUFF, message={ANNOYINGFIELD1="STUFF HEADER=(STUFF,STUFF,STUFF) FIELD=STUFF FIELD=0 FIELD= FIELD=84HDH.1 FIELD=9.6 FIELD="more stuff here" FIELD=- FIELD=NO FIELD="-" ANNOYINGFIELD2="A WHOLE BUNCH OF STUFF""}

As you can the confusing parts (for me, at least) are with ANNOYINGFIELD1, whose quote encompasses the whole rest of the message (And I don’t want to color it, because t开发者_Python百科he things inside need to be colored), HEADER which throws an awesome parenthesis curveball, and ANNOYINGFIELD2, which is similar to the first but I actually do want these to be colored (that is, fields with quoted strings INSIDE ANNOYINGFIELD1. To further clarify, I want the end result to be something like this... (don't have to stick to this, cause I don’t know what RegEx is capable of, but something close).

(Bold will take the place of color 1, and Italics color 2)

2011/03/04 10:27:17 [STUFF] subject=STUFF, message={ANNOYINGFIELD1="STUFF HEADER=(STUFF,STUFF,STUFF) FIELD=STUFF FIELD=0 FIELD= FIELD=84HDH.1 FIELD=9.6 FIELD="inside needs to be italics, editor giving me problems" FIELD=- FIELD=NO FIELD="-" ANNOYINGFIELD2="more italics""}

Since this was confusing just to write, please let me know if I need to clarify anything.

EDIT

Ive been modifying some of the suggestions from my first attempt at asking the question and this is REALLY close: ((\S+)=((?:\x22[^\x22]+\x22|[^>\s]+)))\s The only thing its messing up on are fields with no value (IE: FIELD1= FIELD2= are not receveing color) and an occational edge case where the last field has quotes, so it looks like this: (FIELD1="stuff stuff stuff""}) Any thoughts?


You're looking at a context-sensitive grammar there. In other words, you're looking to match certain patterns differently depending on what is around them. Traditionally regular expressions weren't designed to handle that. But .NET has balanced group definitions which make this possible (although difficult). Ryan Byington describes the technique here. To test this, you'll want to use an app that actually uses the .NET implementation such as Regex Hero.


However, you may find that even with the full power of .NET regular expressions that you'll run into edge cases that are seemingly impossible to solve. Context-sensitive grammars are just complicated like that.

That's why I actually wrote my own parser in the form of an embedded pushdown automaton to solve this problem. The idea is simpler than it sounds...

  • Parse the string from left to right one character at a time.
  • When you encounter a ", then add it to a List<string>. When you encounter a second " then pop it off of the list.

It's like working with a stack. When the top of the stack has a " on it, then you know that you're inside a pair of quotes. As you're parsing from left to right, you're constantly checking the stack to establish that context.


Figured it out! Just need a | at the end to grab the no value cases... Still dosnt handle that one end case but this is a good enough compromise for me!

Final RegEx: ((\S+)=((?:\x22[^\x22]+\x22|[^>\s]+)|))\s


Just given the general rules, a one off solution could be as below. Regex is in Perl language, but are the same for .net.

You have nested situation, but your condions may not be of a recursive balanced kind. If it doesen't work, no loss/gain, but simply don't use it.

Note - that tags are just placeholders. After the final regex, use one that substitutes the color control codes needed.

use strict;
use warnings;

my $original_str = '2011/03/04 10:27:17   [STUFF] subject=STUFF, message={ANNOYINGFIELD1="STUFF HEADER=(STUFF,STUFF,STUFF) FIELD=STUFF FIELD=0 FIELD= FIELD=84HDH.1 FIELD=9.6 FIELD="9.6 CMP(ILD Oxide_ACL)" FIELD=- FIELD=NO FIELD="-" ANNOYINGFIELD2="A WHOLE BUNCH OF STUFF""}  ';

my $str = '
2011/03/04 10:27:17   [STUFF] 

subject=STUFF,
message=
  {
     ANNOYINGFIELD1=
     "
        STUFF 
        HEADER=(STUFF,STUFF,STUFF)
        FIELD=STUFF
        FIELD=0
        FIELD=
        FIELD=84HDH.1
        FIELD=9.6
        FIELD=
        "
           9.6 CMP(ILD Oxide_ACL)
        "
        FIELD=-
        FIELD=NO
        FIELD="-"
        ANNOYINGFIELD2=
        "
             A WHOLE BUNCH OF STUFF
        "
     "
  }
';

$str =~ s/(\w+)(\s*=)/<c1>$1<\/c1>$2/g;
$str =~ s/"(\s*)((?:(?!["=]|\s*").)+)(\s*)"/"$1<c2>$2<\/c2>$3"/gs;
$str =~ s/"(\s*)((?:(?!["={}]|\s*<c).)+)(\s*)/"$1<c2>$2<\/c2>$3/sg;
$str =~ s/(=\s*)((?:(?!["={]|\s*<c).)+)(?!\s*\w+\s*=)/$1<c2>$2<\/c2>/sg;
$str =~ s/<c2>(\s*)<\/c2>/$1/g;

print $str,"\n";

$original_str =~ s/(\w+)(\s*=)/<c1>$1<\/c1>$2/g;
$original_str =~ s/"(\s*)((?:(?!["=]|\s*").)+)(\s*)"/"$1<c2>$2<\/c2>$3"/gs;
$original_str =~ s/"(\s*)((?:(?!["={}]|\s*<c).)+)(\s*)/"$1<c2>$2<\/c2>$3/sg;
$original_str =~ s/(=\s*)((?:(?!["={]|\s*<c).)+)(?!\s*\w+\s*=)/$1<c2>$2<\/c2>/sg;
$original_str =~ s/<c2>(\s*)<\/c2>/$1/g;

print $original_str,"\n";

__END__

output:

2011/03/04 10:27:17   [STUFF]

<c1>subject</c1>=<c2>STUFF,</c2>
<c1>message</c1>=
  {
     <c1>ANNOYINGFIELD1</c1>=
     "
        <c2>STUFF</c2>
        <c1>HEADER</c1>=<c2>(STUFF,STUFF,STUFF)</c2>
        <c1>FIELD</c1>=<c2>STUFF</c2>
        <c1>FIELD</c1>=<c2>0</c2>
        <c1>FIELD</c1>=
        <c1>FIELD</c1>=<c2>84HDH.1</c2>
        <c1>FIELD</c1>=<c2>9.6</c2>
        <c1>FIELD</c1>=
        "
           <c2>9.6 CMP(ILD Oxide_ACL)</c2>
        "
        <c1>FIELD</c1>=<c2>-</c2>
        <c1>FIELD</c1>=<c2>NO</c2>
        <c1>FIELD</c1>="<c2>-</c2>"
        <c1>ANNOYINGFIELD2</c1>=
        "
             <c2>A WHOLE BUNCH OF STUFF</c2>
        "
     "
  }

2011/03/04 10:27:17   [STUFF] <c1>subject</c1>=<c2>STUFF,</c2> <c1>message</c1>={<c1>ANNOYINGFIELD1</c1>="<c2>STUFF</c2> <c1>HEADER</c1>=<c2>(STUFF,STUFF,STUFF)</c2> <c1>FIELD</c1>=<c2>STUFF</c2> <c1>FIELD</c1>=<c2>0</c2> <c1>FIELD</c1>= <c1>FIELD</c1>=<c2>84HDH.1</c2> <c1>FIELD</c1>=<c2>9.6</c2> <c1>FIELD</c1>="<c2>9.6 CMP(ILD Oxide_ACL)</c2>" <c1>FIELD</c1>=<c2>-</c2> <c1>FIELD</c1>=<c2>NO</c2> <c1>FIELD</c1>="<c2>-</c2>" <c1>ANNOYINGFIELD2</c1>="<c2>A WHOLE BUNCH OF STUFF</c2>""}  
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜