Tokenizing a complex string (a Snort rule) with regular expressions (.NET)

2023-03-07 05:13 问答作者：

I need help from the Regex wizards out there. I am trying to write a simple parser that can tokenize the options list of a Snort rule (Snort, the IDS/IPS software). Problem is, I can't seem to find a workable formula that breaks apart the individual rule options based on their terminating semi-colon. The formulas that I have cooked up grab all options between parenthesis into a single capture group.

I am using the excellent RegExr tool at the GSkinner site with some of the below sample rule options from Emerging Threats (I parsed off the rule header -- that's easy to tokenize):

(msg:"ET DELETED Majestic-12 Spider Bot User-Agent (MJ12bot)"; flow:to_server,established; content:"|0d 0a|User-Agent\: MJ12bot|0d 0a|"; classtype:trojan-activity; reference:url,www.majestic12.co.uk/; reference:url,doc.emergingthreats.net/2003409; reference:url,www.emergingthreats.net/cgi-bin/cvsweb.cgi/sigs/POLICY/POLICY_Majestic-12; sid:2003409; rev:4;)
(msg:"ET DELETED Majestic-12 Spider Bot User-Agent Inbound (MJ12bot)"; flow:to_server,established; content:"|0d 0a|User-Agent\: MJ12bot"; classtype:trojan-activity; reference:url,www.majestic12.co.uk/; reference:url,doc.emergingthreats.net/2007762; reference:url,www.emergingthreats.net/cgi-bin/cvsweb.cgi/sigs/POLICY/POLICY_Majestic-12; sid:2007762; rev:4;)
(msg:"ET POLICY McAfee Update User Agent (McAfee AutoUpdate)"; flow:to_server,established; content:"User-Agent|3a| "; http_header; nocase; content:"McAfee AutoUpdate"; http_header; pcre:"/User-Agent\x3a[^\n]+McAfee AutoUpdate/i"; classtype:not-suspicious; reference:url,doc.emergingthreats.net/2003381; reference:url,www.emergingthreats.net/cgi-bin/cvsweb.cgi/sigs/POLICY/POLICY_McAffee; sid:2003381; rev:6;)
(msg:"ET DELETED Metacafe.com family filter off"; flow:established,to_server; content:"POST"; http_method; content:"Host|3a| www.metacafe.com"; http_header; fast_pattern:6,16; content:"submit=Continue+-+I%27m+over+18"; classtype:policy-violation; reference:url,doc.emergingthreats.net/2006367; reference:url,www.emergingthreats.net/cgi-bin/cvsweb.cgi/sigs/POLICY/POLICY_Metacafe; sid:2006367; rev:7;)

And this is the formula:

([a-zA-Z0-9_:]+(?:[\w\s.,\-/=<>+!\[\]\(\)\{\}\"|\\;'?`~@#$%^&*])+;)

The problem is, it doesn't handle colons. So two of the rules above will not have their 'content' options properly parsed. But on RegExr, each option will be highlighted in blue, including the terminating semi-colon, but NOT the space after the semi-colon. If I fed this into .NET, I should be able to do a Regex.Split and break apart all the tokens correctly.

If I add the colon to the character list, then on RegExr, the entire set of rules will get tokenized as a single blob of text, which is not what I want. Further attempts to tweak the formula result in Adobe Flash crashing, indicating I'm hitting a bug in either Flash or RegExr.

I've not ruled out writing my own string tokenizer, but I was hoping regex could save me from dealing with things like counting my open quotations, escaped characters, whitespace, etc.

Snort rule options typically come in the following format:

option:value;
option:"string value";
option:!"negated string value";
option:>num;
option:param1,param2,param3;

But several options tend to have more 'exotic' formats for their value, like byte_test. And everyone's favourite, 'pcre', which is basically an option for performing perl-compatible regex's. So any such tokenizer has to avoid getting confused if it runs into the 'pcre' keyword with regex in it.

Thoughts?

Edit: This below is REALLY close:

([\w]+:?(?:[\x20]|)?(?:[\x00-\xff])*?;)

But, accor开发者_如何学Cding to RegExr, it gets messed by pcre syntax:

(msg:"ET WEB_SPECIFIC_APPS Horde 3.0.9-3.1.0 Help Viewer Remote PHP Exploit"; flow:established,to_server; content:"/services/help/"; nocase; http_uri; pcre:"/module=[^\;]*\;.*\"/UGi"; classtype:web-application-attack; reference:url,www.milw0rm.com/exploits/1660; reference:cve,2006-1491; reference:bugtraq,17292; reference:url,doc.emergingthreats.net/2002867; reference:url,www.emergingthreats.net/cgi-bin/cvsweb.cgi/sigs/WEB_SPECIFIC_APPS/WEB_Horde; sid:2002867; rev:9; http_method;)

In the above, every single option is highlighted as a distinct grouping, except ]*\;.*\"/. I would think that \x00-\xff would get it all, but it appears that I am using a lazy match. A greedy match gets everything, including all the spaces between options, which I do not want. So I need to somehow modify the regex to handle tokenizing pcre text.

Edit2:This does the trick:

([\w]+:?(?:[\x20]|)?(?<!\\)\"?.*?(?<!\\)\"?;)

I had to play with a few example regex's that work with quoted strings. Finally realized that I am staring at negative look-behinds that avoid quotes that are escaped. This seems to solve any other escaped character, too, because escaped characters only appear inside unescaped quotes.

No need for lookaround. Just carefully write the regex to precisely match what you need. This is made much clearer (and easier to maintain) by writing this in verbose free-spacing mode like so: (Although VB.NET syntax makes it awkward to do so)

Dim RegexObj As New Regex(
    "# Match set of Snort rules enclosed within parentheses." & chr(10) & _
    "\(                              # Literal opening parentheses." & chr(10) & _
    "(?:                             # Group for one or more rules." & chr(10) & _
    "  \w+                           # Required rule name." & chr(10) & _
    "  (?:                           # Group for optional rule value." & chr(10) & _
    "    :                           # Rule name/values separated by :" & chr(10) & _
    "    (?:                         # Group for rule value alternatives." & chr(10) & _
    "      ""                        # Either a double quoted string," & chr(10) & _
    "      [^""\\]*                  # {normal} Use ""Unrolling the Loop""." & chr(10) & _
    "      (?:                       # Begin {(special normal*)*} construct." & chr(10) & _
    "        \\.                     # {special} == escaped anything." & chr(10) & _
    "        [^""\\]*                # More {normal*} non-quote, non-escapes." & chr(10) & _
    "      )*                        # Finish {(special normal*)*} construct." & chr(10) & _
    "      ""                        # Closing quote." & chr(10) & _
    "    | '[^'\\]*(?:\\.[^'\\]*)*'  # or a single quoted string," & chr(10) & _
    "    | [^;]+                     # or one or more non semi-colons." & chr(10) & _
    "    )                           # End group for rule value options." & chr(10) & _
    "  )?                            # Rule value is optional." & chr(10) & _
    "  ; \s*                         # Rule ends with ;, optional ws." & chr(10) & _
    ")+                              # One or more rules." & chr(10) & _
    "\)                              # LiteraL closing parentheses.", 
    RegexOptions.IgnorePatternWhitespace)
Dim MatchResults As Match = RegexObj.Match(SubjectString)
While MatchResults.Success
    ' matched text: MatchResults.Value
    ' match start: MatchResults.Index
    ' match length: MatchResults.Length
    MatchResults = MatchResults.NextMatch()
End While

This regex demonstrates use of Jeffrey Friedl's "Unrolling the Loop" efficiency technique for correctly matching quoted strings which may contain escaped characters. (See: MRE3)

Oh yeah, one more thing... Icarus has found you!

继续阅读：regex

Tokenizing a complex string (a Snort rule) with regular expressions (.NET)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？