Regex help: My regex pattern will match invalid Dictionary
I hope you guys can help me out. I'm using C# .Net 4.0
I want validate file structure like
const string dataFileScr = @"
Start 0
{
Next = 1
Author = rk
Date = 2011-03-10
/* Description = simple */
}
PZ 11
{
IA_return()
}
GDC 7
{
Message = 6
Message = 7
Message = 8
Message = 8
RepeatCount = 2
ErrorMessage = 10
ErrorMessage = 11
onKey[5] = 6
onKey[6] = 4
onKey[9] = 11
}
";
So far I managed to build this regex pattern
const string patternFileScr = @"
^
((?:\[|\s)*
(?<Section>[^\]\r\n]*)
(?:\])*
(?:[\r\n]{0,}|\Z))
(
(?:\{) ### !! improve for .ini file, dont take {
(?:[\r\n]{0,}|\Z)
( # Begin capture groups (Key Value Pairs)
(?!\}|\[) # Stop capture groups if a } is found; new section
(?:\s)* # Line with space
(?<Key>[^=]*?) # Any text before the =, matched few as possible
(?:[\s]*=[\s]*) # Get the = now
开发者_运维技巧 (?<Value>[^\r\n]*) # Get everything that is not an Line Changes
(?:[\r\n]{0,})
)* # End Capture groups
(?:[\r\n]{0,})
(?:\})?
(?:[\r\n\s]{0,}|\Z)
)*
";
and c#
Dictionary <string, Dictionary<string, string>> DictDataFileScr
= (from Match m in Regex.Matches(dataFileScr,
patternFileScr,
RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline)
select new
{
Section = m.Groups["Section"].Value,
kvps = (from cpKey in m.Groups["Key"].Captures.Cast().Select((a, i) => new { a.Value, i })
join cpValue in m.Groups["Value"].Captures.Cast().Select((b, i) => new { b.Value, i }) on cpKey.i equals cpValue.i
select new KeyValuePair(cpKey.Value, cpValue.Value)).OrderBy(_ => _.Key)
.ToDictionary(kvp => kvp.Key, kvp => kvp.Value)
}).ToDictionary(itm => itm.Section, itm => itm.kvps);
It works for
const string dataFileScr = @"
Start 0
{
Next = 1
Author = rk
Date = 2011-03-10
/* Description = simple */
}
GDC 7
{
Message = 6
RepeatCount = 2
ErrorMessage = 10
onKey[5] = 6
onKey[6] = 4
onKey[9] = 11
}
";
in other words
Section1
{
key1=value1
key2=value2
}
Section2
{
key1=value1
key2=value2
}
, but
DictDataFileScr["GDC 7"]["Message"] = "6|7|8|8"
DictDataFileScr["GDC 7"]["ErrorMessage"] = "10|11"
....
[Section1]
key1 = value1
key2 = value2
[Section2]
key1 = value1
key2 = value2
...
....
PZ 11
{
IA_return()
}
.....
Here is a complete rework of the regex in C#.
Assumptions : (tell me if one of them is false or all are false)
- An INI file section can only have key/value pair lines in its body
- In an non INI file section, function calls can't have any parameters
Regex flags :
RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled | RegexOptions.Singleline
Input test:
const string dataFileScr = @"
Start 0
{
Next = 1
Author = rk
Date = 2011-03-10
/* Description = simple */
}
PZ 11
{
IA_return()
}
GDC 7
{
Message = 6
Message = 7
Message = 8
Message = 8
RepeatCount = 2
ErrorMessage = 10
ErrorMessage = 11
onKey[5] = 6
onKey[6] = 4
onKey[9] = 11
}
[Section1]
key1 = value1
key2 = value2
[Section2]
key1 = value1
key2 = value2
";
Reworked regex:
const string patternFileScr = @"
(?<Section> (?# Start of a non ini file section)
(?<SectionName>[\w ]+)\s* (?# Capture section name)
{ (?# Match but don't capture beginning of section)
(?<SectionBody> (?# Capture section body. Section body can be empty)
(?<SectionLine>\s* (?# Capture zero or more line(s) in the section body)
(?: (?# A line can be either a key/value pair, a comment or a function call)
(?<KeyValuePair>(?<Key>[\w\[\]]+)\s*=\s*(?<Value>[\w-]*)) (?# Capture key/value pair. Key and value are sub-captured separately)
|
(?<Comment>/\*.+?\*/) (?# Capture comment)
|
(?<FunctionCall>[\w]+\(\)) (?# Capture function call. A function can't have parameters though)
)\s* (?# Match but don't capture white characters)
)* (?# Zero or more line(s), previously mentionned in comments)
)
} (?# Match but don't capture beginning of section)
)
|
(?<Section> (?# Start of an ini file section)
\[(?<SectionName>[\w ]+)\] (?# Capture section name)
(?<SectionBody> (?# Capture section body. Section body can be empty)
(?<SectionLine> (?# Capture zero or more line(s) in the section body. Only key/value pair allowed.)
\s*(?<KeyValuePair>(?<Key>[\w\[\]]+)\s*=\s*(?<Value>[\w-]+))\s* (?# Capture key/value pair. Key and value are sub-captured separately)
)* (?# Zero or more line(s), previously mentionned in comments)
)
)
";
Discussion The regex is build to match either non INI file sections (1) or INI file section (2).
(1) Non-INI file sections These sections are composed by a section name followed by a body enclosed by { and }. The section name con contain either letters, digits or spaces. The section body is composed by zero or more lines. A line can be either a key/value pair (key = value), a comment (/* Here is a comment */) or a function call with no parameters (my_function()).
(2) INI file sections These sections are composed by a section name enclosed by [ and ] followed by zero or more key/value pairs. Each pair is on one line.
Do yourself and your sanity a favor and learn how to use GPLex and GPPG. They are the closest thing that C# has to Lex and Yacc (or Flex and Bison, if you prefer) which are the proper tools for this job.
Regular expressions are great tools for performing robust string matching, but when you want to match structures of strings that's when you need a "grammar". This is what a parser is for. GPLex takes a bunch of regular expressions and generates a super-fast lexer. GPPG takes the grammar you write and generates a super-fast parser.
Trust me, learn how to use these tools ... or any other tools like them. You'll be glad you did.
# 2. not work for .ini file like
Won't work because as stated by your regular expression, an { is required after [Section]. Your regex will match if you have something like this :
[Section] { key = value }
Here is a sample in Perl. Perl doesen't have named capture arrays. Probably because of backtracking.
Maybe you can pick something out of the regex though. This assumes there is no nesting of {} bracktes.
Edit Never content to leave well enough alone, a revised version is below.
use strict;
use warnings;
my $str = '
Start 0
{
Next = 1
Author = rk
Date = 2011-03-10
/* Description = simple
*/
}
asdfasdf
PZ 11
{
IA_return()
}
[ section 5 ]
this = that
[ section 6 ]
this_ = _that{hello() hhh = bbb}
TOC{}
GDC 7
{
Message = 6
Message = 7
Message = 8
Message = 8
RepeatCount = 2
ErrorMessage = 10
ErrorMessage = 11
onKey[5] = 6
onKey[6] = 4
onKey[9] = 11
}
';
use re 'eval';
my $rx = qr/
\s*
( \[ [^\S\n]* )? # Grp 1 optional ini section delimeter '['
(?<Section> \w+ (?:[^\S\n]+ \w+)* ) # Grp 2 'Section'
(?(1) [^\S\n]* \] |) # Condition, if we matched '[' then look for ']'
\s*
(?<Body> # Grp 3 'Body' (for display only)
(?(1)| \{ ) # Condition, if we're not a ini section then look for '{'
(?{ print "Section: '$+{Section}'\n" }) # SECTION debug print, remove in production
(?: # _grp_
\s* # whitespace
(?: # _grp_
\/\* .*? \*\/ # some comments
| # OR ..
# Grp 4 'Key' (tested with print, Perl doesen't have named capture arrays)
(?<Key> \w[\w\[\]]* (?:[^\S\n]+ [\w\[\]]+)* )
[^\S\n]* = [^\S\n]* # =
(?<Value> [^\n]* ) # Grp 5 'Value' (tested with print)
(?{ print " k\/v: '$+{Key}' = '$+{Value}'\n" }) # KEY,VALUE debug print, remove in production
| # OR ..
(?(1)| [^{}\n]* ) # any chars except newline and [{}] on the condition we're not a ini section
) # _grpend_
\s* # whitespace
)* # _grpend_ do 0 or more times
(?(1)| \} ) # Condition, if we're not a ini section then look for '}'
)
/x;
while ($str =~ /$rx/xsg)
{
print "\n";
print "Body:\n'$+{Body}'\n";
print "=========================================\n";
}
__END__
Output
Section: 'Start 0'
k/v: 'Next' = '1'
k/v: 'Author' = 'rk'
k/v: 'Date' = '2011-03-10'
Body:
'{
Next = 1
Author = rk
Date = 2011-03-10
/* Description = simple
*/
}'
=========================================
Section: 'PZ 11'
Body:
'{
IA_return()
}'
=========================================
Section: 'section 5'
k/v: 'this' = 'that'
Body:
'this = that
'
=========================================
Section: 'section 6'
k/v: 'this_' = '_that{hello() hhh = bbb}'
Body:
'this_ = _that{hello() hhh = bbb}
'
=========================================
Section: 'TOC'
Body:
'{}'
=========================================
Section: 'GDC 7'
k/v: 'Message' = '6'
k/v: 'Message' = '7'
k/v: 'Message' = '8'
k/v: 'Message' = '8'
k/v: 'RepeatCount' = '2'
k/v: 'ErrorMessage' = '10'
k/v: 'ErrorMessage' = '11'
k/v: 'onKey[5]' = '6'
k/v: 'onKey[6]' = '4'
k/v: 'onKey[9]' = '11'
Body:
'{
Message = 6
Message = 7
Message = 8
Message = 8
RepeatCount = 2
ErrorMessage = 10
ErrorMessage = 11
onKey[5] = 6
onKey[6] = 4
onKey[9] = 11
}'
=========================================
精彩评论