开发者

How to adjust my regex to work with multiline and more complex text?

Background: I've wrote a small library that is able to create asp.net controls from a string.

Sample text:

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et 
{{asp:hyperlink|NavigateUrl="/faq.aspx";Text="FAQ";}}
{{codesample|Text="FAQ";}}
accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur 

I got my initial help here. I've expanded the code with some reflection to gain full flexibility to be able to render WebControls and UserControls. Works fine so far, for every usercontrol I've tested. I'm now facing the problem, that the parsing for my property key-value is not flexible enough to support arbitrary multine content.

This is part of the code that I'm using for the string operations:

substring = substring.Replace("\\"", "\""); //substring is the string containing lore ipsum
substring = substring.Replace(""", "\"");
Regex r = new Regex("{{(?<single>([a-z0-9:]*))\\|((?<pair>([a-z0-9:]*=\"[a-z0-9.:/?_~=]*\";))*)}}", RegexOptions.Singleline | RegexOptions.IgnoreCase);
Match m = r.Match(substring);
if (m.Success)
{
    Dictionary<string, string> properties = new Dictionary<string, string>();
    foreach (Capture cap in m.Groups["pair"].Captures)
    {
        string key = cap.ToString().Substring(0, cap.ToString().IndexOf("="));
        if (!properties.ContainsKey(key))
        {
            string value = cap.ToString().Substring(cap.ToString().IndexOf("=\"") + 2);
            value = value.Substring(0, value.Length - 2);
            properties.Add(key, value);
        }
    }
    MethodInfo dynamicRenderControl = null;
    String controlString = m.Groups["single"].Value.ToLower();
}

(The string comes from a database. It was previously set in my CMS. I have left our the code for getting groups of {{FOO|BAR="Foo2";}})

This is what the regex does: Example:

{{asp:hyperlink|NavigateUrl="/faq.aspx";Text="FAQ";}}

It parses "asp:hyperlink" into m.Groups["single"]. It is the string I need for mapping to a specific control type.

After the '|' I have the list of properties that will be captured into m.Groups["pair"].Captures.

This all works fine, but not for multiline text or more complex text. E.g.

{{codesample|Text="using System.Text;<br />\r\nusing System.Bla;";}}

This is where my code breaks. Question:

How must I adjust the regex 开发者_JS百科to make it working for multiline text, that starts with \" and ends with \"; altough there might also be \" inside that text? Or is that not possible with regex?

Edit: I've been thinking. Its not possible to achieve what I want with regex, because a \" in the text automatically breaks the code. I'm switching the outer delimeter to the CDATA syntax XML uses. Wikientry for CDATA

"<![CDATA[This is my content]]>";

This means that each entry looks like this:

{{codesample|Text="<![CDATA[this is text on the first line<br />\r\nthis is text on the second line]]>";}}

Where the beginning of the value is

"<![CDATA[

and the end

]]>";

I've been trying to write this regex myself but I failed. Could anyone assist me with this one?


You must set the single-line option to get the effect you're describing; you can do that in two ways, both using the RegexOptions.SingleLine option, which does exactly that: allows . to also match newlines in addition to 'any character'.

  • in the Regex constructor using RegexOptions.SingleLine; however that can mess up the entire regex.
  • inline, using the syntax (?s) to turn it on and (?-s) to turn it off. You can use this to turn it on just before the expression you want to be able to match multiple lines and back off afterwards.

That takes care of spanning multiple lines. Now for double quotes embedded in the string... I'm assuming they'll be escaped somehow? Is it plain backslash escaping? Double-quoting? You'll have to see which is the case, there's a solution for every case. However... in the words of some very wise man (can't remember who he was so obviously wiser than me), 'if you have a problem and say - I know, I'll use regex -- now you have two problems'. That can certainly be the case when you keep discovering corner cases.

Edit:

Note that you can actually ignore escaped characters... somewhat... for example you can match quotes only when not preceded by backslash, with negative look-behind assertions (I think that's what they're called at least): ?<!, but going that way is a bit more complex. I'm not even sure exactly how it works myself.

In the case of CDATA it's considerably easier to write a regex; all you need to do is turn on single line as I said, and:

  • match the start, which is \"\<!\[CDATA\[; you need to escape the characters because most of them have specific meanings in the regex syntax. To be on the safe side (if you don't feel like looking for documentation on what exactly you need to escape), you can escape with a backslash pretty much any non-standard character.
  • match any characters, for the maximum length possible before encountering the next match: (.+)? - note the question mark, which makes the match non-greedy.
  • match the CDATA end tag: \]\]\>\";.

So the complete expression would be... (without testing it a great deal, mind you):

(
{{
(?<single>\w*)
|
(?<pair>
  (?<key>\w*)="\<!\[CDATA\[ (?<cdatavalue>.*)?\]\]\>";*)
}}
)+

(I've spread it across multiple lines with IgnoreWhitespace to be more readable).

However it might make for some awkward code when going over the results so I've taken the liberty of improving it slightly:

(
{{
(?<title>.*?)
\|
((?<single>\w*)
|
(?<pair>
  (?<key>\w*)
  ="\<!\[CDATA\[
  (?<cdatavalue>.+)?
  \]\]\>";
)+
)
}}
)+

(Note that when pasting in Visual Studio you'll need to escape the quotes again!)

What this does, when going through multiple matches with the option ExplicitCapture on (to only capture named groups), is this:

  • the match will contain a title group. This is the first part of the regex.
  • the match will have some data in either the single or pair groups; you can check with string.IsNullOrEmpty which one has matched.
  • if the single contains something, then that's the match you're looking for.
  • if the pair contains something, you can look further at the key and cdatavalue groups for the key-value-pair broken up according to what you requested.

Example: sample text:

{{asp:sample|test}}
{{asp:codesample|Text="<![CDATA[this is text on the first line<br />
this is text on the second line]]>";}}

Results:

How to adjust my regex to work with multiline and more complex text?

Also, can't believe I didn't mention this earlier: Expresso is an awesome tool for testing and developing .net regexes, and it's free (the registration required is a minor nuisance).

Holy cow, that was long. Sorry for the long-windedness.


If I've understood your problem correctly, I believe this should solve the problem?

Regex r = new Regex("{{(?<single>([a-z0-9:]*))\\|((?<pair>([a-z0-9:]*=\"[^\"]*\";))*)}}", RegexOptions.Singleline | RegexOptions.IgnoreCase);

It captures everything between the " and ".

Br. Morten

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜