C# Regular expressions find and replace links with only uppercase characters and don't match exclusions

2023-01-26 07:13 问答作者：

I'm having a hard time with what seems to be a simple Regex task. I'd like to replace all href links within text that have uppercase character with lowercase with the following exclusions.

For example

href="/image-ZOOM.aspx?UPPERcasE=someThing" match and replace to

href="/image-zoom.aspx?uppercase=something"

href="/image-coorect.aspx" - would not match

Also it would exclude href="javascript:function();" and would not lowercase anything between <% %> tags.

For example:

href="/images/PDFs/<%=Product.ShortSku %>.pdf" gets translated into

href="/images/**pdfs**/<%=Product.ShortSku %>.pdf"

I've tried something like href="([^"]*[A-Z]+[^"]*)" but that still matches links with all lowercase. Could you please shine 开发者_如何学Gosome light.

Thanks!

The tricky part is your <% ... %> requirement. It's actually pretty simple once you break each part of the URL into groups.

href="/images/PDFs/<%=Product.ShortSku %>.pdf"
      |_____1_____||__________2_________||_3_|

This group must exist.
This group is optional.
If group 2 doesn't exist then group 3 won't exist, in which case group 1 matches the entire href content. If group 2 exists, group 3 will be the remainder of the href content.

By understanding the above you end up with this for other strings:

href="/image-ZOOM.aspx?UPPERcasE=someThing"
      |________________1_________________|

I ended up with this pattern which makes use of named groups:

@"href=""(?!javascript:)(?=[^""]*[A-Z])(?<Start>[^""<]+)(?<Special><%[^""]+%>)?(?<End>[^""]*)"""

href="" : matches href and opening double-quote.
(?!javascript:) : negative look-ahead to ignore javascript functions.
(?=[^""]*[A-Z]) : positive look-ahead to find uppercase letters in the content to come. The [^""]* matches any char that isn't a double-quote. This is done to avoid going past the end of the content and greedily matching unintended content.
(?<Start>[^""<]+) : named group that matches any char as long as it is not a double-quote or opening angle bracket. Look at the earlier depiction - the angle bracket check ensures we stop if <% ... %> content is encountered. If it doesn't the pattern will continue till it encounters the closing double-quote.
(?<Special><%[^""]+%>)? : optional named group to capture <% ... %> content. The trailing ? marks this entire group as optional.
(?<End>[^""]*) : named group to match any remaining content. Notice here I use * to make it match zero or more content. This allows this portion of the pattern to act as an optional match in the case where the Special group doesn't exist.
"" : closing double-quote.

Sample code:

string[] inputs =
{
    "href=\"/image-ZOOM.aspx?UPPERcasE=someThing\"", // match
    "href=\"/image-coorect.aspx\"",  // no match, lowercase
    "href=\"javascript:function();\"", // no match, javascript
    "href=\"/images/PDFs/<%=Product.ShortSku %>.pDf\"", // bypass <% %> content
};

string pattern = @"href=""(?!javascript:)(?=[^""]*[A-Z])(?<Start>[^""<]+)(?<Special><%[^""]+%>)?(?<End>[^""]*)""";

foreach (var input in inputs)
{
    Console.WriteLine("{0,6}: {1}", Regex.IsMatch(input, pattern), input);
    string result = Regex.Replace(input, pattern,
                        m => "href=\""
                            + m.Groups["Start"].Value.ToLower()
                            + m.Groups["Special"].Value
                            + m.Groups["End"].Value.ToLower()
                            + "\"");
    Console.WriteLine("Result: " + result);
    Console.WriteLine();
}

This uses a lambda in place of the MatchEvaluator. Essentially we are reconstructing the string and referring to the named groups, altering the case on the groups we want to modify. The subtle key to this code is that if a group didn't match we can still reference it and it'll simply give us an empty string. Also, this might not be obvious from the code, but when a match fails the original string is returned unaltered by Regex.Replace.

Maybe you're using the "/i" modifier, make sure that you're not using "RegexOptions.IgnoreCase"

   List<string> list = new List<string>() {
        "href=\"/image-ZOOM.aspx?UPPERcasE=someThing\"",
        "href=\"/image-zoom.aspx?uppercase=something\"",
        "href=\"/image-coorect.aspx\"",
        "href=\"javascript:function();\"" 
    };

    foreach (string l in list) 
    {
        if (Regex.IsMatch(l, "href=\"([^\"]*[A-Z]+[^\"]*)\"")) 
        {
            Console.WriteLine(l);
        }
    }

Will only match: href="/image-ZOOM.aspx?UPPERcasE=someThing"

Ok, I'm confused. If you have a collection of controls and or tags on your page, you can test them to see if they are anchor types, and if so, you can get the href attribute from the tag, then set the href to href.ToLower ...

Is there a particular reason to use a regex to solve a string and DOM parsing problem? Seems like overkill to me.

继续阅读：regex

C# Regular expressions find and replace links with only uppercase characters and don't match exclusions

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？