C# Regular expressions find and replace links with only uppercase characters and don't match exclusions
I'm having a hard time with what seems to be a simple Regex task. I'd like to replace all href links within text that have uppercase character with lowercase with the following exclusions.
For example
href="/image-ZOOM.aspx?UPPERcasE=someThing"
match and replace to
href="/image-zoom.aspx?uppercase=something"
href="/image-coorect.aspx"
- would not match
Also it would exclude href="javascript:function();"
and would not lowercase anything between <% %>
tags.
For example:
href="/images/PDFs/<%=Product.ShortSku %>.pdf"
gets translated into
href="/images/**pdfs**/<%=Product.ShortSku %>.pdf"
I've tried something like href="([^"]*[A-Z]+[^"]*)"
but that still matches links with all lowercase. Could you please shine 开发者_如何学Gosome light.
Thanks!
The tricky part is your <% ... %>
requirement. It's actually pretty simple once you break each part of the URL into groups.
href="/images/PDFs/<%=Product.ShortSku %>.pdf"
|_____1_____||__________2_________||_3_|
- This group must exist.
- This group is optional.
- If group 2 doesn't exist then group 3 won't exist, in which case group 1 matches the entire href content. If group 2 exists, group 3 will be the remainder of the href content.
By understanding the above you end up with this for other strings:
href="/image-ZOOM.aspx?UPPERcasE=someThing"
|________________1_________________|
I ended up with this pattern which makes use of named groups:
@"href=""(?!javascript:)(?=[^""]*[A-Z])(?<Start>[^""<]+)(?<Special><%[^""]+%>)?(?<End>[^""]*)"""
href=""
: matches href and opening double-quote.(?!javascript:)
: negative look-ahead to ignore javascript functions.(?=[^""]*[A-Z])
: positive look-ahead to find uppercase letters in the content to come. The[^""]*
matches any char that isn't a double-quote. This is done to avoid going past the end of the content and greedily matching unintended content.(?<Start>[^""<]+)
: named group that matches any char as long as it is not a double-quote or opening angle bracket. Look at the earlier depiction - the angle bracket check ensures we stop if<% ... %>
content is encountered. If it doesn't the pattern will continue till it encounters the closing double-quote.(?<Special><%[^""]+%>)?
: optional named group to capture<% ... %>
content. The trailing?
marks this entire group as optional.(?<End>[^""]*)
: named group to match any remaining content. Notice here I use*
to make it match zero or more content. This allows this portion of the pattern to act as an optional match in the case where the Special group doesn't exist.""
: closing double-quote.
Sample code:
string[] inputs =
{
"href=\"/image-ZOOM.aspx?UPPERcasE=someThing\"", // match
"href=\"/image-coorect.aspx\"", // no match, lowercase
"href=\"javascript:function();\"", // no match, javascript
"href=\"/images/PDFs/<%=Product.ShortSku %>.pDf\"", // bypass <% %> content
};
string pattern = @"href=""(?!javascript:)(?=[^""]*[A-Z])(?<Start>[^""<]+)(?<Special><%[^""]+%>)?(?<End>[^""]*)""";
foreach (var input in inputs)
{
Console.WriteLine("{0,6}: {1}", Regex.IsMatch(input, pattern), input);
string result = Regex.Replace(input, pattern,
m => "href=\""
+ m.Groups["Start"].Value.ToLower()
+ m.Groups["Special"].Value
+ m.Groups["End"].Value.ToLower()
+ "\"");
Console.WriteLine("Result: " + result);
Console.WriteLine();
}
This uses a lambda in place of the MatchEvaluator
. Essentially we are reconstructing the string and referring to the named groups, altering the case on the groups we want to modify. The subtle key to this code is that if a group didn't match we can still reference it and it'll simply give us an empty string. Also, this might not be obvious from the code, but when a match fails the original string is returned unaltered by Regex.Replace
.
Maybe you're using the "/i" modifier, make sure that you're not using "RegexOptions.IgnoreCase"
List<string> list = new List<string>() {
"href=\"/image-ZOOM.aspx?UPPERcasE=someThing\"",
"href=\"/image-zoom.aspx?uppercase=something\"",
"href=\"/image-coorect.aspx\"",
"href=\"javascript:function();\""
};
foreach (string l in list)
{
if (Regex.IsMatch(l, "href=\"([^\"]*[A-Z]+[^\"]*)\""))
{
Console.WriteLine(l);
}
}
Will only match: href="/image-ZOOM.aspx?UPPERcasE=someThing"
Ok, I'm confused. If you have a collection of controls and or tags on your page, you can test them to see if they are anchor types, and if so, you can get the href attribute from the tag, then set the href to href.ToLower ...
Is there a particular reason to use a regex to solve a string and DOM parsing problem? Seems like overkill to me.
精彩评论