Hyperlink regex including http(s):// not working in C#
I think this is sufficiently different from similar questions to warrant a new one.
I have the following regex to match the beginning hyperlink tags in HTML, including the http(s):// part in order to avoid mailto: links
<a[^>]*?href=[""'](?<href>\\b(https?)://[^\[\]""]+?)[""'][^>]*?>
When I run this through Nregex (with escaping removed) it matches correctly for the following test cases:
<a href="http://www.bbc.co.uk">
<a href="http://bbc.co.uk">
<a href="https://www.bbc.co.uk">
<a href="mailto:rory@domain.com">
However when I run this in my C# code it fails. Here is the matching code:
public static IEnumerable<string> GetUrls(this string input, string matchPattern)
{
var matches = Regex.Matches(input, matchPattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);
foreach (Match match in matches)
{
yield return match.Groups["href"].Value;
}
}
And my tests:
@"<a href=""https://www.bbc.co.uk"">bbc</a>".GetUrls(StringExtensions.HtmlUrlRegexPattern).Count().ShouldEqual(1);
@"<a href=""mailto:rory@domain.com"">bb开发者_JAVA技巧c</a>".GetUrls(StringExtensions.HtmlUrlRegexPattern).Count().ShouldEqual(0);
The problem seems to be in the \\b(https?)://
part which I added, removing this passes the normal URL test but fails the mailto: test.
Anyone shed any light?
Are you writing the regex like this?
@"<a[^>]*?href=[""'](?<href>\\b(https?)://[^\[\]""]+?)[""'][^>]*?>"
If so, you have too many backslashes in the word boundary. Because it's a verbatim string literal, the regex compiler sees two backslashes just like you wrote it, so it thinks you're looking for the literal sequence \b
.
But you don't need to use a word boundary there anyway. You're already specifying that the protocol must be immediately preceded by a single- or double-quote, so it can't be preceded by a word character.
The problem is that your regex is actually looking to match something like <a href="\bhttps://...
. If you remove the \\b
(which is unnecessary) it should work. Use this instead:
<a[^>]*?href=[""'](?<href>(https?)://[^\[\]""]+?)[""'][^>]*?>
As general advice, when dealing with regular expressions, you need to break them down into constituent pieces and get each piece to work correctly. Then, you can focus on assembling them together to match your input. Sometimes this can be hard to do - particularly with complex expressions involving trackback or lookahead, but your case is simple enough that you should be able to decompose the expression into parts that work individually.
I think this should work:
@"(https?):[/][/][^\[\]""]+?)[""'][^>]*?"
You don't need to escape /
symbols in regular expressions, but it doesn't hurt to wrap them in a [ ]
groups selector.
精彩评论