Help identifying issue in Relative to Absolute URL RegEx Replacement
I have been working with some code I found here to help me with converting relative URLs to absolute URLs in HTML page source.
I want to work with RegEx, and not HTML Agility pack for this particular problem.
I've modified the code slightly, which is working well except that relative urls with a preceeding "/" are replaced, but it seems, so far as I can tell, relative URL's that don't include a preceeding slash are not.
I'm pretty sure the issue is in the intitial regEx string, as no replaces are attempted. This is beyond my regular expression knowledge.
Can anyone help me identify what is causing this not to match the types of URL I have described?
const string htmlPattern = "(?<attrib>\\shref|\\ssrc|\\sbackground)\\s*?=\\s*?"
+ "(?<delim1>[\"'\\\\]{0,2})(?!#|http|ftp|mailto|javascript)"
+ "/(?<url>[^\"'>\\\\]+)(?<delim2>[\"'\\\\]{0,2})";
// Wrapper Code
public static string GetRelativePathReplacedHtml(string source, Uri uri)
{
source = source.HtmlAppRelativeUrlsToAbsoluteUrls( uri );
return source;
}
// R开发者_Python百科egEx Match Code
public static string HtmlAppRelativeUrlsToAbsoluteUrls(this string html, Uri rootUrl)
{
if (string.IsNullOrEmpty(html))
return html;
const string htmlPattern = "(?<attrib>\\shref|\\ssrc|\\sbackground)\\s*?=\\s*?"
+ "(?<delim1>[\"'\\\\]{0,2})(?!#|http|ftp|mailto|javascript)"
+ "/(?<url>[^\"'>\\\\]+)(?<delim2>[\"'\\\\]{0,2})";
var htmlRegex = new Regex(htmlPattern, RegexOptions.IgnoreCase | RegexOptions.Multiline);
html = htmlRegex.Replace(html, m => htmlRegex.Replace(m.Value, "${attrib}=${delim1}" + ("~/" + m.Groups["url"].Value).ToAbsoluteUrl(rootUrl) + "${delim2}"));
const string cssPattern = "@import\\s+?(url)*['\"(]{1,2}"
+ "(?!http)\\s*/(?<url>[^\"')]+)['\")]{1,2}";
var cssRegex = new Regex(cssPattern, RegexOptions.IgnoreCase | RegexOptions.Multiline);
html = cssRegex.Replace(html, m => cssRegex.Replace(m.Value, "@import url(" + ("~/" + m.Groups["url"].Value).ToAbsoluteUrl(rootUrl) + ")"));
return html;
}
// Url Conversion
public static string ToAbsoluteUrl(this string relativeUrl, Uri rootUrl)
{
if (string.IsNullOrEmpty(relativeUrl))
return relativeUrl;
if (relativeUrl.StartsWith("/"))
relativeUrl = relativeUrl.Insert(0, "~");
if (!relativeUrl.StartsWith("~/"))
relativeUrl = relativeUrl.Insert(0, "~/");
var url = rootUrl;
var port = url.Port != 80 ? (":" + url.Port) : String.Empty;
// return string.Format("{0}://{1}{2}{3}", url.Scheme, url.Host, port, VirtualPathUtility.ToAbsolute(relativeUrl));
return string.Format("{0}://{1}{2}{3}", url.Scheme, url.Host, port, relativeUrl.Replace("~/", "/"));
}
change
+ "/(?<url>[^\"'>\\\\]+)(?<delim2>[\"'\\\\]{0,2})";
to
+ "(?<url>[^\"'>\\\\]+)(?<delim2>[\"'\\\\]{0,2})";
ie drop the leading slash
and in the css section change
+ "(?!http)\\s*/(?<url>[^\"')]+)['\")]{1,2}";
to
+ "(?!http)\\s*(?<url>[^\"')]+)['\")]{1,2}";
I suspect the issue is only partially the Regex, which requires relative URLs to begin with the opening "/". Removing this restriction would still fail since the ToAbsoluteUrl
method ultimately calls VirtualPathUtility.ToAbsolute
, which requires a rooted URL (relative to the application or absolute).
You can change the ToAbsoluteUrl
function to return the proper absolute URL for the given attribute. When the expression is changed as suggested below, ToAbsoluteUrl will receive the HTML attribute without the preceding ~/, e.g., /path/a.aspx instead of /~/path/a.aspx.. Then the pattern can be loosened to:
const string htmlPattern = @"(?<attrib>\s(?>href|src|background))\s*=\s*"
+ @"(?<delim1>[""'\\])(?!#|(?>https?|ftp|mailto|javascript|file)://)"
+ @"(?<url>.+?)\k<delim1>";
// to handle escaped deliminators in URL string, use below
// in place of last segment:
// + @"(?<url>.+?)(?<!(?:(?<!\\)(?:\\\\)*)\\)\k<delim1>";
and two lines later:
html = htmlRegex.Replace(html, m => m.Result("${attrib}=${delim1}"
+ (m.Groups["url"].Value).ToAbsoluteUrl(rootUrl)
+ "${delim2}"));
(I replaced the inner Regex.Replace
with m.Result
, which seems to be the original author's intent.)
Two important notes. First, m.Groups["url"].Value is not escaped, so a source like /path/${something}.aspx
will throw an exception. (This characteristic was present in the original code.) Second, the general caveat that using regular expressions to match HTML is generally not advised. For instance, if href="/path.asp"
happens to appear in the source outside a tag, it will be matched and converted. (You can use a pattern like (?<\<[^>]*)
to the start of the pattern to guard against this, but even this causes issues in cases like <a onmouseover=\"g(f(this)>2)\" href="/a.aspx">
because of the >2
.) And third, this doesn't address the CSS import, although that can be similarly resolved (most simply by removing the /
after \\s*
in cssPattern
).
精彩评论