c# regular expression to match img src="*" type URLs

2023-01-16 02:36 问答作者：

I have a regex in c# that i'm using to match image tags and pull out the URL. My code is working in most situations. The code below will "fix" all relative image URLs to Absolute URLs.

The issue is that the regex will not match the following:

<img height="150" width="202" alt="" src="../Image%20Files/Koala.jpg" style="border: 0px solid black; float: right;">

For example it matches this one just fine

<img height="147" width="197" alt="" src="../Handlers/SignatureImage.ashx?cid=5" style="border: 0px solid black;">

Any ideas on how to make it match would be great. I think the issue is the % but I could be wrong.

Regex rxImages = new Regex(" src=\"([^\"]*)\"", RegexOptions.IgnoreCase & RegexOptions.IgnorePatternWhitespace);
mc = rxImages.Matches(html);
if (mc.Count > 0)
{
    Match m = mc[0];
    string relitiveURL = html.Substring(m.Index + 6, m.Length - 7);
    if (relitiveURL.Substring(0, 4) != "http")
    {
        Uri absoluteUri = new Uri(baseUri, relitiveURL);
        ret += html.Substring(0, m.Index + 5);
        ret += absoluteUri.ToString();
        ret += html.Substring(m.Index + m.L开发者_运维技巧ength - 1, html.Length - (m.Index + m.Length - 1));
        ret = convertToAbsolute(URL, ret);
    }
}

Using RegEx to parse images in this way is a bad idea. See here for a good demonstration of why.

You can use an HTML parser such as the HTML Agility Pack to parse the HTML and query it using XPath syntax.

First, I would try to skip all the manual parsing and use linq to html

HDocument document = HDocument.Load("http://www.microsoft.com");

foreach (HElement element in document.Descendants("img"))
{
   Console.WriteLine("src = " + element.Attribute("src"));
}

If that didn't work, only then would I go back to manual parsing and I'm sure one of the fine gentle-people here has already posted a working regex for your needs.

regex is a bad idea. better use an html parser. here is a a regex i used for parsing links with regex though:

String body = "..."; //body of the page
Matcher m = Pattern.compile("(?im)(?:(?:(?:href)|(?:src))[ ]*?=[ ]*?[\"'])(((?:http|https)(?::\\/{2}[\\w]+)(?:[\\/|\\.]?)(?:[^\\s\"]*))|((?:\\/{0,1}[\\w\\.]+)+))[\"']").matcher(body);
while(m.find()){
  String absolute = m.group(2);
  String relative = m.group(3);
}

its a lot easier with a parser though, and better on resources. here is a link showing what i eventually wrote when i switched to a parser.

http://notetodogself.blogspot.com/2007/11/extract-links-using-htmlparser.html

probably not as helpful since that was java and you need C#

I don't know what your program does, but I'm guessing this is an example of something you would do in 5 minutes from the command line in linux. You can download windows versions of many of the same tools (sed, for instance) and save yourself the hassle of writing all that code.

继续阅读：pattern-matching regex

c# regular expression to match img src="*" type URLs

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？