开发者

c# regular expression to match img src="*" type URLs

I have a regex in c# that i'm using to match image tags and pull out the URL. My code is working in most situations. The code below will "fix" all relative image URLs to Absolute URLs.

The issue is that the regex will not match the following:

<img height="150" width="202" alt="" src="../Image%20Files/Koala.jpg" style="border: 0px solid black; float: right;">

For example it matches this one just fine

<img height="147" width="197" alt="" src="../Handlers/SignatureImage.ashx?cid=5" style="border: 0px solid black;">

Any ideas on how to make it match would be great. I think the issue is the % but I could be wrong.

Regex rxImages = new Regex(" src=\"([^\"]*)\"", RegexOptions.IgnoreCase & RegexOptions.IgnorePatternWhitespace);
mc = rxImages.Matches(html);
if (mc.Count > 0)
{
    Match m = mc[0];
    string relitiveURL = html.Substring(m.Index + 6, m.Length - 7);
    if (relitiveURL.Substring(0, 4) != "http")
    {
        Uri absoluteUri = new Uri(baseUri, relitiveURL);
        ret += html.Substring(0, m.Index + 5);
        ret += absoluteUri.ToString();
        ret += html.Substring(m.Index + m.L开发者_运维技巧ength - 1, html.Length - (m.Index + m.Length - 1));
        ret = convertToAbsolute(URL, ret);
    }
}


Using RegEx to parse images in this way is a bad idea. See here for a good demonstration of why.

You can use an HTML parser such as the HTML Agility Pack to parse the HTML and query it using XPath syntax.


First, I would try to skip all the manual parsing and use linq to html

HDocument document = HDocument.Load("http://www.microsoft.com");

foreach (HElement element in document.Descendants("img"))
{
   Console.WriteLine("src = " + element.Attribute("src"));
}

If that didn't work, only then would I go back to manual parsing and I'm sure one of the fine gentle-people here has already posted a working regex for your needs.


regex is a bad idea. better use an html parser. here is a a regex i used for parsing links with regex though:

String body = "..."; //body of the page
Matcher m = Pattern.compile("(?im)(?:(?:(?:href)|(?:src))[ ]*?=[ ]*?[\"'])(((?:http|https)(?::\\/{2}[\\w]+)(?:[\\/|\\.]?)(?:[^\\s\"]*))|((?:\\/{0,1}[\\w\\.]+)+))[\"']").matcher(body);
while(m.find()){
  String absolute = m.group(2);
  String relative = m.group(3);
}

its a lot easier with a parser though, and better on resources. here is a link showing what i eventually wrote when i switched to a parser.

http://notetodogself.blogspot.com/2007/11/extract-links-using-htmlparser.html

probably not as helpful since that was java and you need C#


I don't know what your program does, but I'm guessing this is an example of something you would do in 5 minutes from the command line in linux. You can download windows versions of many of the same tools (sed, for instance) and save yourself the hassle of writing all that code.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜