开发者

Constructing a regular expression to wrap images with <a>

A web page contains lots of image elements:

<img src="myImage.gif" width="180" height="18" />

But they may not be very well-formed, for example, the width or height attribute may be missing. And it also may not be properly closed with /. The src attribute is always there.

I need a regular expression that wraps these with a hyperlink having href set to the src of the img.

<a href="myImage.gif" target="_blank"><img src="myImage.gif" width="180" height="18" /></a>

I can successfully loc开发者_StackOverflow社区ate the images using this regexp in this editor: http://gskinner.com/RegExr/:

<img src="([^<]*)"[^<]*>

But what is the next step?


A DOM-based method is best, but if that regex works (not easy to accomplish for general HTML input) to match the desired <img> elements, with the value of the src attribute captured in \1, then just replace the whole match (captured in \0) with:

<a href="\1" target="_blank">\0</a>

In Java, the backreferences in replacement string will be $0 and $1; I'm not sure what language you're using so adjust accordingly.

In Java, though, something like this would work:

String imgHrefed = str.replaceAll(
   "<img src=\"([^<]*)\"[^<]*>",
   "<a href=\"$1\" target=\"_blank\">$0</a>"
);

It wasn't clear from your question what to do with any other attributes that the <img> may have. The above replacement keeps them as they are. If you also want to rewrite them (i.e. you're not just wrapping <img> in an <a> anymore), then perhaps you want to rewrite to this:

<a href="\1" target="_blank"><img src="\1" width="180" height="18" /></a>


In JavaScript, use string.replace() with $1 being the part you matched:

str.replace(/<img src="([^<]*)"[^<]*>/, 
    '"<a href="$1" target="_blank"><img src="$1" width="180" height="18" /></a>')

Or better still capture the whole image tag (now the src is $2 since it's in the second capture):

s.replace(/(<img src="([^<]*)"[^<]*>)/, '"<a href="$2" target="_blank">$1</a>')


In .net the regex is basically the same as javascript in most cases but the notation of the surrounding code would be slightly different.

    string imageHtmlSnippet = @"<img src=""myImage.gif"" width=""180"" height=""18"" />";
    string imageHtmlReplacement = @"<a href=""$1"" target=""_blank""><img src=""$1"" width=""180"" height=""18"" /></a>";

    Regex findImages = new Regex(@"<img src=""([^<]*)""[^<]*>");

    string fixedHtmlSnippet = findImages.Replace(imageHtmlSnippet, imageHtmlReplacement);

HOWEVER - this regex will fail if the src isn't the first attribute on the tag. I dont have time to fix it because I should already be out the door :)

In truth you should be looking to a html parsing library such as HtmlAgilityPack to parse it (if you are working in .net):

  • http://runtingsproper.blogspot.com/search/label/HtmlAgilityPack
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜