Constructing a regular expression to wrap images with <a>
A web page contains lots of image elements:
<img src="myImage.gif" width="180" height="18" />
But they may not be very well-formed, for example, the width or height attribute may be missing. And it also may not be properly closed with /. The src attribute is always there.
I need a regular expression that wraps these with a hyperlink having href set to the src of the img.
<a href="myImage.gif" target="_blank"><img src="myImage.gif" width="180" height="18" /></a>
I can successfully loc开发者_StackOverflow社区ate the images using this regexp in this editor: http://gskinner.com/RegExr/:
<img src="([^<]*)"[^<]*>
But what is the next step?
A DOM-based method is best, but if that regex works (not easy to accomplish for general HTML input) to match the desired <img>
elements, with the value of the src
attribute captured in \1
, then just replace the whole match (captured in \0
) with:
<a href="\1" target="_blank">\0</a>
In Java, the backreferences in replacement string will be $0
and $1
; I'm not sure what language you're using so adjust accordingly.
In Java, though, something like this would work:
String imgHrefed = str.replaceAll(
"<img src=\"([^<]*)\"[^<]*>",
"<a href=\"$1\" target=\"_blank\">$0</a>"
);
It wasn't clear from your question what to do with any other attributes that the <img>
may have. The above replacement keeps them as they are. If you also want to rewrite them (i.e. you're not just wrapping <img>
in an <a>
anymore), then perhaps you want to rewrite to this:
<a href="\1" target="_blank"><img src="\1" width="180" height="18" /></a>
In JavaScript, use string.replace() with $1
being the part you matched:
str.replace(/<img src="([^<]*)"[^<]*>/,
'"<a href="$1" target="_blank"><img src="$1" width="180" height="18" /></a>')
Or better still capture the whole image
tag (now the src is $2
since it's in the second capture):
s.replace(/(<img src="([^<]*)"[^<]*>)/, '"<a href="$2" target="_blank">$1</a>')
In .net the regex is basically the same as javascript in most cases but the notation of the surrounding code would be slightly different.
string imageHtmlSnippet = @"<img src=""myImage.gif"" width=""180"" height=""18"" />";
string imageHtmlReplacement = @"<a href=""$1"" target=""_blank""><img src=""$1"" width=""180"" height=""18"" /></a>";
Regex findImages = new Regex(@"<img src=""([^<]*)""[^<]*>");
string fixedHtmlSnippet = findImages.Replace(imageHtmlSnippet, imageHtmlReplacement);
HOWEVER - this regex will fail if the src isn't the first attribute on the tag. I dont have time to fix it because I should already be out the door :)
In truth you should be looking to a html parsing library such as HtmlAgilityPack to parse it (if you are working in .net):
- http://runtingsproper.blogspot.com/search/label/HtmlAgilityPack
精彩评论