开发者

Read a web page with all images in Base64-Embedded format

In my scenario I want to download the HTML of a page (any page on the Internet) programaticaly but also I want all of the images in the HTML to be in base64 embedded format (not referenced)

In other words, instead of :

<img src='/images/delete.gif' />

I want the downloaded html to look like this:

<img src="data:image/gif;base64,R0lGODl..." />

This way I don't need to go through the process of storing all images in directories, etc, etc.

Does any of you have any idea how this can be done? Or any plugin开发者_开发技巧 to do this efficiently?


Well, you'd need to:

  • Download the original HTML
  • Find each img element in the HTML (for instance using the HTML agility pack) and for each one:
    • If it's already using a data URL, ignore it
    • Otherwise:
    • Download the image
    • Encoded it in Base64 using Convert.ToBase64String
    • Replace the original img tag with one using the base64 version (either in the original string, or via a DOM representation)
  • Save the final HTML to disk

Is any of these steps causing you a particular problem? You could potentially make it quicker by downloading the images in parallel, but I'd get a serial version working first.


Instead of using a html page with images as base64 encoded strings in the src attribute you might consider using the MHTML format instead. Most browsers supports the format and it embeds all external resources (including images).

var msg = new CDO.MessageClass();
msg.MimeFormatted = true;
msg.CreateMHTMLBody("http://www.google.com", CDO.CdoMHTMLFlags.cdoSuppressNone, "", "");
var stream = msg.GetStream();
var mhtml = stream.ReadText(stream.Size);


Use a regular expression (regex) to extract URLs from img tags, translate them to absolute URLs using the Uri class, then use WebClient to download the target images. After that it's just a case of using Convert.ToBase64String to produce the Base64.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜