开发者

Using htmlparse to replace image and css source urls in a html file (python)

I'm trying to write a script that will download a webpage, including all the images and style sheets - i.e. so a locally hosted version looks identical to the original.

Originally I was just downloading the images, but I realize now that I have to (of course) edit the html source so that the img src actually points to the locally hosted image. As I have to change the html source anyway, I decided it was better if I just updated the locally hosted file to point to the images and style sheets hosted remotely.

So this brings me to my question, can I use htmlparse to search for the style sheets and image tags and then replace the links to them with the updated versions?

I've had a look at the htmlparse documentation, but I'm still pretty new to python so some parts unclear. I thought it might be possible to use:

HTMLParser.handle_data(data)
This method is called to process arbitrary data. It is intended to be overridden by a 
derived class; the base class implementation does nothing.

and add my own replacing class to it? Or am I on totally the wrong lines?

Another option of course would be to use regular expressions to search for the tags and replace the text after them, but this could get pretty complicated so I was wondering if htmlparse wo开发者_运维知识库uld provide a simpler solution.

I realize that beautiful soup would be the ideal solution, but I will be distributing the finished tool around my company, so I can't use any third party modules unfortunately. Similarly I'd like the tool to be platform independent, so unfortunately cannot use wget.

Thanks for any input =)


You can use any modules to your heart's content if you pack the Python program to self-contained binary (not even Python runtime necessary) with this: http://www.pyinstaller.org/

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜