Upload images from from web-page

2023-02-23 01:02 问答作者：

I want to implement a feature similar to this http://www.tineye.com/parse?url=yahoo.com - allow user upload images from any web page.

Main problem for me is that it takes too much time for web pages with big number of images.

I'm doing this in Django (using curl or urllib) according to the next scheme:

Grab html of the page (takes about 1 sec for big pages):

file = urllib.urlopen(requested_url)
html_string = file.read()

Parse it with HTML-parser (BeautifulSoup), looking for img tags, and writing all src of images to a list. (takes about 1 sec too for big pages)
Check sizes of all images in my list and if they are big enough return them in json response (takes very long about 15 sec, when there are about 80 images on a web page). Here's the code of the function:


 def get_image_size(uri):
    file = urllib.urlopen(uri)
    p = ImageF开发者_如何学编程ile.Parser()
    data = file.read(1024)
    if not data:
        return None
    p.feed(data)
    if p.image:
        return p.image.size
    file.close()
    #not an image
    return None

As you can see, I'm not loading full image to get it's size, only 1kb of it. But it still takes too much time when there are lot of images (i'm calling this function one time for each image found).

So how can I make it work faster?

May be is there any way for not making a request for every single image?

Any help will be highly appreciated.

Thanks!

i can think of few optimisations:

parse as you are reading a file from the stream
use SAX parser (which will be great with point above)
use HEAD to get size of the images
use queue to put your images, then use few threads to connect and get file sizes

example of HEAD request:

$ telnet m.onet.pl 80
Trying 213.180.150.45...
Connected to m.onet.pl.
Escape character is '^]'.
HEAD /_m/33fb7563935e11c0cba62f504d91675f,59,29,134-68-525-303-0.jpg HTTP/1.1
host: m.onet.pl

HTTP/1.0 200 OK
Server: nginx/0.8.53
Date: Sat, 09 Apr 2011 18:32:44 GMT
Content-Type: image/jpeg
Content-Length: 37545
Last-Modified: Sat, 09 Apr 2011 18:29:22 GMT
Expires: Sat, 16 Apr 2011 18:32:44 GMT
Cache-Control: max-age=604800
Accept-Ranges: bytes
Age: 6575
X-Cache: HIT from emka1.m10r2.onet
Via: 1.1 emka1.m10r2.onet:80 (squid)
Connection: close

Connection closed by foreign host.

You can use the headers attribute of the file like object returned by urllib2.urlopen (I don't know about urllib).

Here's a test I wrote for it. As you can see, it is rather fast, though I imagine some websites would block too many repeated requests.

|milo|laurie|¥ cat test.py
import urllib2
uri = "http://download.thinkbroadband.com/1GB.zip"

def get_file_size(uri):
    file = urllib2.urlopen(uri)
    content_header, = [header for header in file.headers.headers if header.startswith("Content-Length")]
    _, str_length = content_header.split(':')
    length = int(str_length.strip())
    return length

if __name__ == "__main__":
    get_file_size(uri)
|milo|laurie|¥ time python2 test.py
python2 test.py  0.06s user 0.01s system 35% cpu 0.196 total

继续阅读：curl django python urllib

Upload images from from web-page

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？