开发者

How can I extract images from a site that I'm linking to?

If you're familiar with Reddit, you'll know how all of their posts containing pictures get a small thumbnail preview beside the title of the submission. How does Reddit go about doing that? Does it just check to see if the link ends with .jpg,开发者_开发问答 .png, .bmp, etc?


reddit will try to pull a thumbnail from any source--not just an image URL. This is done firstly by having set rules for specific sites, and secondly by having one generic process for retrieving thumbnails for unknown URLs--and is an automated periodic task.

One of the (many) benefits of reddit is that the source code is open, and if you understand Python, you should check out /r2/lib/scraper.py for a more detailed view at how this process works.

Also, while StackOverflow is a great place to have programming-related questions answered, you might also want to check out reddit's own /r/redditdev for information on reddit development.

How can I extract images from a site that I'm linking to?


  1. Indeed, if the URL contains .jpg, .png, etc., use that.
  2. If the site is a popular domain (flickr.com, youtube.com, amazon.com, etc.), have a set of predefined rules to extract something you know will be relevant (may it be the featured image, YouTube thumbnail, Amazon product image, etc.)
  3. Otherwise, if all you have to work with is some HTML, you'll have to dig it out yourself. You could choose the first one on the page, the biggest by size, or even the one you've algorithmically determined to be the most relevent (e.g. relatively big, inside what you think is the main body content.)

If you have to resort to the last option, one technique I'd recommend is to extract multiple images, and A/B test them to find the one which has the best click-through rate. That way you can nearly always get the best one.


You can check for the content of the <img> tag.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜