How would you pick the best image from a webpage in a crawler?
If you were given any random webpage on the internet and had the html source only. What method would use to give you the most accurate image that would best describe that webpage? Assume that there are no meta tags or hints.
Facebook does something similar when you post a link but they give you choices of n images to chose from, they don't actually pick on开发者_JS百科e unless it has the meta tags on it.
Try to analyze the structure of the page. The majority of web pages roughly has a header, content and footer area. The content area is most likely to contain images related to the subject of the page, so that's what you're looking for.
Find the content area
Most content areas are div
elements with with an ID or class named content
, so that's always a good first guess. There may be alternative descriptors of the content element, so you'll need to do some research to find common patterns.
The content area will also contain multiple h1
or h2
headings in most cases, so that's another indicator to look for.
Find the header and footer
Another approach is to identify the header and footer. Headers usually contain a hint to the logo of the site, such as an image, CSS class name or link to the root of the site. Footers are most likely to contain things like copyright statements.
You can also find the header and footer by analyzing the links on the page. Most internal links will be in the header and footer, while the content has relatively more outgoing links, if any.
Once you have the header and footer, the content is usually in between :)
Find an image
Once you've identified the content area, the first image is usually your best pick. You should, however, ignore images with a small width and/or height, as these will likely be decorative images.
You could also double-check the images against any included CSS files, to make sure you're not picking an image that's related to the design of the page.
Fall back to an educated guess
If you cannot reliably guess the content area of the page, just use the biggest image on the page, as egrunin suggested. Again, you can check this image against the CSS files, to rule out any design-related images.
In the fall-back case, you could log the URL and review those pages to improve your image detection algorithms.
This is best-guess stuff, but:
- ignoring anything hosted in another domain will eliminate most ads
- once you've grabbed the images, you can get their size; the biggest is probably the one to use.
- images that are inside
<a>
and point to the root of the domain are probably logos. Example: the SO logo on this page is inside<a href="/"></a>
.
Edited to add:
It's true that large sites use auxiliary servers for their images. But you could probably make up a couple of simple parsing rules that will get 80% of cases, picking out g-ecx.images-amazon.com
and static.ak.fbcdn.net
as non-ad servers.
If you find og:image
meta property, you can use that quite safely, as part of Open Graph specification used to provide images for Facebook links.
Example of format:
<html xmlns:og="http://opengraphprotocol.org/schema/">
<head>
<title>The Rock (1996)</title>
<meta property="og:image" content="http://ia.media-imdb.com/rock.jpg"/>
...
</head>
...
</html>
Well I would try to look for divs/spans/h1 with something like class or id = "logo" or "top". Almost every page has its logo on the top of page. Just look on stackoverflow :) logo.
I do it this way in my crawler and it works fine :)
精彩评论