How can I distinguish between graphics and photographs?
I have a directory of images, photos, web graphics, logos, etc... these are all pulled from the web. There are .jpg, .gif, and .png files.
I would like to extract images that are of real things (keep photos and remove graphics). I'm not trying to get actual / original photographs, just images of real life stuff versus computer made graphics (I'm not sure how to say this more clearly). Almost all of these images have been manipulated and exif information will not be available.
A large (even very large) margin of error is acceptable.
I've already:
- removed images with low color counts using
imagecolorstotal()
- removed images that have large height to width ratios, and vice versa (a ratio of 3+ works shockingly well).
- removed images that are smaller than a certain dimension (50-75px is good)
I'm thinking about removing images with histogram values concentrated around certain colors, rather than a smooth or distributed curve. I have not attempted this yet.
How else can I improve this filtering of images to extract (mostly) real ph开发者_运维技巧otos? I'd prefer to use PHP but that is not required.
UPDATE: It turns out that for my application, the first three things I had already tried was a solid 80% solution. Further filtering can be done using some of the answers below.
The function exif_read_data can provide information about cameras used, it differs greatly for each camera. This won't be the perfect solution but it should add to what you are already using.
Entropy would be a good metric to differentiate "real" photos from computer graphics. It really is just a more structured version of your histogram idea. Entropy is given by
H(X) = -sum(p[i] * log2(p[i]))
where p[i] is the probability of the ith color. p[i]
is pretty much the histogram value at each color (percentage(0.0->1.0) of pixels a color i). The more distributed the colors are, the higher H(X)
will be. If the pixels are only distributed among a few colors, H(X)
will be small.
Note that compressed filesize is directly related to entropy (higher entropy, higher filesize), so the suggestion in another answer to use filesize could be an indirect way of getting at this.
Below is the code that I've used and the reasoning behind why I've applied each filter. I've done a lot of testing on these functions and settings, but you'll still want to run some tests to optimize these settings for your set of images.
I've used IMagick ( the PHP wrapper for ImageMagick ) to do the work when calculating the following image attributes:
$Image = new Imagick( $image_path );
$height = $Image->getImageHeight();
$width = $Image->getImageWidth();
$histogram = $Image->getImageHistogram();
$num_colors = $image->getImageColors();
Height to Width Ratio
Filtering images by a height to width ratio eliminates a large percentage of junk. The closer to set your filter to 1:1, the better this filter works, but you'll start to filter lots of good images too. This is one of the most valuable filters I've applied:
// max height to width ratio we allow on images before we junk them
$max_size_ratio = 3;
if( $size_ratio > $max_size_ratio )
throw new Exception( "image height to width ratio exceeded max of $max_size_ratio" );
Number of Colors
Filtering images below 32 colors generally only removes junk images, however, I also lost lots of black and white diagrams and drawings.
// min number of colors allowed before junking
$min_colors = 32;
if( $num_colors < $min_colors )
throw new Exception( "image had less than $min_colors colors" );
Min Height and Width
Filtering images based on an absolute minimum height and width that both dimensions must pass as well as a slightly larger value that at least one dimension must pass helped filter some junk.
// min height and width in pixels both dimensions must meet
$min_height_single = 50;
$min_width_single = 50;
if(
$width < $min_width_single
OR $height < $min_height_single
)
throw new Exception( "height or width were smaller than absolute minimum" );
// min height and width in pixels at least one dimension must meet
$min_height = 75;
$min_width = 75;
if(
$width < $min_width
&& $height < $min_height
)
throw new Exception( "height and width were both smaller than minimum combo" );
Image Color Entropy using the Image Histogram
Finally, I calculate image color entropy ( as suggested by @Jason in his answer ) for every image in my system. When I'm choosing images to display, I generally order them ranked by this entropy in descending order. The higher the entropy, the more likely an image is to be a photograph of a real thing, versus a graphic. There are three major problems with this method:
Highly stylized graphics tend to have higher entropies because of the great color depth and color variation.
Photographs that have been photoshopped to have solid backgrounds and studio backgrounds tend to have lower entropies because of the dominant solid color.
This did not work well as an absolute filter because of the wide variation between images in my set, their file types, color depths, etc. Where it is extremely useful, however, is in choosing the best image out of a small subset within my whole set. An example would be to choose which image to display as the primary image out of all the images found on one webpage.
Here is the function I use to calculate image entropy:
function set_image_entropy()
{
// create Imagick object and get image data
$Image = new Imagick( $this->path );
$histogram = $Image->getImageHistogram();
$height = $Image->getImageHeight();
$width = $Image->getImageWidth();
$num_pixels = $height * $width;
// calculate entropy for each color in the image
foreach( $histogram as $color )
{
$color_count = $color->getColorCount();
$color_percentage = $color_count / $num_pixels;
$entropies[] = $color_percentage * log( $color_percentage, 2 );
}
// calculate total image color entropy
$entropy = ( -1 ) * array_sum( $entropies );
return $entropy;
}
Graphics and line drawing are usually smaller when stored as png, while photo's are smaller when stored as jpg. Store each image in each format, and make an educated guess based on the file size.
精彩评论