开发者

Extracting *relevant* image from a web-page

I have a couple of twitter-powered news aggregation website. I have been planning to add images from articles that I find on twitter.

If I download the page and extract image using <img> ta开发者_如何学JAVAg, I get a bunch of images; not all of them relevant to the article. For example, images of button, icons, ads etc are captured. How do I extract the image accompanying the article? I know there is a solution -- Facebook link sharer does this pretty well.

Mithun

Duplicate of : How to find and extract "main" image in website


Download all images from the page, blacklist all images coming from an ad server. then find some heuristic which will get you the correct image...

I think something like:

  • Biggest resolution += 5pts
  • Biggest filesize += 10 pts
  • Jpeg += 2 pts

then take the image with the most points and throw the rest away

Probably works for majority of sites.

(Would require some fiddling with the heuristics though)


It's been a long time. But this may help next time.

You can use this API https://urlmeta.org/

It's very simple to use and result is the best we need.

example for using API:

<?php
$url = "http://timesofindia.indiatimes.com/business/india-business/Raghuram-Rajan-not-fit-to-be-RBI-Governor-Subramanian-Swamy/articleshow/52236298.cms";

$result = file_get_contents('https://api.urlmeta.org/?url='.$url);
$array = json_decode($result,1);
print_r($array['meta']['image']);

?>

And that's the result you needed.


I kind of came-up with a solution that is a bit hacky but works for me. Here is what I do to get thumbnails.

  1. Say the headline of the page I find is "this is a headline"
  2. I use this as a query to the Google Image API and then extract the first thumbnail I find.

It actually works quite well for a majority of the cases. Check it out for yourself http://cricketfresh.in

Mithun

ps: I think this is a good answer. Will give credit to someone who comes with a more elegant answer.


I would guess that Facebook has a link extractor for the various sites it supports. Something like id="content" -> img (1st).

Guess I am wrong. Seems that Facebook uses the Open Graph Protocol to define which image (og:image) and which metadata to use.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜