Facebook like on demand meta content scraper

2023-01-02 15:20 问答作者：

you guys ever saw that FB scrapes the link you post on facebook (status, message etc.) live right after you paste it in the link field and displays various metadata, a thumb of the image, various images from the a page link or a video thu开发者_运维技巧mb from a video related link (like youtube).

any ideas how one would copy this function? i'm thinking about a couple gearman workers or even better just javascript that does a xhr requests and parses the content based on regex's or something similar... any ideas? any links? did someone already tried to do the same and wrapped it in a nice class? anything? :)

thanks!

FB scrapes the meta tags from the HTML.

I.e. when you enter a URL, FB displays the page title, followed by the URL (truncated), and then the contents of the <meta name="description"> element.

As for the selection of thumbnails, I think maybe FB chooses only those that exceed certain dimensions, i.e. skipping over button graphics, 1px spacers, etc.

Edit: I don't know exactly what you're looking for, but here's a function in PHP for scraping the relevant data from pages.
This uses the simple HTML DOM library from http://simplehtmldom.sourceforge.net/

I've had a look at how FB does it, and it looks like the scraping is done at server side.


    class ScrapedInfo
    {
        public $url;
        public $title;
        public $description;
        public $imageUrls;
    }

    function scrapeUrl($url)
    {
        $info = new ScrapedInfo();
        $info->url = $url;
        $html = file_get_html($info->url);

        //Grab the page title
        $info->title = trim($html->find('title', 0)->plaintext);

        //Grab the page description
        foreach($html->find('meta') as $meta)
                if ($meta->name == "description")
                        $info->description = trim($meta->content);

        //Grab the image URLs
        $imgArr = array();
        foreach($html->find('img') as $element)
        {
                $rawUrl = $element->src;

                //Turn any relative Urls into absolutes
                if (substr($rawUrl,0,4)!="http")
                        $imgArr[] = $url.$rawUrl;
                else
                        $imgArr[] = $rawUrl;
        }
        $info->imageUrls = $imgArr;

        return $info;
    }

Facebook looks at various meta information in the HTML of the page that you paste into a link field. The title and description are two obvious ones but a developer can also use <link rel="image_src" href="thumbnail.jpg" /> to provide a preferred screengrab. I guess you could check for these things. If this tag is missing you could always use a website thumbnail generation service.

As I am developing a project like that, it is not as easy as it seems, encoding issues, rendering content with javascript, existence of so many non-semantic websites are one of big problems I encountered. Especially extracting video info and trying to get auto-play behavior is always tricky or sometimes impossible. You can see a demo in http://www.embedify.me , it is written in .net but it has a service interface so you can call it via javascript, also there is javascript api to get the same ui/behavior as in fb.

继续阅读：facebook metadata php scraper

Facebook like on demand meta content scraper

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？