How to mimic Facebook's "link share" functionality using node.js and javascript
so what I want to mimic is the link share feature Facebook provides. You simply enter in the URL an开发者_如何学Cd then FB automatically fetches an image, the title, and a short description from the target website. How would one program this in javascript with node.js and other javascript libraries that may be required? I found an example using PHP's fopen function, but i'd rather not include PHP in this project.
Is what I'm asking an example of webscraping? Is all I need to do is retrieve the data from inside the meta tags of the target website, and then also get the image tags using CSS selectors?
If someone can point me in the right direction, that'd be greatly appreciated. Thanks!
Look at THIS post. It discusses scraping with node.js. HERE you have lots of previous info on scraping with javascript and jquery.
That said, Facebook doesn't actually guess what the title and description and preview are, they (at least most of the time) get that info from meta tags present in the sites that want to be more accessible to fb users.
Maybe you could make use of that existing metadata to pull titles, descriptions and img previews. The docs on the available metadata is HERE.
Yes web-scraping is required and that's the easy part. The hard part is the generic algo to find headings and relevant texts and images.
How to scrape
You can use jsdom to download and create a DOM structure in your server and scrape that using jquery on your server. You can find a good tutorial at blog.nodejitsu.com/jsdom-jquery-in-5-lines-on-nodejs as suggested by @generalhenry above.
What to scrape
I guess a good way to find the heading would be:-
var h;
for(var i=6; i<=1; i++)
if(h = $('h'+i).first()){
break;
}
Now h
will have the title or undefined
if it fails. The alternative for this could be simply get the page's title
tag. :)
As for the images. List all or first few images on that page which are reasonably large, i.e. so as to filter out sprites used for buttons, arrows, etc.
And while fetching the remote data make sure that ProcessExternalResources
flag is off. This will ensure that script tags for ads do not pollute the fetched page.
And yes the relevant text would be in some tags after h
.
精彩评论