search copies of data from all over internet
i need your help and want advice as developer point开发者_Python百科 of view that how people are running like sites like copyscape.com bascially they search copies of data on whole internet i want to know how they are searching and making catalog of all website from internet same like google as google makes index of site from internet
please guide me how they are searching data from all over internet how its possible to keep track of each and every website on internet how google knows that there is new site on internet from where there crawlers knows that new website is launched so in short i want to know how can i develop a site in which i can search copies of data all over internet with out depending on any third party api plzzz advice me i hope you will help me
thanks
Google's crawlers don't know when a new site is launched. Usually developers must submit their sites to Google or get incoming links from sites that are indexed.
And nobody has a copy of the entire Internet. There are websites that are not linked and never get visited by any crawler. This is called the deep web and is generally inaccessible to crawlers.
How do they do it exactly? I don't know. Maybe they index popular sites where text is likely to be copied, like Blogger, ezinearticles, etc. And if they don't find the text on those sites, they simply say its original. Just a theory and I am probably wrong.
Me? I would probably use Google. Just take a good chunk of text from the website you are checking is copied and then filter out the results that are from the original website. And viola, you have the website that have that exact phrase which is presumably copied.
精彩评论