programmatically determining if someone owns a website?
I need to figure out the best way to determine if someone is the actual owner of a website. I don't just mean the domain although in a lot of cases that might be the case.
My first inclination was to have them put a special comment in their HTML that开发者_开发问答 my program can scrape. e.g.:
<!-- @webcode:1234 -->
One possible problem with that approach is someone in theory could add it in the comments on their page or some other way to add content. Although I'm not sure anything I have them do couldn't be gotten that way.
My other idea was since I was planning on also offering a JavaScript widget was to just scrape that although I didn't want to necessarily force them to add the widget.
<script type="text/javascript" src="http://yoursite.com/widget/widget/A4923D2342JF"></script>
What other mechanisms could be employed to determine ownership/control of a website?
Here are the options that Google uses for Domain verification:
Create a CNAME or TXT record in your domain's DNS settings. These methods require accessing DNS settings for your domain at your domain host's website. Which method you can choose (CNAME or TXT record) depends on what's offered in your Google Apps control panel. We're currently rolling out the TXT record method but still ask many customers to create a CNAME record, instead.
Upload an HTML file to your domain's web server This method requires being able to upload files to your domain's web server. Try doing this if you don't have access to your domain's DNS settings.
Add a tag to your home page This method is available only for some customers (it's another new method we're rolling out). It requires accessing your domain's web server but not uploading to it. Try doing this if you have write access to files on the server but can't upload new files.
CNAME/TXT or uploading an HTML file to the root of the domain is the most secure, since it requires full control of the domain. If you want to be a bit more lax you could use a Meta tag in the head node, which would prevent someone from adding a comment to a page. All depends on how secure you want to be.
Do what Google does for their Webmaster Tools. Generate a unique key, and have them put it in a meta tag in the head of their front page. It's pretty unlikely that a user who does not own the site will be able to change the contents within the <head></head>
tags. If they can, the site is vulnerable to almost any kind of vandalism, and is hopeless.
You could have them add your original idea but only accept the comment in, say, the <header>
tag of the website. This way you could avoid having them past the comment into a 'comments' section like you originally suggested.
In fact, I subscribed to a service that did just that: include the special comment in the header section of your page
.
Make part of the requirement be that comment be inside of the <head>
tag. Typically, even user generated content wouldn't make it's way into the head.
Also, your concern about the comment hack are probably unnecessary. Any comment system worth it's weight knows to escape comments so that the comment is not displayed as actual HTML markup.
Have them put a file with a hard to guess name on the server?
such as http://www.example.com/5gdbadcab234g3.txt
The only true way is to be able to access their fileserver. Anything transferred through HTTP can be reproduced.
If you don't have access to their server, then the best way would be to have an encrypted string embedded on the page (or in an image or some binary file on that page).
The string should be comprised of the URI, author, and timestamp. That way, even if someone does copy this string to their website, you would still be able to determine the author and the page. An added bonus is you'll be able to determine if there was a theft.
Granted, this is only as good as the algorithm that encrypts the page/author combination; hackers that are good at decrypting could get around this. Additionally, a dishonest author could create his own key for his page, thus you'd need to host the encryption so that no one could tinker with the timestamp. Also, this requires that all authors places the code on their page.
I know you mentioned that it isn't necessarily domain dependent but that would help. You could hash the domain (as they are unique) and send the person that string to put somewhere on their site either .txt or in the header as others have mentioned.
Then you store all their domains and their hashes in a database and your scraper would check that the domain it is scraping matches the hashed comment string, if it checks out then its fine.
精彩评论