Preventing data stealing
I know it's impossible to prevent people from stealing our data, but I have a large database and I want to at least prevent automated scripts from stealing my database.
My ideas so far:
- use JavaScript or encode HTML = heavy and could easily be decoded
- recaptcha for the search = no way, users will just leave 开发者_开发百科my website
- inserting random data and tags in the site HTML to avoid regex rip = good?
Any ideas are appreciated.
Why would people want to steal your database? Why does it matter if they do ? Would asking them not to not be sufficient?
Make your policy clear and ensure that your company legal department have checked the wording. Discourage unauthorised syndication by making it clear that it is not permitted and that you will take legal steps to prevent it.
Or better still, encourage authorised syndication. People will only carry out unauthorised syndication if there is no sensible way for them to do so in an authorised manner.
Technical measures might have some effect, but would only deter those who aren't particular competent or determined.
None of those solutions you proposed would work. A good script writer could easily bypass those. But, there is a technical solution to this on the application server side: implement a rate limit. Only allow one search from a given IP address once every, say, 10 seconds. This will make automated data-mining from your site very slow.
I think Alexa inserts random tags into the markup, and it gave me a heck of a time when I tried to mine it... they put some extra tags in the Alexa rankings, like <span class="a5r">35</span><span class="et4">52</span><span class="arer">16</span>
and unless you downloaded the style sheet too and looked at the rendering rules, you couldn't figure out what number that was supposed to be.
But... if I was patient enough, I could have "rendered" the numbers and then mined it, but it just wasn't worth it for me. Limiting page requests to a humanly possible amount would probably work well (50/min or something).
精彩评论