开发者

How to prepare for data loss in a production website?

I am building an app that is fast moving into production and I am concerned about the possibility that due to hacking, some silly personal error (like running rake db:schema:load or rake db:rollback) or other circumstance we may suffer data loss in one database table or even across the system.

While I don't find it likely that the above will happen, I would be remiss in not being prepared in case it ever does.

I am using Heroku's PG Backups (which is to be replaced with something else this month), and I also run automated daily backups to S3: http://trevorturk.com/2010/04/14/automated-heroku-backups/, successfully generating .dump files.

What is the correct way to deal with data loss on a production app?

  1. How would I restore the .dump file in case I need to? Can I do a selective restore if a small part of the system is hit?
  2. In case a selective restore is not possible: assume one table loses data 4 hours after the last backup. Result => would fixing the lost table require rolling back 4 hours of users' activity? Any good solution to this?
  3. 开发者_如何转开发
  4. What is the best way to support users through the inconvenience if something like this happens?


A full DR (disaster recovery) solution requires the following:

  1. Multisite. If a fire, flood, Osama Bin Laden or whathaveyou strikes the Amazon (or is it Salesforce?) data center that Heroku uses, you want to be sure that your data is safe elsewhere.
  2. On-going replication of the data to a separate site (or sites). That means that every transaction that's written to your database on one site, is replicated within seconds to the mirror database on the other site. Most RDBMS's have mechanisms to let you do a master-slave replication like that.
  3. The same goes for anything you put on a filesystem outside of the database, such as images, XML configuration files etc. S3 is a good solution here - they replicate everything to multiple data centers for you.
  4. I won't hurt to create periodic (daily or so) dumps of the database and store them separately (e.g. on S3). This helps you recover from data corruption that propagates to the slave DBs.
  5. Automate the process of data recovery. You want this to just work when you need it.
  6. Test everything. Ideally, you want to automate the test process and run it periodically to ensure that your backups can restore. Netflix Chaos Monkey is an extreme example of this.

I'm not sure how you'd implement all this on Heroku. A complete solution is still priced out of reach for most companies - we're running this across our own data centers (one in the US, one in EU) and it costs many millions. Work according to the 80-20 rule - on-going backup to a separate site, plus a well tested recovery plan (continuously test your ability to recover from backups) covers 80% of what you need.

As for supporting users, the best solution is simply to communicate timely and truthfully when trouble happens and make sure you don't lose any data. If your users are paying for your service (i.e. you're not ad-supported), then you should probably have an SLA in place.


About backups, you cannot be sure at 100 percent every time that no data will be lost. The best is to test it on another server. You must have at leat two types of backup :

  • A database backup, like pg-dump. A dump is uniquely SQL commands so you can use it to recreate the whole database, just a table, or just a few rows. You loose the data added in the meantime.

  • A code backup, for example a git repository.


in addition to Hartator's answer:

  • use replication if your DB offers it, e.g. at least master/slave replication with one slave

  • do database backups on a slave DB server and store them externally (e.g. scp or rsync them out of your server)

  • use a good version control system for your source code, e.g. Git

  • use a solid deploy mechanism, such as Capistrano and write your custom tasks, so nobody needs to do DB migrations by hand

  • have somebody you trust check your firewall setup and the security of your system in general

The DB-Dumps contain SQL-commands to recreate all tables and all data... if you were to restore only one table, you could extract that portion from a copy of the dump file and (very carefully) edit it and then restore with the modified dump file (for one table).

Always restore first to an independent machine and check if the data looks right. e.g. you could use one Slave server, take if offline, then restore there locally and check the data. Good if you have two slaves in your system, then the remaining system has still one master and one slave while you restore to the second slave.


To simulate a fairly simple "total disaster recovery" on Heroku, create another Heroku project and replicate your production application completely (except use a different custom domain name).

You can add multiple remote git targets to a single git repository so you can use your current production code base. You can push your database backups to the replicated project, and then you should be good to go.

The only step missing from this exercise verses a real disaster recovery is assigning your production domain to the replicated Heroku project.

If you can afford to run two copies of your application in parallel, you could automate this exercise and have it replicate itself on a regular basis (e.g. hourly, daily) based on your data loss tolerance.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜