Setup for high volume of database writing
I am researching a project that would require hundreds of database writes per a minute. I have never dealt with this level of data writes before and I am looking for good scalable techniques and technologies.
I am a comfortable python developer with experience in django and sql alchemy. I am开发者_开发技巧 thinking I will build the data interface on django, but I don't think that it is a good idea to go through the orm to do the amount of data writes I will require. I am definitely open to learning new technologies.
The solution will live on Amazon web services, so I have access to all their tools. Ultimately I am looking for advice on database selection, data writing techniques, and any other needs I may have that I do not realize.
Any advice on where to start?
Thanks, CG
Follow the trends, in other words, enter the world of NOSQL. Some technologies that are worthy include mongodb and redis. They are really fast, scalable, and with decent python drivers. For example, mongodb plays really nice with django, and has many common things with traditional SQL, like MySQL. On the other hand, redis has more "primitive" data structures but is superior in terms of speed (which of course depends somehow on the drivers). Using any of them ( or both, it's a clever idea for something glorious ) you are free ( and sometimes enforced ) to write your own "low-level" logic to accomplish your needs.
You should actually be okay with low hundreds of writes per minute through SQLAlchemy (thats only a couple a second); if you're talking more like a thousand a minute, yeah that might be problematic.
What kind of data do you have? If it's fairly flat (few tables, few relations), you might want to investigate a non-relational database such as CouchDB or Mongo. If you want to use SQL, I strongly reccommend PostgreSQL, it seems to deal with large databases and frequent writes a lot better than MySQL.
It also depends how complex the data is that you're inserting.
I think unfortunately, you're going to just have to try a couple things and run benchmarks, as each situation is different and query optimizers are basically magic.
If it's just a few hundred writes you still can do with a relational DB. I'd pick PostgreSQL (8.0+), which has a separate background writer process. It also has tuneable serialization levels so you can enable some tradeoffs between speed and strict ACID compliance, some even at transaction level.
Postgres is well documented, but it assumes some deeper understanding of SQL and relational DB theory to fully understand and make the most of it.
The alternative would be new fangled "NO-SQL" system, which can probably scale even better, but at the cost of buying into a very different technology system.
Any way, if you are using python and it is not 100% critical to lose writes on shutdown or power loss, and you need a low latency, use a threadsafe Queue.Queue and worker threads to decouple the writes from your main application thread(s).
精彩评论