How stable is RabbitMQ in production (using DRBD and Pacemaker)?
Looking for experience with RabbitMQ, especially in HA configuration using Pacemaker and DRDB as recommended here: http://www.rabbitmq.com/pacemaker.html
开发者_运维百科The DRBD part in particular makes me nervous, so I'm hoping someone here has real-world experience to share.
Works most of the time. However you'll have to pay special attention to fencing (split brain), when dealing with DRBD. On a production system it's always a pain to have to fix this kind of issues manually.
We failed to run RabbitMQ in a master/slave (multi-state RA). We thought we'd enhance availability. We're back to a single instance now. If anyone else has experience with several RabbitMQ instances running concurrently and backing a master entity that would be great to share!
I find the lack of tools to debug Pacemaker when there are issues is a big hurdle to deploy to live systems... It's not always clear what Pacemaker is "thinking" or doing. hb_report is not sufficient unfortunately.
Hope this helps,
D.
We tried master/slave configuration as well, however it became difficult to maintain all instances up to date with no downtime. And trust me, you want to update RabbitMQ. There are always bugs popping up either in RabbitMQ itself or in Erlang.
We've are getting about 100 crashes per year without any meaningful explanation in the logs. The error log just has generic "error while starting" in it and that's pretty much it. Sometimes it won't start after the crash and most of those times, the only solution is to delete all the persistent messages from all instances, so that the queue state is synchronized across the cluster. Other times it would crash immediately after launching and only after multiple repeated attempts will it properly load. Meaning there is no added reliability what so ever when using master/slave. At least there was none in our case. (RabbitMQ 3.5.3, Erlang 18.0)
It works for production, but only if you keep a copy of the message somewhere in the logs or in the database, from where it can be quickly recovered after a major crash.
精彩评论