AppFabric Cache seems unstable
We're trying to use AppFabric distributed cache. After a lot of back and forth with non-domain servers we finally put them in a domain and installation/setup was a bit easier. We got it up and running after fighting through a ton of errors, most of which seems trivial to include some test or more descriptive error message for in AppFabric. "Temporary error" does not explain a lot...
But there are still issues.
We set up 3 servers, one of which is "lead". We finally got the cache working and we confirmed this by poin开发者_StackOverflowting a Network Load Balancer to one server at a time confirming that we can set cache at one server and retrieve it at another.
Then I restarted the AppFabric Caching service on all servers and suddenly it is not working. Get-CacheHost says they are up, but we get exceptions like:
ErrorCode<ERRCA0018>:SubStatus<ES0001>:The request timed out ErrorCode<ERRCA0017>:SubStatus<ES0001>:There is a temporary failure. Please retry later.
Why would this error condition occur by simply restarting the services?
Is AppFabric Cache really ready for production use? What happens if a server goes offline? Long timeouts? Are we dependent on the "lead" server being up?I suspect it will be back up after 5-10 minutes of R&R. It seems to come back by itself sometimes.
Update: It did come up after a few minutes. We have now tested by removing one server from the cluster and it resulted in a long timeout and finally an exception.
We have been debugging this for some time and I'm sharing what we have found so far.
- UAC on Windows 2008 actually blocks access to local computer, so commands towards local computer will fail. Start PowerShell as admin or turn off UAC completely to bypass.
- Simply changing the config file manually will not work. You need to use export and import commands.
- Firewalls are a major issue as the installer opens the 222* range of ports, but the PowerShell tools use other Windows services. Turning off the firewall on all servers (not recommended) solved the problem.
- If a server is removed from the cluster there will be an initial timeout before the cluster can operate again.
- After restart the cluster uses 2-5 minutes to get back up.
- If restarting and one server is not reachable the startup time is increased.
- If the server holding the shared fileshare for config is not reachable the services will not start. We tried to solve this by giving each server a private share.
精彩评论