开发者

AppFabric Cache seems unstable

We're trying to use AppFabric distributed cache. After a lot of back and forth with non-domain servers we finally put them in a domain and installation/setup was a bit easier. We got it up and running after fighting through a ton of errors, most of which seems trivial to include some test or more descriptive error message for in AppFabric. "Temporary error" does not explain a lot...

But there are still issues.

We set up 3 servers, one of which is "lead". We finally got the cache working and we confirmed this by poin开发者_StackOverflowting a Network Load Balancer to one server at a time confirming that we can set cache at one server and retrieve it at another.

Then I restarted the AppFabric Caching service on all servers and suddenly it is not working. Get-CacheHost says they are up, but we get exceptions like:

ErrorCode<ERRCA0018>:SubStatus<ES0001>:The request timed out
ErrorCode<ERRCA0017>:SubStatus<ES0001>:There is a temporary failure. Please retry later.

Why would this error condition occur by simply restarting the services?

Is AppFabric Cache really ready for production use?

What happens if a server goes offline? Long timeouts?

Are we dependent on the "lead" server being up?

I suspect it will be back up after 5-10 minutes of R&R. It seems to come back by itself sometimes.

Update: It did come up after a few minutes. We have now tested by removing one server from the cluster and it resulted in a long timeout and finally an exception.


We have been debugging this for some time and I'm sharing what we have found so far.

  • UAC on Windows 2008 actually blocks access to local computer, so commands towards local computer will fail. Start PowerShell as admin or turn off UAC completely to bypass.
  • Simply changing the config file manually will not work. You need to use export and import commands.
  • Firewalls are a major issue as the installer opens the 222* range of ports, but the PowerShell tools use other Windows services. Turning off the firewall on all servers (not recommended) solved the problem.
  • If a server is removed from the cluster there will be an initial timeout before the cluster can operate again.
  • After restart the cluster uses 2-5 minutes to get back up.
  • If restarting and one server is not reachable the startup time is increased.
  • If the server holding the shared fileshare for config is not reachable the services will not start. We tried to solve this by giving each server a private share.
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜