开发者

Testing fault tolerant code

I’m currently working on a server application were we have agreed to try and maintain a certain level of service. The level of service we want to guaranty is: if a request is accepted by the server and the server sends on an acknowledgement to the client we want to guaranty that the request will happen, even if the server crashes. As requests can be long running and the acknowledgement time needs be short we implement this by persisting the request, then sending an acknowledgement to the client, then carrying out the various actions to fulfill the request. As actions are carried out they too are persisted, so the server knows the state of a request on start up, and there’s also various开发者_开发问答 reconciliation mechanisms with external systems to check the accuracy of our logs.

This all seems to work fairly well, but we have difficult saying this with any conviction as we find it very difficult to test our fault tolerant code. So far we’ve come up with two strategies but neither is entirely satisfactory:

  • Have an external process watch the server code and then try and kill it off at what the external process thinks is an appropriate point in the test
  • Add code the application that will cause it to crash a certain know critical points

My problem with the first strategy is the external process cannot know the exact state of the application, so we cannot be sure we’re hitting the most problematic points in the code. My problem with the second strategy, although it gives more control over were the fault takes, is I do not like have code to inject faults within my application, even with optional compilation etc. I fear it would be too easy to over look a fault injection point and have it slip into a production environment.


I think there are three ways to deal with this, if available I could suggest a comprehensive set of integration tests for these various pieces of code, using dependency injection or factory objects to produce broken actions during these integrations.

Secondly, running the application with random kill -9's, and disabling of network interfaces may be a good way to test these things.

I would also suggest testing file system failure. How you would do that depends on your OS, on Solaris or FreeBSD I would create a zfs file system in a file, and then rm the file while the application is running.

If you are using database code, then I would suggest testing failure of the database as well.

Another alternative to dependency injection, and probably the solution I would use, are interceptors, you can enable crash test interceptors in your code, these would know the state of the application and introduce the above listed failures at the correct time, or any others you may want to create. It would not require changes to your existing code, just some additional code to wrap it.


A possible answer to the first point is to multiply experiments with your external process so that probability to impact problematic parts of code is increased. Then you can analyze core dump file to determine where the code has actually crashed.

Another way is to increase observability and/or commandability by stubbing library or kernel calls, i.e., without modifying your application code.

You can find some resources on Fault Injection page of Wikipedia, in particular in Software Implemented Fault Injection section.


Your concern about fault injection is not a fundamental concern. You merely need a foolproof way to prevent such code ending up in deployment. One way to do so is by designing your fault injector as a debugger. I.e. the faults are injected by a process external to your process. This already provides a level of isolation. Furthermore, most OS'es provide some kind of access control which prevents debugging unless specifially enabled. In the most primitive form, it's by limiting it to root, on other operating systems it requires a specific "debug privilege". Naturally, on production nobody will have that, and thus your fault injector cannot even run on production.

Practially, the fault injector can set breakpoints at specific addresses, i.e. function or even line of code. You can then react to that, e.g. by terminating the process after a certain breakpoint is hit three times.


I was just about to write the same as Justin :)

The component I would suggest to replace during testing could be the logging component (if you have one, if not, I'd strongly suggest to implement one...). It's relatively easy to replace it with code that generates error and the logger usually gets enough information to know the current application state.

Also it seems to be feasible to make sure that the testing code doesn't go into production. I would discourage conditional compilation though but rather go with some configuration file to select the logging component.

Using "random" kills might help to detect errors but is not well suited for systematic testing because of its non-determinism. Therefore I wouldn't use it for automatic tests.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜