MEASURING SERVER SOFTWARE RELIABILITY WITH FOCUSED ITERATIVE TESTING

Pramod Gupta and David Godwin

IBM

psgupta@ca.ibm.com


Abstract

Server software (e.g. database server, messaging systems) is a multi-process, multi-threaded, multi-user software system. It is a fertile source of timing related defects since the code is complex with many users concurrently accessing the system. Resources (e.g. structures in memory, files on disk) have to be locked correctly, transactions have to be written to disk properly and replayed correctly if the system crashes.

We present a general approach for testing server software and measuring its reliability. The concrete example in this presentation involves crash recovery in a database. However this technique called Focused Iterative Testing is general in nature and can be applied to a wide variety of server software.

Testing software to find timing related defects is a difficult task. The present approach increases the probability of finding timing related defects. It also increases the probability of reproducing such defects for debugging purposes. Moreover, we can also quantify if a defect is common or rare e.g. if a defect happens once in a thousand iterations then it may be called rare but if it occurs once in ten iterations then it is common.

The common measurement of hardware reliability involves mean time to failure. While some of the defects in hardware may be timing related, the main class of defects is due to wear and tear of the hardware over time.

The same concept of mean time to failure has also been commonly applied to measure software reliability. However, using mean time to failure for measuring software reliability has limitations. Time to failure depends on the hardware being used (memory, CPU). Also large mean time to failure does not imply excellent quality for events which are rare but when they happen then the software must be able to handle them e.g. a database server may have a large mean time to failure during normal operations. However, if there is a power outage, it must be able to successfully handle crash recovery.

In order to overcome the limitations of mean time to failure, we use the mean number of successful iterations to measure server software reliability for a particular component. This is a precise quantitative measure which is independent of the hardware on which the server software is run. We believe that our approach of measuring number of successful iterations instead of time to failure is better for measuring software reliability.

The Focused Iterative Testing approach consists of below steps: 1) Find a focus area which has timing related defects e.g. crash recovery.

2) Identify what finite number of commands are to be tested e.g. restart database command after a crash.

3) Write a tool to automate the whole test case and control the background workload. Note that even though the number of commands being tested is finite, the background workload can have any content e.g. for a database server, the background workload would consist of SQL and database administration commands.

4) The tool is run repeatedly till the target number of successful iterations is reached frequently. Track the number of successful iterations as a function of time. e.g. for testing crash recovery in a database server, we would track the number of iterations of successful crash recoveries. During a release cycle you can actually track this number and decide to ship or not ship the feature based on whether the target number of iterations was reached or not.

We have applied this approach with great success to Crash recovery, Fast Communication Manager, Monitoring, Workload Management and Faster redistribute and other components of DB2. In our presentation we will describe the general applicability of this approach to measuring reliability of server software with Focused Iterative Testing.