WoSAR Keynote Talks

Larry Bernstein

Historical perspectives on software rejuvenation

Distributed and mobile computing breaks down physical barriers. Virtual offices are wherever people are, and databases stretch around the world. Lawyers carry entire legal libraries into court in a 2 pound PC. Express delivery drivers update their corporate databases when a recipient signs a handheld device. Water meters are read as trucks drive through neighborhoods. Tolls are paid as cars drive through toll booths. Reliable computers are an essential foundation this new way of life. Hardware and network elements are reasonably reliable, but now the software must be demonstrably so. This need is not new. That reliable software is critical is not a new idea. Insights from the 1960s led to software rejuvenation research and use by the 1990s. Here is my story of how this happened, why the use of this powerful technology is not widespread, and why it should be.

Kishor Trivedi

The Role of Measurements and Models in Software Rejuvenation

The study of software failures has now become more important since it has been recognized that computer system outages are more due to software faults than due to hardware faults. Recently, the phenomenon of "software aging", one in which the state of the software system degrades with time, has been reported in widely used software and also in high-availability and safety-critical systems. The primary causes of this degradation are the exhaustion of operating system resources, data corruption and numerical error accumulation. This may eventually lead to performance degradation of the software or crash/hang failure or both. To counteract this phenomenon, a proactive approach to fault management, called "software rejuvenation" has been proposed. This essentially involves gracefully terminating an application or a system and restarting it in a clean internal state. This process removes the accumulated errors and frees up operating system resources. The preventive action can be done at optimal times (for example, when the load on the system is low) so that the overhead due to planned system downtime is minimal. This method therefore avoids unplanned and potentially expensive system outages due to software aging. In this talk, we start with a classification of software faults. We then discuss methods of evaluating the effectiveness of proactive fault management in operational software systems and determining optimal times to perform rejuvenation. This is done by developing stochastic models which tradeoff the cost of unexpected failures due to software aging with the overhead of proactive fault management. Next we show measurements from a real system that show aging. We then discuss how to predict the time to resource exhaustion using the measured data. Finally, we describe the use of rejuvenation in cluster systems and its implementation in a major commercial system.