ISSRE 2006 START Conference Manager    

Performance-Oriented System Test in High Performance Computing Clusters

Bin Ye

The 17th IEEE International Symposium on Software Reliability Engineering (ISSRE 2006) -- Industry Practices (ISSRE 2006)
Raleigh, North Carolina, USA, 6-10 November 2006


Abstract

In a high performance computing (HPC) cluster, system performance is impacted by a wide range of factors, which include hardware configurations, software stacks, and tunings of all related components. For example, GPFS file system performance is impacted by storage subsystems, cluster interconnect, software stacks, etc. In earlier cluster system test in our lab, no focus was placed on system performance until a performance related problem was encountered while a large cluster was used by a lot of testers. Problem determination led to cluster system down-time, which impeded the overall system test progress, and could potentially delay clustering software delivery schedule.

To solve this problem, a performance-oriented methodology was proposed and implemented in HPC cluster system test. The key in this methodology is to measure the performance of all major subsystems that could impact system performance at every stage of system testing, most importantly at cluster bring-up stage. Full scale system test will not get started until all major subsystems’ performance is understood and acceptable system performance is achieved. Performance is closely monitored throughout the testing cycle so that performance related defects can be detected.

As an example, this methodology was implemented in testing GPFS 3.1 release in comparison with testing an earlier GPFS 2.3 release when this methodology was not adopted. In testing GPFS 2.3, a lot of network and storage performance related problems were encountered during the regular test cycle. A lot of effort was spent debugging network and storage problems, resulting 75% completion of planned testing by the scheduled delivery date.

From the lessons we learned from the GPFS 2.1 release, we adopted this performance-oriented methodology. Setup time was shortened; returned defects number due to network problems and system down-time that was used to debug network and storage problems were reduced. 100% testing completion was achieved on schedule.


  
START Conference Manager (V2.52.6)
Maintainer: mark.sherriff@ncsu.edu