Scientific Workflows:
Scientific Computing Meets Transactional Workflows

Munindar P. Singh and Mladen A. Vouk
Department of Computer Science
North Carolina State University
Raleigh, NC 27695-7534

+1.919.515.5677 (voice)
+1.919.515.7925 (fax)

singh@ncsu.edu

+1.919.515.7886 (voice)
+1.919.515.7925 (fax)

vouk@adm.csc.ncsu.edu

We maintain an up-to-date version of this paper here.

Abstract

We introduce the idea of Scientific Workflows as an amalgamation of scientific problem-solving and traditional workflow techniques. Scientific workflows share many features of business workflows, but also go beyond them. Many known workflow results and techniques can be leveraged in scientific settings, and many additional features of scientific applications can be usefully deployed in business settings. Scientific workflows promise to become an important area of research within workflow and process automation, and will lead to the development of the next generation of problem-solving and decision-support environments. In the spirit of this NSF workshop, we focus on the conceptual aspects.

1. Introduction

Workflows have drawn an enormous amount of attention in the databases and information systems research and development communities [Elm92], [Geo95], [Hsu93]. Over 100 workflow products of various shapes and sizes now exist. Much of the recent research interest in workflows has been focused on workflows as they arise in business environments, e.g., [Ell79]. The products too are geared to enterprise computing, e.g., [Ley94]. Although business workflows deserve the attention they are receiving, another class of workflows emerges naturally in sophisticated scientific problem-solving environments. We believe this class of workflows, which we dub scientific workflows, will become ever more important as computing expands into the routine activities of scientists. Indeed, there are compelling reasons why scientific workflows should be of significance to the research community:

Although business applications are important, some of the heaviest users of computing are in the sciences.
The sciences are becoming increasingly computation-intensive. It is no longer possible for scientists to carry out their day-to-day activities without heavy use of computing. This holds in fields and problem areas as diverse as computational biology, chemistry, genetics, electrical utility management, and reasoning about the environment.
Scientific workflows, as we understand them, are crucial to the success of major initiatives in high-performance computing. As parallel computing expands, systems such as PVM and standards such as MPI encourage scientists to construct complex distributed solutions that span the networks Bal94, and through web-based interfaces invite incorporation into still more complex systems that may include interactions with economic and business flows. Workflows represent the logical culmination of this trend. They provide the necessary abstractions that enable effective usage of computational resources, and development of robust problem-solving environments that marshal high-performance computing resources.
Scientific techniques can be generalized and marshaled for business workflows. These include process simulation techniques [Elm95].

Section 2 defines scientific workflows and discusses their similarities with, and differences from, business workflows. Section 3 highlights some of the key research challenges in scientific workflows. Section 4 describes two of our recent prototype systems that incorporate scientific workflow concepts, and shows how they might be synthesized into a powerful theory and tools for scientific workflow management.

We emphasize that the principal aim of this document is to identify the key issues, and some promising ways of thinking about them, rather than to present complete solutions.

2. What are Scientific Workflows?

We use the term scientific workflows as a blanket term to describe series of structured activities and computations that arise in scientific problem-solving. In many science and engineering areas, the use of computation is not only heavy, but also complex and structured with intricate dependencies. Graph-based notations, e.g., generalized activity networks (GAN), are a natural way of representing numerical and human processing [Den96, Elm95, Elm66]. These structured activities are often termed studies or experiments. However, they bear the following similarities to what the databases research community calls workflows.

Scientific problem-solving usually involves the invocation of a number and variety of analysis tools. However, these are typically invoked in a routine manner. For example, the computations involve much detail (e.g., sequences of format translations that ensure that the tools can process each other's outputs), and often routine verification and validation of the data and the outputs. As scientific data sets are consumed and generated by the pre- and post-processors and simulation programs, the intermediate results are checked for consistency and validated to ensure that the computation as a whole remains on track.
Semantic mismatches among the databases and the analysis tools must be handled. Some of the tools are designed for performing simulations under different circumstances or assumptions, which must be accommodated to prevent spurious results. Heterogeneous databases are extensively accessed; they also provide repositories for intermediate results. When the computation runs into trouble, semantic rollforward must be attempted; just as for business workflows, rollback is often not an option.
Many large-scale scientific computations of interest are long-term, easily lasting weeks if not months. They can also involve much human intervention. This is especially so during the early stages of process (workflow) design. However, as they are debugged, the exceptions that arise are handled automatically. Thus, in the end, the production runs frequently require no more than semiskilled human support. The roles of the participating humans involved must be explicitly represented to enable effective intervention by the right person.
The computing environments are heterogeneous. They include supercomputers as well as networks of workstations and supercomputers. This puts additional stress on the run-time support and management. Also, users typically want some kind of a predictability of the time it would take for a given computation to complete. Making estimates of this kind is extremely complex and requires performance modeling of both computational units and interconnecting networks.

Consequently, it is appropriate to view these coarse-granularity, long-lived, complex, heterogeneous, scientific computations as workflows. Although we introduce the term scientific workflows here, we emphasize that the activities this term covers are to a large extent already carried out by practitioners in scientific computing. However, by describing these activities as workflows, we hope to bring to bear on them the advanced techniques being developed in workflows research. These include sophisticated notions of workflow specification and of toolkits and environments for describing and managing workflows. In this way, scientific workflows will be to problem-solving environments what business workflows are to enterprise integration. Further, by making the connections explicit, we also hope to draw upon research in software process modeling and engineering [Cur92], which has no obvious correlate at an appropriately high level in scientific computing.

3. Challenges

Scientific workflows go beyond business workflows. Therefore, it stands to reason that existing tools, which are inadequate for business settings, would also be inadequate for scientific workflows. Certain research challenges must be surmounted in order for scientific workflow management to become practicable. We identify two classes of such challenges. The first category applies to workflows in general. This category includes the usual issues, such as i) handling exceptions, ii) handling the roles of different participants and allowing the role bindings to change, iii) declarative specification of control and data flow, and iv) automatic execution and monitoring of workflows to meet stated specifications. These issues are leveraged for scientific computing when workflows techniques are applied there.

The second category is specific to scientific workflows and includes the features required for scientific computations, but which may not be adequately addressed in traditional workflows research. This category includes issues such as i) the ability to handle a vast number and variety of analysis tools, not just database systems, ii) interfacing to a diverse array of computational environments including supercomputers, and iii) the ability to handle activity mixes that are different from typical business profiles. These would be extensions to current workflow research, but would pay off ultimately in future business applications.

Scientific workflows often begin as research workflows and end up as production workflows. Early in the lifecycle, they require considerable human intervention and collaboration; later they begin to be executed increasingly automatically. Thus in the production mode, there is typically less room for collaboration at the scientific level and the computations are more long-lived. This happens partly because of limitations of the available technology. We speculate that if true workflow technology were available to manage scientific computations, there would be a reduced push to automate everything and the quality of the solutions obtained could be improved by involving the right humans at the appropriate places.

Be that as it may, during the research phase, scientific workflows need to be enacted and animated far more intensively than business workflows. In this phase, which is more extensive than the corresponding phase for business workflows, the emphasis is on execution with a view to design, and thus naturally includes iterative execution. The corresponding activity can be viewed as a correlate of business process engineering. For this reason, the approaches for constructing, managing, and coordinating process models will find useful application in scientific settings, if only the main problems are cast appropriately. Also, the techniques for animation and enactment can be fed into business process design. Thus ideas from process modeling can be incorporated, but because of the intensity of the tasks and the stress on enactment, those ideas must be realized using general workflow techniques. In the production phase, there is still need for human intervention, more than present scientific environments can support. True extensions will be attained by extending scientific environments to use workflow techniques, rather than by restricting them to fully automatic distributed systems.

Some of the features that scientific workflows need and can be imported from classical workflow paradigms are i) succinct and natural, declarative specification of the workflows themselves, ii) high-level views of the computations, iii) elegant incorporation of human decisions into the process, and iv) coordination and synchronization with other scientific and business workflows.

On the other hand, characteristics that go beyond business workflows are also important. These include i) preponderance of analysis tools relative to databases, ii) relative uniqueness of each workflow, particularly during the research phase when there is less opportunity to use canned or "normal" solutions, iii) explicit representation of knowledge needed at different stages, iv) auditability of the computations when their results are used to make decisions that carry regulatory or legislative implications.

Considerable progress has been made both in i) the implementation of complex systems of scientific computations, and ii) workflow specification and scheduling. Despite this, there is currently no unified theory or system that formalizes scientific workflows as defined above.

4. Prototypes and Research Directions

This has motivated us to attempt to fuse our experiences from the scientific problem-solving and computational community with those from the workflow community. Following are descriptions of two recent independent efforts in these areas that we have been engaged in. One system is used for specifying and scheduling arbitrary workflows, the other for enacting and managing scientific computations. We show how their limitations with respect to scientific workflows may be addressed by unifying them into a cohesive approach that also accommodates the insights of colleagues at other institutions. This promises rich rewards in building workflow management systems that can handle scientific as well as business workflows.

4.1. The MCNC Environmental Decision Support System

Although computations that can be appropriately studied as scientific workflows arise in a number of areas, we give one specific example so as to ground our discussion in our experience. We consider the case of a study management for environmental applications.

The Environmental Decision Support System (EDSS) is an experimental system being constructed by MCNC North Carolina Supercomputing Center (NCSC) in collaboration with NC State [Amb95, Vou95]. It involves a study planner, a scheduler, a visualization subsystem, and an object-oriented repository interconnected by a lightweight "software bus" [Bal96]. The system provides access to heterogeneous database systems that, in the near future, will including a GIS. A study is modeled as a partial order of program invocations and (possibly) human interventions. The partial order describes the flow of data from one program to the next. Each program performs some useful function, such as a simulation, visualization, or data reduction, and consumes and produces scientific and other data sets. Although an EDSS prototype is in alpha-testing phase, full system will require integration of scientific and appropriate economic and business models and workflows so that regulatory decisions can be fully qualified. Original EDSS design plans call for rule-based process flow-control [Coa93]. However, the current graph-based implementation stops short of that, mostly for lack of an adequate formal semantic and structural specification framework. We hope that the theory of scientific workflows will provide this framework.

4.2. Workflow Specification and Scheduling a la Carnot

One of the authors has been involved in the design and implementation of three workflow specification and execution systems in industry. The first system was based on an expert system shell [Sin94]; the second was based on temporal logic [Att93]; the third on temporal logic and process algebra [Sin96a, Sin96b]. All systems were implemented; the last as a fully distributed one.

What this research provides is a generic facility through which computations could be structured in terms of selected events of the constituent tasks. These events are those significant for the purposes of coordinating the various tasks. Although this research makes substantial progress in terms of understanding the distributed events that underlie workflows and gives formal semantics and scheduling algorithms, it remains at a somewhat low level of abstraction. It does not provide a view of computations from the perspective of the user. This visual aspect of the flows is well covered by the EDSS "Planner", HeNCE [Beg93, Beg94], PVE [Don95], and a variety of similar facilities. On the other hand, we expect that the extension of this research into the realm of scientific workflows will provide the process control specification and enactment framework that EDSS and similar scientific systems need for full implementation of its decision support functionalities.

4.3. Proposed Approach

Although the above examples by no means provide an exhaustive survey of the state of issues in to scientific workflows, they are suggestive of the kinds of techniques and technologies that are available and the major challenges that remain to be overcome. Our proposed approach builds on our previous work by attempting to unify their complementary strengths. Our objectives can be summarized succinctly as

The search for abstractions that characterize scientific computations in problem-solving environments,
The identification of the key primitives that underlie them,
The formalization of these primitives, and the implementation of the formalizations in a semantically honest and general manner on top of a heterogeneous execution environment.

Our approach involves a notion of partial order of computations, as is usual. However, unlike current approaches, we allow the partial order or digraph of computations to be specified dynamically. Thus all possibilities do not need to be anticipated, but can instead be encoded more compactly in a rule-based manner and automatically invoked when necessary. The graph representation, however, enables analysis and optimization. This can be ignored in many business settings, but is particularly important in the computation-intensive applications executed on supercomputers and large networks. Eventually it will also enable metareasoning about the control structures with a view to estimating resources required and for producing optimal execution plans.

5. Conclusions

We have argued that scientific computations are no less structured or complex than their business or enterprise integration cousins. Further, scientific workflows are an interesting research concept from the perspective of databases. One, scientific computations in problem-solving environments, which are of great importance, have the key features of workflows and provide a rich testbed on which to apply workflow ideas. Two, scientific workflows are sufficiently different from business workflows to merit separate study and will lead to a number of interesting research problems that have not come up in traditional business environments.

Computations more similar to scientific than to business workflows also arise in other applications, e.g., conducting marketing analyses, producing legal briefs, or performing decision-support analyses in general. Consequently, many of the research advances made with scientific workflows will also have ramifications in the broader segments of decision-making applications. We conjecture that existing problem-solving and workflow computing environments will be merged into powerful decision-support environments that will find widespread use wherever computing is prevalent today.

References

J. Ambrosiano, R.Balay, C. Coats, A. Eyth, S. Fine, D. Hils, T. Smith, S. Thorpe, T. Turner, and M. Vouk, "The Environmental Decision Support System: Air Quality Modeling and Beyond", Proceedings of the U.S. EPA Next Generation Environmental Modeling Computational Methods (NGEMCOM) Workshop, Bay City, Michigan, August 7-9, 1995 .
P.C. Attie, M.P. Singh, A.P. Sheth, and M. Rusinkiewicz, "Specifying and Enforcing Intertask Dependencies," Proceedings of the 19th Very Large Databases Conference (VLDB), August 1993.
R. Balay and V. Wall. "Use of File Transport Wrappers for a HeNCE/PVM Implementation of the Urban Airshed Model," PVM Users' Group Meeting, Oak Ridge, Tennessee, May 19-20, 1994.
R. Balay and M. A. Vouk. "A Lightweight Software Bus for Prototyping Problem Solving Environments," Accepted for the Special Session on Networks and Distributed Systems in the Eleventh International Conference on Systems Engineering, Las Vegas, 1996.
Adam Beguelin, J. Dongarra, Al Geist, Robert Manchek, K. Moore, and Vaidy Sunderam, " PVM and HeNCE: Tools for Heterogeneous Network Computing," Environments and Tools for Parallel Scientific Computing, Edited by Jack Dongarra and Bernard Tourancheau, Advances in Parallel Computing, Volume 6, North-Holland, 1993.
A. Beguelin, J. Dongarra, A. Geist, and R. Manchek, " HeNCE: A Heterogeneous Network Computing Environment , " Scientific Programming, Vol. 3, No. 1, pp 49--60.
C. Coats, "Classes for the Models-3 System," Requirements Documentation for Models3 Project, EPA, January 1993.
B. Curtis, M.I. Kellner, and J. Over, "Process Modeling," Communications of the ACM, Vol. 35(9), pp. 75-90, September 1992
R.L. Dennis, D.W. Byun, J.H. Novak, K.J. Galluppi, C.C. Coats, and M.A. Vouk, "The Next Generation of Integrated Air Quality Modeling: EPA's Models-3," Atmospheric Environment, accepted, in print, expected 1996.
J. Dongarra and Peter Newton, " Overview of VPE: A Visual Environment for Message-Passing Parallel Programming ," Heterogeneous Computing Workshop '95, Proceedings of the 4th Heterogeneous Computing Workshop, Santa Barbara, CA, April 25, 1995.
C. A. Ellis, "Information Control Nets: A Mathematical Model of Office Information Flow", Proceedings of the Conference on Simulation, Measurement and Modeling of Computer Systems, 1979.
M. Hsu (ed.), "Special Issue on Workflow and Extended Transaction Systems", IEEE Data Engineering, Vol. 16(2), June 1993.
Elmaghraby S.E., "On generalized activity networks," J. Ind. Eng., Vol. 17, 621-631, 1966.
A.K. Elmagarmid, "Database Transaction Models for Advanced Applications", Morgan Kaufmann, 1992.
Elmaghraby S.E., Baxter E.I., and Vouk M.A., "An Approach to the Modeling and Analysis of Software Production Processes," Intl. Trans. Operational Res., Vol. 2(1), pp. 117-135, 1995.
D. Georgakopoulos, M. Hornick, and A. Sheth, "An Overview of Workflow Management: From Process Modeling to Workflow Automation Infrastructure," Distributed and Parallel Databases, Vol. 3(2), April 1995.
F. Leymann and W. Altenhuber, "Managing Business Processes as an Information Resource", IBM Systems Journal, Vol. 33(2), pp. 326-348, 1994.
M.P. Singh and M.N. Huhns, "Automating Workflows for Service Provisioning: Integrating {AI} and Database Technologies," IEEE Expert, Vol. 9(1), October 1994.
M.P. Singh, "Formal Semantics for Workflow Computations", January 1996. Extends "Semantical Considerations on Workflows: Algebraically Specifying and Scheduling Intertask Dependencies," Proceedings of the 5th International Workshop on Database Programming Languages (DBPL), September 1995.
M.P. Singh, "Synthesizing Distributed Constrained Events from Transactional Workflow Specifications," Proceedings of the 12th International Conference on Data Engineering (ICDE), March 1996.
M. A. Vouk, R. Balay, and J. Ambrosiano, " EDSS - An Environment for Large-Scale Numerical Computing and Decision Making," International IFIP/WG 2.5 Workshop on Current Directions in Numerical Software and High Performance Computing, Kyoto, Japan, October 16-17, 1995.