Reliability Prediction for Component-based Software Architectures
This page describes a scientific approach to predict the reliability of IT systems with component-based software architectures. The approach builds upon a model of the system's architecture, rather than the system itself. Hence, it has a broad applicability throughout the software development process, including early architectural design stages, when the implementations of the involved components are not yet fully available. Software architects may use the approach to compare multiple design alternatives with respect to their expected reliability, or to identify reliability-critical parts in the architecture or steps during service execution.
The following introduces the basics of the approach, gives practical guidance for reliability modelling and prediction, references relevant scientific publications and case studies and provides access to the tool support.
An IT system is perveiced as reliable by its users if it provides service as expected, meaning that it produces the expected results, without any unwanted side effects. Any deviation from service as expected is defined to be a failure. System reliability is defined as the probability that a system operates failure-free in a specified execution environment for a specified time interval. More concretely, our approach predicts the system's reliability as the probability that a system user proceeds through a specified usage scenario (a sequence of system service invocations) without experiencing a failure.
The approach is based on the assumption that a system's reliability is essentially determined by its software architecture with its included failure potentials. Thereby, we interpret the term software architecture in a broad sense, including the following aspects:
- the structure of an IT system in terms of its included software component instances and their interconnections;
- the provided and required interfaces of each software component, as well as its internal (high-level) control and data flow;
- the resource environment of the system with its computing nodes, their interconnections and included hardware resources;
- the allocation of software components to computing nodes and the usage of hardware resources during service execution;
- the usage of system-external services for providing the system's own services;
- the system's usage profile in terms of a set of usage scenarios, where each scenario specifies the sequences of invoked system services and their input parameter properties;
- the software, hardware and network failure potentials that the system comprises;
- the failure potentials associated with system-external service invocations;
- the capabilities of service execution to recover from local failure occurrences and to prevent them from reaching the system's boundaries.
Our approach provides a design-oriented architecture modelling language for IT systems, which allows for explicit representation of all mentioned aspects. Furthermore, the approach includes an analysis method for reliability evaluation of a provided architecture specification. Hence, software architects are comprehensively supported by our approach in their decision-making.
Our approach is based on the Palladio Component Model (PCM) as a design-oriented modelling language for component-based software architectures. The description in this section assumes that the reader is familiar with general architecture modelling as done with PCM (for further information, see the PCM tutorials). The reliability-specific extensions of the PCM meta-model allow for expressing failure potentials and capabilities for failure recovery that exist within the system's architecture:
- Software failure potentials: Software implementations can be flawed due to various reasons. Programming errors can lead to unexpected situations during service execution (such as division by zero). Specification errors may lead to unexpected behaviour even though the implementation accurately realizes a given specification. Moreover, implemented computational procedures may be subject to natural limitations (e.g., algorithms for image recognition or virus detection naturally have success rates of less than 100%). Our approach allows for annotating computational actions during the service execution with independent per-visit failure probabilities, expressing that a visit to this action may lead to a failure of a specified type with the specified probability.
- Hardware failure potentials: Our approach associates hardware-related failure potentials with the individual hardware resources that are modelled as part of the system's resource environment. Due to inevitable wear-out effects, the lifetime of these resources is naturally limited. When a resource becomes unavailable, all accesses to this resource during service execution lead to hardware-induced failure occurrences, until the resource is eventually repaired or replaced. Our approach expresses limited resource availability through annotating a Mean-Time-To-Failure (MTTF) and a Mean-Time-To-Repair (MTTR) value to each resource.
- Network failure potentials: Distributed IT systems include network connections, which can be affected by various phenomena such as communication overload, transmission protocol errors, physical interference of transmission lines or unavailability of transmission devices, leading to corruption or losses of the transported communication messages. Our approach allows for annotating a network communication link between two computing nodes with a transmission failure probability, i.e. the probability that a network-induced failure occurs when the link is used for transmitting a service invocation or return message between two software components.
- System-external failure potentials: Today's IT systems are increasingly interconnected in the sense that one system acts as a user of other systems, i.e. invokes (external) service operations of other systems in order to provide its own (target) services. In such a situation, external service failures can lead to failures of the target services of the invoking system. Consequently, our approach allows for annotating an invocation of a system-external service operation with a failure probability, denoting the possibility that the invocation may result in a failure.
- Failure recovery: IT systems typically include fault tolerance (FT) capabilities, which allow them to autonomously recover from local failure occurrences and to prevent these occurrences from reaching the system's boundaries. Such capabilities significantly influence the user-perceived system reliability and should be explicitly planned during system design. To this end, our approach introduces a new action type in the PCM's behavioural meta-model, expressing how service execution can mask failure occurrences through specific failure-handling behaviours and return to normal operation afterwards.
To find out more about the discussed modelling capabilities, see PCM-based Reliability Modelling.
Based on a created architectural specification of an IT system under study, our approach conducts fully automated reliability prediction and shows the prediction results as a visual feedback to the modeller. Internally, prediction is realized through a transformation from the originial PCM meta-model instance (enriched by the reliability-specific constructs and annotations) to a discrete-time Markov chain (DTMC) and solving of the Markov chain. Prediction results comprise the success probabilities of all specified usage scenarios, as well as the reliability impacts of specified failure potentials at different levels of granularity. Additionally, a simulation-based prediction method is available. For further details, see PCM-based Reliability Prediction.
The approach has been validated on a number of different systems:
- A web-based media store product line (case study overview | model download)
- An industrial control system product line (case study overview | model download)
- A distributed business reporting system (case study overview | model download)
- The Common Component Modelling Example (CoCoME) (CoCoME homepage | model download)
- The SLA@SOI Open Reference Case (ORC) (SLA@SOI homepage | model download)
- A web-based audio hosting solution (model download)
In order to examine the models, install the Eclipse-based PCM 3.3 Release and import the downloaded models into the Eclipse workspace.
A further specific discussion of the approach focuses on its ability to model input parameters and their propagation (detailed discussion).
Tool support for reliability modelling and prediction is realized as a set of Eclipse plug-ins, included in the Palladio Component Model (PCM) tool suite, and distributed as an Eclipse-based application. All uploaded models and illustrations on this web site refer to the PCM 3.3 Release version. Generally, the most current version is available from the Palladio download page.
- Dr. Franz Brosch, formerly Research Center for Information Technology (FZI), Karlsruhe, Germany
- Dr. Barbora Buhnova, Masaryk University, Brno, Czech Republic
- Dr. Heiko Koziolek, ABB Corporate Research, Ladenburg, Germany
- Prof. Dr. Ralf Reussner, Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany
- Franz Brosch. Integrated Software Architecture-Based Reliability Prediction for IT Systems. PhD thesis, Institut für Programmstrukturen und Datenorganisation (IPD), Karlsruher Institut für Technologie, Karlsruhe, June 2012.
- Franz Brosch, Heiko Koziolek, Barbora Buhnova, and Ralf Reussner. Architecture-based reliability prediction with the palladio component model. IEEE Transactions on Software Engineering, 99(PrePrints), 2011.
- Franz Brosch, Barbora Buhnova, Heiko Koziolek, and Ralf Reussner. Reliability Prediction for Fault-Tolerant Software Architectures. In International ACM Sigsoft Conference on the Quality of Software Architectures (QoSA), pages 75-84, New York, NY, USA, 2011. ACM.
- Franz Brosch, Heiko Koziolek, Barbora Buhnova, and Ralf Reussner. Parameterized Reliability Prediction for Component-based Software Architectures. In International Conference on the Quality of Software Architectures (QoSA), volume 6093 of LNCS, pages 36-51. Springer, 2010.
- Franz Brosch and Barbora Zimmerova. Design-Time Reliability Prediction for Software Systems. In International Workshop on Software Quality and Maintainability (SQM), pages 70-74, March 2009.
- Heiko Koziolek and Franz Brosch. Parameter dependencies for component reliability specifications. In Proceedings of the 6th International Workshop on Formal Engineering approaches to Software Components and Architectures (FESCA), volume 253 of ENTCS, pages 23 - 38. Elsevier, 2009.