PCM-based Reliability Prediction

Aus SDQ-Wiki

The Palladio Component Model (PCM) can be used to predict the reliability of IT systems with component-based software architectures. While the overall approach to PCM-based reliability modelling and prediction is described on a separate page, this page specifically focusses on the prediction part, which evaluates a fully specified PCM instance with respect to its expected reliability. The description assumes that the reader is familiar with the general concepts of PCM (for further information, see the PCM tutorials).

Background

PCM-based reliability prediction determines an IT system's expected reliability as the probability of successful service execution, relative to a given PCM UsageScenario. The prediction method focuses on the behavioural specifications of a PCM instance, i.e. ScenarioBehaviours specifying user behaviour and ResourceDemandingBehaviours specifying system behaviour. Taken together, these specifications form a tree of nested action sequences: each ScenarioBehaviour and ResourceDemandingBehaviour represents a sequence of invididual actions, and the actions can include further nested behaviours (e.g., a LoopAction contains a body behaviour, a BranchAction contains multiple branch transition behaviours, and an ExternalCallAction points to the next ResourceDemandingSEFF behaviour). The UsageScenario itself contains one ScenarioBehaviour that constitutes the topmost action sequence.

The overall tree of nested action sequences and the semantics of each action determine the possible execution paths through the UsageScenario, as well as the occurrence probability of each execution path (in this discussion, the term 'execution path' is meant to include user actions as well as system actions). For example, the body behaviour of a LoopAction is executed multiple times in a row, while for a BranchAction, exactly one of its branch transition behaviours is executed. Loop iteration counts and branch transition probabilities are stochastically specified and may include parameter dependencies (i.e. they may depend on properties of input parameters as specified in the PCM UsageModel), which are resolved prior to reliability prediction by the dependency solver algorithm.

During service execution, visits to certain actions can result in a failure occurrence. Hence, these actions are the potential points of failure (PPOFs) of the service execution. Other actions possess the capability to recover from failure occurrences and constitute the points of recovery (POR). In short, PPOFs include InternalActions, which may fail due to (i) system-internal software faults or (ii) unavailable required system-internal hardware resources, and ExternalCallActions, which may fail due to (i) system-external service failures, (ii) system-internal network transmission failures or (iii) unavailable system-internal hardware resources (which may be required by a remote computing node that hosts the called software component). PORs include RecoveryActions, which can tolerate failure occurrences of their internal RecoveryActionBehaviours. See PCM-based Reliability Modelling for further details.

Considering the set of possible execution paths and their occurrence probabilities, the included PPOFs and PORs and their associated reliability annotations (i.e. failure types, failure probabilities and MTTF/MTTR values), all required information is available to conduct the reliability prediction. The main prediction result is the probability P(SUCCESS|U) of a successful (i.e. failure-free) run through a UsageScenario U, as well as its counterpart P(FAILURE|U) = 1 - P(SUCCESS|U). The latter P(FAILURE|U) may be further differentiated depending on the configuration of the reliability prediction, in order to give more insight into the reliability impacts of the individual modelled failure potentials. The prediction method itself includes a transformation from a PCM instance to a discrete-time Markov chain (DTMC) and solves the chain in order to obtain the prediction results.

Reliability Prediction Configuration

Based on a fully specified PCM instance, reliability prediction can be configured and executed as an Eclipse run configuration as follows:

  • in the Eclipse workbench, select "Run" - "Run Configurations..." from the main menu;
  • in the "Run Configurations" dialog, create a new run configuration of type "PCM Solver Reliability";
  • do all the settings and click on "Run".

The available settings are organized in multiple tabs "Architecture Model(s)", "Analysis Configuration", "Analysis Options" and "Common". The following explains the first three tabs in detail (the "Common" tab contains generic settings for all Eclipse run configurations, see the Eclipse help for further information). Notice that some of the settings are not actually relevant for the reliability solver - they have been taken over from the SimuCom run configuration. A limitation of the displayed settings to the relevant ones only remains as an open issue for the implementation.


Reliability Configuration Settings in the "Architecture Model(s)" Tab
Figure 1: Reliability Configuration Settings in the "Architecture Model(s)" Tab


The "Architecture Model(s)" tab contains settings to specify the PCM instance under study:

  • Middleware repository file: Not relevant for the reliability solver. The default entry "pathmap://PCM_MODELS/Glassfish.repository" can be taken.
  • Event middleware repository file: Not relevant for the reliability solver (unless an event middleware repository with modelled failure potentials is used). The default entry "pathmap://PCM_MODELS/default_event_middleware.repository" can be taken.
  • Allocation file: The file that contains the PCM Allocation model as part of the modelled PCM instance.
  • Usage file: The file the contains the PCM UsageModel as part of the modelled PCM instance.

Notice that the PCM Allocation and UsageModel are sufficient to unambiguously define a PCM instance, as they reference the other model parts.


Reliability Configuration Settings in the "Analysis Configuration" Tab
Figure 2: Reliability Configuration Settings in the "Analysis Configuration" Tab


The "Analysis Configuration" tab specifies general settings with respect to the conducted reliability analysis:

  • Location of temporary data: The reliability solver will create a temporary project in the Eclipse workspace during the analysis. The name of this project can either be set to the default, or a specific name can be selected.
  • Temporary data: Indicates if the temporary project shall be automatically deleted at the end of the analysis.
  • Accuracy influence analysis: This option is currently not supported by the reliability solver. Leave it unchecked.
  • Sensitivity analysis: A sensitivity model file and a result log file can be specified, in order to conduct a series of prediction runs (see Results Display section).


Reliability Configuration Settings in the 'Analysis Options' Tab
Figure 3: Reliability Configuration Settings in the "Analysis Options" Tab


The "Analysis Options" tab contains further settings to specify how the reliability prediction shall be conducted:

  • Stop conditions: Although prediction runs are usually fast (taking only seconds), models with a high number of hardware resources may lead to infeasibly long prediction times. In such cases, the user can specify stop conditions to balance prediction accuracy against analysis time. Whenever one or more stop conditions are specified, the analysis may stop earlier than usual - as soon as any stop condition is fulfilled. Prediction results will be available, but may differ from the final results that a full analysis would have yielded. The maximum possible deviation between obtained and envisioned final prediction results is known and will be given as a feedback to the user. The individual stop conditions are as follows:
    • Number of evaluated system states: for n modelled hardware resources, there are 2^n possible system hardware states (where each system hardware state is a combination of availability states of all hardware resources). A full analysis run evaluates all system hardware states. The stop condition forces the analysis to stop after the specified number of evaluated system hardware states. By evaluating states with high occurrence probabilities first, the obtained prediction result is usually close to the final result even if only a fraction of all states have been evaluated.
    • Number of exact decimal places: this stop condition directly relates to the numerical accuracy (in terms of exact decimal places) of the obtained prediction result. If the condition is set, the analysis stops as soon as maximum possible deviation of the current prediction result from the final result is below the given threshold.
    • Solving time: sets a real time upper limit for the analyis time. The analysis stops as soon as its execution time reaches the limit.
  • Markov transformation: A reliability prediction run includes a transformation from the given PCM instance to a discrete-time Markov chain (DTMC) as its central step. Several options provide control over the execution of this transformation:
    • Apply Markov model reduction: By default, the transformation algorithm reduces the DTMC on-the-fly during its creation, such that the resulting DTMC is of a basic structure. The basic structure contains one initial state, one success state and a set of failure states for the distinguished failure modes. Success and failure mode probabilities are directly visible from this structure as the transition probabilities of the DTMC. By unchecking this option, the on-the-fly reduction is switched off, and the full DTMC representing the whole visited behavioural tree will result from the transformation. This can be useful for debugging purposes or for comparing the approach to other DTMC-based reliability prediction approaches.
    • Add Markov traces: If this option is checked, the individual states of the resulting DTMC are additionally equipped with tracing information, to allow for mapping them back to the original PCM instance. Again, this option allows for debugging the transformation.
    • Iterate over physical system states: This option switches to another method of handling of hardware resources, avoiding the exponential number of system hardware state evaluations. While the alternative method may greatly reduce analysis time for models with many hardware resources, it may also significantly impact prediction accuracy and must be handled with care.
    • Store Markov model: This option allows for permanently storing the DTMC that results from the Markov transformation. The DTMC is stored as an EMF model, and an EMF editor for inspection of the model is available as part of the PCM installation.
  • Evaluation mode: Determines the granularity of considered failure modes and their occurrence probabilities (in other words, the differentiation of P(FAILURE|U), see Background section). Generally, the more differentiated the analysis, the bigger the created DTMC, and the longer the required analysis time. Notice that the first two modes "single" and "category" constitute a simplified evaluation without consideration of failure recovery. They can be applied if a modelled PCM instance does not contain any RecoveryActions. The following evaluation modes with the following provided predition results are available:
    • Single failure mode
      • Probability of success
      • Probability of failure
    • Failure categories
      • Probability of success
      • Probability of software-induced failure occurrences
      • Probability of hardware-induced failure occurrences
      • Probability of network-induced failure occurrences
    • Failure types
      • Probability of success
      • Probability of software-induced failure occurrences per specified SoftwareInducedFailureType
      • Probability of hardware-induced failure occurrences per specified ProcessingResourceType
      • Probability of network-induced failure occurrences per specified CommunicationLinkResourceType
    • Points of failure
      • Probability of success
      • Probability of internal software-induced failure occurrences per specified SoftwareInducedFailureType and InternalAction
      • Probability of internal hardware-induced failure occurrences per specified ProcessingResourceType and ResourceContainer
      • Probability of internal network-induced failure occurrences per specified CommunicationLinkResourceType and LinkingResource
      • Probability of external software-induced failure occurrences per specified SoftwareInducedFailureType and OperationRequiredRole and OperationSignature
      • Probability of external hardware-induced failure occurrences per specified ProcessingResourceType and OperationRequiredRole and OperationSignature
      • Probability of external network-induced failure occurrences per specified CommunicationLinkResourceType and OperationRequiredRole and OperationSignature
  • Logging: Provides extended logging facilities for the Markov transformation:
    • Print Markov statistics on console: Provides additional output on the console during the Markov analysis, such as the number of evaluated system hardware states and the required execution times.
    • Log results of individual Markov transformation runs: Lists the individual evaluated system hardware states, their occurrence and success probabilities in a CSV format. Can be used for debugging purposes.
  • Markov analysis results: Saves the displayed Markov analysis results (see Results Display section) permanently in a HTML format.

Prediction Results Display

At the end of a reliability prediction run, the prediction results are automatically displayed in a HTML format in the central pane of the Eclipse workbench. Through the "Save results to file" option of the "Analysis Options" tab, the results can also be persisted in a file (see the Prediction Configuration section). A double-click on this file within the Eclipse workbench will again display the results. Figures 4 and 5 show the visualiation of prediction results by example.


Failure Mode Analysis Results
Figure 4: Failure Mode Analysis Results


Impact Analysis Results
Figure 5: Impact Analysis Results


Prediction results are displayed individually for each PCM UsageScenario. For each UsageScenario, the display includes the scenario name and ID, the overall success probability, a failure mode analysis (Figure 4) and optionally an impact analysis (Figure 5). The failure mode analysis shows the occurrence probabilities of individual failure modes, according to the specification of the evaluation mode in the analyis options (see the Prediction Configuration section). In the example, the most differentiated "points of failure" evaluation mode has been selected, and Figure 4 shows some of the corresponding prediction results (internal software-induced failure modes per InternalAction and SoftwareInducedFailureType, as well as internal hardware-induced failure modes per ResourceContainer and ProcessingResourceType).

The impact analysis as shown in Figure 5 is only provided if the evaluation mode has been set to "points of failure". It aggregates individual failure mode probabilities to show the overall failure impacts (i.e. contributions to P(FAILURE|U)) of software components, component services, component operations etc. The results of the impact analysis allow for drawing conclusions about the system under study. For example, the analysis reveals which software component within a planned architecture and usage profile context is most critical (i.e. most likely to cause failure occurrences due to included software faults) and should be rigorously tested after development.

By clicking on the headers of each individual results table, the table contents can be sorted in ascending or descending order of the corresponding column (in Figure 4, the "Component" column of the topmost table is sorted in descending order).

Prediction Series

While a single reliability prediction run reveals several insights in a system under study, the obtained prediction results refer to a single architectural candidate and usage profile. They cannot be used to investigate questions such as the following ones:

  • What is the variance between optimistic and pessimistic estimations of reliability annotations such as failure probabilities, MTTF and MTTR values?
  • How do changing usage profile aspects influence the expected system reliability?
  • What is the effect of gradual reliability improvements throughout the architecture?

To this end, a sensitivity meta-model has been defined, which specifies gradual changes of individual aspects of an underlying PCM instance. Hence, modellers can easily express variations of a system architecture and usage profile, and they can trigger a series of prediction runs that automatically investigates each variation and collects the results.


Sensitivity Model for PCM Parameter Variations
Figure 6: Sensitivity Model for PCM Parameter Variations


Figure 6 shows an example of a specified sensitivity model. To create such a model, proceed as follows:

  • in the Eclipse workbench, select "File" - "New" - "Other..." from the main menu;
  • in the "New" dialog, select "Example EMF Model Creation Wizzards" - "Sensitivity Model";
  • choose a workspace parent folder and file name for the model file;
  • select "Configuration" as the root model object.

The model is constructed according to the following rules:

  • The top-level SensitivityConfiguration contains SensitivityParameters and SensitivityResultSpecifications.
  • A SensitivityParameter specifies a variation of a certain aspect of the underlying PCM instance. It can relate to a single value (such as the size of a ParametricResourceDemand), a group of values (such as all software failure probabilities in a given ResourceDemandingSEFF), or a combination of other parameters. If the parameter is a single value, it may be either of Type DOUBLE or STRING. A set of concrete sub classes is used to specify different kinds of parameters (e.g., ProbabilisticBranchParameter, SoftwareReliabilityParameter or HardwareMTTFParameter). The CombinedSensitivityParameter includes a set of child parameters and is used to represent a combination of those.
  • Actual value variations are specified through SensitivityParameterVariations, which can express DOUBLE ranges, DOUBLE sequences or STRING sequences. The specification may be either absolute or relative to the original values contained in the underlying PCM instance.
  • SensitivityResultSpecifications define which prediction result aggregations are of interest in each prediction run. They may relate to failure impacts of overall failure dimensions (e.g. all hardware-induced failure types) or of groups of individual specified failure types.

In the example of Figure 6, the SensitivityConfiguration "InteractiveBatchSensitivity" contains a CombinedSensitivityParameter "InteractiveBatchVariation", which influences a Branch in the underlying PCM UsageModel by gradually shifting the branch transition probabilities of two branch transitions "InteractiveTransition" and "BatchTransition" between 0 and 1. Hence, the model expresses a gradual change of user behaviour in the UsageModel. Moreover, a set of SensitivityResultSpecifications specifies prediction results of interest, namely the overall failure impacts of the software, hardware and network dimensions, as well as the failure impacts related to certain processing steps in the behavioural specifications of the PCM instance.

A series of prediction runs (instead of a single run only) can be triggered by the "Sensivity analysis" settings of the reliability run configuration (see the Prediction Configuration section). A sensitivity model file related to the given PCM instance must be specified. The prediction results of all runs will be stored in a CSV format at the specified location ("Sensitivity Result Log File").

Reliability Simulation

As an alternative to the previously discussed Markov analysis, the PCM also offers a means to predict a system's reliability through simulation, using its built-in SimuCom engine (see [1] for details). Traditionally, SimuCom simulates the behaviour of system users, the execution of system services and the consumption of hardware resources over a simulated timeline, and it collects data about the system's performance such as completion times of service execution and the utilization of resources. For reliability prediction, SimuCom has been extended to trigger failure occurrences according to the failure potentials included in the modelled PCM instance. To this end, the simulation generates exceptions when visiting potential points of failure (PPOFs) according to the specified software and network failure probabilities, hardware MTTF/MTTR values and external service failure probabilities. Just as the Markov analysis, the simulation outputs the success and failure mode probabilities of each UsageScenario.

As failure occurrences are typically rare events, the applicability of SimuCom for reliability prediction is limited to moderate reliability levels and / or "simple" system architectures, and it is not propagated as the main method for PCM-based reliability prediction. Furthermore, the simulation results are affected by statistical variation, in contrast to the numerically "exact" results of the Markov analysis. However, the reliability simulation does possess interesting features, which are not provided by the Markov analysis:

  • The simulation takes the concept of execution time explicitly into account (whereas the Markov analysis abstracts from all time-related aspects). A simulation run observes a system over a limited mission time, considering resource demand sizes and resource processing speeds, system workloads, waiting times due to concurrent accesses to processing resources and passive resources, network latency effects and throughput limitations, as well as user delays. For hardware resources, the simulation observes them changing forth and back between being available and unavailable over time (whereas the Markov analysis just considers the time-independent steady-state availability of each resource).
  • The simulation records each simulated usage scenario run with its time of occurrence and its execution result (which may be either success or a certain failure mode). This does not only allow for inferring the overall success and failure mode probabilities, but also the service-level Time-to-Failure (TTF) distributions, as well as the spreading of failure occurrences over the simulated timeline (for example, there may be bulks of failures in periods when critical resources are unavailable, and fewer failures in other periods).
  • The simulation captures success and failure mode probabilities not only at the system level, but for each involved component service operation throughout the architecture.
  • By conducting repeated simulation runs, the variances of the inferred success and failure mode probabilities can additionally be determined.

Notice that SimuCom will make use and interpret all performance-specific annotations of the PCM instance (e.g. system workloads, resource demand sizes, resource processing speeds, resource schedulers etc.), even if it is used only for reliability simulation. Moreover, the simulation makes some additional assumptions that are not necessary for the Markov analysis:

  • hardware resource Times-to-Failure (TTFs) and Times-to-Repair (TTRs) are exponentially distributed;
  • all failure occurrences during service execution cancel the current control and data flow, and they lead the usage scenario run to an immediate end (unless they are handled by a RecoveryAction);
  • for actions that represent PPOFs, failure occurrences are triggered directly at the start of the action execution.


Visualization of Reliability Simulation Results
Figure 7: Visualization of Reliability Simulation Results


In order to conduct a reliability simulation with SimuCom, proceed as follows:

  • in the Eclipse workbench, select "Run" - "Run Configurations..." from the main menu;
  • in the "Run Configurations" dialog, create a new run configuration of type "SimuBench";
  • do the SimuCom settings as usual, and in addition, select "Simulate failures" in the "Feature Settings" tab

With the "Simulate failures" option activated, SimuCom will consider the modelled failure potentials and generate corresponding failure occurrences. In the simulation results, a new class of sensors exist - the execution result sensors. These sensors record the results (success or failure mode X) of all usage scenario runs, entry-level system calls and component service invocations. On the highest level (i.e. the usage scenario runs), the results correspond to those of the Markov analysis, expressing the overall success and failure mode probabilities of the scenario. Each execution result sensor has two graphical visualizations:

  • All execution results: displays a pie chart with all results (success and failure modes) according to their percental proportions.
  • Failures only: displays a pie chart with failure modes only.

Figure 7 shows an example of a "Failures only" visualization of a usage scenario execution result sensor. Additionally, the reliability results are printed on the Eclipse Console view as a direct feedback to the simulation user.

References

  1. Steffen Becker. Coupled Model Transformations for QoS Enabled Component-Based Software Design. PhD thesis, University of Oldenburg, Germany, March 2008.