Reliability Prediction Case Study: Business Reporting System
This page gives an overview of the Business Reporting System (BRS), which serves as a case study for PCM-based reliability modelling and prediction. The model represents a system that generates management reports from business data collected in a database. We have reported on the case study in our TSE 2011 publication.
Business Reporting System
The BRS model, as created for the case study, covers the core functionality that the system offers to its users:
- Data queries related to the stored business data
- System maintenance functionality
Figure 1 gives an overview over the involved software components, hardware resources, and the modelled system usage.
Users can access the BRS via web browsers to perform data queries. This functionality is provided by the WebProcessing component through the IWebProcessing interface, which offers the following service operations:
- ProcessUserSession(Boolean login, UserData data): Allows for users to login (login = true) or logout (login = false). Each user session starts with a login and concludes with a logout.
- CreateGraphicalReport(Boolean detailed, List<DataEntryId>) entries): Creates a report with respect to the requested business data entries. The report may be coarse-grained (detailed = false) or fine-grained (detailed = true). The system generates a report document and returns this document as a result.
- CreateOnlineReport(Boolean detailed, List<DataEntryId>) entries): Creates a report with respect to the requested business data entries. The report may be coarse-grained (detailed = false) or fine-grained (detailed = true). The system generates the report in terms of a web page and displays this page to the user.
- CreateGraphicalView(): Creates a pre-defined statistical summary of the currently stored data. The system generates a document and returns it as a result.
- CreateOnlineView(): Creates a pre-defined statistical summary of the currently stored data. The system generates a corresponding web page and displays this page to the user.
Users with administration rights can directly access the system backend and perform system maintenance. To this end, the CoreGraphicEngine composite component offers the following service operation through the IAdministration interface:
- PerformMaintenance(MaintenanceCommand command): Performs a pre-defined maintenance command.
Notice that the modelled BRS functionality as decribed above is an abstraction from the details of the actual interfaces, in order to focus on the reliability-relevant aspects of these interfaces.
The modelled system configuration consists of 23 components instantiated from the following 12 component types:
- WebProcessing: Provides the web-based access point for system users.
- Scheduler: Schedules all user session commands and data queries according to the available system resources.
- User Management: Manages user sessions and access rights.
- GraphicalProcessing: Controls the execution of graphical report and view generation.
- OnlineProcessing: Controls the execution of online report and view generation.
- GraphicalLoadBalancer: Distributes incoming graphical requests to the 3 application servers.
- OnlineLoadBalancer: Distributes incoming online requests to the 3 application servers.
- InnerCoreReportingEngine: Core functionality for the generation of graphical reports and views.
- CRESingleMessageAdapter: Optimizes the message exchange with the underlying database with respect to performance.
- FastCoreReportingEngine: Core functionality for the generation of online reports and views.
- CacheAccess: Represents the access to the application server caches.
- DatabaseAccess: Represents the access to the underlying database.
The BRS model comprises 6 servers as follows:
- WebServer: Hosts the web-based access point for system users. Contains a modeled CPU that represents the computational capacities of the server.
- SchedulerServer: Hosts the scheduler and controlling components for system commands, as well as the user management component. Contains a modelled CPU that represents the computational capacities of the server.
- ApplicationServer 1/2/3: 3 redundant servers that host the core functionality for report and view generation. Each server includes a cache for fast data retrieval and contains a modelled CPU representing its computational capacities.
- DatabaseServer: Hosts the system's database. Contains a modelled CPU representing the computational capacities of the server, and a modelled hard disk drive (HDD) representing the database storage.
The system servers are connected through communication links as shown be Figure 1.
As system reliability can only be predicted with respect to a given system usage, the BRS model includes 3 envisioned usage scenarios with given occurrence probabilities:
- Usage Scenario 1 - Administrator: This scenario consists of a login, maintenance command, and logout. It has an occurrence probability of 0.1 (i.e. 10% of all executed scenarios are administrator scenarios).
- Usage Scenario 2 - Sales Manager: Consists of a login, online report or view command, and logout. Has an occurrence probability of 0.7. Reports are requested in 10% of all cases; views in 90%. 8% of all requested reports are detailed. Report requests refer to an average of 100 business data entries.
- Usage Scenario 3 - Accounting Manager: Consists of a login, graphical report or view command, and logout. Has an occurrence probability of 0.2. Reports are requested in 90% of all cases; views in 10%. 64% of all requested reports are detailed. Report requests refer to an overage of 100 business data entries.
The risk of failures-on-demand during the execution of the usage scenarios is modelled through the following reliability-specific annotations to the BRS model:
- Software Failures: modelled through failure probabilities of the internal actions in component behavioural specifications (all internal action failure probabilities set to 10^-5).
- Hardware Unavailability: modelled through MTTF and MTTR values for hardware resources (MTTF between 27800 hours and 55600 hours as shown in Figure 1; MTTR between 10 hours and 20 hours as shown in Figure 1).
- Network Failures: modelled through communication link failure probabilities (all failure probabilities set to 10^-6 as shown in Figure 1).
Besides the standard version of the BRS model, we applied different architectural tactics for reliability improvement, resulting in the following design alternatives:
- Design Alternative 1 - High-reliability Components: decreased internal action failure probabilities (from 10^-5 to 0)
- Design Alternative 2 - High-availability Servers: decreased MTTR values for all modelled hardware resources
- WebServer - CPU: MTTR = 10.0 hours
- SchedulerServer - CPU: MTTR = 10.0 hours
- ApplicationServer - CPU: MTTR = 3.3 hours
- DatabaseServer - CPU: MTTR = 3.3 hours
- DabataseServer - HDD: MTTR = 3.3 hours
- Design Alternative 3 - High-reliability Network: decreased network failure probability (from 10^-6 to 0)
Download the complete BRS model instance here.
To demonstrate the capabilities of our reliability modelling and prediction approach, we performed a series of reliability predictions for the BRS model, with the following results:
Figure 2 shows the predicted BRS system reliability (i.e. the probability of a successful run through a usage scenario) for the different usage scenarios and design alternatives. The aggregated usage scenario is the overall success probability, taking into account each individual scenario and its occurrence probability. The figure shows several trends, which can also be acknowledged through reasoning about the system:
- The administrator usage scenario generally has the highest reliability. This is due to the fact that the maintenance commands are not as computational intensive as the data queries.
- Sales managers experience generally less failures than accounting managers, because they mostly request the less computational intensive view generation (while accounting managers mostly request report generation).
- The aggregated usage scenario shows similar results as the sales manager usage scenario, because most executed usage scenarios are sales manager scenarios (70%).
- The design alternatives 1-3 generally increase system reliability, as each of them applies an architecture tactic for reliability improvement.
Furthermore, the figure allows for interesting, less intuitive conclusions:
- Design alternatives 1 and 2 have generally more impact on the system reliability than alternative 3. This shows that network failures are less critical for the BRS than software failures and unavailability of hardware resources.
- Design alternative 1 (improved software reliability) shows the most benefit for the most computationally intensive usage scenario (the accounting manager). With this alternative, accounting managers will even experience higher reliability than sales managers.
- For administrators, only alternative 2 shows a noticeable improvement. Software and network reliability is not critical for this usage scenario.
- Overall (i.e. in the aggregated scenario), design alternatives 1 and 2 (i.e. improved software reliability & hardware availability) show the most benefit.
Figure 3 further investigates the BRS design alternatives and shows their failure potential when the required model parameters are subject to uncertainty. The figure shows three experiments. Each experiment alters the model parameters with respect to one of the dimensions software reliability, hardware availability and network reliability, while leaving the other dimensions unchanged. More concretely, the first experiment varies all software failure probabilities of all InternalActions in the model between 10^-7 and 10^-3; the second experiment varies the MTTF values of all hardware resources between 0.1 years and 10 years; the third experiment varies the failure probabilities of all network links between between 10^-7 and 10^-3. Such variations can express the uncertainty when estimating the model parameters at an early design stage. Taking a closer look at the results, one finds:
- Changing one dimension for the worse generally increases the failure potential of all design alternatives (i.e. it decreases their success probabilities). There are only two exceptions from this rule: Alternative 1 (high-reliability components) does not react to changes of software failure probabilities, because it always sets all software failure probabilities to 0. Alternative 3 (high-reliability network) does not react to changes of network failure probabilities, as it always sets them to 0.
- As a general rule (and in accordance with general intuition), changing any dimension to the worse increases the impact of the design alternative that deals with this dimension. More concretely, alternative 1 is superior for high software failure probabilities, alternative 2 for low hardware MTTF values, and alternative 3 for high network failure probabilities.
- Most interesting are the break-even points where design alternatives change their ranking. For example, alternative 2 is the preferred choice for software failure probabilities between 10^-7 and 10^-5, while alternative 1 is better for probabilities above 10^-5.
- Software architects can use the results of the experiments to make a more well-informed decision between the design alternatives, taking into account the uncertainty of the estimated input parameters. For example, assuming that software and network failure probabilities are generally not above 10^-5, and hardware MTTF values are not above 8 years, a decision for alternative 2 (high-availability servers) can be made with a high confidence that this is actually the most beneficial choice with respect to the system's reliability.
Figure 4 shows the sensitivity of the BRS system reliability to changes of software failure probabilities (assuming the base design alternative). The figure summarizes a series of experiments. In each experiment, we varied the internal action failure probabilities of one component type between 0 and 0.1, while leaving all other software failure probabilities unchanged (i.e. 10^-5). If the altered component type has multiple instances in the BRS model (such as the InnerCoreReportingEngine), the alteration effects each instance of this type. The results show how critical the failure probabilities of each component type are with respect to the BRS system reliability. We conducted each experiment for each of the defined usage scenarios, including the aggregated usage scenario.
As the figure shows, system reliability generally decreases with increasing software failure probabilities of each component type. Beyond that, the sensitivity analysis allows for various conclusions. For example, if one is interested in identifying the component types with the greatest impact on system reliability, one finds:
- Most critical to system reliability are the CacheAccess and DatabaseAccess components (this becomes obvious when looking at the aggregated usage scenario), due to the fact that they are visited most frequently during the execution of the individual usage scenarios. A closer look reveals the following:
- They are the only components with a non-linear effect on system reliability, as they are visited multiple times during the execution of report requests.
- They are extremly critical to the accounting manager scenario, which mostly requests report generation (this is due to the fact that comprehensive processing is necessary for each single report generation request; the generation of a report requires a lookup of several hundreds of historical database entries, where each individual loockup is associated with its own independent failure probability).
- On the other hand, they are not critical to the administrator scenario, as they are not involved in the execution of maintenance commands.
- The InnerCoreReportingEngine component is involved in all usage scenarios and thus has a high overall impact on system reliability.
The analysis shown by Figure 5 is similar to that of Figure 4, but varies MTTF values of hardware resources instead of software failure probabilities. More concretely, each experiment varies the MTTF values of all modelled hardware resources of one server between 0.1 years and 10 years, while leaving all other MTTF values unchanged. The results show the significance of the MTTF values of each server to the BRS system reliability, and allow for various conclusions, such as:
- Obvious differences between the significance of the individual servers are only visible for low MTTF values (below 1 year).
- The application servers have lower impact on system reliability than the other servers, as they are replicated (which means that each server can be substituted by the other servers).
- The scheduler server has the greatest impact. Unlike the database and web servers, it impacts all usage scenarios, including the administrator scenario.
- The administrator scenario is least sensitive to hardware MTTF values, as it only involves the scheduler and main application servers for its maintenance commands.