Reliability Improvements

Aus SDQ-Wiki

This page is meant for collecting a number of known architectural tactics or model improvements for optimizing the reliabilty of software architectures.

Reliability Improvements Collection

Reference Name Type of Improvement Short Description Reference How to model in PCM
High-reliability components Reliability (Software) Spend higher effort on implementation and testing process of components to achieve higher component reliability --- Decrease internal action failure probabilities
High-availability hardware Availability (Hardware) Spend more money on buying better hardware --- Increase physical resource MTTF
High-reliability network Reliability
(Network)
Spend more money on more reliable / higher capacity network links --- Decrease communication link failure probability
Change component deployment Topological
Change deployment of components to servers in a way such that "reliability-sensitive" components are located on servers with high availability. --- Change PCM allocation model
Change component assembly Topological If possible, change assembly of components such that services are provided by the least "reliability-sensitive" components --- Change PCM system model
Optimize external services
(NOT YET SUPPORTED)
Reliability
(external Services)
Spend more money on system-external services with higher reliability. --- Increase reliabillity of system-external services
Redundant hardware (also: fault-tolerant hardware, fail over)
Availability (Hardware)
Spend more money for usage of redundant physical resources (e.g., RAID arrays, redundant CPUs, redundant servers)
Rozanski2005

Assume for the unavailability of n-time redundant resources in steady state:

U(n) = U(1)^n

Under the assumption

MTTF(n)+MTTR(n)=MTTF(1)+MTTR(1)

it follows:

MTTR(n)=(MTTR(1)^n)/((MTTF(1)+MTTR(1))^(n-1))

and

MTTF(n)=MTTF(1)+MTTR(1)-MTTR(n)

Heartbeat (also: ping/echo) Availability
(Hardware)
Spend additional money for a monitoring component / system that periodically tests the availability of physical resources. If a resource turns out to be unavailable, an immediate repair action can be taken.

Bass2003

Kim2009

Decrease the MTTR of the monitored resource in steady state to

MTTR=M/2+R

with (average) check interval M and (average) repair time.

Possibly decrease processing speed of monitored resource if the monitoring puts load on the resource.

Design diversity / n-version-programming Reliability (Software) Realize one algorithm in n different ways. Let each computation request be handled by all versions simultaneously. Apply a voting algorithm that collects all results and applies a certain strategy to choose one of the results (e.g., majority voting). Higher costs arise from designing n algorithms, and n-times computational load at run-time.

Bass2003

Kienzle2003

Think of an internal action as being executed n times. Assume a failure probability for each version; assume a certain voting strategy. Calculate the overall failure probability depending on the individual failure probabilities and the chance that the voting algorithm decides to take the right decision. Increase the resource consumption of the internal action to reflect the n-times computation and the voring overhead.
Transaction Logging
(NOT YET SUPPORTED)
Reliability
(Software)
Log the steps of transactions to a persistent storage, to be able to redo certain steps upon system failure

Rozanski2005

Kienzle2003

Explicitly model how the system recovers from a failure, and follows an alternative execution path.
Rejuvenation Techniques Reliability
(Software)
Automatically restart components, applications servers, or operating systems after failures to ensure high availability  ? You need a model of how the internal action failure probability inceases over time. Such a model, combined with a regular restart, yields an average failure probability that can be used as a fixed value for the internal action.

Notes

  • data diversity
  • environment diversity
  • incorporate sensitivity in analysis to know where to start with heuristic
  • n-version programming: very high costs for additional component (new development, not only licensing)