WICSA ECSA 2012

Aus SDQ-Wiki

Workload-aware System Monitoring Using Performance Predictions Applied to a Large-scale E-Mail System

Complementing Wiki Site for the Paper by Christoph Rathfelder, Stefan Becker, Klaus Krogmann, and Ralf Reussner. This website holds information, which, due to space restrictions, could not be included in the paper but still provides relevant background to get into the details of the contributions.

Palladio Model (Excerpts)

Tool Support

The tooling employed in the case study combines the open source tool Palladio-Bench and customer-specific extensions by 1&1. These extensions for monitoring and log file extraction are platform-specific for the 1&1 system. Comparable monitoring and log file extraction mechanisms can be developed. One example for a tool which can serve as foundation is ProM6.

Complementing Figure

The following figure shows a comparison of measured and predicted resource utilisation of a proxy server. Please note that the figures represent different servers not the same.

WICSA ECSA 2012-measured-predicted-resource-utilisation-proxy-server.png

Figure 9, presents the measured (solid) and predicted (dashed) resource utilisations on the Proxy servers for one day. The predicted curves have the same characteristics compared to the measured ones, with small differences between predicted and measured values. They show that the expected resource utilisation based on the performance predictions fits the measured values on the live system with only small deviations.

Remarks for the validation

To compare measured and predicted values we calculated the difference between measured and predicted value. From the monitoring framework we only get average values of the CPU for the last minutes. The prediction is executed based on the workload on the system measured during the same time period. The deviation threshold that leads to a warning is system specific and depends on the quality of the prediction model.

As we were not able to "play" with the live system, we could not analyse different series of errors to evaluate the detection rate of our algorithms and measure the rate of false positives and false negatives as requested by one of the reviewers. For this reasons we can only evaluate the prediction accuracy and analyse the difference between predicted and measured utilization values to show, that the predicted utilization curves fits the measured utilisation curve over one day (measured each 30 minute) quite well. And as described in the paper the difference is less then 10% over the whole day. (Expect the manually induced software update, which was the failure/unexpected behaviour we want to detect).

Further reading

Diploma Thesis "Performance-Modellierung des 1&1 Mail-Systems" (German)