Analysis of Classifier Performance on Aggregated Energy Status Data

Aus SDQ-Wiki
Ausschreibung (Liste aller Ausschreibungen)
Typ Bachelorarbeit oder Masterarbeit
Aushang ClassifierAggregationPerformance.pdf
Betreuer Wenden Sie sich bei Interesse oder Fragen bitte an:

Dominik Werle (E-Mail: dominik.werle@kit.edu, Telefon: +49-721-608-41609)

The energy consumption of the industrial sector amounts to 24% of the overall consumption in the European Union. From a business perspective, energy consumption is also an important factor to the overall production cost. Companies can directly benefit from energy-efficient production processes. On the other hand, the shift to renewable energies requires companies to adapt to a less flexible energy supply.

In parallel with these new challenges, the technology to measure energy consumption is advancing. Smart meters allow to measure different physical quantities, such as voltage, frequency and harmonic distortion. They give an indication of machine behavior and the quality of the electrical grid. With sample rates up to multiple measurements per second, these devices produce huge amounts of data. This challenges the data processing infrastructure to scale up to hundreds of thousands of events per second.

The focus of this thesis is to model and analyze the impact of different aggregation methods on the result quality of classifiers for high volume energy status data sets.

The result quality of a classification algorithm depends on the characteristics of the underlying training data, such as size, missing values or the data distribution. The time and resources needed to perform an analysis is related to these characteristics as well. Current research in the field of automated classification selection focuses on finding the best-suited algorithm for a specific data set. It does not answer the question how an aggregation of the data such as sampling or averaging would influence the classification. The models that predict classifier rankings do not perform well on data sets with characteristics that are not covered by the training sets. However, these issues are very important when planning or revising a software system for the analysis of energy status data.

This suggests the following research questions:

  • How do different aggregation techniques influence the performance and result quality of

classification algorithms?

  • How does the performance of the classification relate to the performance overhead of the

aggregation?

  • What are characteristics of energy status data that are helpful for this prediction?

For this thesis, you will:

  • Choose a set of classifiers and aggregation methods.
  • Automate the systematic aggregation of data and measurement of classifier performance.
  • Train and evaluate a model for the relationship between performance and result quality.

Our technology stack builds upon modern data processing frameworks such as Apache Cassandra and Apache Spark. Experimental evaluations can be run on a cluster with 512 GB RAM and 48 Cores.

In this thesis, you are working on latest research questions and acquire practical knowledge on large- scale data analytics. Knowledge from a lecture such as “Big Data Analytics” and interest in software performance engineering are beneficial.