Performance Modeling of Distributed Computing

Aus SDQ-Institutsseminar
Vortragende(r) Valerii Zhyla
Vortragstyp Masterarbeit
Betreuer(in) Larissa Schmid
Termin Fr 3. Mai 2024
Vortragsmodus in Präsenz
Kurzfassung Optimizing resource allocation in distributed computing systems is crucial for enhancing system efficiency and reliability. Predicting job execution metadata, based on resource demands and platform characteristics, plays a key role in this optimization process.

Distributed computing simulators are utilized for this purpose to model and predict system behaviors. Among the various simulators developed in recent decades, this thesis specifically focuses on the state-of-the-art simulator DCSim. DCSim simulates the nodes and links of the configured platform, generates the workloads according to configured parameter distributions, and performs the simulations. The simulated job execution metadata is accurate, yet the simulations demand computational resources and time that increase superlinearly with the number of nodes simulated.

In this thesis, we explore the application of Recurrent Neural Networks and Transformer models for predicting job execution metadata within distributed computing environments. We focus on data preparation, model training, and evaluation for handling numerical sequences of varying lengths. This approach enhances the scalability of predictive systems by leveraging deep neural networks to interpret and forecast job execution metadata based on simulated data or historical data.

We assess the models across four scenarios of increasing complexity, evaluating their ability to generalize for unseen jobs and platforms. We examine the training duration and the amount of data necessary to achieve accurate predictions and discuss the applicability of such models to overcome the scalability challenges of DCSim. The key findings of this work demonstrate that the models are capable of generalizing across sequences of lengths encountered during training but fall short in generalizing across different platforms.