Suche mittels Attribut

Diese Seite stellt eine einfache Suchoberfläche zum Finden von Objekten bereit, die ein Attribut mit einem bestimmten Datenwert enthalten. Andere verfügbare Suchoberflächen sind die Attributsuche sowie der Abfragengenerator.

Liste der Ergebnisse

Improving Document Information Extraction with efficient Pre-Training + (SAP Document Information Extraction (DOX) … SAP Document Information Extraction (DOX) is a service to extract logical entities from scanned documents based on the well-known Transformer architecture. The entities comprise header information such as document date or sender name, and line items from tables on the document with fields such as line item quantity. The model currently needs to be trained on a huge number of labeled documents, which is impractical. Also, this hinders the deployment of the model at large scale, as it cannot easily adapt to new languages or document types. Recently, pretraining large language models with self-supervised learning techniques have shown good results as a preliminary step, and allow reducing the amount of labels required in follow-up steps. However, to generalize self-supervised learning to document understanding, we need to take into account different modalities: text, layout and image information of documents. How to do that efficiently and effectively is unclear yet. The goal of this thesis is to come up with a technique for self-supervised pretraining within SAP DOX. We will evaluate our method and design decisions against SAP data as well as public data sets. Besides the accuracy of the extracted entities, we will measure to what extent our method lets us lower label requirements.r method lets us lower label requirements.)
Wichtigkeit von Merkmalen für die Klassifikation von SAT-Instanzen (Proposal) + (SAT gehört zu den wichtigsten NP-schweren … SAT gehört zu den wichtigsten NP-schweren Problemen der theoretischen Informatik, weshalb die Forschung vor allem daran interessiert ist, besonders effiziente Lösungsverfahren dafür zu finden. Deswegen wird eine Klassifizierung vorgenommen, indem ähnliche Probleminstanzen zu Instanzfamilien gruppiert werden, die man mithilfe von Verfahren des maschinellen Lernens automatisieren will. Die Bachelorarbeit beschäftigt sich unter anderem mit folgenden Themen: Mit welchen (wichtigsten) Eigenschaften kann eine Instanz einer bestimmten Familie zugeordnet werden? Wie erstellt man einen guten Klassifikator für dieses Problem? Welche Gemeinsamkeiten haben Instanzen, die oft fehlklassifiziert werden? Wie sieht eine sinnvolle Familieneinteilung aus?eht eine sinnvolle Familieneinteilung aus?)
Verification of Access Control Policies in Software Architectures + (Security in software systems becomes more … Security in software systems becomes more important as systems becomes more complex and connected. Therefore, it is desirable to to conduct security analysis on an architectural level. A possible approach in this direction are data-based privacy analyses. Such approaches are evaluated on case studies. Most exemplary systems for case studies are developed specially for the approach under investigation. Therefore, it is not that simple to find a fitting a case study. The thesis introduces a method to create usable case studies for data-based privacy analyses. The method is applied to the Community Component Modeling Example (CoCoME). The evaluation is based on a GQM plan and shows that the method is applicable. Also it is shown that the created case study is able to check if illegal information flow is present in CoCoME. Additionally, it is shown that the provided meta model extension is able to express the case study.tension is able to express the case study.)
Beyond Similarity - Dimensions of Semantics and How to Detect them + (Semantic similarity estimation is a widely … Semantic similarity estimation is a widely used and well-researched area. Current state-of-the-art approaches estimate text similarity with large language models. However, semantic similarity estimation often ignores fine-grain differences between semantic similar sentences. This thesis proposes the concept of semantic dimensions to represent fine-grain differences between two sentences. A workshop with domain experts identified ten semantic dimensions. From the workshop insights, a model for semantic dimensions was created. Afterward, 60 participants decided via a survey which semantic dimensions are useful to users. Detectors for the five most useful semantic dimensions were implemented in an extendable framework. To evaluate the semantic dimensions detectors, a dataset of 200 sentence pairs was created. The detectors reached an average F1 score of 0.815.tors reached an average F1 score of 0.815.)
Faster Feedback Cycles via Integration Testing Strategies for Serverless Edge Computing + (Serverless computing allows software engin … Serverless computing allows software engineers to develop applications in the cloud without having to manage the infrastructure. The infrastructure is managed by the cloud provider. Therefore, software engineers treat the underlying infrastructure as a black box and focus on the business logic of the application. This lack of inside knowledge leads to an increased testing difficulty as applications tend to be dependent on the infrastructure and other applications running in the cloud environment. While isolated unit and functional testing is possible, integration testing is a challenge, as reliable results are often only achieved after deploying to the deployment environment because infrastructure specifics and other cloud services are only available in the actual cloud environment. This leads to a laborious development process. For this reason, this thesis deals with creating testing strategies for serverless edge computing to reduce feedback cycles and speed up development time. For evaluation, the developed testing strategies are applied to Lambda@Edge in AWS.ategies are applied to Lambda@Edge in AWS.)
Influence of Load Profile Perturbation and Temporal Aggregation on Disaggregation Quality + (Smart Meters become more and more popular. … Smart Meters become more and more popular. With Smart Meter, new privacy issues arise. A prominent privacy issue is disaggregation, i.e., the determination of appliance usages from aggregated Smart Meter data. The goal of this thesis is to evaluate load profile perturbation and temporal aggregation techniques regarding their ability to prevent disaggregation. To this end, we used a privacy operator framework for temporal aggregation and perturbation, and the NILM TK framework for disaggregation. We evaluated the influence on disaggregation quality of the operators from the framework individually and in combination. One main observation is that the de-noising operator from the framework prevents disaggregation best.he framework prevents disaggregation best.)
Modelling and Enforcing Access Control Requirements for Smart Contracts + (Smart contracts are software systems emplo … Smart contracts are software systems employing the underlying blockchain technology to handle transactions in a decentralized and immutable manner. Due to the immutability of the blockchain, smart contracts cannot be upgraded after their initial deploy. Therefore, reasoning about a contract’s security aspects needs to happen before the deployment. One common vulnerability for smart contracts is improper access control, which enables entities to modify data or employ functionality they are prohibited from accessing. Due to the nature of the blockchain, access to data, represented through state variables, can only be achieved by employing the contract’s functions. To correctly restrict access on the source code level, we improve the approach by Reiche et al. who enforce access control policies based on a model on the architectural level.This work aims at correctly enforcing role-based access control (RBAC) policies for Solidity smart contract systems on the architectural and source code level. We extend the standard RBAC model by Sandhu, Ferraiolo, and Kuhn to also incorporate insecure information flows and authorization constraints for roles. We create a metamodel to capture the concepts necessary to describe and enforce RBAC policies on the architectural level. The policies are enforced in the source code by translating the model elements to formal specifications. For this purpose, an automatic code generator is implemented. To reason about the implemented smart contracts on the source code level, tools like solc-verify and Slither are employed and extended. Furthermore, we outline the development process resulting from the presented approach.To evaluate our approach and uncover problems and limitations, we employ a case study using the three smart contract software systems Augur, Fizzy and Palinodia. Additionally, we apply a metamodel coverage analysis to reason about the metamodel’s and the generator’s completeness. Furthermore, we provide an argumentation concerning the approach’s correct enforcement.This evaluation shows how a correct enforcement can be achieved under certain assumptions and when information flows are not considered. The presented approach can detect 100% of manually introduced violations during the case study to the underlying RBAC policies. Additionally, the metamodel is expressive enough to describe RBAC policies and contains no unnecessary elements, since approximately 90% of the created metamodel are covered by the implemented generator. We identify and describe limitations like oracles or public variables.itations like oracles or public variables.)
Methodology for Evaluating a Domain-Specific Model Transformation Language + (Sobald ein System durch mehrere Modelle be … Sobald ein System durch mehrere Modelle beschrieben wird, können sich diese verschiedenen Beschreibungen auch gegenseitig widersprechen. Modelltransformationen sind ein geeignetes Mittel, um das selbst dann zu vermeiden, wenn die Modelle von mehreren Parteien parallel bearbeitet werden. Es gibt mittlerweile reichhaltige Forschungsergebnisse dazu, Änderungen zwischen zwei Modellen zu transformieren. Allerdings ist die Herausforderung, Modelltransformationen zwischen mehr als zwei Modellen zu entwickeln, bislang unzureichend gelöst. Die Gemeinsamkeiten-Sprache ist eine deklarative, domänenspezifische Programmiersprache, mit der multidirektionale Modelltransformationen programmiert werden können, indem bidirektionale Abbildungsspezifikationen kombiniert werden. Da sie bis jetzt jedoch nicht empirisch validiert wurde, stellt es eine offene Frage dar, ob die Sprache dazu geeignet ist, realistische Modelltransformationen zu entwickeln, und welche Vorteile die Sprache gegenüber einer alternativen Programmiersprache für Modelltransformationen bietet.In dieser Abschlussarbeit entwerfe ich eine Fallstudie, mit der die Gemeinsamkeiten-Sprache evaluiert wird. Ich bespreche die Methodik und die Validität dieser Fallstudie. Weiterhin präsentiere ich Kongruenz, eine neue Eigenschaft für bidirektionale Modelltransformationen. Sie stellt sicher, dass die beiden Richtungen einer Transformation zueinander kompatibel sind. Ich leite aus praktischen Beispielen ab, warum wir erwarten können, dass Transformationen normalerweise kongruent sein werden. Daraufhin diskutiere ich die Entwurfsentscheidungen hinter einer Teststrategie, mit der zwei Modelltransformations- Implementierungen, die beide dieselbe Konsistenzspezifikation umsetzen, getestet werden können. Die Teststrategie beinhaltet auch einen praktischen Einsatzzweck von Kongruenz. Zuletzt stelle ich Verbesserungen der Gemeinsamkeiten-Sprache vor.Die Beiträge dieser Abschlussarbeit ermöglichen gemeinsam, eine Fallstudie zu Programmiersprachen für Modelltransformationen umzusetzen. Damit kann ein besseres Verständnis der Vorteile dieser Sprachen erzielt werden. Kongruenz kann die Benutzerfreundlichkeit beliebiger Modelltransformationen verbessern und könnte sich als nützlich herausstellen, um Modelltransformations-Netzwerke zu konstruieren. Die Teststrategie kann auf beliebige Akzeptanztests für Modelltransformationen angewendet werden. Modelltransformationen angewendet werden.)
Modeling of Security Patterns in Palladio + (Software itself and the contexts, it is us … Software itself and the contexts, it is used in, typically evolve over time. Analyzing and ensuring security of evolving software systems in contexts, that are also evolving, poses many difficulties. In my thesis I declared a number of goals and propose processes for the elicitation of attacks, their prerequisites and mitigating security patterns for a given architecture model and for annotation of it with security-relevant information. I showed how this information can be used to analyze the systems security, in regards of modeled attacks, using an attack validity algorithm I specify. Process and algorithm are used in a case study on CoCoME in order to show the applicability of each of them and to analyze the fulfillment of the previously stated goals. Security catalog meta-models and instances of catalogs containing a number of elements have been provided.g a number of elements have been provided.)
Multi-model Consistency through Transitive Combination of Binary Transformations + (Software systems are usually described thr … Software systems are usually described through multiple models that address different development concerns. These models can contain shared information, which leads to redundant representations of the same information and dependencies between the models. These representations of shared information have to be kept consistent, for the system description to be correct. The evolution of one model can cause inconsistencies with regards to other models for the same system. Therefore, some mechanism of consistency restoration has to be applied after changes occurred. Manual consistency restoration is error-prone and time-consuming, which is why automated consistency restoration is necessary. Many existing approaches use binary transformations to restore consistency for a pair of models, but systems are generally described through more than two models. To achieve multi-model consistency preservation with binary transformations, they have to be combined through transitive execution.In this thesis, we explore transitive combination of binary transformations and we study what the resulting problems are. We develop a catalog of six failure potentials that can manifest in failures with regards to consistency between the models. The knowledge about these failure potentials can inform a transformation developer about possible problems arising from the combination of transformations. One failure potential is a consequence of the transformation network topology and the used domain models. It can only be avoided through topology adaptations. Another failure potential emerges, when two transformations try to enforce conflicting consistency constraints. This can only be repaired through adaptation of the original consistency constraints. Both failure potentials are case-specific and cannot be solved without knowing which transformations will be combined. Furthermore, we develop two transformation implementation patterns to mitigate two other failure potentials. These patterns can be applied by the transformation developer to an individual transformation definition, independent of the combination scenario. For the remaining two failure potentials, no general solution was found yet and further research is necessary.We evaluate the findings with a case study that involves two independently developed transformations between a component-based software architecture model, a UML class diagram and its Java implementation. All failures revealed by the evaluation could be classified with the identified failure potentials, which gives an initial indicator for the completeness of our failure potential catalog. The proposed patterns prevented all failures of their targeted failure potential, which made up 70% of all observed failures, and shows that the developed implementation patterns are applicable and help to mitigate issues occurring from transitively combining binary transformations.sitively combining binary transformations.)
Abstrakte und konsistente Vertraulichkeitsspezifikation von der Architektur bis zum Code + (Software-Systeme können sensible Informati … Software-Systeme können sensible Informationen verarbeiten. Um ihre Vertraulichkeit zu gewährleisten, können sowohl das Architekturmodell, als auch seine Implementierung hinsichtlich des Informationsflusses untersucht werden. Dazu wird eine Vertraulichkeitsspezifikation definiert. Beide Modellebenen besitzen eine Repräsentation der gleichen Spezifikation. Wird das System weiterentwickelt, kann sie sich auf beiden Ebenen verändern und dementsprechend widersprüchliche Aussagen enthalten. Möchte man die Vertraulichkeit der Informationen verifizieren, müssen die Spezifikationselemente im Quellcode in einem zusätzlichen Schritt in eine weitere Sprache übersetzt werden. Die Bachelorarbeit beschäftigt sich mit der Transformation der unterschiedlichen Repräsentationen der Vertraulichkeitsspezifikation eines Software-Systems. Das beinhaltet ein Abbildungskonzept zur Konsistenzhaltung der Vertraulichkeitsspezifikation und die Übersetzung in eine Sprache, die zur Verifikation benutzt werden kann. die zur Verifikation benutzt werden kann.)
Automatisiertes GUI-basiertes Testen einer Passwortmanager-Applikation mit Neuroevolution + (Software-Testing ist essenziell zur Gewähr … Software-Testing ist essenziell zur Gewährleistung der Qualität und Funktionalität von Softwareprodukten. Es existieren sowohl manuelle als auch automatisierte Methoden. Allerdings weisen sowohl automatisierte Verfahren als auch menschliche und skriptbasierte Tests bezüglich Kosteneffizienz und Zeitaufwand Einschränkungen auf. Monkey-Testing, gekennzeichnet durch zufällige Klicks auf der Benutzeroberfläche, berücksichtigt dabei oft nicht ausreichend die Logik der Applikation.Diese Bachelorarbeit konzentriert sich auf die automatisierte neuroevolutionäre Testmethode, die neuronale Netze als Testagenten nutzt und diese mittels evolutionärer Algorithmen über mehrere Generationen hinweg verfeinert. Zur Evaluierung dieser Agenten und zum Vergleich mit Monkey-Testing wurde eine simulierte Version einer Passwort-Manager Applikation eingesetzt. Dabei wurde eine Belohnungsstruktur innerhalb der simulierten Anwendung implementiert. Die Ergebnisse verdeutlichen, dass das neuroevolutionäre Testverfahren im Hinblick auf die erzielten Belohnungen im Vergleich zum Monkey-Testing signifikant besser performt. Dies führt zu einer besseren Berücksichtigung der Anwendungslogik im Testprozess.tigung der Anwendungslogik im Testprozess.)
GUI-basiertes Testen einer Lernplattform-Anwendung durch Nutzung von Neuroevolution + (Software-Testing ist notwendig, um die Qua … Software-Testing ist notwendig, um die Qualität und Funktionsfähigkeit von Softwareartefakten sicherzustellen. Es gibt sowohl automatisierte als auch manuelle Testverfahren. Allerdings sind automatisierte Verfahren, sowie menschliches Testen und skriptbasiertes Testen in Bezug auf Zeitaufwand und Kosten weniger gut skalierbar. Monkey-Testing, das durch zufällige Klicks auf der Benutzeroberfläche gekennzeichnet ist, berücksichtigt die Applikationslogik oft nicht ausreichend.Der Fokus dieser Bachelorarbeit liegt auf dem automatisierten neuroevolutionären Testverfahren, das neuronale Netze als Testagenten verwendet und sie mithilfe evolutionärer Algorithmen über mehrere Generationen hinweg verbessert. Um das Training der Agenten zu ermöglichen und den Vergleich zum Monkey-Testing zu ermöglichen, wurde eine simulierte Version der Lernplattform Anki implementiert. Zur Beurteilung der Testagenten wurde eine Belohnungsstruktur in der simulierten Anwendung entwickelt.Die Ergebnisse zeigen, dass das neuroevolutionäre Testverfahren im Vergleich zum Monkey-Testing in Bezug auf erreichte Belohnungen signifikant besser abschneidet. Dadurch wird die Applikationslogik im Testprozess besser berücksichtigt.ogik im Testprozess besser berücksichtigt.)
Entity Linking für Softwarearchitekturdokumentation + (Softwarearchitekturdokumentationen enthalt … Softwarearchitekturdokumentationen enthalten Fachbegriffe aus der Domäne der Softwareentwicklung. Wenn man diese Begriffe findet und zu den passenden Begriffen in einer Datenbank verknüpft, können Menschen und Textverarbeitungssysteme diese Informationen verwenden, um die Dokumentation besser zu verstehen. Die Fachbegriffe in Dokumentationen entsprechen dabei Entitätserwähnungen im Text.In dieser Ausarbeitung stellen wir unser domänenspezifisches Entity-Linking-System vor. Das System verknüpft Entitätserwähnungen innerhalb von Softwarearchitekturdokumentationen zu den zugehörigen Entitäten innerhalb einer Wissensbasis. Das System enthält eine domänenspezifische Wissensbasis, ein Modul zur Vorverarbeitung und ein Entity-Linking-System.erarbeitung und ein Entity-Linking-System.)
Entwicklung einer Entwurfszeit-DSL zur Formalisierung von Runtime Adaptationsstrategien für SAS zum Zweck der Strategie-Optimierung + (Softwaresysteme der heutigen Zeit werden z … Softwaresysteme der heutigen Zeit werden zunehmend komplexer und unterliegen immermehr variierenden Bedingungen. Dadurch gewinnen selbst-adaptive Systeme an Bedeutung, da diese sich neuen Bedingungen dynamisch anpassen können, indem sie Veränderungen an sich selbst vornehmen. Domänenspezifische Modellierungssprachen (DSL) zur Formalisierung von Adaptionsstrategien stellen ein wichtiges Mittel dar, um den Entwurf von Rückkopplungsschleifen selbst-adaptiver Softwaresysteme zu modellieren und zu optimieren. Hiermit soll eine Bachelorarbeit vorgeschlagen werden, die sich mit der Fragestellung befasst, wie eine Optimierung von Adaptionsstrategien in einer DSL zur Entwurfszeit beschrieben werden kann. zur Entwurfszeit beschrieben werden kann.)
Preventing Code Insertion Attacks on Token-Based Software Plagiarism Detectors + (Some students tasked with mandatory progra … Some students tasked with mandatory programming assignments lack the time or dedication to solve the assignment themselves. Instead, they plagiarize a peer’s solution by slightly modifying the code. However, there exist numerous tools that assist in detecting these kinds of plagiarism. These tools can be used by instructors to identify plagiarized programs. The most used type of plagiarism detection tools is token-based plagiarism detectors. They are resilient against many types of obfuscation attacks, such as renaming variables or whitespace modifications. However, they are susceptible to inserting lines of code that do not affect the program flow or result.The current working assumption was that the successful obfuscation of plagiarism takes more effort and skill than solving the assignment itself. This assumption was broken by automated plagiarism generators, which exploit this weakness. This work aims to develop mechanisms against code insertions that can be directly integrated into existing token-based plagiarism detectors. For this, we first develop mechanisms to negate the negative effect of many types of code insertion. Then we implement these mechanisms prototypically into a state-of-the-art plagiarism detector. We evaluate our implementation by running it on a dataset consisting of real student submissions and automatically generated plagiarism. We show that with our mechanisms, the similarity rating of automatically generated plagiarism increases drastically. Consequently, the plagiarism generator we use fails to create usable plagiarisms.we use fails to create usable plagiarisms.)
Software Plagiarism Detection on Intermediate Representation + (Source code plagiarism is a widespread pro … Source code plagiarism is a widespread problem in computer science education. To counteract this, software plagiarism detectors can help identify plagiarized code. Most state-of-the-art plagiarism detectors are token-based. It is common to design and implement a new dedicated language module to support a new programming language. This process can be time-consuming, furthermore, it is unclear whether it is even necessary. In this thesis, we evaluate the necessity of dedicated language modules for Java and C/C++ and derive conclusions for designing new ones. To achieve this, we create a language module for the intermediate representation of LLVM. For the evaluation, we compare it to two existing dedicated language modules in JPlag. While our results show that dedicated language modules are better for plagiarism detection, language modules for intermediate representations show better resilience to obfuscation attacks. better resilience to obfuscation attacks.)
Portables Auto-Tuning paralleler Anwendungen + (Sowohl Offline- als auch Online-Tuning ste … Sowohl Offline- als auch Online-Tuning stellen gängige Lösungen zur automatischen Optimierung von parallelen Anwendungen dar. Beide Verfahren haben ihre individuellen Vor- und Nachteile: das Offline-Tuning bietet minimalen negativen Einfluss auf die Laufzeiten der Anwendung, die getunten Parameterwerte sind allerdings nur auf im Voraus bekannter Hardware verwendbar. Online-Tuning hingegen bietet dynamische Parameterwerte, die zur Laufzeit der Anwendung und damit auf der Zielhardware ermittelt werden, dies kann sich allerdings negativ auf die Laufzeit der Anwendung ausüben.Wir versuchen die Vorteile beider Ansätze zu verschmelzen, indem im Voraus optimierte Parameterkonfigurationen auf der Zielhardware, sowie unter Umständen mit einer anderen Anwendung, verwendet werden. Wir evaluieren sowohl die Hardware- als auch die Anwendungsportabilität der Konfigurationen anhand von fünf Beispielanwendungen.ionen anhand von fünf Beispielanwendungen.)
DomainML: A modular framework for domain knowledge-guided machine learning + (Standard, data-driven machine learning app … Standard, data-driven machine learning approaches learn relevant patterns solely from data. In some fields however, learning only from data is not sufficient. A prominent example for this is healthcare, where the problem of data insufficiency for rare diseases is tackled by integrating high-quality domain knowledge into the machine learning process.Despite the existing work in the healthcare context, making general observations about the impact of domain knowledge is difficult, as different publications use different knowledge types, prediction tasks and model architectures. It further remains unclear if the findings in healthcare are transferable to other use-cases, as well as how much intellectual effort this requires.With this Thesis we introduce DomainML, a modular framework to evaluate the impact of domain knowledge on different data science tasks. We demonstrate the transferability and flexibility of DomainML by applying the concepts from healthcare to a cloud system monitoring. We then observe how domain knowledge impacts the model’s prediction performance across both domains, and suggest how DomainML could further be used to refine both the given domain knowledge as well as the quality of the underlying dataset. as the quality of the underlying dataset.)
State of the Art: Multi Actor Behaviour and Dataflow Modelling for Dynamic Privacy + (State of the Art Vortrag im Rahmen der Praxis der Forschung.)
Data-Preparation for Machine-Learning Based Static Code Analysis + (Static Code Analysis (SCA) has become an i … Static Code Analysis (SCA) has become an integral part of modern software development, especially since the rise of automation in the form of CI/CD. It is an ongoing question of how machine learning can best help improve SCA's state and thus facilitate maintainable, correct, and secure software. However, machine learning needs a solid foundation to learn on. This thesis proposes an approach to build that foundation by mining data on software issues from real-world code. We show how we used that concept to analyze over 4000 software packages and generate over two million issue samples. Additionally, we propose a method for refining this data and apply it to an existing machine learning SCA approach.an existing machine learning SCA approach.)
Creating Study Plans by Generating Workflow Models from Constraints in Temporal Logic + (Students are confronted with a huge amount … Students are confronted with a huge amount of regulations when planning their studies at a university. It is challenging for them to create a personalized study plan while still complying to all official rules. The STUDYplan software aims to overcome the difficulties by enabling an intuitive and individual modeling of study plans. A study plan can be interpreted as a sequence of business process tasks that indicate courses to make use of existing work in the business process domain. This thesis focuses on the idea of synthesizing business process models from declarative specifications that indicate official and user-defined regulations for a study plan. We provide an elaborated approach for the modeling of study plan constraints and a generation concept specialized to study plans. This work motivates, discusses, partially implements and evaluates the proposed approach.ments and evaluates the proposed approach.)
A comparative study of subgroup discovery methods + (Subgroup discovery is a data mining techni … Subgroup discovery is a data mining technique that is used to extract interesting relationships in a dataset related to to a target variable. These relationships are described in the form of rules. Multiple SD techniques have been developed over the years. This thesis establishes a comparative study between a number of these techniques in order to identify the state-of-the-art methods. It also analyses the effects discretization has on them as a preprocessing step . Furthermore, it investigates the effect of hyperparameter optimization on these methods. Our analysis showed that PRIM, DSSD, Best Interval and FSSD outperformed the other subgroup discovery methods evaluated in this study and are to be considered state-of-the-art . It also shows that discretization offers an efficiency improvement on methods that do not employ internal discretization. It has a negative impact on the quality of subgroups generated by methods that perform it internally. The results finally demonstrates that Apriori-SD and SD-Algorithm were the most positively affected by the hyperparameter optimization.fected by the hyperparameter optimization.)
Software Testing + (TBA)
Exploring Modern IDE Functionalities for Consistency Preservation + (TBA)
Exploring the Traceability of Requirements and Source Code via LLMs + (TBA)

Data-Driven Approaches to Predict Material Failure and Analyze Material Models + (Te prediction of material failure is usefu … Te prediction of material failure is useful in many industrial contexts such as predictive maintenance, where it helps reducing costs by preventing outages. However, failure prediction is a complex task. Typically, material scientists need to create a physical material model to run computer simulations. In real-world scenarios, the creation of such models is ofen not feasible, as the measurement of exact material parameters is too expensive. Material scientists can use material models to generate simulation data. Tese data sets are multivariate sensor value time series. In this thesis we develop data-driven models to predict upcoming failure of an observed material. We identify and implement recurrent neural network architectures, as recent research indicated that these are well suited for predictions on time series. We compare the prediction performance with traditional models that do not directly predict on time series but involve an additional step of feature calculation. Finally, we analyze the predictions to fnd abstractions in the underlying material model that lead to unrealistic simulation data and thus impede accurate failure prediction. Knowing such abstractions empowers material scientists to refne the simulation models. The updated models would then contain more relevant information and make failure prediction more precise. and make failure prediction more precise.)
Improving SAP Document Information Extraction via Pretraining and Fine-Tuning + (Techniques for extracting relevant informa … Techniques for extracting relevant information from documents have made significant progress in recent years and became a key task in the digital transformation. With deep neural networks, it became possible to process documents without specifying hard-coded extraction rules or templates for each layout. However, such models typically have a very large number of parameters. As a result, they require many annotated samples and long training times. One solution is to create a basic pretrained model using self-supervised objectives and then to fine-tune it using a smaller document-specific annotated dataset. However, implementing and controlling the pretraining and fine-tuning procedures in a multi-modal setting is challenging. In this thesis, we propose a systematic method that consists in pretraining the model on large unlabeled data and then to fine-tune it with a virtual adversarial training procedure. For the pretraining stage, we implement an unsupervised informative masking method, which improves upon standard Masked-Language Modelling (MLM). In contrast to randomly masking tokens like in MLM, our method exploits Point-Wise Mutual Information (PMI) to calculate individual masking rates based on statistical properties of the data corpus, e.g., how often certain tokens appear together on a document page. We test our algorithm in a typical business context at SAP and report an overall improvement of 1.4% on the F1-score for extracted document entities. Additionally, we show that the implemented methods improve the training speed, robustness and data-efficiency of the algorithm.ness and data-efficiency of the algorithm.)
Analyse von Zeitreihen-Kompressionsmethoden am Beispiel von Google N-Grams + (Temporal text corpora like the Google Ngra … Temporal text corpora like the Google Ngram dataset usually incorporate a vast number of words and expressions, called ngrams, and their respective usage frequencies over the years. The large quantity of entries complicates working with the dataset, as transformations and queries are resource and time intensive. However, many use-cases do not require the whole corpus to have a sufficient dataset and achieve acceptable results. We propose various compression methods to reduce the absolute number of ngrams in the corpus. Additionally, we utilize time-series compression methods for quick estimations about the properties of ngram usage frequencies. As basis for our compression method design and experimental validation serve CHQL (Conceptual History Query Language) queries on the Google Ngram dataset. The goal is to find compression methods that reduce the complexity of queries on the corpus while still maintaining good results.rpus while still maintaining good results.)
Analyse von Zeitreihen-Kompressionsmethoden am Beispiel von Google N-Gram + (Temporal text corpora like the Google Ngra … Temporal text corpora like the Google Ngram Data Set usually incorporate a vast number of words and expressions, called ngrams, and their respective usage frequencies over the years. The large quantity of entries complicates working with the data set, as transformations and queries are resource and time intensive. However, many use cases do not require the whole corpus to have a sufficient data set and achieve acceptable query results. We propose various compression methods to reduce the total number of ngrams in the corpus. Specially, we propose compression methods that, given an input dictionary of target words, find a compression tailored for queries on a specific topic. Additionally, we utilize time-series compression methods for quick estimations about the properties of ngram usage frequencies. As basis for our compression method design and experimental validation serve CHQL (Conceptual History Query Language) queries on the Google Ngram Data Set.age) queries on the Google Ngram Data Set.)
Implementation and Evaluation of CHQL Operators in Relational Database Systems + (The IPD defined CHQL, a query algebra that … The IPD defined CHQL, a query algebra that enables to formalize queries about conceptual history. CHQL is currently implemented in MapReduce which offers less flexibility for query optimization than relational database systems does. The scope of this thesis is to implement the given operators in SQL and analyze performance differences by identifying limiting factors and query optimization on the logical and physical level. At the end, we will provide efficient query plans and fast operator implementations to execute CHQL queries in relational database systems.QL queries in relational database systems.)
The Kconfig Variability Framework as a Feature Model + (The Kconfig variability framework is used … The Kconfig variability framework is used to develop highly variable software such as the Linux kernel, ZephyrOS and NuttX. Kconfig allows developers to break down their software in modules and define the dependencies between these modules, so that when a concrete configuration is created, the semantic dependencies between the selected modules are fulfilled, ensuring that the resulting software product can function. Kconfig has often been described as a tool of define software product lines (SPLs), which often occur within the context of feature-oriented programming (FOP). In this paper, we introduce methods to transform Kconfig files into feature models so that the semantics of the model defined in a Kconfig file are preserved. The resulting feature models can be viewed with FeatureIDE, which allows the further analysis of the Kconfig file, such as the detection of redundant dependencies and cyclic dependencies.dant dependencies and cyclic dependencies.)
Review of data efficient dependency estimation + (The amount and complexity of data collecte … The amount and complexity of data collected in the industry is increasing, and data analysis rises in importance. Dependency estimation is a significant part of knowledge discovery and allows strategic decisions based on this information.There are multiple examples that highlight the importance of dependency estimation, like knowing there exists a correlation between the regular dose of a drug and the health of a patient helps to understand the impact of a newly manufactured drug.Knowing how the case material, brand, and condition of a watch influences the price on an online marketplace can help to buy watches at a good price.Material sciences can also use dependency estimation to predict many properties of a material before it is synthesized in the lab, so fewer experiments are necessary.Many dependency estimation algorithms require a large amount of data for a good estimation. But data can be expensive, as an example experiments in material sciences, consume material and take time and energy.As we have the challenge of expensive data collection, algorithms need to be data efficient. But there is a trade-off between the amount of data and the quality of the estimation. With a lack of data comes an uncertainty of the estimation. However, the algorithms do not always quantify this uncertainty. As a result, we do not know if we can rely on the estimation or if we need more data for an accurate estimation.In this bachelor's thesis we compare different state-of-the-art dependency estimation algorithms using a list of criteria addressing these challenges and more. We partly developed the criteria our self as well as took them from relevant publications. The existing publications formulated many of the criteria only qualitative, part of this thesis is to make these criteria measurable quantitative, where possible, and come up with a systematic approach of comparison for the rest.From 14 selected criteria, we focus on criteria concerning data efficiency and uncertainty estimation, because they are essential for lowering the cost of dependency estimation, but we will also check other criteria relevant for the application of algorithms.As a result, we will rank the algorithms in the different aspects given by the criteria, and thereby identify potential for improvement of the current algorithms.We do this in two steps, first we check general criteria in a qualitative analysis. For this we check if the algorithm is capable of guided sampling, if it is an anytime algorithm and if it uses incremental computation to enable early stopping, which all leads to more data efficiency.We also conduct a quantitative analysis on well-established and representative datasets for the dependency estimation algorithms, that performed well in the qualitative analysis.In these experiments we evaluate more criteria:The robustness, which is necessary for error-prone data, the efficiency which saves time in the computation, the convergence which guarantees we get an accurate estimation with enough data, and consistency which ensures we can rely on an estimation.hich ensures we can rely on an estimation.)
Identifying Security Requirements in Natural Language Documents + (The automatic identification of requiremen … The automatic identification of requirements, and their classification according to their security objectives, can be helpful to derive insights into the security of a given system. However, this task requires significant security expertise to perform. In this thesis, the capability of modern Large Language Models (such as GPT) to replicate this expertise is investigated. This requires the transfer of the model's understanding of language to the given specific task. In particular, different prompt engineering approaches are combined and compared, in order to gain insights into their effects on performance. GPT ultimately performs poorly for the main tasks of identification of requirements and of their classification according to security objectives. Conversely, the model performs well for the sub-task of classifying the security-relevance of requirements. Interestingly, prompt components influencing the format of the model's output seem to have a higher performance impact than components containing contextual information.ponents containing contextual information.)
Predicting System Dependencies from Tracing Data Instead of Computing Them + (The concept of Artificial Intelligence for … The concept of Artificial Intelligence for IT Operations combines big data and machine learning methods to replace a broad range of IT operations including availability and performance monitoring of services. In large-scale distributed cloud infrastructures a service is deployed on different separate nodes. As the size of the infrastructure increases in production, the analysis of metrics parameters becomes computationally expensive. We address the problem by proposing a method to predict dependencies between metrics parameters of system components instead of computing them. To predict the dependencies we use time windowing with different aggregation methods and distributed tracing data that contain detailed information for the system execution workflow. In this bachelor thesis, we inspect the different representations of distributed traces from simple counting of events to more complex graph representations. We compare them with each other and evaluate the performance of such methods. evaluate the performance of such methods.)
Change Detection in High Dimensional Data Streams + (The data collected in many real-world scen … The data collected in many real-world scenarios such as environmental analysis, manufacturing, and e-commerce are high-dimensional and come as a stream, i.e., data properties evolve over time – a phenomenon known as "concept drift". This brings numerous challenges: data-driven models become outdated, and one is typically interested in detecting specific events, e.g., the critical wear and tear of industrial machines. Hence, it is crucial to detect change, i.e., concept drift, to design a reliable and adaptive predictive system for streaming data. However, existing techniques can only detect "when" a drift occurs and neglect the fact that various drifts may occur in different dimensions, i.e., they do not detect "where" a drift occurs. This is particularly problematic when data streams are high-dimensional. The goal of this Master’s thesis is to develop and evaluate a framework to efficiently and effectively detect “when” and “where” concept drift occurs in high-dimensional data streams. We introduce stream autoencoder windowing (SAW), an approach based on the online training of an autoencoder, while monitoring its reconstruction error via a sliding window of adaptive size. We will evaluate the performance of our method against synthetic data, in which the characteristics of drifts are known. We then show how our method improves the accuracy of existing classifiers for predictive systems compared to benchmarks on real data streams.mpared to benchmarks on real data streams.)
Automated Test Selection for CI Feedback on Model Transformation Evolution + (The development of the transformation mode … The development of the transformation model also comes with the appropriate system-level testing to verify its changes. Due to the complex nature of the transformation model, the number of tests increases as the structure and feature description become more detailed. However, executing all test cases for every change is costly and time-consuming. Thus, it is necessary to conduct a selection for the transformation tests. In this presentation, you will be introduced to a change-based test prioritization and transformation test selection approach for early fault detection.ection approach for early fault detection.)
Statistical Generation of High Dimensional Data Streams with Complex Dependencies + (The evaluation of data stream mining algor … The evaluation of data stream mining algorithms is an important task in current research. The lack of a ground truth data corpus that covers a large number of desireable features (especially concept drift and outlier placement) is the reason why researchers resort to producing their own synthetic data. This thesis proposes a novel framework ("streamgenerator") that allows to create data streams with finely controlled characteristics. The focus of this work is the conceptualization of the framework, however a prototypical implementation is provided as well. We evaluate the framework by testing our data streams against state-of-the-art dependency measures and outlier detection algorithms.measures and outlier detection algorithms.)
Statistical Generation of High-Dimensional Data Streams with Complex Dependencies + (The extraction of knowledge from data stre … The extraction of knowledge from data streams is one of the most crucial tasks of modern day data science. Due to their nature data streams are ever evolving and knowledge derrived at one point in time may be obsolete in the next period. The need for specialized algorithms that can deal with high-dimensional data streams and concept drift is prevelant.A lot of research has gone into creating these kind of algorithms. The problem here is the lack of data sets with which to evaluate them. A ground truth for a common evaluation approach is missing. A solution to this could be the synthetic generation of data streams with controllable statistical propoerties, such as the placement of outliers and the subspaces in which special kinds of dependencies occur. The goal of this Bachelor thesis is the conceptualization and implementation of a framework which can create high-dimensional data streams with complex dependencies.al data streams with complex dependencies.)
Theory-guided Load Disaggregation in an Industrial Environment + (The goal of Load Disaggregation (or Non-in … The goal of Load Disaggregation (or Non-intrusive Load Monitoring) is to infer the energy consumption of individual appliances from their aggregated consumption. This facilitates energy savings and efficient energy management, especially in the industrial sector.However, previous research showed that Load Disaggregation underperforms in the industrial setting compared to the household setting. Also, the domain knowledge available about industrial processes remains unused.The objective of this thesis was to improve load disaggregation algorithms by incorporating domain knowledge in an industrial setting. First, we identified and formalized several domain knowledge types that exist in the industry. Then, we proposed various ways to incorporate them into the Load Disaggregation algorithms, including Theory-Guided Ensembling, Theory-Guided Postprocessing, and Theory-Guided Architecture. Finally, we implemented and evaluated the proposed methods.mented and evaluated the proposed methods.)
Tuning of Explainable ArtificialIntelligence (XAI) tools in the field of textanalysis + (The goal of this bachelor thesis was to an … The goal of this bachelor thesis was to analyse classification results using a 2017 published method called shap. Explaining how an artificial neural network makes a decision is an interdisciplinary research subject combining computer science, math, psychology and philosophy. We analysed these explanations from a psychological standpoint and after presenting our findings we will propose a method to improve the interpretability of text explanations using text-hierarchies, without loosing much/any accuracy. Secondary, the goal was to test out a framework developed to analyse a multitude of explanation methods. This Framework will be presented next to our findings and how to use it to create your own analysis. This Bachelor thesis is addressed at people familiar with artificial neural networks and other machine learning methods.tworks and other machine learning methods.)
Specifying and Maintaining the Correspondence between Architecture Models and Runtime Observations + (The goal of this thesis is to provide a ge … The goal of this thesis is to provide a generic concept of a correspondence model (CM) to map high-level model elements to corresponding low-level model elements and to generate this mapping during implementation of the high-level model using a correspondence model generator (CGM). In order to evaluate our approach, we implement and integrate the CM for the iObserve project. Further we implement the proposed CMG and integrate it into ProtoCom, the source code generator used by the iObserve project. We first evaluate the feasibility of this approach by checking whether such a correspondence model can be specified as desired and generated by the CGM. Secondly, we evaluate the accuracy of the approach by checking the generated correspondences against a reference model.correspondences against a reference model.)
Intelligent Match Merging to Prevent Obfuscation Attacks on Software Plagiarism Detectors + (The increasing number of computer science … The increasing number of computer science students has prompted educators to rely on state-of-the-art source code plagiarism detection tools to deter the submission of plagiarized coding assignments. While these token-based plagiarism detectors are inherently resilient against simple obfuscation attempts, recent research has shown that obfuscation tools empower students to easily modify their submissions, thus evading detection. These tools automatically use dead code insertion and statement reordering to avoid discovery. The emergence of ChatGPT has further raised concerns about its obfuscation capabilities and the need for effective mitigation strategies.Existing defence mechanisms against obfuscation attempts are often limited by their specificity to certain attacks or dependence on programming languages, requiring tedious and error-prone reimplementation. In response to this challenge, this thesis introduces a novel defence mechanism against automatic obfuscation attacks called match merging. It leverages the fact that obfuscation attacks change the token sequence to split up matches between two submissions so that the plagiarism detector discards the broken matches. Match merging reverts the effects of these attacks by intelligently merging neighboring matches based on a heuristic designed to minimize false positives.Our method’s resilience against classic obfuscation attacks is demonstrated through evaluations on diverse real-world datasets, including undergrad assignments and competitive coding challenges, across six different attack scenarios. Moreover, it significantly improves detection performance against AI-based obfuscation. What sets our method apart is its language- and attack-independence while its minimal runtime overhead makes it seamlessly compatible with other defence mechanisms. compatible with other defence mechanisms.)
Efficient k-NN Search of Time Series in Arbitrary Time Intervals + (The k nearest neighbors (k-NN) of a time s … The k nearest neighbors (k-NN) of a time series are the k closest sequences within adataset regarding a distance measure. Often, not the entire time series, but only specifictime intervals are of interest, e.g., to examine phenomena around special events. Whilenumerous indexing techniques support the k-NN search of time series, none of themis designed for an efficient interval-based search. This work presents the novel indexstructure Time Series Envelopes Index Tree (TSEIT), that significantly speeds up the k-NNsearch of time series in arbitrary user-defined time intervals. in arbitrary user-defined time intervals.)
Reinforcement Learning for Solving the Knight’s Tour Problem + (The knight’s tour problem is an instance o … The knight’s tour problem is an instance of the Hamiltonian path problem that is a typical NP-hard problem. A knight makes L-shape moves on a chessboard and tries to visit all the squares exactly once. The tour is closed if a knight can finish a complete tour and end on a square that is a neighbourhood of its starting square; Otherwise, it is open. Many algorithms and heuristics have been proposed to solve this problem. The most well-known one is warnsdorff’s heuristic. Warnsdorff’s idea is to move to the square with the fewest possible moves in a greedy fashion. Although this heuristic is fast, it does not always return a closed tour. Also, it only works on boards of certain dimensions. Due to its greedy behaviour, it can get stuck into a local optimum easily. That is similar to the other existing approaches. Our goal in this thesis is to come up with a new strategy based on reinforcement learning. Ideally, it should be able to find a closed tour on chessboards of any size. We will consider several approaches: value-based methods, policy optimization and actor-critic methods. Compared to previous work, our approach is non-deterministic and sees the problem as a single-player game with a tradeoff between exploration and exploitation. We will evaluate the effectiveness and efficiency of the existing methods and new heuristics.f the existing methods and new heuristics.)
Discovering data-driven Explanations + (The main goal knowledge discovery focusses … The main goal knowledge discovery focusses is, an increase of knowledge using some set of data. In many cases it is crucial that results are human-comprehensible. Subdividing the feature space into boxes with unique characteristics is a commonly used approach for achieving this goal. The patient-rule-induction method (PRIM) extracts such "interesting" hyperboxes from a dataset by generating boxes that maximize some class occurrence inside of it. However, the quality of the results varies when applied to small datasets. This work will examine to which extent data-generators can be used to artificially increase the amount of available data in order to improve the accuracy of the results. Secondly, it it will be tested if probabilistic classification can improve the results when using generated data.ove the results when using generated data.)
Conception and Design of Privacy-preserving Software Architecture Templates + (The passing of new regulations like the Eu … The passing of new regulations like the European GDPR has clarified that in the future it will be necessary to build privacy-preserving systems to protect the personal data of its users. This thesis will introduce the concept of privacy templates to help software designers and architects in this matter. Privacy templates are at their core similar to design patterns and provide reusable and general architectural structures which can be used in the design of systems to improve privacy in early stages of design. In this thesis we will conceptualize a small collection of privacy templates to make it easier to design privacy-preserving software systems. Furthermore, the privacy templates will be categorized and evaluated to classify them and assess their quality across different quality dimensions.ality across different quality dimensions.)
Modellierung und Verifikation von Mehrgüterauktionen als Workflows am Beispiel eines Auktionsdesigns + (The presentation will be in English. Die Z … The presentation will be in English.Die Zielsetzung in dieser Arbeit war die Entwicklung eines Systems zur Verifikation von Mehrgüterauktionen als Workflows am Beispiel eines Auktionsdesigns. Aufbauend auf diversen Vorarbeiten wurde in dieser Arbeit das Clock-Proxy Auktionsdesign als Workflow modelliert und zur Verifikation mit Prozessverifikationsmethoden vorbereitet. Es bestehen bereits eine Vielzahl an Analyseansätzen für Auktionsdesign, die letztendlich aber auf wenig variierbaren Modellen basieren. Für komplexere Auktionsverfahren, wie Mehrgüterauktionen, die in dieser Arbeit betrachtet wurden, liefern diese Ansätze keine zufriedenstellenden Möglichkeiten. Basierend auf den bereits bestehenden Verfahren wurde ein Ansatz entwickelt, dessen Schwerpunkt auf der datenzentrierten Erweiterung der Modellierung und der Verifikationsansätze liegt. Im ersten Schritt wurden daher die Regeln und Daten in das Workflowmodell integriert. Die Herausforderung bestand darin, den Kontroll-und Datenfluss sowie die Daten und Regeln aus dem Workflowmodell über einen Algorithmus zu extrahieren und bestehende Transformationsalgorithmen hinreichend zu erweitern. Die Evaluation des Ansatzes zeigt, dass die Arbeit mit der entwickelten Software das globale Ziel, einen Workflow mittels Eigenschaften zu verifizieren, erreicht hat.genschaften zu verifizieren, erreicht hat.)
Measuring the Privacy Loss with Smart Meters + (The rapid growth of renewable energy sourc … The rapid growth of renewable energy sources and the increased sales inelectric vehicels contribute to a more volatile power grid. Energy suppliersrely on data to predict the demand and to manage the grid accordingly.The rollout of smart meters could provide the necessary data. But on theother hand, smart meters can leak sensitive information about the customer.Several solution were proposed to mitigate this problem. Some depend onprivacy measures to calculate the degree of privacy one could expect from asolution. This bachelor thesis constructs a set of experiments which help toanalyse some privacy measures and thereby determine, whether the value ofa privacy measure increases or decreases with an increase in privacy. or decreases with an increase in privacy.)
Standardized Real-World Change Detection Data + (The reliable detection of change points is … The reliable detection of change points is a fundamental task when analysing data across many fields, e.g., in finance, bioinformatics, and medicine. To define “change points”, we assume that there is a distribution, which may change over time, generating the data we observe. A change point then is a change in this underlying distribution, i.e., the distribution coming before a change point is different from the distribution coming after. The principled way to compare distributions, and to find change points, is to employ statistical tests.While change point detection is an unsupervised problem in practice, i.e., the data is unlabelled, the development and evaluation of data analysis algorithms requires labelled data. Only few labelled real world data sets are publicly available and many of them are either too small or have ambiguous labels. Further issues are that reusing data sets may lead to overfitting, and preprocessing (e.g., removing outliers) may manipulate results.To address these issues, van den Burg et al. publish 37 data sets annotated by data scientists and ML researchers and use them for an assessment of 14 change detection algorithms. Yet, there remain concerns due to the fact that these are labelled by hand: Can humans correctly identify changes according to the definition, and can they be consistent in doing so?The goal of this Bachelor's thesis is to algorithmically label their data sets following the formal definition and to also identify and label larger and higher-dimensional data sets, thereby extending their work.To this end, we leverage a non-parametric hypothesis test which builds on Maximum Mean Discrepancy (MMD) as a test statistic, i.e., we identify changes in a principled way. We will analyse the labels so obtained and compare them to the human annotations, measuring their consistency with the F1 score. To assess the influence of the algorithmic and definition-conform annotations, we will use them to reevaluate the algorithms of van den Burg et al. and compare the respective performances.. and compare the respective performances.)