Standardized Real-World Change Detection Data: Unterschied zwischen den Versionen

Aus SDQ-Institutsseminar
Keine Bearbeitungszusammenfassung
Keine Bearbeitungszusammenfassung
Zeile 6: Zeile 6:
|termin=Institutsseminar/2022-05-13 Zusatztermin
|termin=Institutsseminar/2022-05-13 Zusatztermin
|vortragsmodus=in Präsenz
|vortragsmodus=in Präsenz
|kurzfassung=Change point detection is a fundamental task with many applications in finance, bioinformatics and other areas. The basic assumption is that the distribution generating a data set might change at a so-called “Change Point” over time. The detection of those points is crucial and in practice an unsupervised problem. In order to analyse given algorithms for change point detection, there has to be labled data. Only few labled real world data sets are publicly available and many of them are either too small, reused, preprocessed or ambiguous. Recently, there has been a publication of data sets annotated by data scientists and ML researchers and an assessment of 14 algorithms on their data. Because they did the labelling by hand, there are issues raised. Can humans correctly identify changes and be consistent?
|kurzfassung=The reliable detection of change points is a fundamental task when analysing
The goal of this Bachelor Thesis is to algorithmically label this data set and extend it. This is done by constructing a non-parametric hypothesis test using Maximum Mean Discrepancy (MMD) as a statistic and approximating the null-distribution performing a permutation test.
data across many fields, e.g., in finance, bioinformatics, and medicine. To define
The obtained results should be analysed and compared to the human labelling. Furthermore, a new assessment of change point detection algorithms should be performed and again compared to the given one.
“change points”, we assume that there is a distribution, which may change over
time, generating the data we observe. A change point then is a change in this
underlying distribution, i.e., the distribution coming before a change point is
different from the distribution coming after. The principled way to compare
distributions, and to find change points, is to employ statistical tests.
While change point detection is an unsupervised problem in practice, i.e.,
the data is unlabelled, the development and evaluation of data analysis algo-
rithms requires labelled data. Only few labelled real world data sets are publicly
available and many of them are either too small or have ambiguous labels. Fur-
ther issues are that reusing data sets may lead to overfitting, and preprocessing
(e.g., removing outliers) may manipulate results. To address these issues, van
den Burg et al. publish 37 data sets annotated by data scientists and ML re-
searchers and use them for an assessment of 14 change detection algorithms.
Yet, there remain concerns due to the fact that these are labelled by hand: Can
humans correctly identify changes according to the definition, and can they be
consistent in doing so?
The goal of this Bachelor’s thesis is to algorithmically label their data sets
following the formal definition and to also identify and label larger and higher-
dimensional data sets, thereby extending their work. To this end, we leverage
a non-parametric hypothesis test which builds on Maximum Mean Discrepancy
(MMD) as a test statistic, i.e., we identify changes in a principled way. We will
analyse the labels so obtained and compare them to the human annotations,
measuring their consistency with the F1 score. To assess the influence of the
algorithmic and definition-conform annotations, we will use them to reevaluate
the algorithms of van den Burg et al. and compare the respective performances.
}}
}}

Version vom 10. Mai 2022, 16:56 Uhr

Vortragende(r) Moritz Teichner
Vortragstyp Proposal
Betreuer(in) Florian Kalinke
Termin Fr 13. Mai 2022
Vortragsmodus in Präsenz
Kurzfassung The reliable detection of change points is a fundamental task when analysing

data across many fields, e.g., in finance, bioinformatics, and medicine. To define “change points”, we assume that there is a distribution, which may change over time, generating the data we observe. A change point then is a change in this underlying distribution, i.e., the distribution coming before a change point is different from the distribution coming after. The principled way to compare distributions, and to find change points, is to employ statistical tests. While change point detection is an unsupervised problem in practice, i.e., the data is unlabelled, the development and evaluation of data analysis algo- rithms requires labelled data. Only few labelled real world data sets are publicly available and many of them are either too small or have ambiguous labels. Fur- ther issues are that reusing data sets may lead to overfitting, and preprocessing (e.g., removing outliers) may manipulate results. To address these issues, van den Burg et al. publish 37 data sets annotated by data scientists and ML re- searchers and use them for an assessment of 14 change detection algorithms. Yet, there remain concerns due to the fact that these are labelled by hand: Can humans correctly identify changes according to the definition, and can they be consistent in doing so? The goal of this Bachelor’s thesis is to algorithmically label their data sets following the formal definition and to also identify and label larger and higher- dimensional data sets, thereby extending their work. To this end, we leverage a non-parametric hypothesis test which builds on Maximum Mean Discrepancy (MMD) as a test statistic, i.e., we identify changes in a principled way. We will analyse the labels so obtained and compare them to the human annotations, measuring their consistency with the F1 score. To assess the influence of the algorithmic and definition-conform annotations, we will use them to reevaluate the algorithms of van den Burg et al. and compare the respective performances.