Efficient Pruning of N-gram Corpora for Culturomics using Language Models

Aus IPD-Institutsseminar
Zur Navigation springen Zur Suche springen
Vortragende(r) Caspar Friedrich Maximilian Nagy
Vortragstyp Bachelorarbeit
Betreuer(in) Jens Willkomm
Termin Fr 23. Oktober 2020
Kurzfassung Big data technology pushes the frontiers of science. A particularly interesting application of it is culturomics. It uses big data techniques to accurately quantify and observe language and culture over time. A milestone to enable this kind of analysis in a traditionally humanistic field was the effort around the Google Books project. The scanned books were then transformed into a so called N-gram corpus, that contains the frequency of words and their combinations over time. Unfortunately this corpus is enormous in size of over 2 terabytes of storage. This makes handling, storing and querying the corpus difficult. In this bachelor thesis, we introduce a novel technique to reduce the storage requirements of N-gram corpora. It uses Natural Language Processing to estimate the counts of N-grams. Our approach is able to prune around 30% more effective than state-of-the-art methods.