Aus IPD-Institutsseminar
Wechseln zu: Navigation, Suche
Termin (Alle Termine)
Datum Fr 19. Juni 2020, 11:30 Uhr
Dauer 20 min
Raum Raum 348 (Gebäude 50.34)
Vorheriger Termin Fr 12. Juni 2020
Nächster Termin Fr 26. Juni 2020


Vortragende(r) Caspar Nagy
Titel Approximating an Ngram Corpus with Probabilistic Methods
Vortragstyp Proposal
Betreuer(in) Jens Willkomm
Kurzfassung In this work, we consider ngram corpora, i.e., a set of word chains of different lengths and its usage frequency in natural language. For example, the 3-gram "bag of words" may be used 200 times. Obviously, there exists a dependence between the usage frequency of (1) the unigrams "bag", "of", and "words", (2) the bigrams "bag of" and "of words", and (3) the trigram "bag of words". This connection is partially used in language models to implement grammar correction or speech recognition. From a database point of view, the ngram corpus contains either redundant information or information that can be well estimated. This is an indication that we can achieve a high reduction of the corpus size while still providing its information with high accuracy.

In this work, we research the connection between n- and (n+1)-grams and vice versa. Our objective is to store only a part of the full ngram corpus and estimate the rest of the corpus.

Neuen Vortrag erstellen