Evidence-based Token Abstraction for Software Plagiarism Detection: Unterschied zwischen den Versionen

Aus SDQ-Institutsseminar
(Die Seite wurde neu angelegt: „{{Vortrag |vortragender=Hannes Greule |email=hannes.greule@student.kit.edu |vortragstyp=Bachelorarbeit |betreuer=Timur Sağlam |termin=Institutsseminar/2023-04…“)
 
Keine Bearbeitungszusammenfassung
 
(Eine dazwischenliegende Version von einem anderen Benutzer wird nicht angezeigt)
Zeile 6: Zeile 6:
|termin=Institutsseminar/2023-04-28
|termin=Institutsseminar/2023-04-28
|vortragsmodus=in Präsenz
|vortragsmodus=in Präsenz
|kurzfassung=Kurzfassung
|kurzfassung=Programming assignments for students are target of plagiarism. Especially for graded assignments, instructors want to detect plagiarism among the students. For larger courses, however, manual inspection of all submissions is a resourceful task. For this purpose, there are numerous tools that can help detect plagiarism in submissions. Many well-known plagiarism detection tools are token-based detectors. In an abstraction step, they map source code to a list of tokens, and such lists are then compared with each other.  While there is much research in the area of comparison algorithms, the mapping is often only considered superficially. In this work, we conduct two experiments that address the issue of token abstraction. For that, we design different token abstractions and explain their differences. We then evaluate these abstractions using multiple datasets. We show that different abstractions have pros and cons, and that a higher abstraction level does not necessarily perform better. These findings are useful when adding support for new programming languages and for improving existing plagiarism detection tools. Furthermore, the results can be helpful to choose abstractions tailored to specific requirements.
}}
}}

Aktuelle Version vom 19. April 2023, 14:47 Uhr

Vortragende(r) Hannes Greule
Vortragstyp Bachelorarbeit
Betreuer(in) Timur Sağlam
Termin Fr 28. April 2023
Vortragsmodus in Präsenz
Kurzfassung Programming assignments for students are target of plagiarism. Especially for graded assignments, instructors want to detect plagiarism among the students. For larger courses, however, manual inspection of all submissions is a resourceful task. For this purpose, there are numerous tools that can help detect plagiarism in submissions. Many well-known plagiarism detection tools are token-based detectors. In an abstraction step, they map source code to a list of tokens, and such lists are then compared with each other. While there is much research in the area of comparison algorithms, the mapping is often only considered superficially. In this work, we conduct two experiments that address the issue of token abstraction. For that, we design different token abstractions and explain their differences. We then evaluate these abstractions using multiple datasets. We show that different abstractions have pros and cons, and that a higher abstraction level does not necessarily perform better. These findings are useful when adding support for new programming languages and for improving existing plagiarism detection tools. Furthermore, the results can be helpful to choose abstractions tailored to specific requirements.