A little goes a long way: Improving toxic language classification despite data scarcity

Mika Juuti, Tommi Gröndahl, Adrian Flanagan, N. Asokan

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

135 Downloads (Pure)

Abstract

Detection of some types of toxic language is hampered by extreme scarcity of labeled training data. Data augmentation – generating new synthetic data from a labeled seed dataset – can help. The efficacy of data augmentation on toxic language classification has not been fully explored. We present the first systematic study on how data augmentation techniques impact performance across toxic language classifiers, ranging from shallow logistic regression architectures to BERT – a state-of-the-art pre-trained Transformer network. We compare the performance of eight techniques on very scarce seed datasets. We show that while BERT performed the best, shallow classifiers performed comparably when trained on data augmented with a combination of three techniques, including GPT-2-generated sentences. We discuss the interplay of performance and computational overhead, which can inform the choice of techniques under different constraints.
Original languageEnglish
Title of host publicationFindings of the Association for Computational Linguistics: EMNLP 2020
PublisherAssociation for Computational Linguistics
Pages2991-3009
Number of pages18
ISBN (Electronic)978-1-952148-90-3
DOIs
Publication statusPublished - 20 Nov 2020
MoE publication typeA4 Conference publication
EventConference on Empirical Methods in Natural Language Processing - Virtual, Online
Duration: 16 Nov 202020 Nov 2020

Conference

ConferenceConference on Empirical Methods in Natural Language Processing
Abbreviated titleEMNLP
CityVirtual, Online
Period16/11/202020/11/2020

Fingerprint

Dive into the research topics of 'A little goes a long way: Improving toxic language classification despite data scarcity'. Together they form a unique fingerprint.

Cite this