A little goes a long way: Improving toxic language classification despite data scarcity

Mika Juuti, Tommi Gröndahl, Adrian Flanagan, N. Asokan

Tutkimustuotos: Artikkeli kirjassa/konferenssijulkaisussaConference article in proceedingsScientificvertaisarvioitu

135 Lataukset (Pure)

Abstrakti

Detection of some types of toxic language is hampered by extreme scarcity of labeled training data. Data augmentation – generating new synthetic data from a labeled seed dataset – can help. The efficacy of data augmentation on toxic language classification has not been fully explored. We present the first systematic study on how data augmentation techniques impact performance across toxic language classifiers, ranging from shallow logistic regression architectures to BERT – a state-of-the-art pre-trained Transformer network. We compare the performance of eight techniques on very scarce seed datasets. We show that while BERT performed the best, shallow classifiers performed comparably when trained on data augmented with a combination of three techniques, including GPT-2-generated sentences. We discuss the interplay of performance and computational overhead, which can inform the choice of techniques under different constraints.
AlkuperäiskieliEnglanti
OtsikkoFindings of the Association for Computational Linguistics: EMNLP 2020
KustantajaAssociation for Computational Linguistics
Sivut2991-3009
Sivumäärä18
ISBN (elektroninen)978-1-952148-90-3
DOI - pysyväislinkit
TilaJulkaistu - 20 marrask. 2020
OKM-julkaisutyyppiA4 Artikkeli konferenssijulkaisussa
TapahtumaConference on Empirical Methods in Natural Language Processing - Virtual, Online
Kesto: 16 marrask. 202020 marrask. 2020

Conference

ConferenceConference on Empirical Methods in Natural Language Processing
LyhennettäEMNLP
KaupunkiVirtual, Online
Ajanjakso16/11/202020/11/2020

Sormenjälki

Sukella tutkimusaiheisiin 'A little goes a long way: Improving toxic language classification despite data scarcity'. Ne muodostavat yhdessä ainutlaatuisen sormenjäljen.

Siteeraa tätä