More features are not always better: Evaluating generalizing models in incident type classification of tweets

Axel Schulz, Christian Guckelsberger, Benedikt Schmidt

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

2 Citations (Scopus)

Abstract

Social media represents a rich source of upto-date information about events such as incidents. The sheer amount of available information makes machine learning approaches a necessity for further processing. This learning problem is often concerned with regionally restricted datasets such as data from only one city. Because social media data such as tweets varies considerably across different cities, the training of efficient models requires labeling data from each city of interest, which is costly and time consuming. In this study, we investigate which features are most suitable for training generalizable models, i.e., models that show good performance across different datasets. We reimplemented the most popular features from the state of the art in addition to other novel approaches, and evaluated them on data from ten different cities. We show that many sophisticated features are not necessarily valuable for training a generalized model and are outperformed by classic features such as plain word-n-grams and character-n-grams.

Original languageEnglish
Title of host publicationConference Proceedings - EMNLP 2015
Subtitle of host publicationConference on Empirical Methods in Natural Language Processing
PublisherAssociation for Computational Linguistics
Pages421-430
Number of pages10
ISBN (Electronic)9781941643327
DOIs
Publication statusPublished - 2015
MoE publication typeA4 Conference publication
EventConference on Empirical Methods in Natural Language Processing - Lisbon, Portugal
Duration: 17 Sept 201521 Sept 2015

Conference

ConferenceConference on Empirical Methods in Natural Language Processing
Abbreviated titleEMNLP
Country/TerritoryPortugal
CityLisbon
Period17/09/201521/09/2015

Fingerprint

Dive into the research topics of 'More features are not always better: Evaluating generalizing models in incident type classification of tweets'. Together they form a unique fingerprint.

Cite this