More features are not always better: Evaluating generalizing models in incident type classification of tweets

Axel Schulz, Christian Guckelsberger, Benedikt Schmidt

Tutkimustuotos: Artikkeli kirjassa/konferenssijulkaisussaConference contributionScientificvertaisarvioitu

2 Sitaatiot (Scopus)

Abstrakti

Social media represents a rich source of upto-date information about events such as incidents. The sheer amount of available information makes machine learning approaches a necessity for further processing. This learning problem is often concerned with regionally restricted datasets such as data from only one city. Because social media data such as tweets varies considerably across different cities, the training of efficient models requires labeling data from each city of interest, which is costly and time consuming. In this study, we investigate which features are most suitable for training generalizable models, i.e., models that show good performance across different datasets. We reimplemented the most popular features from the state of the art in addition to other novel approaches, and evaluated them on data from ten different cities. We show that many sophisticated features are not necessarily valuable for training a generalized model and are outperformed by classic features such as plain word-n-grams and character-n-grams.

AlkuperäiskieliEnglanti
OtsikkoConference Proceedings - EMNLP 2015
AlaotsikkoConference on Empirical Methods in Natural Language Processing
KustantajaAssociation for Computational Linguistics
Sivut421-430
Sivumäärä10
ISBN (elektroninen)9781941643327
DOI - pysyväislinkit
TilaJulkaistu - 2015
OKM-julkaisutyyppiA4 Artikkeli konferenssijulkaisussa
TapahtumaConference on Empirical Methods in Natural Language Processing - Lisbon, Portugali
Kesto: 17 syysk. 201521 syysk. 2015

Conference

ConferenceConference on Empirical Methods in Natural Language Processing
LyhennettäEMNLP
Maa/AluePortugali
KaupunkiLisbon
Ajanjakso17/09/201521/09/2015

Sormenjälki

Sukella tutkimusaiheisiin 'More features are not always better: Evaluating generalizing models in incident type classification of tweets'. Ne muodostavat yhdessä ainutlaatuisen sormenjäljen.

Siteeraa tätä