Abstract
Social media represents a rich source of upto-date information about events such as incidents. The sheer amount of available information makes machine learning approaches a necessity for further processing. This learning problem is often concerned with regionally restricted datasets such as data from only one city. Because social media data such as tweets varies considerably across different cities, the training of efficient models requires labeling data from each city of interest, which is costly and time consuming. In this study, we investigate which features are most suitable for training generalizable models, i.e., models that show good performance across different datasets. We reimplemented the most popular features from the state of the art in addition to other novel approaches, and evaluated them on data from ten different cities. We show that many sophisticated features are not necessarily valuable for training a generalized model and are outperformed by classic features such as plain word-n-grams and character-n-grams.
Original language | English |
---|---|
Title of host publication | Conference Proceedings - EMNLP 2015 |
Subtitle of host publication | Conference on Empirical Methods in Natural Language Processing |
Publisher | Association for Computational Linguistics |
Pages | 421-430 |
Number of pages | 10 |
ISBN (Electronic) | 9781941643327 |
DOIs | |
Publication status | Published - 2015 |
MoE publication type | A4 Conference publication |
Event | Conference on Empirical Methods in Natural Language Processing - Lisbon, Portugal Duration: 17 Sept 2015 → 21 Sept 2015 |
Conference
Conference | Conference on Empirical Methods in Natural Language Processing |
---|---|
Abbreviated title | EMNLP |
Country/Territory | Portugal |
City | Lisbon |
Period | 17/09/2015 → 21/09/2015 |