Abstrakti
Social media represents a rich source of upto-date information about events such as incidents. The sheer amount of available information makes machine learning approaches a necessity for further processing. This learning problem is often concerned with regionally restricted datasets such as data from only one city. Because social media data such as tweets varies considerably across different cities, the training of efficient models requires labeling data from each city of interest, which is costly and time consuming. In this study, we investigate which features are most suitable for training generalizable models, i.e., models that show good performance across different datasets. We reimplemented the most popular features from the state of the art in addition to other novel approaches, and evaluated them on data from ten different cities. We show that many sophisticated features are not necessarily valuable for training a generalized model and are outperformed by classic features such as plain word-n-grams and character-n-grams.
Alkuperäiskieli | Englanti |
---|---|
Otsikko | Conference Proceedings - EMNLP 2015 |
Alaotsikko | Conference on Empirical Methods in Natural Language Processing |
Kustantaja | Association for Computational Linguistics |
Sivut | 421-430 |
Sivumäärä | 10 |
ISBN (elektroninen) | 9781941643327 |
DOI - pysyväislinkit | |
Tila | Julkaistu - 2015 |
OKM-julkaisutyyppi | A4 Artikkeli konferenssijulkaisussa |
Tapahtuma | Conference on Empirical Methods in Natural Language Processing - Lisbon, Portugali Kesto: 17 syysk. 2015 → 21 syysk. 2015 |
Conference
Conference | Conference on Empirical Methods in Natural Language Processing |
---|---|
Lyhennettä | EMNLP |
Maa/Alue | Portugali |
Kaupunki | Lisbon |
Ajanjakso | 17/09/2015 → 21/09/2015 |