Explainable Publication Year Prediction of Eighteenth Century Texts with the BERT Model

Iiro Rastas, Yann Ciarán Ryan, Iiro Tiihonen, Mohammadreza Mohammadnia Qaraei, Liina Repo, Rohit Babbar, Eetu Mäkelä, Mikko Tolonen, Filip Ginter

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

50 Downloads (Pure)

Abstract

In this paper, we describe a BERT model trained on the Eighteenth Century Collections Online (ECCO) dataset of digitized documents. The ECCO dataset poses unique modelling challenges due to the presence of Optical Character Recognition (OCR) artifacts. We establish the performance of the BERT model on a publication year prediction task against linear baseline models and human judgement, finding the BERT model to be superior to both and able to date the works, on average, with less than 7 years absolute error. We also explore how language change over time affects the model by analyzing the features the model uses for publication year predictions as given by the Integrated Gradients model explanation method.
Original languageEnglish
Title of host publicationProceedings of the 3rd Workshop on Computational Approaches to Historical Language Change
PublisherAssociation for Computational Linguistics
Pages68-77
Number of pages10
ISBN (Electronic)978-1-955917-42-1
DOIs
Publication statusPublished - 2022
MoE publication typeA4 Conference publication
EventWorkshop on Computational Approaches to Historical Language Change - Dublin, Ireland
Duration: 26 May 202227 May 2022
Conference number: 3

Workshop

WorkshopWorkshop on Computational Approaches to Historical Language Change
Country/TerritoryIreland
CityDublin
Period26/05/202227/05/2022

Fingerprint

Dive into the research topics of 'Explainable Publication Year Prediction of Eighteenth Century Texts with the BERT Model'. Together they form a unique fingerprint.

Cite this