Abstrakti
Disambiguation is an important step in the semantic data transformation process. In this scope, the process sought to eliminate the ambiguity of which person a record is describing. \emph{Constellation of Correspondence} or CoCo is a data integration project focused on historical epistolary data. In its data transformation flow, actor records from source data are linked to actor entities in an external linked open data source to enrich the actors' information with metadata found in external databases.
This work presents an advanced disambiguation system for CoCo data transformation flow. The system has managed to deliver a reliable and flexible linking system that provides advantages,hi such as the incorporation of an additional external database, novel linking rule definition and implementation, and a more transparent linking result provenance presentation and management. This work also evaluates linking process performance in various linking cases by employing the help of a human expert judge to evaluate whether the proposed valid link made by the linking systems are indeed accurate or not. The system and the proposed rule configuration delivers a satisfactory performance on the easier, more common case but still struggles to deliver good precision on rarer edge cases.
There are insightful observations made regarding the data that was observed during the development and evaluation of the system. Firstly is the importance of naming similarity in determining a link between two actors and the imperfection of name similarity in the majority of the valid linking case. This observation justifies the need for dissimilarity tolerance in naming comparison despite the importance of naming similarity. This imperfect state of the systems inspires the several future works that this work proposes. The proposed future works are the further fine-tuning of the linking rule and selection rule and the advancing the evaluation by increasing the completeness of the evaluation and the research of a more automated evaluation process.
This work presents an advanced disambiguation system for CoCo data transformation flow. The system has managed to deliver a reliable and flexible linking system that provides advantages,hi such as the incorporation of an additional external database, novel linking rule definition and implementation, and a more transparent linking result provenance presentation and management. This work also evaluates linking process performance in various linking cases by employing the help of a human expert judge to evaluate whether the proposed valid link made by the linking systems are indeed accurate or not. The system and the proposed rule configuration delivers a satisfactory performance on the easier, more common case but still struggles to deliver good precision on rarer edge cases.
There are insightful observations made regarding the data that was observed during the development and evaluation of the system. Firstly is the importance of naming similarity in determining a link between two actors and the imperfection of name similarity in the majority of the valid linking case. This observation justifies the need for dissimilarity tolerance in naming comparison despite the importance of naming similarity. This imperfect state of the systems inspires the several future works that this work proposes. The proposed future works are the further fine-tuning of the linking rule and selection rule and the advancing the evaluation by increasing the completeness of the evaluation and the research of a more automated evaluation process.
Alkuperäiskieli | Englanti |
---|---|
Pätevyys | Maisteritutkinto |
Myöntävä instituutio |
|
Valvoja/neuvonantaja |
|
Myöntöpäivämäärä | 9 lokak. 2023 |
Kustantaja | |
Tila | Julkaistu - 9 lokak. 2023 |
OKM-julkaisutyyppi | G2 Pro gradu, diplomityö, ylempi amk-opinnäytetyö |