Data has become an essential resource, which is used to guide decision making across many levels of society. To fully leverage the abundance of data sources, the various sources need to be integrated, which poses difficult computational challenges. Entity resolution techniques address these challenges by trying to identify data records referring to the same underlying entity. Often, relational information about the records (for example, a friendship network between the users of a social networking service) is available, but this information is ignored by the traditional entity resolution techniques. The goal of this thesis is to develop novel collective entity resolution methods which match records by leveraging relational information and produce an entity network. The developed methods are applicable to a wide array of applications - from bioinformatics to ontologies - but the initial motivation for this work has been the problem of integrating genealogical data to infer large-scale genealogical networks (family trees). This thesis makes the following methodological contributions: First, we develop novel methods for linking vital records, such as birth records, to infer genealogical networks. An experimental evaluation of the inferred networks shows that even fully automatic methods can produce fairly accurate networks, and moreover, the estimated link probabilities provide a reliable way to quantify the certainty of the inferred family relationships. Second, we propose methods with theoretical guarantees for aggregating the edges of directed acyclic graphs in the case that the correspondance between input-graph nodes is known. Third, if the correspondance is unknown, an alignment between the nodes has to be found. We study the resulting network alignment problem and propose methods for aligning multiple networks and for aligning networks actively by leveraging human experts. The proposed vital-record linking methods have been employed to automatically link a dataset of five million historical birth records from Finland. To visualize the resulting network and to enable the exploration of the inferred links, we have developed an online tool called AncestryAI, which has been used so far by thousands of genealogists in Finland. In the final part of the thesis, we demonstrate the usefullness of the inferred genealogical network for the field of computational social science by presenting a longitudinal analysis on assortative mating, that is, the tendency to marry someone with a similar socioeconomic status. This phenomenon is quantified by comparing the socioeconomic statuses of the automatically inferred spouses. We find evidence that assortative mating existed in Finland (1735-1885), but interestingly, we do not observe any monotonically decreasing or increasing trend in the strength of assortative mating.
|Translated title of the contribution||Kollektiivisia tietueiden linkitysmenetelmiä verkostojen päättelyyn|
|Publication status||Published - 2018|
|MoE publication type||G5 Doctoral dissertation (article)|
- entity resolution
- network alignment
- machine learning
- computational social science