Data science for social good: Theory and applications in epidemics, polarization, and fair clustering

Han Xiao

Research output: ThesisDoctoral ThesisCollection of Articles


Technical innovations have transformed our lives fundamentally, in both positive and negative ways. In this thesis, we look at the negative side. We identify three problems to tackle, namely epidemics, online polarization, and bias in automatic decision-making processes, and approach them using data-driven approaches. Thanks to globalization, our world is more interconnected than before. While trade and exchange of ideas are happening at an unprecedented rate, the rapid spread of disease is happening globally, as evidenced by the pandemic of COVID-19. To contain epidemics effectively, it is crucial to identify as many infected persons as possible. In practice, however, it is almost impossible to obtain the complete information of who is infected. We study this challenge in the context of social networks, where a disease spreads via network edges. Specifically, we assume only a subset of all infections is observed and we seek to infer who else is infected. Furthermore, we consider two different settings: (1) temporal setting, in which infection time is also observed and, (2) probabilistic setting, in which infection probability of each individual is produced. Social-media platforms enable people to share and access information easily. Meanwhile, flawed designs in these platforms contribute to the formation of online polarization. As a result, people are unlikely to adopt new ideas that differ from their beliefs, which finally leads to a polarized society. To tackle online polarization, we argue that it is important to discover who is involved in the polarization. We consider a problem setting under social networks, in which the interaction between two persons is either friendly or antagonistic. Furthermore, given some seed nodes that represent different sides of a polarized subgraph, we seek to find the polarized subgraph that is relevant to the seeds. Finding such structures can be used to understand the nature of polarization, and to mitigate the degree of polarization. Machine-learning algorithms allow the automation of many decision-making processes, for example, deciding whether to grant a loan to a loan applicant. However, unfair results that favor one demographic group (e.g., male) over another (e.g., female) are witnessed. The unfair outcomes may further affect the well-being of the mistreated groups. In this thesis, we focus on the task of data clustering, which has applications in infrastructure design and online social media. We discuss potential fairness issues in existing clustering algorithms that are designed to be fair. As a result, we propose a new fair clustering formulation that captures a novel fairness notion. For all proposed problems, we study their complexity and design algorithms whose theoretical performance is analyzed. We evaluate all proposed algorithms' efficacy in both synthetic and real-world settings.
Translated title of the contributionData science for social good - Theory and applications in epidemics, polarization, and fair clustering
Original languageEnglish
QualificationDoctor's degree
Awarding Institution
  • Aalto University
  • Gionis, Aristides, Supervising Professor
Print ISBNs978-952-60-3989-3
Electronic ISBNs978-952-60-3990-9
Publication statusPublished - 2020
MoE publication typeG5 Doctoral dissertation (article)


  • data mining
  • graph mining
  • social network analysis
  • epidemics
  • fairness
  • online polarization
  • algorithm design
  • approximation algorithm


Dive into the research topics of 'Data science for social good: Theory and applications in epidemics, polarization, and fair clustering'. Together they form a unique fingerprint.

Cite this