A Representative User-centric Dataset of 10 Million GitHub Developers

  • Qingyuan Gong (Creator)
  • Jiayun Zhang (Creator)
  • Yang Chen (Creator)
  • Yu Xiao (Contributor)
  • Xiaoming Fu (Creator)
  • Pan Hui (Creator)
  • Xiang Li (Creator)
  • Xin Wang (Creator)



Using GitHub APIs, we construct an unbiased dataset of over 10 million GitHub users. The data was collected between Jul. 20 and Aug. 27, 2018, covering 10,649,574 users, 118,602,740 commits, and 20,999,258 repositories. Each data entry is stored in JSON format, representing one GitHub user, and containing the descriptive information in the user’s profile page, the information of her commit activities and created/forked public repositories.
Date made available1 Jan 2018
PublisherHarvard Dataverse

Dataset Licences

  • CC0-1.0
  • Detecting Malicious Accounts in Online Developer Communities Using Deep Learning

    Gong, Q., Zhang, J., Chen, Y., Li, Q., Xiao, Y., Wang, X. & Hui, P., Nov 2019, CIKM '19:Proceedings of the 28th ACM International Conference on Information and Knowledge Management. ACM, p. 1251-1260 (ACM International Conference on Information & Knowledge Management).

    Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

    Open Access
    18 Citations (Scopus)
    267 Downloads (Pure)

Cite this