A Dataset for GitHub Repository Deduplication: Replication Package

Diomidis Spinellis, Zoe Kotti & Audris Mockus
This is the replication package for creating a dataset of GitHub projects that are copies of other. GitHub projects can be easily replicated through the site’s fork process or through a Git clone-push sequence. This is a problem for empirical software engineering, because it can lead to skewed results or mistrained machine learning models. We provide a dataset of 10.6 million GitHub projects that are copies of others, and link each record with the project’s...
This data repository is not currently reporting usage information. For information on how your repository can submit usage information, please see our documentation.