For the last year and a half I’ve been working with Anita Sarma, a professor at the University of Nebraska, Lincoln and her graduate student, Corey Jergensen, to try and understand some of the social dynamics around GitHub. As we began to dig at the ecosystem we realized that we had an opportunity to perform some novel analysis on the community. Specifically, GitHub is a highly networked ecosystem and most of the queries that we were doing were localized around single projects or developers. At this time graph databases were taking off so we decided to learn a new technology while getting some data at the same time.
This resulted in the creation of GitMiner, a tool that utilizes the GitHub APIs to download all the data about a project and it’s related users, issues, pull requests, and basically everything else that you can get out of the API. It then stores this information inside of a graph database - something that I’ve written about before when I first published a dataset on the Tinkerpop family of projects.
Now we’ve had a chance to formally publish a larger set of data, thousands of projects associated with Ruby on Rails. The data are published in this years conference on Mining Software Repositories. If you’d like to read the actual paper, here’s the authors’ pre-print of the paper and the GitHub repository with the actual data.
In the coming weeks/months I’ll probably write more about how to use GitMiner to collect large amounts of data from GitHub and how to crawl this data. For the interim, however, I’ll leave you with this nifty picture of shared developers between projects, which is part of an upcoming submission of ours.
Citation: Wagstrom, P., Jergensen, C., and Sarma, A. A Network of Rails: A Graph Dataset of Ruby on Rails and Associated Projects. Proceedings of the 2013 Working Conference on Mining Software Repositories, ACM (2013).