The second github data challenge has been announced for a while. Github records every events happened and opens it to the public as a service. It is a big data archive: last year there is up to 16G, and people have made a lot of fun with it.
The sheer amount of events data seems be able to answer a lot of questions and reveal some interesting stuff. I wonder what I can find out from these events, something at least not boring. Finally I came up with this question: how programming languages evolve in the last year?
Why do I choose this question? The main reason of it is probably that
it is simpler to analyze those data from a language point of
view. There are a lot of information in each event record. You
typically can find out who did this event, what repository is related
to this event, and possibly the personal information such as location,
company of the actor if they ever input those information when they
created their github account. If the event has something to do with a
repository, you can also get information such as language, owner,
description of the repository; there are also event specific data
IssuesEvent has some issue specific data). So it is
possible to analyze those events data by when(time), where(people’s
locations) and what(events and its related information). However the
location of an event is not always available, so a location based
analysis will suffer from incompleteness of data. Fortunately most
events are repository related(e.g.
WatchEvent etc.), and nearly all these events are marked with a
language, so the information is abundant if we want to know how
language changes with time.
To complete this work I used
bit of shell script to download events data from github
d3 is used to draw the alluvial diagrams
shown below. The work is pushed to
my github repository.
Github archive service exposes events data by hours, so in order to get all data from the last year, I have to download data of every hour in the last year. The download process takes me several hours and the data eats 16G of my disk space.
Then I extracted the counts of every event type for all languages and
every month using a
ruby script and
yajl gem. It takes less one
hour to extract the csv file I want.
yajl is a pretty fast library.
The rest of work is handed over to
R. Once I read the csv file into
R, I filter out those event types with zero count, which leaves me
14 types events(
WatchEvent). So finally I
data.frame with 16 columns (14 event types plus language and
month), in which each row represents a language’s events count in some
Taking a closer look at our initial question, I basically want to know how active a language is comparing with others, how their activity changes with time. Since every language has 14 type of events for each month, I choose to compute the magnitude of this 14 dimension vector as the indicator of overall activity of each language each month.
In each month, we can also do clustering analysis of those languages, so that we learn that some languages have similar activity and how these groups change by time. Here I perform a principal component analysis of the 14 dimension vectors then use basic kmeans clustering algorithms to do the clustering analysis. The principal component analysis can reduce the dimension of those vectors to 2 and still reserve almost all variability. The reduced 2 dimension vectors are good for visual representation later on.
By clustering the languages each month and compute their overall activity, we can draw the following alluvial diagram:
In the above diagram, each column is a month, and languages are grouped and separated by empty line. Groups are ordered by their maximum overall activity, and languages in each group are ordered by their overall activity.
There are a few interesting points revealed by this diagram:
IssuesEventin 2013 August for Nemerle, which is extraordinarily high considering its normal performance.
There are probably more interesting stuff you can find out of the result diagram, at least by doing analysis I learned a lot of languages that I have never heard before. The diagram is not perfect and I hope it is not too bad.
For those who are interested in the clustering analysis, I also post the 2 dimension clustering plot of those languages in each month below. The plot is generated by running PCA dimension reduction then kmeans clustering.