For the last year and a half I’ve been working with Anita Sarma, a professor at the University of Nebraska, Lincoln and her graduate student, Corey Jergensen, to try and understand some of the social dynamics around GitHub. As we began to dig at the ecosystem we realized that we had an opportunity to perform some novel analysis on the community. Specifically, GitHub is a highly networked ecosystem and most of the queries that we were doing were localized around single projects or developers. At this time graph databases were taking off so we decided to learn a new technology while getting some data at the same time.
This resulted in the creation of GitMiner, a tool that utilizes the GitHub APIs to download all the data about a project and it’s related users, issues, pull requests, and basically everything else that you can get out of the API. It then stores this information inside of a graph database - something that I’ve written about before when I first published a dataset on the Tinkerpop family of projects.
In the coming weeks/months I’ll probably write more about how to use GitMiner to collect large amounts of data from GitHub and how to crawl this data. For the interim, however, I’ll leave you with this nifty picture of shared developers between projects, which is part of an upcoming submission of ours.
Developers shared between projects in our Ruby on Rails dataset. The size of nodes represents the number of developers on the project, edge width is the number of shared developers between projects, and color represents programming language. [link to full size image]
On May 1st, 2012 I embarked on an experiment at work — I started signing work emails to my team and friends inside and outside the office with the words “KTHXBYE” or “KTHXBAI”. The goal was to see how long it would take until someone mentioned or asked about it. About two weeks after I started the experiment a friend from Microsoft noticed it and mentioned it to me. Of course, I replied with a meme:
To which my friend at Microsoft was gracious enough to reply with a meme of his own. This experiment was clearly off to an awesome start.
I expected that I’d hear back from other folks in a matter of days. But then the days turned to weeks and the weeks slowly turned into months. I concluded one of two things: either no one actually read my email, or no one actually caught the reference. Undaunted I persisted. Over the course of the experiment I sent out more than 450 messages with the signature “KTHXBYE” and about 65 with “KTHXBAI”, although I only realized that “KTHXBAI” was the appropriate spelling late into the experiment.
Finally, yesterday, March 5, 2013, the experiment came to an end. My manager asked about what it meant, and she googled the definition. Which, unfortunately led to the Urban Dictionary definition of “KTHXBAI”.
My response: “Ughh…”. This led to an explanation that Urban Dictionary shouldn’t be trusted and that, no, I wasn’t telling my co-workers to get bent at the end of every email. I had to introduce the whole concept of LOLCats, which thankfully was backed up by the creation of my 2007 intern project called LOLJazz, which somehow lingers on as a zombie inside of our Rational Team Concert Server. Still, I wasn’t out of the woods, there was the chance that it could still be “actionable”. This is where I had an ace up my sleeve. During the development of Watson, IBM’s Jeopardy! playing computer, the team, which happens to be in my organization, fed the entire Urban Dictionary into Watson. As could be guessed, the importation of Urban Dictionary into Watson led to many hilarious and wholly inappropraite responses. In short, Urban Dictionary was a cesspool and shouldn’t be used as canon. Rather in this case the Cheezeburger kthxbai is a much better source.
A traditional use of KTHXBAI (even if it is misspelled)
And so, my experiment has come to an end. In the end it was sorta a drag as month after month passed with no one mentioning it. After I talked about it as an experiment everyone came out of the woodwork to say they had seen it and wondered what it meant, but didn’t bother to ask. Which leads me to wonder, how often does this happen? Do people even read my emails? Do they just ignore things they don’t understand or perceive as irrelevant? Do they do that to everyone, or just me? Could I start saying that we need to replace the fitzervalve on the flux capacitor in order to keep the keep the servers from frobnicating themselves and get away with it?
Now, it’s time to find a new subversive work experiment…
Once again I’m looking for an amazingly bright Ph.D. student to work with me over the course of the summer. The position is open to Ph.D. students from any university and at any point of their studies, and I can nearly guarantee it’s going to be an awesome experience.
The primary task will be applying machine learning techniques (lexical analysis, network extraction, predictive analytics) to the usage data from a large piece of commercial software. With a little bit of luck the software will be instrumented by this point in time so you’ll just need to slice and dice the data and find awesome stuff. The goal, of course, is to publish an amazing paper that provides great insight into how users actually use this type of software and provide guidance to architects and developers of such a system.
A loose list of skills that are desirable are:
Java: Most of our tools are written in Java. It took me a while to get used to this, but Java has some nice advantages for developing code to run in an enterprise. Here at IBM we really love it and most of our software, including the tool we’re looking at, is built in Java.
Software Engineering Processes: Domain expertise in understanding the relationships between the different levels of stakeholders in a software project is immensely helpful and will make it a lot easier to tease great bits of nuggets out of the data.
Machine Learning: We use various types of machine learning, both Java libraries and some R to understand the data. On the Java side knowledge of text analysis packages such as OpenNLP is helpful.
Statistics: I love R. If you love R it helps out.
Visualizations: I’m big on making great visualizations to show off our findings. If you’re a ninja with ggplot or d3 then you probably qualify.
Of course, there’s a variety of other skills that are helpful too. The intern absolutely must be self motivated and able to find answers to questions on their own. This isn’t an unsupervised position, but I travel a lot and am frequently out of the office, which limits my ability to provide direct daily supervision. As a result, excellent communication skills are also helpful — you should know how to ask questions over email in way which is succinct while providing enough information to other people to answer the question. If you’ve got a great profile on StackOverflow you’re probably already there.
There’s some great advantages to spending a summer working with me at IBMTJ Watson Research in Yorktown Heights, NY. First, you’ll be working with some of the smartest people in the world at a facility that has an amazing legacy. IBM Research was the genesis of DRAM, the processors in all major video game consoles, Watson - the Jeopardy! playing computer, LASIK, and thousands of other things. We make the world awesome.
Second, our interns come from around the world and are generally smarter than we are. You know that feeling you get when you go to a conference? You’re always excited about new ideas and feel like you could go home and churn out your thesis in a week. Imagine that feeling for an entire summer! I had a blast when I interned here and met some incredible young researchers who I’m still friends with.
Thirdly, we’re just outside of New York in scenic Westchester County, NY. I took the train into the city every Friday, Saturday, and Sunday when I interned here. It was the perfect combination of excitement from New York City and a setting where you can really get work done. You may be saying “isn’t New York really expensive?”. You’re entirely right. Don’t worry, we pay enough that it’s totally worth your time.
Tech jobs are hot in New York right now. Last year while sitting in LaGuardia Airport waiting for a flight I was hacking on some code for work in Eclipse and guy who was shoulder surfing me tried to persuade me to interview for positions he had available at his hedge fund. If you visit any Meetup in the city you’ll hear from dozens of people who are looking for the best and the brightest. When I combine these with a publicly visible Github profile, a resume that’s sitting on my web page, and a fairly complete LinkedIn profile it means that messages from recruiters are constantly flooding my mailbox.
They’re nearly all amateurish wastes of my time.
In this series of posts I’m going to chronicle why they’re such a waste of my time. Here’s a paraphrased recent message I got:
Dear Dr. Wagstrom,
I work for MegaHyperTech, a leading technology placement firm in New York City. We came across your profile on GitHub and later found your resume and think that you may have the talent that our client Quanttastic Solutions is looking for. They’re a hedge fund that makes it feel like you’re working at Google. They hire only the best and the brightest from schools like MIT, Berkeley, CMU, and Michigan. We’d love to send your information over there, but we noticed that you don’t list your GPA on your resume and they only hire individuals with exemplary GPAs. If you’re willing to update your resume to include that information for all your degrees we think that you’d enjoy the challenge.
The recruiter is entirely correct, I don’t list my GPA on my resume. This si done for a couple of reasons: first, my degrees are intertwined. It’s really hard to differentiate the GPAs for computer science, electrical engineering, and computer engineering bachelors degrees. They’re all in the 3.5 - 3.9 range, but really I don’t remember what they were. Likewise, my masters and Ph.D. from Carnegie Mellon are also intertwined and probably have a similar range.
But the bigger issue is that a Ph.D. isn’t about classes. In fact, while working on a Ph.D. if you’re taking a required class that isn’t directly related to your research you probably shouldn’t spend enough time to get an ‘A’ in the class. The measure of the work for a Ph.D. is the thesis and the publications that come out as a result of doing the research. I think of all the times that I met with my advisors I was asked my grades only once, and it was over a concern that I was spending too much time on my homework for my machine learning class.
So here’s my hope that maybe at some point a recruiter will read this. If you ask me for my GPA you’re not going to get it. If your client insists on GPAs for their candidates, then they don’t know what they’re getting.
A couple of years ago I got the bright idea that I’d get my wife open water SCUBA certification as her Christmas present. She likes aquariums and fish and I thought it would be a fun way to do something different when we travel. Fast forward to the present day and I’ve got a closet filled with neoprene, BCDs, fins, first aid kits, and a dive log filled with all sorts of certification cards from PADI.
We purchased our own equipment relatively early in the process of learning how to SCUBA dive - shortly after getting our open water certification, thanks in large part to a nice tax rebate. For the most part we’ve been very happy with our purchases and I feel like it’s made us much more comfortable when we’re underwater. One of the key components of diving is a dive computer. The most basic dive computers tell you your depth and warn you if you’re ascending too fast or are going to need a decompression stop somewhere along the way. More advanced computers replace your entire diving console and provide a compass and wireless integration of you and your buddy’s pressure gauge. Yeah, we went for that kind of over the top dive computer and bought the Uwatec Galileo Luna.
Uwatec/ScubaPro Galileo Luna Hoseless Air Integrated Dive Computer
I’ll be the first to admit that this probably wasn’t the wisest of ideas. I spent two weeks researching $55 gel pads for my standing desk and here we just decided to drop $2000 on a couple of dive computers thanks to thirty minutes at our local dive shop. We’ve been completely thrilled with them under water. Where we’ve had more problems is getting data out of them above water. More advanced computers also take periodic samplings of your depth, remaining air, water temperature, etc. You can use this data to reconstruct a dive profile in a way that much richer than what normally appears in your dive log.
Screenshot of jTrak - Does it feel like it’s 1999?
Getting this data off your computer isn’t trivial. Dive computers are
expensive for a couple of reasons: they’re produced in relatively low
volumes, they often license patented algorithms for estimating your
air consumption and remaining bottom time, and, of course, they need to be waterproof. This
means that you can’t just drop a USB port on the outside of the
case. Nor can you just put a USB port under a rubber flap. At 30
meters you’re facing about 400kPa of pressure - four times the
pressure at the surface. Water will find a way in. If it gets in the
salt will corrode everything and it will die. Thus, dive computers
tend to be very well sealed and make even trivial things like changing
the battery a process that requires tools and new grease for the O-rings.
There really isn’t a standard interface to these devices. It seems as
though a lot of devices, such as the Mares puck computers, have
corrosion resistant metallic contacts that connect to a USB controller
with an FTDIUSB→Serial chip in it. However, the Uwatec Galileo
decided to be more advanced and use what I’m sure was the hip
protocol at the time: IrDA.
Now, in case you missed it, IrDA was all the rage in the 1990’s and
early 2000’s. Every laptop seemed to ship with an IrDA port built
in. You could use it to synchronized data with your Palm or Handspring
in the late 1990’s. Once cell phones were more common you could even
tether your laptop to your cell phone and get very slow data. In the
pre-wifi, pre-edge days this was pretty hot stuff. “Was” being the key
word. Hot stuff being around the speed of the 28.8k modem that I used
when back in 1994.
You can still find devices that use IrDA, most notably a lot of the
heart rate monitors from Polar, but for the most part the technology
is from about 10 years ago. This also means that you’re dealing with
the headaches of 10 years ago, including the near total lack of Mac
support for devices. Those that do support the Mac often only support
the PPC Mac and never really fully supported it anyway. Did I mention
that MacOS X doesn’t even have full support for IrDA? Just try opening
up a socket using AF_INET. It doesn’t exist. Ughh. This was going to
be a great adventure.
Setting Baseline Expectations
My first naive attempts were to hack an IrDA driver into the framework
of libdivecomputer. There was already support for IrDA
dive computers under Windows and Linux, and I had confirmed that they
worked just fine with my my computer, how hard could it be? The answer is
a lot more complicated than I thought. The first step was to find an
IrDA dongle that even worked with Mac OS X. I ordered a couple of cheap
ones off eBay and had no luck. I read a few comments from folks saying that
the official Uwatec USB->IrDA devices worked with JTrak on Mac OS X, however
the official dongles about $70 and JTrak is a bit less than what I’m looking for in a
dive logging software. Fortunately, I was able to find another device that looked nearly identical from the outside - the IRJoy USB 2.0 USB IrDA adapter for $30. When this guy arrived a quick scan showed that it was the exact same hardware as the official Uwatec dongle - both were based on the MosChip 7780 chipset.
Plugging the device into my trusty Thinkpad x31 showed that it quickly and easily worked both in Windows and Linux using the SmartTrak software from Uwatec, JTrak, and the test applications from libdivecomputer. I knew that I could at least make some progress. Next up was to test it on my Mac. I plugged in the IrDA stick and fired up JTrak and to my amazement it just worked. That NEVER happens. Poking around showed why it worked, the company behind JTrak had licensed a complete pure Java IrDA stack. Well, at least I could use JTrak if everything else failed. However I had my eyes set on something much prettier, MacDive.
Writing a Driver
I had heard people refer to the fact that the MosChip devices had a Mac driver, but most of those conversations ended many years ago &emdash; as if I needed more evidence that I was dealing with a dead protocol. After some digging around and emailing random customer service addresses I found that the IP for the MosChip devices were sold off to a company in Taiwan called Asix. They provided a couple of different versions of the driver and I eventually found one that worked in full 64 bit mode on Mac OS X Lion. Score.
The driver came with a simple test application that would let me read the data coming over the device as though it was a serial device. Using this test application I was able to position the reader in the line of sight of other IrDA devices and receive data. Neat. The problem is that I was getting the raw bytes of the IrDA sockets. There’s a lot of overhead in there that goes along with handshaking, setting speed, and resending data when connections are interrupted. None of this seemed to be enabled in the driver. The driver simply provided a couple of serial devices that I could open up and use to smack bits back and forth. If I wanted this to work I would need to write a complete IrDA stack on top of this serial device.
The problem is that the IrDA stack is actually fairly complex. Theres’s a myriad of different protocols that stack on top of IrDA to make everything work. This was basically the equivalent of trying to implement TCP/IP using just the raw bits coming over the 802.11 physical layer. In other words, it was a nasty layer mismatch that was not going to do me any favors.
The Multifaceted IrDA Stack - From Wikipedia
I continued to email Asix, who were more than helpful, although they seemed most concerned that I would write a driver that would let the user transfer files with Windows and cell phones. After a few more emails I explained was a dive computer was and how much of a niche this issue was and Asix offered me an NDA to work on the driver
and attempt to implement the AF_INET stack for Mac OS X. If I were in
undergrad this would probably sound like a great idea. However, I’m not.
I’ve got a job that keeps me quite busy and has me flying back and forth
between New York and Washington on a weekly basis. I just don’t have the
time to acquire the knowledge needed to hack together a driver on Mac OS X.
Of course, there’s also the issue of me performing gratis work for a
for-profit company, which I didn’t really want to do either.
VMs to the Rescue
This left me with really only one simple solution, use what I know
already works for communicating with the Galileo Luna, Linux or Windows.
In an effort to keep this simple and avoid worrying about license issues
I chose to use a very minimal Linux installation under VirtualBox as my
guest environment. The next problem was the software to make use of my
data. There were a couple of different ways to handle this, either do
all of my log work inside of the virtual machine, or just download the
data in the virtual machine and copy it over to my Mac to do most of the
work on the log. Starting up a VM is a bit of a pain, so the choice was
made to use mac dive log software and download the data in the VM then
copy it over.
There are a couple of different formats that might be able to fit the bill, SDE, UDCF, UDDF, and ZXL. SDE is the output format from Suunto Dive Explorer software. There doesn’t appear to be much documentation for the format, but it supposedly contains all the necessary information that a diver might want to recreate a dive log on a computer. Supposedly Subsurface, a dive log software package by Linus Torvalds, can import from SDE, so there should be some source code there that I just haven’t had a chance to dig at yet. ZXL is a format designed by DAN to collect information for scientific studies of diving related injuries. UDCF and UDDF are formats developed by a group of interested divers that seem to achieved moderate success. UDCF can be considered to be the little brother to the more robust UDDF. Many tools support UDCF, but it lacks official mechanisms to do things like save the pressure in a tank.
The most promising format seems to be UDDF - the Universal Dive Data Format. UDDF, like most interchange formats, sadly uses XML so it is parseable by neither humans nor machines. It is able to contain information about dive profile, temperature, and air usage, which are the main things I want to track. I wasn’t able to find a tool that used libdivecomputer to produce a UDDF file, so I wrote my own, the cleverly named dc2uddf.
dc2uddf is a simple tool that uses libdivecomputer and libxml2 to download data from a dive computer and save it as a UDDF file. That’s all it does. There isn’t much of a user interface, but it works, and it’s written in C, which makes me feel a little more like a programmer than I normally do. I’m certain there are some things that it is doing incorrectly, if folks discover problems email me or [file issues on github][ghissues] and I’ll be sure to fix them. Along the way I’ve also found several defects in the UDDF standard, so I feel like I’m making the standard better too.
Now I’m at the point where I can download the data using a Linux VM and then
copy the data over to my Mac where I can easily import it into the excellent
MacDive software, as you can see below.
The Pretty-Pretty Output of MacDive
The Future
I’ve thought about a couple of ways that I could make this a bit more streamlined. The current candidate is to get a Raspberry Pi board and create a small dedicated device for downloading dive computer data. Basically you’d turn it on, put the dive computer within range, press a button on the case for the Raspberry Pi and your data would be automatically downloaded. You could feed it an SD card and it could either use the configuration file on the SD card to upload the data to a remote host or just store a copy on the SD card. However, given the long waits for Raspberry Pis at the current time and my busy schedule I’ll just have to wait on that idea.
I’ve also toyed with the idea of making a service that provides real analytics on dives. Right now there are a couple of different sites that allow you to share dive logs. DiveBoard seems to be the most cross-platform of sites and they’ve even developed a browser plugin based on libdivecomputer to automatically upload your dives from your browser. Aside from their plugin they allow users to upload UDCF, SDE, and ZXL files. They’ve even gone so far as to extend UDCF to allow for pressure information — although this seems to be a clear sign to me that they should consider allowing UDDF uploads.
Another community is Suunto Movescount. This is the successor to Suunto’s Dive Explorer software and reflects the fact that they’ve moved beyond just diving metrics. The problem is that as near as I can see it’s a locked platform. There doesn’t appear to be any way to get your data out of it, or, for that matter, get data from non-Suunto devices into it.
Both of these sites are missing some of the potential for such sites, which is the ability to measure and track rather than just keeping a log. It’s something that sites like RunKeeper are just beginning to explore with efforts like their FitnessReports, but even those reports are rather cursory. There’s a number of metrics that we can calculate both on an individual and across a community that would be highly beneficial to everyone involved - divers, dive shops, travel agents, tour operators, and gear manufacturers, to name just a few. However, the description of these analytics will have to wait for a future post.
Recently I’ve seen a number of friends and acquantences post some variation of the following message to their Facebook walls:
In response to the new Facebook guidelines I hereby declare that my copyright is attached to all of my personal details, illustrations, comics, paintings, professional photos and videos, etc. (as a result of the Berne Convention). For commercial use of the above my written consent is needed at all time.
By the present communiqué, I notify Facebook that it is strictly forbidden to disclose, copy, distribute, disseminate, or take any other action against me on the basis of this profile and/or its content. The aforementioned prohibited actions also apply to employees, students, agents and/or any staff under Facebook’s direction or control.
The content of this profile is private and confidential information. The violation of my privacy is punished by law (UCC 1 1-308-308 1-103 and the Rome Statute).
Facebook is now an open capital entity. All members are recommended to publish a notice like this, or if you prefer, you may copy and paste this version. If you do not publish a statement at least once, you will be tacitly allowing the use of elements such as your photos as well as the information contained in your profile status updates.
The intent of these postings is to limit the way that Facebook is legally allowed to use or share your information. On the one hand this makes me happy because it seems as though some people are taking their privacy seriously, on the other hand, it’s very frustrating because of the ham-fisted way people are going bout this.
The crux of the problem is that the Facebook Terms of Service supersede any declaration or addendum you attempt to make toward Facebook. Specifically clause 19.5:
Any amendment to or waiver of this Statement must be made in writing and signed by us.
However, you might think there is a loophole that will protect you somehow. Maybe something that Facebook forgot to expressly enumerate. Sorry, that’s covered in clause 19.10:
We reserve all rights not expressly granted to you.
As an additional level of backup the posts typically attempt to cite various portions of the Uniform Commercial Code, most often Article 1. First, it’s important to understand what the UCC is. It is NOT some overarching set of Federal Laws. The UCC is an attempt to harmonize various state laws and make it easier to do business across state lines. In some ways you can think of the UCC a little like the Talmud, the text is important, but so are the comments that go along with it. Unfortunately, the text and comments are copyright, so these semi-binding documents are not accessible to the common man (that’s a whole different problem, one which Carl Malamud and Public.Resource.org are attempting to remedy.
Anyway, we’ll ignore for a moment that the entirety of Article 1 of the UCC deals with definitions and ways to interpret further rules, and therefore probably isn’t the thing you’re looking for. The first reference, UCC 1-308 (which is often mistyped 1-308-308, which renders it null in the eyes of the law) reads:
§ 1-308. Performance or Acceptance Under Reservation of Rights.
(a) A party that with explicit reservation of rights performs or promises performance or assents to performance in a manner demanded or offered by the other party does not thereby prejudice the rights reserved. Such words as “without prejudice,” “under protest,” or the like are sufficient.
(b) Subsection (a) does not apply to an accord and satisfaction.
However, the issue with 1-308 is that your Facebook content, while being a creative work, isn’t a performance in most cases. There isn’t a transaction from Facebook unto you for performing such an action, therefore this most likely doesn’t apply.
Second is UCC 1-103, I have no idea how this got mixed up in here:
§ 1-103. Construction of [Uniform Commercial Code] to Promote its Purposes and Policies: Applicability of Supplemental Principles of Law.
(a) [The Uniform Commercial Code] must be liberally construed and applied to promote its underlying purposes and policies, which are: (1) to simplify, clarify, and modernize the law governing commercial transactions; (2) to permit the continued expansion of commercial practices through custom, usage, and agreement of the parties; and (3) to make uniform the law among the various jurisdictions.
(b) Unless displaced by the particular provisions of [the Uniform Commercial Code], the principles of law and equity, including the law merchant and the law relative to capacity to contract, principal and agent, estoppel, fraud, misrepresentation, duress, coercion, mistake, bankruptcy, and other validating or invalidating cause supplement its provisions.
Reading through this I can’t understand why 1-103 was even brought into this. It’s a simple description of the UCC and highlighting that unless the UCC attempts to supersede laws for things like fraud, duress, and bankruptcy, that they stay in effect.
Finally, let’s look at the appeal of the Rome Statute. I’m going to out on a limb here and say this was added by someone in Europe as the original postings I saw by Americans didn’t include this caveat. I’m assuming that the Rome Statute refers to the Rome Statute of the International Criminal Court. This international agreement established the international criminal court and gave the UN authority to investigate crimes when the host nations have chosen not to investigate. For example, the ICC often comes into play with state sponsored genocide.
One could easily argue that the United States has initiated investigations in privacy and Facebook (see the Senate Judiciary Committee Subcommittee on Privacy, Technology and the Law meeting on July 18, 2012 when Franken tore into Facebook’s manager of Privacy and Public Policy). The fact that the US is conducting investigations would seem to disallow the ICC any sort of jurisdiction. would therefore make such an investigation outside the bounds of the International Criminal Court — which really has non-first-world-problems to deal with, like genocide.
In short, if you’re really concerned about your privacy posting such a message on Facebook doesn’t do anything other than annoy your friends. If you’re really concerned about your privacy on Facebook you need to stop using it altogether.
Important Disclaimer: I am not a lawyer. I’m merely someone who took the time to read the Facebook Terms of Service and look up the relevant portions of the law that people are attempting to quote. None of this should be regarded as real legal advice.
Development of any moderately complex software package is a social
process. Even if a project is developed entirely by a single person,
there is still a social component that consists of all of the people
who use the software, file bugs, and provide recommendations for
enhancements. This social aspect is one of the driving forces behind
the proliferation of social software development sites such as
GitHub, SourceForge, Google Code, and BitBucket.
These sites combine together a variety of tools that are common for
software development such as version control, bug trackers, mailing lists,
release management, project planning, and wikis. In addition, some of
these have more social aspects that allow you find and follow
individual developers or watch particular projects. In this post I’m
going to show you how we can use some this information to gain insight
into a software development community, specifically the community
around the Tinkerpop stack of tools for graph databases.
Graph Databases
Graph Databases are in the broad family of NoSQL databases. For about
30 years the dominant form of data storage and access has been through
relational databases (e.g. Oracle, MySQL, PostgreSQL, DB2, etc). These
present your data as a table with various rows. These tables can have
constraints and pointers that map a column in one table to a column in
another table through a process called a join. In this way it’s
possible to create relations between records and build rich
collections of data.
Relational databases are very nice and can scale fairly well, but they’re
not suitable for all problems. In particular, there may be cases where
atomicity can be sacrificed in exchange for higher performance or
where the schema of the data may frequently change resulting in severe
problems mapping the data to a traditional database.
This has led to a multitude of different solutions for data storage
and access. Some of the more popular solutions are Google’s BigTable
for distributed data storage, MongoDB for a schemaless document
database, and Memcached for distributed object storage and
caching. These alternative style of databases are generally lumped
into a category of NoSQL, which means either “Not SQL” or “Not Only
SQL” or perhaps something else depending on who you speak to.
A specific subclass of NoSQL databases is graph databases. A graph
database represents your data a network of vertices and edges that
connect them. Vertices and edges can have various properties that
define the object. As opposed to traditional databases where a query
crawls over the entire table to find the appropriate elements, queries
within a graph database are often done via traversals that walk the
graph from one node to another. Examples of graph databases are
Neo4j, OrientDB, Trinity,
InfiniteGraph, and Dex. A complete description
of these databases are beyond the simple explanation here, but
Wikipedia has a decent primer on graph databases.
Tinkerpop Background
Tinkerpop is a loosely coupled virtual organization centered around
Marko Rodriguez that develops infrastructure libraries and interfaces
for graph databases.
Tinkerpop has six major projects that are hosted on Github:
Pipes: A general data flow and processing framework
Blueprints: A library to abstract graph database interfaces
Gremlin: A domain specific language for traversing graphs
Rexster: A general web interface for Blueprints supported databases
Furnace: A library of algorithms for traversing graphs
The Tinkerpop Network
As part of an ongoing research effort between IBM and the University
of Nebraska, Lincoln, I’ve written a tool called GitMiner
that can connect to Github and pull down information on a set of
projects. In celebration of Gremlin hitting 600 watchers on Github, I
pulled the complete network for all of the Tinkerpop projects from
Github from May 1-3, 2012. This network contains the following pieces
of information:
In a future post I’ll provide more details of how you can use GitMiner
to access data on your own projects. I’ll also provide some pointers
to other data sets people may wish to analyze.
Getting Started with Analysis
For this analysis we’re going to use a couple of different software
packages. First, we’ll be using Gremlin to do some queries of the
database and to create exportable networks for further
analysis. Additional analysis will be conducted using R. These
instructions are generically for people running a Mac, Linux, or other
operating system with a posix-like command line interface. If you’re
on Windows you should be able to follow along but you’ll need to
modify the shell commands. All the tools used in this analysis are
cross-platform, open source, and freely available.
Installing Gremlin
I’m not going to repeat everything in the Gremlin docs
here, but here’s a brief overview of what you’ll need to do to get
going on a Mac or :
This assumes that you’ve already got a nice java development
environment setup and that you have maven installed. If this
is your first time using maven to build any Java packages this can
take a long time as it will automatically download all of the
dependencies needed to compile and run Gremlin.
Installing R
R is a language for statistical computing. It’s slow, uses strange
syntax, and is a memory hog. In short, it’s quite possible one of the
worst possible ways to do this analysis. However, it also is the
dominant language in the field and provides a huge number of libraries
and tutorials that we’ll use for our analysis.
There are a variety of different ways to interact with R. If you’re on
Windows or a Mac the standard downloads of R have a decent graphical
interface for editing scripts and running commands. If you’re an Emacs
hacker, ESS is a great library that interfaces nicely with R. If
working inside of Eclipse is your thing, then use StatEt. Personally,
I use R-Studio for most of my work. Further screenshots will be based
on R-Studio, but you should be able to follow along with other interfaces.
Installing R-Studio is straightforward. Visit the
R-Studio Desktop download page and download and
install the version for your platform.
Downloading the Data
I’ve posted the Tinkerpop Social Graph as a Neo4j
database, you should visit it and download
TinkerpopSocialGraph.20120501.db.tar.gz. After
downloading it you should go into the directory where you downloaded
and compiled Gremlin and extract it. If you’re on a Mac or Linux, the commands
will generally be something like this:
The dataset is fairly large, about 148MB compressed. It’s quite a bit
of data and if you’re a lazy student taking your first SNA class it
should have enough data to do a really kick-ass class project. If
you’re a grad student and interested in writing a paper on this sort
of data email me and we can probably collaborate.
Exploring the Graph
Gremlin provides a interactive interpreter that we can use to explore
the graph. You can start it up by running ./gremlin.sh. Then run the
following commands. lines that begin with gremlin> are the lines you
should type into the interpreter.
To begin with we we’ll connect to graph and get a specific node from
the database. In this case, we’ll pull up the node that represents
Marko Rodriguez, the main developer of tools from Tinkerpop.
\,,,/ (o o)-----oOOo-(_)-oOOo-----gremlin> g = new Neo4jGraph("tinkerpop/tinkerpop.db")==>neo4jgraph[EmbeddedGraphDatabase [/Users/pwagstro/gremlin/tinkerpop/tinkerpop.db]]gremlin> marko = g.idx("user-idx").get("login","okram").next()==>v[8]gremlin>marko.map()==>location=SantaFe,NewMexico==>sys_last_updated=1335930109==>blog=http://markorodriguez.com==>type=USER==>gravatarId=https://secure.gravatar.com/avatar/fb12ea6a621399613aae4d692533e067?d=https://a248.e.akamai.net/assets.github.com%2Fimages%2Fgravatars%2Fgravatar-140.png==>followers=57==>following=12==>createdAt=1257359950==>name=MarkoA.Rodriguez==>login=okram==>fullname=MarkoA.Rodriguez==>gitHubId=148925==>sys_events_added=1335918859==>user_type=User==>totalPrivateRepoCount=0==>private_gist_count=0==>sys_last_full_update=1335918850==>biography=graphalgebra,digitallibrarianship,computationaleudaemonics,graphtheory,networkscience,governmentarchitecture,networkmetrics,decisionsupportsystems,computationalsocialchoicetheory,socialnetworks,scientometrics,collectiveintelligence,semanticnetworks,ontologies,bibliometrics,informationscience,swarmintelligence,informationmarkets,peer-reviewprocess,computationalsociology,knowledgeengineering,computerarchitecture,programminglanguages,theoreticalcomputing,psychometrics,multi-relationalgraphs,knowledgerepresentation,reasoning,neuralnetworks,multi-valuedlogic,neuralgrowthalgorithms,recommendationalgorithms,distributedcomputing,ethics.==>diskUsage=0==>url=https://api.github.com/users/okram==>public_gist_count=14==>collaborators=0==>email=marko@markorodriguez.com==>sys_created_at=1335918699==>ownedPrivateRepoCount=0==>public_repo_count=0
The values output by marko.map() are the properties of the vertex
that represents Marko in the database. With the exception of the
properties that being with sys_, which were added by
GitMiner when the data were imported, all of the other
properties are obtained directly from the GitHub API.
In a similar vein we can get the vertex that represents Gremlin using
the following commands:
While this provides a lot of information about individual vertices in
the database, it doesn’t provide information about how projects or
people are related. We get at this information by looking at the edges
connected to a vertex. Within databases such as Neo4j and OrientDB
edges are directed and always got from a single source node to a
single target node. This query will iterate over all of the outgoing
edges from Marko and count up their types.
There are a lot of types of edges in the database (see [EdgeType.java
in the project source][edgetype] for the complete list). In this case
we’ll focus on the project social network, which is shown through the
FOLLOWING and FOLLOWER relationships. At the time of data pull
Marko was following 12 people and had 57 followers.
Likewise, we can do a similar query for incoming edges:
When we reverse the direction and look at incoming edges these numbers
differ, and it shows that there are only nine people that Marko is a
follower of and 41 people that Marko is following. The difference in
these values is because the data only contains the sample of people
around the Tinkerpop projects. Thus, we can see that there are
57-41=16 people that are following Marko that don’t show up in the
data. This is because they don’t have activity, such as creating
issues, commenting on issues, or watching a repository, that would
pick them up in our sample. We know they exist, but we don’t have much
information about them.
Your First Graph Traversal
Now that you’ve gotten a feel for getting information about a single
vertex in graph, it’s time to do a simple traversal. To start with,
lets get the names of all of the contributors to gremlin.
This query starts with the Gremlin vertex we identified before and
then follows all edges labeled REPO_CONTRIBUTOR which is GitHub’s
way of saying someone has code in the project repository. Once we’ve
followed all of those edges we can fetch the login name of the users.
In a similar vein, we can get the name of all of the projects that
Marko has contributed to using the following query:
Now, we can put the two together. Our first query got a list of all of
the people who contributed to Gremlin. Let’s take it step further and
get the list of all of the people who have contributed to projects
that Marko has contributed to.
This, however shows many people multiples. Let’s just count how many
times each name appears and then sort the list. This will give a rough
idea of the people that Marko works closest to.
Taking this a step forward, lets look at what other projects people in
this set watch. We need to branch out another layer, but first we need
to be careful and add in a dedup() in the pipe to ensure that we’re
not counting some projects too often.
It’s no surprise that the projects in the tinkerpop stack are the most
watched projects among the developers who work on Tinkerpop
projects. However, there are a few other interesting pieces of
software the seem popular. Among others Storm is a realtime computation
system written in Java and Clojure that’s great for munging through
thousands of logs. Bulbs is a nifty python interface to many
of graph databases. Nerlo is a mechanism to use Neo4j
from within Erlang. My apologies if my descriptions are wrong, as some
of these projects are new to me too.
That’s enough about traversals in the data for now. I’ll leave to
explore the data on your own. In future articles I’ll cover more about
actually mining the relationships.
Exporting a Graph to GraphML
While graph databases and Gremlin are very useful for storing your
data and doing traversals on data, they’re not always well structured
for doing computation on the data and gaining insight over a wide
number of projects. In grad school I studied with one of the leaders
in the field of social network analysis, and now that she’s
given me a hammer, it seems like everything looks like a nail. In this
section I describe how to get your data out of a graph database and
into a program like R.
A common interchange format for social network data is in the GraphML
format - an XML specification for describing graphs. It was first used
by individuals interested in visualizing large scale graphs. As such,
it has significant drawbacks that make it less than ideal compared to
other formats such as DynetML (e.g. on a single graph, no nesting,
edges must all be directed or undirected). In any case, it’s what we
have, so we’ll use it. Fortunately, both Gremlin and the igraph
package for R, which we’ll be using later, support GraphML.
I’ve created a simple script that you can run in your current Gremlin
session. You should be able to just paste this code into your running
gremlin session and it will save the network to file called
follower.graphml.
The astute observer will notice a couple of things about this. First,
we’re using a specialized method to get all of the users associated
with the Gremlin project on GitHub. However, we’re not following all
of the ways a user can be associated. For example, we’re not looking a
issues, pull requests, commits, or other events.
Secondly, we’re skipping a lot of edges and vertices. In this case
we’re skipping every edge that doesn’t lead to a user in this set. The
reason for that is because if we didn’t skip these edges we’d have a
network with 30,000+ nodes as opposed to to the 606 in this
network. While it’s possible to do analysis on networks of that size,
it is much slower and would prove to be a bit of a distraction here.
Network as visualized in Cytoscape
This finishes the section of the article dealing with gremlin from the
command line. From here on out the operations are done in R.
Very Important: Before exiting Gremlin run the command
g.shutdown() to close the graph database. If you don’t do this
then you’ll have to wait for a recovery process then next time you
look at the data.
Examining the Data in R
Within R the first thing to do is to make sure you have the igraph
package installed. You can do this by running the following command
and following the directions:
install.packages('igraph')
Now that we’ve got igraph installed, it’s time to have some
fun. First, we need to tell R to use the functions inside of the
igraph library and to load our data.
First let’s get some summary information. This can be done with the
ecount and vcount functions. It shows that in the current network
there are 510 edges and 606 nodes.
> ecount(graph)[1]510> vcount(graph)[1]606
This network has a lot of isolates in it. That’s somewhat to be
expected as not every user utilizes the follower feature of
github. The following commands will remove isolates from our data set
and results in a network of 236 vertices and 510 edges.
First, lets get an idea of the degree of the vertices in the
graph. This command creates a histogram that clumps vertices by the
number of edges they have. We see that only a very few have many
edges, most have fewer than 10 edges. I should stress, this does not
reflect the total number of people those accounts follow, rather it
reflects only the total number of people within Gremlin that each
account follows.
hist(degree(graph))
Now, lets look a couple of the classic centrality
measures. Betweenness centrality calculates the proportion of all
shortest paths between vertices that a particular vertex sits on. If
communication had to go person to person and could only go along
connections that are established, these people would prove to be key
in the network.
Now that we’ve calculated those centralities, let’s take a look. We’ll
start with betweenness. According to this data the user that has the
most central role is spmallette, an active participant
in the tinkerpop communities, followed by ahzf, a developer
who is working on .Net ports of many blueprints services. In third
place is a research account from a university in Korea. This account
shows up all over the place and I generally consider it to be a spam
account. It follows tens of thousands users and therefore creates
artificially short paths between users, boosting it’s score in the
process. In fourth place is Marko, the leader of Tinkerpop.
In the betweenness centrality model, which is a directed model, users
who follow few additional users are penalized. As Marko only follows a
handful of users, his score is low, despite the fact that many people
in the community follow him.
However, when we use eigenvector centrality, which is a more robust
centrality metric, is used, we find a more interesting picture. Marko
and peterneubauer are the top individuals, followed by spmallette and
joshsh, additional developers of Tinkerpop.
There’s always more that you can do with these tools, and in the
future I’ll discuss some more, but for now I hope this has given you a
taste for how to mine social networks from GitHub. Enjoy!
This is my first new post written in markdown for the static version
of
patrick.wagstrom.net/weblog. The
only reason I was running both PHP and MySQL on my server was to host
wordpress which became a pain in the ass with all of the
upgrades. This eliminates all of those nasty security holes and allows
me to focus a little bit more on just writing. Which is what a weblog
is supposed to be.
I’m running Octopress, which is a blogging
framework based on Jekyll. The downside to
this is that means that it cannot accomodate dynamic elements,
therefore all comments need to be farmed off to an external
service. Fortunately, I was already using
IntenseDebate. With only a very small
amount of work I was able to migrate everything over to the new
system. Perhaps most substantial is that I had to write a patch to
Octopress to support IntenseDebate. I’ve since created a
pull request for IntenseDebate support
on Github. Hopefully the authors will see fit to pull it in.
So yeah, it’s a little more work now that I don’t have a web interface
to do things like manage images and remember my links, but I can write
posts from any text editor, which is VERY handy for when I’m stuck in
airplanes and too cheap to pay for WiFi.
Overall I’m not certain if this is a good idea. In the past I’ve
extolled various reasons why
you shouldn’t try to do it yourself. However
there is also merit to doing it yourself. Up until this point I’ve
been an active Ruby Hater, and it’s becoming clear that I should at
least be peripherally aware of what Ruby can do. Although my
extensions to this point have not involved hacking Ruby, it might at
some point in the future.
So, for now, enjoy the fact that every post is showing up again in
your RSS reader and marvel at the beautiful new theme. With no more
worries about annoying security faults and a faster response time.
In the past couple of days I have twice received an opinion piece from the
Wall Street Journal which suggests that the models used for estimating climate
change are grossly pessimistic and that we really need not be concerned with
anthropogenic climate
change.
It was signed by sixteen scientists and engineers. The problem is that almost
none of these people are climatologists - which is the field they are claiming
is producing invalid science. Anyone can call themselves a scientist - having a
Ph.D. helps - but, just because you are a scientist does not mean that you can
speak authoritatively on all issues related to science. Stephen Hawking is a
brilliant scientist, but he studies astrophysics, not climatology. I trust him
on a lot of things, but I wouldn’t trust him on climate change. Nor would I
trust Albert Einstein, Louis Pastuer, Marie Curie, or Isaac Newton on issues of
climate change.
So, who are these climate change deniers that have the right frothing at the
mouth again? Let’s take a quick look.
Claude Allegre, former director of the Institute for the Study of the Earth, University of Paris - Is a geochemist, which might make him qualified. It’s hard to tell as he has spent most of his time doing political work recently. He appears to have a strong contrarian streak, such as in 1996 when he insisted that asbestos was harmless and that anger over it was caused by mass hysteria. That last time I checked the link between asbestos and mesothelioma was pretty firm.
J. Scott Armstrong, cofounder of the Journal of Forecasting and the International Journal of Forecasting - This one gave me a decent chuckle. At first I thought he was a climate forecasting scientist, nope. Armstrong’s expertise is in marketing style forecasting, as in trends. His journal is also published by Elsevier. I think I threw up a little in my mouth.
Jan Breslow, head of the Laboratory of Biochemical Genetics and Metabolism, Rockefeller University - A medical doctor and not a climatologist. Breslow is perhaps most well known for his work on heart disease. This is great work he has done, but it’s not atmospheric science.
Roger Cohen, fellow, American Physical Society - It’s difficult to find information on Cohen. Prior to retirement he worked for ExxonMobil research, but that’s about all I can find. I can’t seem to find any publications on any issue. However, he does have a very common name, making him hard to google. He frequently consort with William Happer, who appears later in the list.
Edward David, member, National Academy of Engineering and National Academy of Sciences - As a member of the National Academy of Engineering I have great respect for Dr. David. However, he is an electrical engineer and has been largely retired from research for more than 20 years. Did I mention he was director of research at Exxon from 1977-1985?
William Happer, professor of physics, Princeton - Seems to have moved away from research as he’s advanced in his career. During his prime he was a leader in the field of spectroscopy. Which, in case you didn’t know, has nothing to do with climate change. During his 2009 testimony to congress he indicated that an increase in CO2 is good for the planet because it’s good for plants. Yes, very much like the Competitive Enterprise Institute’s “CO2, We Call it Life” vieo.
Michael Kelly, professor of technology, University of Cambridge, U.K. - Kelly primarily works on semi-conductors, specifically SRAM. He is not a climatologist or even a chemical engineer or chemist.
William Kininmonth, former head of climate research at the Australian Bureau of Meteorology - Kininmonth is, perhaps, a meteorologist, although there is little information easily available about his activities. It is known that he is not a prominent researcher in any field and his “Australasian Climate Research Institute” is run out of his home and appears to be only his own writings.
Richard Lindzen, professor of atmospheric sciences, MIT - Lindzen is perhaps the most qualified individual on this list. He is well known for his skepticism of anthropogenic climate change. He stands out from the other signatories because he can speak with true scientific authority on the issue.
James McGrath, professor of chemistry, Virginia Technical University - McGrath studies polymers and fuel cells. He is a scientist, but not a climate scientist.
Rodney Nichols, former president and CEO of the New York Academy of Sciences - This one took me a while longer to find out information about. I believe that Dr. Nichols is a physicist from Harvard, which means he could be a climatologist. However, looking at his publication record for the last 40 years you’ll find that most of his work is dealing with science and technology policy – issues that are close to my heart. However, this doesn’t qualify him as a climatologist. I’m sure he is well learned in a variety of topics, but I don’t believe he has a deep knowledge of the current research on climatology.
Harrison H. Schmitt, Apollo 17 astronaut and former U.S. senator - As an astronaut Harrison Schmitt was on the mission that took the famous “Blue Marble” picture of the earth. In fact, evidence indicates that Schmitt most likely took the photo that has been credited with being a critical catalyst for the environment movement in the 1970’s. Outside of his astronaut career he was a university professor, geologist, and senator from New Mexico. None of these are related to the atmosphere or climate science.
Nir Shaviv, professor of astrophysics, Hebrew University, Jerusalem - Shaviv is primarily an astrophysicst known for his work on cosmic rays and luminosity. He has his own theory of global warming which says that the cosmic rays of the sun are responsible for global warming. His theory has not been widely accepted and has faced great challenges because of the fact that the solar output has been decreasing since the mid 1980’s.
Henk Tennekes, former director, Royal Dutch Meteorological Service - Also a professor of Aeronautical Engineering at Penn State, Tennekes is most well known for his work on turbulence in airflows. In fact, he literally wrote the book on it. Unfortunately, that’s not a book on climate change. He was reportedly ousted from the Royal Dutch Meteorologic Service for his denial of climate change and his sometimes reliance on biblical texts for justification. Look, I’m a Christian and a scientist, but I realize that I can’t use biblical texts to justify my work, that’s not how science works.
Antonio Zichichi, president of the World Federation of Scientists, Geneva - Primarily a sub-nuclear physicist who has worked at labs like CERN and FermiLab. His title of President of the World Federation of Scientists is self bestowed as he is the founder. It should not be considered to be an analog to the Federation of American Scientists. He is a highly cited researcher, and has done significant work in popularizing science in Italy, but he is not a climatologist.
Out of the sixteen people listed I count one atmospheric scientist, Lindzen,
and a half, Allegre. In any community of scientists you’ll have dissenters. The
fact that they could round up only one and a half climate scientists for this
letter should show you just how strong the case for global warming really is.
Want more evidence? 255 scientists, all members of the National Academy of
Science, including 11 Nobel laureates wrote a scathing response, rejected by
the Wall Street Journal and later published in
Science.
Are you one of the best software engineering students in the world? Do you dig mining software repositories? Are you a wizard at social network analysis? Interested in a great summer job looking at what makes software teams work? Even better, want to work with me?
We suggest that you apply online. If you’ve got questions you can email directly for more information. But hurry up, as we’re going to start our selection and interview process soon.
PS. For faculty, this is a great way offload students for the summer if you’d like to take off to St. Barth’s for a few months.
I'm a software engineering researcher at the IBM TJ Watson Research
Center in Yorktown Heights, NY. For more information about what I do
for my day job check out my academic home page.
These writings and opinions on this site are my own and do not reflect
those of IBM...duh.