My Delusional Dream

New Paper: A Network of Rails: A Graph Dataset of Ruby on Rails and Associated Projects

For the last year and a half I’ve been working with Anita Sarma, a professor at the University of Nebraska, Lincoln and her graduate student, Corey Jergensen, to try and understand some of the social dynamics around GitHub. As we began to dig at the ecosystem we realized that we had an opportunity to perform some novel analysis on the community. Specifically, GitHub is a highly networked ecosystem and most of the queries that we were doing were localized around single projects or developers. At this time graph databases were taking off so we decided to learn a new technology while getting some data at the same time.

This resulted in the creation of GitMiner, a tool that utilizes the GitHub APIs to download all the data about a project and it’s related users, issues, pull requests, and basically everything else that you can get out of the API. It then stores this information inside of a graph database - something that I’ve written about before when I first published a dataset on the Tinkerpop family of projects.

Now we’ve had a chance to formally publish a larger set of data, thousands of projects associated with Ruby on Rails. The data are published in this years conference on Mining Software Repositories. If you’d like to read the actual paper, here’s the authors’ pre-print of the paper and the GitHub repository with the actual data.

In the coming weeks/months I’ll probably write more about how to use GitMiner to collect large amounts of data from GitHub and how to crawl this data. For the interim, however, I’ll leave you with this nifty picture of shared developers between projects, which is part of an upcoming submission of ours.

Developers shared between projects in our Ruby on Rails dataset. The size of nodes represents the number of developers on the project, edge width is the number of shared developers between projects, and color represents programming language. [link to full size image]

Citation: Wagstrom, P., Jergensen, C., and Sarma, A. A Network of Rails: A Graph Dataset of Ruby on Rails and Associated Projects. Proceedings of the 2013 Working Conference on Mining Software Repositories, ACM (2013).

The KTHXBAI Experiment

On May 1st, 2012 I embarked on an experiment at work — I started signing work emails to my team and friends inside and outside the office with the words “KTHXBYE” or “KTHXBAI”. The goal was to see how long it would take until someone mentioned or asked about it. About two weeks after I started the experiment a friend from Microsoft noticed it and mentioned it to me. Of course, I replied with a meme:

To which my friend at Microsoft was gracious enough to reply with a meme of his own. This experiment was clearly off to an awesome start.

I expected that I’d hear back from other folks in a matter of days. But then the days turned to weeks and the weeks slowly turned into months. I concluded one of two things: either no one actually read my email, or no one actually caught the reference. Undaunted I persisted. Over the course of the experiment I sent out more than 450 messages with the signature “KTHXBYE” and about 65 with “KTHXBAI”, although I only realized that “KTHXBAI” was the appropriate spelling late into the experiment.

Finally, yesterday, March 5, 2013, the experiment came to an end. My manager asked about what it meant, and she googled the definition. Which, unfortunately led to the Urban Dictionary definition of “KTHXBAI.

My response: “Ughh…”. This led to an explanation that Urban Dictionary shouldn’t be trusted and that, no, I wasn’t telling my co-workers to get bent at the end of every email. I had to introduce the whole concept of LOLCats, which thankfully was backed up by the creation of my 2007 intern project called LOLJazz, which somehow lingers on as a zombie inside of our Rational Team Concert Server. Still, I wasn’t out of the woods, there was the chance that it could still be “actionable”. This is where I had an ace up my sleeve. During the development of Watson, IBM’s Jeopardy! playing computer, the team, which happens to be in my organization, fed the entire Urban Dictionary into Watson. As could be guessed, the importation of Urban Dictionary into Watson led to many hilarious and wholly inappropraite responses. In short, Urban Dictionary was a cesspool and shouldn’t be used as canon. Rather in this case the Cheezeburger kthxbai is a much better source.

A traditional use of KTHXBAI (even if it is misspelled)

And so, my experiment has come to an end. In the end it was sorta a drag as month after month passed with no one mentioning it. After I talked about it as an experiment everyone came out of the woodwork to say they had seen it and wondered what it meant, but didn’t bother to ask. Which leads me to wonder, how often does this happen? Do people even read my emails? Do they just ignore things they don’t understand or perceive as irrelevant? Do they do that to everyone, or just me? Could I start saying that we need to replace the fitzervalve on the flux capacitor in order to keep the keep the servers from frobnicating themselves and get away with it?

Now, it’s time to find a new subversive work experiment…

Looking for an Intern for Summer 2013

Once again I’m looking for an amazingly bright Ph.D. student to work with me over the course of the summer. The position is open to Ph.D. students from any university and at any point of their studies, and I can nearly guarantee it’s going to be an awesome experience.

The primary task will be applying machine learning techniques (lexical analysis, network extraction, predictive analytics) to the usage data from a large piece of commercial software. With a little bit of luck the software will be instrumented by this point in time so you’ll just need to slice and dice the data and find awesome stuff. The goal, of course, is to publish an amazing paper that provides great insight into how users actually use this type of software and provide guidance to architects and developers of such a system.

A loose list of skills that are desirable are:

  • Java: Most of our tools are written in Java. It took me a while to get used to this, but Java has some nice advantages for developing code to run in an enterprise. Here at IBM we really love it and most of our software, including the tool we’re looking at, is built in Java.
  • Software Engineering Processes: Domain expertise in understanding the relationships between the different levels of stakeholders in a software project is immensely helpful and will make it a lot easier to tease great bits of nuggets out of the data.
  • Machine Learning: We use various types of machine learning, both Java libraries and some R to understand the data. On the Java side knowledge of text analysis packages such as OpenNLP is helpful.
  • Statistics: I love R. If you love R it helps out.
  • Visualizations: I’m big on making great visualizations to show off our findings. If you’re a ninja with ggplot or d3 then you probably qualify.

Of course, there’s a variety of other skills that are helpful too. The intern absolutely must be self motivated and able to find answers to questions on their own. This isn’t an unsupervised position, but I travel a lot and am frequently out of the office, which limits my ability to provide direct daily supervision. As a result, excellent communication skills are also helpful — you should know how to ask questions over email in way which is succinct while providing enough information to other people to answer the question. If you’ve got a great profile on StackOverflow you’re probably already there.

There’s some great advantages to spending a summer working with me at IBM TJ Watson Research in Yorktown Heights, NY. First, you’ll be working with some of the smartest people in the world at a facility that has an amazing legacy. IBM Research was the genesis of DRAM, the processors in all major video game consoles, Watson - the Jeopardy! playing computer, LASIK, and thousands of other things. We make the world awesome.

Second, our interns come from around the world and are generally smarter than we are. You know that feeling you get when you go to a conference? You’re always excited about new ideas and feel like you could go home and churn out your thesis in a week. Imagine that feeling for an entire summer! I had a blast when I interned here and met some incredible young researchers who I’m still friends with.

Thirdly, we’re just outside of New York in scenic Westchester County, NY. I took the train into the city every Friday, Saturday, and Sunday when I interned here. It was the perfect combination of excitement from New York City and a setting where you can really get work done. You may be saying “isn’t New York really expensive?”. You’re entirely right. Don’t worry, we pay enough that it’s totally worth your time.

Interested? You can either email me or visit our intern hiring page for more information. We won’t be taking application that much longer, so be sure to act soon.

Rules for Recruiters, Vol 1: GPA Doesn’t Matter If You Have a Ph.D.

Tech jobs are hot in New York right now. Last year while sitting in LaGuardia Airport waiting for a flight I was hacking on some code for work in Eclipse and guy who was shoulder surfing me tried to persuade me to interview for positions he had available at his hedge fund. If you visit any Meetup in the city you’ll hear from dozens of people who are looking for the best and the brightest. When I combine these with a publicly visible Github profile, a resume that’s sitting on my web page, and a fairly complete LinkedIn profile it means that messages from recruiters are constantly flooding my mailbox.

They’re nearly all amateurish wastes of my time.

In this series of posts I’m going to chronicle why they’re such a waste of my time. Here’s a paraphrased recent message I got:

Dear Dr. Wagstrom,

I work for MegaHyperTech, a leading technology placement firm in New York City. We came across your profile on GitHub and later found your resume and think that you may have the talent that our client Quanttastic Solutions is looking for. They’re a hedge fund that makes it feel like you’re working at Google. They hire only the best and the brightest from schools like MIT, Berkeley, CMU, and Michigan. We’d love to send your information over there, but we noticed that you don’t list your GPA on your resume and they only hire individuals with exemplary GPAs. If you’re willing to update your resume to include that information for all your degrees we think that you’d enjoy the challenge.

The recruiter is entirely correct, I don’t list my GPA on my resume. This si done for a couple of reasons: first, my degrees are intertwined. It’s really hard to differentiate the GPAs for computer science, electrical engineering, and computer engineering bachelors degrees. They’re all in the 3.5 - 3.9 range, but really I don’t remember what they were. Likewise, my masters and Ph.D. from Carnegie Mellon are also intertwined and probably have a similar range.

But the bigger issue is that a Ph.D. isn’t about classes. In fact, while working on a Ph.D. if you’re taking a required class that isn’t directly related to your research you probably shouldn’t spend enough time to get an ‘A’ in the class. The measure of the work for a Ph.D. is the thesis and the publications that come out as a result of doing the research. I think of all the times that I met with my advisors I was asked my grades only once, and it was over a concern that I was spending too much time on my homework for my machine learning class.

So here’s my hope that maybe at some point a recruiter will read this. If you ask me for my GPA you’re not going to get it. If your client insists on GPAs for their candidates, then they don’t know what they’re getting.

30 Meters Underwater with a Dead Physical Layer Protocol

A couple of years ago I got the bright idea that I’d get my wife open water SCUBA certification as her Christmas present. She likes aquariums and fish and I thought it would be a fun way to do something different when we travel. Fast forward to the present day and I’ve got a closet filled with neoprene, BCDs, fins, first aid kits, and a dive log filled with all sorts of certification cards from PADI.

We purchased our own equipment relatively early in the process of learning how to SCUBA dive - shortly after getting our open water certification, thanks in large part to a nice tax rebate. For the most part we’ve been very happy with our purchases and I feel like it’s made us much more comfortable when we’re underwater. One of the key components of diving is a dive computer. The most basic dive computers tell you your depth and warn you if you’re ascending too fast or are going to need a decompression stop somewhere along the way. More advanced computers replace your entire diving console and provide a compass and wireless integration of you and your buddy’s pressure gauge. Yeah, we went for that kind of over the top dive computer and bought the Uwatec Galileo Luna.

Uwatec/ScubaPro Galileo Luna Hoseless Air Integrated Dive Computer

Uwatec/ScubaPro Galileo Luna Hoseless Air Integrated Dive Computer

I’ll be the first to admit that this probably wasn’t the wisest of ideas. I spent two weeks researching $55 gel pads for my standing desk and here we just decided to drop $2000 on a couple of dive computers thanks to thirty minutes at our local dive shop. We’ve been completely thrilled with them under water. Where we’ve had more problems is getting data out of them above water. More advanced computers also take periodic samplings of your depth, remaining air, water temperature, etc. You can use this data to reconstruct a dive profile in a way that much richer than what normally appears in your dive log.

Screenshot of jTrak - Does it feel like it's 1999?

Screenshot of jTrak - Does it feel like it’s 1999?

Getting this data off your computer isn’t trivial. Dive computers are expensive for a couple of reasons: they’re produced in relatively low volumes, they often license patented algorithms for estimating your air consumption and remaining bottom time, and, of course, they need to be waterproof. This means that you can’t just drop a USB port on the outside of the case. Nor can you just put a USB port under a rubber flap. At 30 meters you’re facing about 400kPa of pressure - four times the pressure at the surface. Water will find a way in. If it gets in the salt will corrode everything and it will die. Thus, dive computers tend to be very well sealed and make even trivial things like changing the battery a process that requires tools and new grease for the O-rings.

There really isn’t a standard interface to these devices. It seems as though a lot of devices, such as the Mares puck computers, have corrosion resistant metallic contacts that connect to a USB controller with an FTDI USB→Serial chip in it. However, the Uwatec Galileo decided to be more advanced and use what I’m sure was the hip protocol at the time: IrDA.

Now, in case you missed it, IrDA was all the rage in the 1990’s and early 2000’s. Every laptop seemed to ship with an IrDA port built in. You could use it to synchronized data with your Palm or Handspring in the late 1990’s. Once cell phones were more common you could even tether your laptop to your cell phone and get very slow data. In the pre-wifi, pre-edge days this was pretty hot stuff. “Was” being the key word. Hot stuff being around the speed of the 28.8k modem that I used when back in 1994.

You can still find devices that use IrDA, most notably a lot of the heart rate monitors from Polar, but for the most part the technology is from about 10 years ago. This also means that you’re dealing with the headaches of 10 years ago, including the near total lack of Mac support for devices. Those that do support the Mac often only support the PPC Mac and never really fully supported it anyway. Did I mention that MacOS X doesn’t even have full support for IrDA? Just try opening up a socket using AF_INET. It doesn’t exist. Ughh. This was going to be a great adventure.

Setting Baseline Expectations

My first naive attempts were to hack an IrDA driver into the framework of libdivecomputer. There was already support for IrDA dive computers under Windows and Linux, and I had confirmed that they worked just fine with my my computer, how hard could it be? The answer is a lot more complicated than I thought. The first step was to find an IrDA dongle that even worked with Mac OS X. I ordered a couple of cheap ones off eBay and had no luck. I read a few comments from folks saying that the official Uwatec USB->IrDA devices worked with JTrak on Mac OS X, however the official dongles about $70 and JTrak is a bit less than what I’m looking for in a dive logging software. Fortunately, I was able to find another device that looked nearly identical from the outside - the IRJoy USB 2.0 USB IrDA adapter for $30. When this guy arrived a quick scan showed that it was the exact same hardware as the official Uwatec dongle - both were based on the MosChip 7780 chipset.

Plugging the device into my trusty Thinkpad x31 showed that it quickly and easily worked both in Windows and Linux using the SmartTrak software from Uwatec, JTrak, and the test applications from libdivecomputer. I knew that I could at least make some progress. Next up was to test it on my Mac. I plugged in the IrDA stick and fired up JTrak and to my amazement it just worked. That NEVER happens. Poking around showed why it worked, the company behind JTrak had licensed a complete pure Java IrDA stack. Well, at least I could use JTrak if everything else failed. However I had my eyes set on something much prettier, MacDive.

Writing a Driver

I had heard people refer to the fact that the MosChip devices had a Mac driver, but most of those conversations ended many years ago &emdash; as if I needed more evidence that I was dealing with a dead protocol. After some digging around and emailing random customer service addresses I found that the IP for the MosChip devices were sold off to a company in Taiwan called Asix. They provided a couple of different versions of the driver and I eventually found one that worked in full 64 bit mode on Mac OS X Lion. Score.

The driver came with a simple test application that would let me read the data coming over the device as though it was a serial device. Using this test application I was able to position the reader in the line of sight of other IrDA devices and receive data. Neat. The problem is that I was getting the raw bytes of the IrDA sockets. There’s a lot of overhead in there that goes along with handshaking, setting speed, and resending data when connections are interrupted. None of this seemed to be enabled in the driver. The driver simply provided a couple of serial devices that I could open up and use to smack bits back and forth. If I wanted this to work I would need to write a complete IrDA stack on top of this serial device.

The problem is that the IrDA stack is actually fairly complex. Theres’s a myriad of different protocols that stack on top of IrDA to make everything work. This was basically the equivalent of trying to implement TCP/IP using just the raw bits coming over the 802.11 physical layer. In other words, it was a nasty layer mismatch that was not going to do me any favors.

The Multifaceted IrDA Stack - From Wikipedia

The Multifaceted IrDA Stack - From Wikipedia

I continued to email Asix, who were more than helpful, although they seemed most concerned that I would write a driver that would let the user transfer files with Windows and cell phones. After a few more emails I explained was a dive computer was and how much of a niche this issue was and Asix offered me an NDA to work on the driver and attempt to implement the AF_INET stack for Mac OS X. If I were in undergrad this would probably sound like a great idea. However, I’m not. I’ve got a job that keeps me quite busy and has me flying back and forth between New York and Washington on a weekly basis. I just don’t have the time to acquire the knowledge needed to hack together a driver on Mac OS X. Of course, there’s also the issue of me performing gratis work for a for-profit company, which I didn’t really want to do either.

VMs to the Rescue

This left me with really only one simple solution, use what I know already works for communicating with the Galileo Luna, Linux or Windows. In an effort to keep this simple and avoid worrying about license issues I chose to use a very minimal Linux installation under VirtualBox as my guest environment. The next problem was the software to make use of my data. There were a couple of different ways to handle this, either do all of my log work inside of the virtual machine, or just download the data in the virtual machine and copy it over to my Mac to do most of the work on the log. Starting up a VM is a bit of a pain, so the choice was made to use mac dive log software and download the data in the VM then copy it over.

There are a couple of different formats that might be able to fit the bill, SDE, UDCF, UDDF, and ZXL. SDE is the output format from Suunto Dive Explorer software. There doesn’t appear to be much documentation for the format, but it supposedly contains all the necessary information that a diver might want to recreate a dive log on a computer. Supposedly Subsurface, a dive log software package by Linus Torvalds, can import from SDE, so there should be some source code there that I just haven’t had a chance to dig at yet. ZXL is a format designed by DAN to collect information for scientific studies of diving related injuries. UDCF and UDDF are formats developed by a group of interested divers that seem to achieved moderate success. UDCF can be considered to be the little brother to the more robust UDDF. Many tools support UDCF, but it lacks official mechanisms to do things like save the pressure in a tank.

The most promising format seems to be UDDF - the Universal Dive Data Format. UDDF, like most interchange formats, sadly uses XML so it is parseable by neither humans nor machines. It is able to contain information about dive profile, temperature, and air usage, which are the main things I want to track. I wasn’t able to find a tool that used libdivecomputer to produce a UDDF file, so I wrote my own, the cleverly named dc2uddf.

dc2uddf is a simple tool that uses libdivecomputer and libxml2 to download data from a dive computer and save it as a UDDF file. That’s all it does. There isn’t much of a user interface, but it works, and it’s written in C, which makes me feel a little more like a programmer than I normally do. I’m certain there are some things that it is doing incorrectly, if folks discover problems email me or [file issues on github][ghissues] and I’ll be sure to fix them. Along the way I’ve also found several defects in the UDDF standard, so I feel like I’m making the standard better too.

Now I’m at the point where I can download the data using a Linux VM and then copy the data over to my Mac where I can easily import it into the excellent MacDive software, as you can see below.

The Pretty-Pretty Output of MacDive

The Pretty-Pretty Output of MacDive

The Future

I’ve thought about a couple of ways that I could make this a bit more streamlined. The current candidate is to get a Raspberry Pi board and create a small dedicated device for downloading dive computer data. Basically you’d turn it on, put the dive computer within range, press a button on the case for the Raspberry Pi and your data would be automatically downloaded. You could feed it an SD card and it could either use the configuration file on the SD card to upload the data to a remote host or just store a copy on the SD card. However, given the long waits for Raspberry Pis at the current time and my busy schedule I’ll just have to wait on that idea.

I’ve also toyed with the idea of making a service that provides real analytics on dives. Right now there are a couple of different sites that allow you to share dive logs. DiveBoard seems to be the most cross-platform of sites and they’ve even developed a browser plugin based on libdivecomputer to automatically upload your dives from your browser. Aside from their plugin they allow users to upload UDCF, SDE, and ZXL files. They’ve even gone so far as to extend UDCF to allow for pressure information — although this seems to be a clear sign to me that they should consider allowing UDDF uploads.

Another community is Suunto Movescount. This is the successor to Suunto’s Dive Explorer software and reflects the fact that they’ve moved beyond just diving metrics. The problem is that as near as I can see it’s a locked platform. There doesn’t appear to be any way to get your data out of it, or, for that matter, get data from non-Suunto devices into it.

Both of these sites are missing some of the potential for such sites, which is the ability to measure and track rather than just keeping a log. It’s something that sites like RunKeeper are just beginning to explore with efforts like their FitnessReports, but even those reports are rather cursory. There’s a number of metrics that we can calculate both on an individual and across a community that would be highly beneficial to everyone involved - divers, dive shops, travel agents, tour operators, and gear manufacturers, to name just a few. However, the description of these analytics will have to wait for a future post.

On the Facebook Terms of Service

Recently I’ve seen a number of friends and acquantences post some variation of the following message to their Facebook walls:

In response to the new Facebook guidelines I hereby declare that my copyright is attached to all of my personal details, illustrations, comics, paintings, professional photos and videos, etc. (as a result of the Berne Convention). For commercial use of the above my written consent is needed at all time.

By the present communiqué, I notify Facebook that it is strictly forbidden to disclose, copy, distribute, disseminate, or take any other action against me on the basis of this profile and/or its content. The aforementioned prohibited actions also apply to employees, students, agents and/or any staff under Facebook’s direction or control.

The content of this profile is private and confidential information. The violation of my privacy is punished by law (UCC 1 1-308-308 1-103 and the Rome Statute).

Facebook is now an open capital entity. All members are recommended to publish a notice like this, or if you prefer, you may copy and paste this version. If you do not publish a statement at least once, you will be tacitly allowing the use of elements such as your photos as well as the information contained in your profile status updates.

The intent of these postings is to limit the way that Facebook is legally allowed to use or share your information. On the one hand this makes me happy because it seems as though some people are taking their privacy seriously, on the other hand, it’s very frustrating because of the ham-fisted way people are going bout this.

The crux of the problem is that the Facebook Terms of Service supersede any declaration or addendum you attempt to make toward Facebook. Specifically clause 19.5:

Any amendment to or waiver of this Statement must be made in writing and signed by us.

However, you might think there is a loophole that will protect you somehow. Maybe something that Facebook forgot to expressly enumerate. Sorry, that’s covered in clause 19.10:

We reserve all rights not expressly granted to you.

As an additional level of backup the posts typically attempt to cite various portions of the Uniform Commercial Code, most often Article 1. First, it’s important to understand what the UCC is. It is NOT some overarching set of Federal Laws. The UCC is an attempt to harmonize various state laws and make it easier to do business across state lines. In some ways you can think of the UCC a little like the Talmud, the text is important, but so are the comments that go along with it. Unfortunately, the text and comments are copyright, so these semi-binding documents are not accessible to the common man (that’s a whole different problem, one which Carl Malamud and Public.Resource.org are attempting to remedy.

Anyway, we’ll ignore for a moment that the entirety of Article 1 of the UCC deals with definitions and ways to interpret further rules, and therefore probably isn’t the thing you’re looking for. The first reference, UCC 1-308 (which is often mistyped 1-308-308, which renders it null in the eyes of the law) reads:

§ 1-308. Performance or Acceptance Under Reservation of Rights.

(a) A party that with explicit reservation of rights performs or promises performance or assents to performance in a manner demanded or offered by the other party does not thereby prejudice the rights reserved. Such words as “without prejudice,” “under protest,” or the like are sufficient.

(b) Subsection (a) does not apply to an accord and satisfaction.

However, the issue with 1-308 is that your Facebook content, while being a creative work, isn’t a performance in most cases. There isn’t a transaction from Facebook unto you for performing such an action, therefore this most likely doesn’t apply.

Second is UCC 1-103, I have no idea how this got mixed up in here:

§ 1-103. Construction of [Uniform Commercial Code] to Promote its Purposes and Policies: Applicability of Supplemental Principles of Law.

(a) [The Uniform Commercial Code] must be liberally construed and applied to promote its underlying purposes and policies, which are: (1) to simplify, clarify, and modernize the law governing commercial transactions; (2) to permit the continued expansion of commercial practices through custom, usage, and agreement of the parties; and (3) to make uniform the law among the various jurisdictions.

(b) Unless displaced by the particular provisions of [the Uniform Commercial Code], the principles of law and equity, including the law merchant and the law relative to capacity to contract, principal and agent, estoppel, fraud, misrepresentation, duress, coercion, mistake, bankruptcy, and other validating or invalidating cause supplement its provisions.

Reading through this I can’t understand why 1-103 was even brought into this. It’s a simple description of the UCC and highlighting that unless the UCC attempts to supersede laws for things like fraud, duress, and bankruptcy, that they stay in effect.

Finally, let’s look at the appeal of the Rome Statute. I’m going to out on a limb here and say this was added by someone in Europe as the original postings I saw by Americans didn’t include this caveat. I’m assuming that the Rome Statute refers to the Rome Statute of the International Criminal Court. This international agreement established the international criminal court and gave the UN authority to investigate crimes when the host nations have chosen not to investigate. For example, the ICC often comes into play with state sponsored genocide.

One could easily argue that the United States has initiated investigations in privacy and Facebook (see the Senate Judiciary Committee Subcommittee on Privacy, Technology and the Law meeting on July 18, 2012 when Franken tore into Facebook’s manager of Privacy and Public Policy). The fact that the US is conducting investigations would seem to disallow the ICC any sort of jurisdiction. would therefore make such an investigation outside the bounds of the International Criminal Court — which really has non-first-world-problems to deal with, like genocide.

In short, if you’re really concerned about your privacy posting such a message on Facebook doesn’t do anything other than annoy your friends. If you’re really concerned about your privacy on Facebook you need to stop using it altogether.

Important Disclaimer: I am not a lawyer. I’m merely someone who took the time to read the Facebook Terms of Service and look up the relevant portions of the law that people are attempting to quote. None of this should be regarded as real legal advice.

Mining GitHub - Followers in Tinkerpop

Development of any moderately complex software package is a social process. Even if a project is developed entirely by a single person, there is still a social component that consists of all of the people who use the software, file bugs, and provide recommendations for enhancements. This social aspect is one of the driving forces behind the proliferation of social software development sites such as GitHub, SourceForge, Google Code, and BitBucket.

These sites combine together a variety of tools that are common for software development such as version control, bug trackers, mailing lists, release management, project planning, and wikis. In addition, some of these have more social aspects that allow you find and follow individual developers or watch particular projects. In this post I’m going to show you how we can use some this information to gain insight into a software development community, specifically the community around the Tinkerpop stack of tools for graph databases.

Graph Databases

Graph Databases are in the broad family of NoSQL databases. For about 30 years the dominant form of data storage and access has been through relational databases (e.g. Oracle, MySQL, PostgreSQL, DB2, etc). These present your data as a table with various rows. These tables can have constraints and pointers that map a column in one table to a column in another table through a process called a join. In this way it’s possible to create relations between records and build rich collections of data.

Relational databases are very nice and can scale fairly well, but they’re not suitable for all problems. In particular, there may be cases where atomicity can be sacrificed in exchange for higher performance or where the schema of the data may frequently change resulting in severe problems mapping the data to a traditional database.

This has led to a multitude of different solutions for data storage and access. Some of the more popular solutions are Google’s BigTable for distributed data storage, MongoDB for a schemaless document database, and Memcached for distributed object storage and caching. These alternative style of databases are generally lumped into a category of NoSQL, which means either “Not SQL” or “Not Only SQL” or perhaps something else depending on who you speak to.

A specific subclass of NoSQL databases is graph databases. A graph database represents your data a network of vertices and edges that connect them. Vertices and edges can have various properties that define the object. As opposed to traditional databases where a query crawls over the entire table to find the appropriate elements, queries within a graph database are often done via traversals that walk the graph from one node to another. Examples of graph databases are Neo4j, OrientDB, Trinity, InfiniteGraph, and Dex. A complete description of these databases are beyond the simple explanation here, but Wikipedia has a decent primer on graph databases.

Tinkerpop Background

Tinkerpop is a loosely coupled virtual organization centered around Marko Rodriguez that develops infrastructure libraries and interfaces for graph databases.

Tinkerpop has six major projects that are hosted on Github:

  • Pipes: A general data flow and processing framework
  • Blueprints: A library to abstract graph database interfaces
  • Gremlin: A domain specific language for traversing graphs
  • Frames: An object mapper for graph databases
  • Rexster: A general web interface for Blueprints supported databases
  • Furnace: A library of algorithms for traversing graphs

The Tinkerpop Network

As part of an ongoing research effort between IBM and the University of Nebraska, Lincoln, I’ve written a tool called GitMiner that can connect to Github and pull down information on a set of projects. In celebration of Gremlin hitting 600 watchers on Github, I pulled the complete network for all of the Tinkerpop projects from Github from May 1-3, 2012. This network contains the following pieces of information:

In a future post I’ll provide more details of how you can use GitMiner to access data on your own projects. I’ll also provide some pointers to other data sets people may wish to analyze.

Getting Started with Analysis

For this analysis we’re going to use a couple of different software packages. First, we’ll be using Gremlin to do some queries of the database and to create exportable networks for further analysis. Additional analysis will be conducted using R. These instructions are generically for people running a Mac, Linux, or other operating system with a posix-like command line interface. If you’re on Windows you should be able to follow along but you’ll need to modify the shell commands. All the tools used in this analysis are cross-platform, open source, and freely available.

Installing Gremlin

I’m not going to repeat everything in the Gremlin docs here, but here’s a brief overview of what you’ll need to do to get going on a Mac or :

cd ~
git clone git://github.com/tinkerpop/gremlin.git
cd gremlin
mvn clean compile package

This assumes that you’ve already got a nice java development environment setup and that you have maven installed. If this is your first time using maven to build any Java packages this can take a long time as it will automatically download all of the dependencies needed to compile and run Gremlin.

Installing R

R is a language for statistical computing. It’s slow, uses strange syntax, and is a memory hog. In short, it’s quite possible one of the worst possible ways to do this analysis. However, it also is the dominant language in the field and provides a huge number of libraries and tutorials that we’ll use for our analysis.

There are a variety of different ways to interact with R. If you’re on Windows or a Mac the standard downloads of R have a decent graphical interface for editing scripts and running commands. If you’re an Emacs hacker, ESS is a great library that interfaces nicely with R. If working inside of Eclipse is your thing, then use StatEt. Personally, I use R-Studio for most of my work. Further screenshots will be based on R-Studio, but you should be able to follow along with other interfaces.

Installing R-Studio is straightforward. Visit the R-Studio Desktop download page and download and install the version for your platform.

Downloading the Data

I’ve posted the Tinkerpop Social Graph as a Neo4j database, you should visit it and download TinkerpopSocialGraph.20120501.db.tar.gz. After downloading it you should go into the directory where you downloaded and compiled Gremlin and extract it. If you’re on a Mac or Linux, the commands will generally be something like this:

cd ~/gremlin
tar -zxvf ~/Downloads/TinkerpopSocialGraph.20120501.db.tar.gz

The dataset is fairly large, about 148MB compressed. It’s quite a bit of data and if you’re a lazy student taking your first SNA class it should have enough data to do a really kick-ass class project. If you’re a grad student and interested in writing a paper on this sort of data email me and we can probably collaborate.

Exploring the Graph

Gremlin provides a interactive interpreter that we can use to explore the graph. You can start it up by running ./gremlin.sh. Then run the following commands. lines that begin with gremlin> are the lines you should type into the interpreter.

To begin with we we’ll connect to graph and get a specific node from the database. In this case, we’ll pull up the node that represents Marko Rodriguez, the main developer of tools from Tinkerpop.

         \,,,/
         (o o)
-----oOOo-(_)-oOOo-----
gremlin> g = new Neo4jGraph("tinkerpop/tinkerpop.db")
==>neo4jgraph[EmbeddedGraphDatabase [/Users/pwagstro/gremlin/tinkerpop/tinkerpop.db]]
gremlin> marko = g.idx("user-idx").get("login","okram").next()
==>v[8]
gremlin> marko.map()
==>location=Santa Fe, New Mexico
==>sys_last_updated=1335930109
==>blog=http://markorodriguez.com
==>type=USER
==>gravatarId=https://secure.gravatar.com/avatar/fb12ea6a621399613aae4d692533e067?d=https://a248.e.akamai.net/assets.github.com%2Fimages%2Fgravatars%2Fgravatar-140.png
==>followers=57
==>following=12
==>createdAt=1257359950
==>name=Marko A. Rodriguez
==>login=okram
==>fullname=Marko A. Rodriguez
==>gitHubId=148925
==>sys_events_added=1335918859
==>user_type=User
==>totalPrivateRepoCount=0
==>private_gist_count=0
==>sys_last_full_update=1335918850
==>biography=graph algebra, digital librarianship, computational eudaemonics, graph theory, network science, government architecture, network metrics, decision support systems, computational social choice theory, social networks, scientometrics, collective intelligence, semantic networks, ontologies, bibliometrics, information science, swarm intelligence, information markets, peer-review process, computational sociology, knowledge engineering, computer architecture, programming languages, theoretical computing, psychometrics, multi-relational graphs, knowledge representation, reasoning, neural networks, multi-valued logic, neural growth algorithms, recommendation algorithms, distributed computing, ethics.
==>diskUsage=0
==>url=https://api.github.com/users/okram
==>public_gist_count=14
==>collaborators=0
==>email=marko@markorodriguez.com
==>sys_created_at=1335918699
==>ownedPrivateRepoCount=0
==>public_repo_count=0

The values output by marko.map() are the properties of the vertex that represents Marko in the database. With the exception of the properties that being with sys_, which were added by GitMiner when the data were imported, all of the other properties are obtained directly from the GitHub API.

In a similar vein we can get the vertex that represents Gremlin using the following commands:

gremlin> gremlin = g.idx("repo-idx").get("reponame", "tinkerpop/gremlin").next()
==>v[673]
gremlin> gremlin.map()
==>openIssues=17
==>isFork=false
==>sshUrl=git@github.com:tinkerpop/gremlin.git
==>pushedAt=1335827022
==>sys_last_updated=1335929775
==>type=REPOSITORY
==>masterBranch=master
==>htmlUrl=https://github.com/tinkerpop/gremlin
==>hasIssues=true
==>isPrivate=false
==>createdAt=1258695334
==>description=A Graph Traversal Language
==>name=gremlin
==>cloneUrl=https://github.com/tinkerpop/gremlin.git
==>gitUrl=git://github.com/tinkerpop/gremlin.git
==>fullname=tinkerpop/gremlin
==>watchers=600
==>gitHubId=379199
==>svnUrl=https://github.com/tinkerpop/gremlin
==>homepage=http://gremlin.tinkerpop.com
==>url=https://api.github.com/repos/tinkerpop/gremlin
==>size=341021
==>updatedAt=1335827026
==>forks=30
==>sys_created_at=1335918734
==>hasDownloads=true
==>language=Java
==>reponame=tinkerpop/gremlin
==>hasWiki=true

While this provides a lot of information about individual vertices in the database, it doesn’t provide information about how projects or people are related. We get at this information by looking at the edges connected to a vertex. Within databases such as Neo4j and OrientDB edges are directed and always got from a single source node to a single target node. This query will iterate over all of the outgoing edges from Marko and count up their types.

gremlin> m = [:]
gremlin> marko.outE.label.groupCount(m).iterate(); null
==>null
gremlin> m.sort{a,b -> a.value <=> b.value}
==>EMAIL=1
==>ORGANIZATION_MEMBER=1
==>GRAVATAR=1
==>ISSUE_ASSIGNEE=9
==>FOLLOWING=12
==>REPO_WATCHED=34
==>PULLREQUEST_COMMENT_OWNER=37
==>FOLLOWER=57
==>USER_EVENT=300
==>ISSUE_OWNER=404
==>ISSUE_COMMENT_OWNER=639
==>ISSUE_EVENT_ACTOR=700

There are a lot of types of edges in the database (see [EdgeType.java in the project source][edgetype] for the complete list). In this case we’ll focus on the project social network, which is shown through the FOLLOWING and FOLLOWER relationships. At the time of data pull Marko was following 12 people and had 57 followers.

Likewise, we can do a similar query for incoming edges:

gremlin> m = [:]
gremlin> marko.inE.label.groupCount(m).iterate(); null 
==>null
gremlin> m.sort{a,b -> a.value <=> b.value}           
==>REPO_CONTRIBUTOR=6
==>FOLLOWER=9
==>EVENT_FOLLOW_USER=14
==>PULLREQUEST_MERGED_BY=27
==>FOLLOWING=41

When we reverse the direction and look at incoming edges these numbers differ, and it shows that there are only nine people that Marko is a follower of and 41 people that Marko is following. The difference in these values is because the data only contains the sample of people around the Tinkerpop projects. Thus, we can see that there are 57-41=16 people that are following Marko that don’t show up in the data. This is because they don’t have activity, such as creating issues, commenting on issues, or watching a repository, that would pick them up in our sample. We know they exist, but we don’t have much information about them.

Your First Graph Traversal

Now that you’ve gotten a feel for getting information about a single vertex in graph, it’s time to do a simple traversal. To start with, lets get the names of all of the contributors to gremlin.

gremlin> gremlin.out('REPO_CONTRIBUTOR').login
==>pauljackson
==>espeed
==>spmallette
==>invalid-email-address
==>joshsh
==>jramsdale
==>NQuinn
==>peterneubauer
==>tinkerpop
==>zcox
==>xedin
==>okram

This query starts with the Gremlin vertex we identified before and then follows all edges labeled REPO_CONTRIBUTOR which is GitHub’s way of saying someone has code in the project repository. Once we’ve followed all of those edges we can fetch the login name of the users.

In a similar vein, we can get the name of all of the projects that Marko has contributed to using the following query:

gremlin> marko.in('REPO_CONTRIBUTOR').fullname
==>tinkerpop/rexster
==>tinkerpop/furnace
==>tinkerpop/gremlin
==>tinkerpop/pipes
==>tinkerpop/blueprints
==>tinkerpop/frames

Now, we can put the two together. Our first query got a list of all of the people who contributed to Gremlin. Let’s take it step further and get the list of all of the people who have contributed to projects that Marko has contributed to.

gremlin> marko.in('REPO_CONTRIBUTOR').out('REPO_CONTRIBUTOR').login
==>joshsh
==>jordanlewis
==>okram
==>spmallette
[ OUTPUT TRUNCATED FOR BREVITY ]

This, however shows many people multiples. Let’s just count how many times each name appears and then sort the list. This will give a rough idea of the people that Marko works closest to.

gremlin> m = [:]
gremlin> marko.in('REPO_CONTRIBUTOR').out('REPO_CONTRIBUTOR').login.groupCount(m).iterate();
null
==>null
gremlin> m.sort{a,b -> a.value <=> b.value}
==>espeed=1
==>invalid-email-address=1
==>NQuinn=1
==>zcox=1
==>xedin=1
==>svzdvd=1
==>jtakakura=1
==>sgomezvillamor=1
==>fescale-AC=1
==>hendrens=1
==>countvajhula=1
==>tor5=1
==>pierredewilde=1
==>lvca=1
==>alexaverbuch=1
==>dmitriid=1
==>jordanlewis=2
==>pauljackson=2
==>peterneubauer=2
==>tinkerpop=2
==>spmallette=3
==>joshsh=4
==>jramsdale=4
==>okram=6

Taking this a step forward, lets look at what other projects people in this set watch. We need to branch out another layer, but first we need to be careful and add in a dedup() in the pipe to ensure that we’re not counting some projects too often.

gremlin> m = [:]; 
gremlin> marko.in('REPO_CONTRIBUTOR').out('REPO_CONTRIBUTOR').dedup().out('REPO_WATCHED').fullname.groupCount(m).iterate(); null            
==>null
gremlin> m.sort{a,b -> a.value <=> b.value }
[ OUTPUT TRUNCATED FOR BREVITY ]
==>tong/hxmpp.lop=3
==>twitter/flockdb=3
==>twitter/gizzard=3
==>banker/mongulator=3
==>dgreco/graphbase=3
==>espeed/bulbs=4
==>tinkerpop/tinkubator=4
==>nerlo/nerlo=4
==>nathanmarz/storm=4
==>neo4j/community=5
==>tinkerpop/furnace=6
==>tinkerpop/frames=6
==>tinkerpop/rexster=8
==>tinkerpop/pipes=9
==>tinkerpop/gremlin=11
==>tinkerpop/blueprints=18

It’s no surprise that the projects in the tinkerpop stack are the most watched projects among the developers who work on Tinkerpop projects. However, there are a few other interesting pieces of software the seem popular. Among others Storm is a realtime computation system written in Java and Clojure that’s great for munging through thousands of logs. Bulbs is a nifty python interface to many of graph databases. Nerlo is a mechanism to use Neo4j from within Erlang. My apologies if my descriptions are wrong, as some of these projects are new to me too.

That’s enough about traversals in the data for now. I’ll leave to explore the data on your own. In future articles I’ll cover more about actually mining the relationships.

Exporting a Graph to GraphML

While graph databases and Gremlin are very useful for storing your data and doing traversals on data, they’re not always well structured for doing computation on the data and gaining insight over a wide number of projects. In grad school I studied with one of the leaders in the field of social network analysis, and now that she’s given me a hammer, it seems like everything looks like a nail. In this section I describe how to get your data out of a graph database and into a program like R.

A common interchange format for social network data is in the GraphML format - an XML specification for describing graphs. It was first used by individuals interested in visualizing large scale graphs. As such, it has significant drawbacks that make it less than ideal compared to other formats such as DynetML (e.g. on a single graph, no nesting, edges must all be directed or undirected). In any case, it’s what we have, so we’ll use it. Fortunately, both Gremlin and the igraph package for R, which we’ll be using later, support GraphML.

I’ve created a simple script that you can run in your current Gremlin session. You should be able to just paste this code into your running gremlin session and it will save the network to file called follower.graphml.

The astute observer will notice a couple of things about this. First, we’re using a specialized method to get all of the users associated with the Gremlin project on GitHub. However, we’re not following all of the ways a user can be associated. For example, we’re not looking a issues, pull requests, commits, or other events.

Secondly, we’re skipping a lot of edges and vertices. In this case we’re skipping every edge that doesn’t lead to a user in this set. The reason for that is because if we didn’t skip these edges we’d have a network with 30,000+ nodes as opposed to to the 606 in this network. While it’s possible to do analysis on networks of that size, it is much slower and would prove to be a bit of a distraction here.

Network as visualized in Cytoscape

Network as visualized in Cytoscape

This finishes the section of the article dealing with gremlin from the command line. From here on out the operations are done in R.

Very Important: Before exiting Gremlin run the command g.shutdown() to close the graph database. If you don’t do this then you’ll have to wait for a recovery process then next time you look at the data.

Examining the Data in R

Within R the first thing to do is to make sure you have the igraph package installed. You can do this by running the following command and following the directions:

install.packages('igraph')

Now that we’ve got igraph installed, it’s time to have some fun. First, we need to tell R to use the functions inside of the igraph library and to load our data.

library(igraph)
setwd("~/gremlin")
graph <- read.graph("follower.graphml", format="graphml")

First let’s get some summary information. This can be done with the ecount and vcount functions. It shows that in the current network there are 510 edges and 606 nodes.

> ecount(graph)
[1] 510
> vcount(graph)
[1] 606

This network has a lot of isolates in it. That’s somewhat to be expected as not every user utilizes the follower feature of github. The following commands will remove isolates from our data set and results in a network of 236 vertices and 510 edges.

> isolates <- which(degree(graph, mode = 'all') == 0) - 1
> graph <- delete.vertices(graph, isolates)
> summary(graph)
Vertices: 236 
Edges: 510 
Directed: TRUE 
No graph attributes.
Vertex attributes: location, sys_last_updated, type, blog, gravatarId, following, followers, createdAt, name, login, fullname, gitHubId, sys_events_added, user_type, totalPrivateRepoCount, private_gist_count, biography, sys_last_full_update, diskUsage, url, public_gist_count, collaborators, email, sys_created_at, company, ownedPrivateRepoCount, public_repo_count, id.
Edge attributes: sys_created_at, id.

First, lets get an idea of the degree of the vertices in the graph. This command creates a histogram that clumps vertices by the number of edges they have. We see that only a very few have many edges, most have fewer than 10 edges. I should stress, this does not reflect the total number of people those accounts follow, rather it reflects only the total number of people within Gremlin that each account follows.

hist(degree(graph))

Now, lets look a couple of the classic centrality measures. Betweenness centrality calculates the proportion of all shortest paths between vertices that a particular vertex sits on. If communication had to go person to person and could only go along connections that are established, these people would prove to be key in the network.

results <- data.frame(login=get.vertex.attribute(graph, "login"))
results$betweenness <- betweenness(graph)
results$evcent <- evcent(graph)$vector

Now that we’ve calculated those centralities, let’s take a look. We’ll start with betweenness. According to this data the user that has the most central role is spmallette, an active participant in the tinkerpop communities, followed by ahzf, a developer who is working on .Net ports of many blueprints services. In third place is a research account from a university in Korea. This account shows up all over the place and I generally consider it to be a spam account. It follows tens of thousands users and therefore creates artificially short paths between users, boosting it’s score in the process. In fourth place is Marko, the leader of Tinkerpop.

> results[order(-results$betweenness), c("login", "betweenness")][1:10, ]
         login betweenness
165 spmallette    8415.414
156       ahzf    8332.422
134     hcilab    7963.272
16       okram    6492.181
107  igrigorik    5793.494
235  joshbuddy    4623.192
87      collin    3305.313
172   pangloss    2442.794
219   stonegao    2165.359
60        dann    1876.813

In the betweenness centrality model, which is a directed model, users who follow few additional users are penalized. As Marko only follows a handful of users, his score is low, despite the fact that many people in the community follow him.

However, when we use eigenvector centrality, which is a more robust centrality metric, is used, we find a more interesting picture. Marko and peterneubauer are the top individuals, followed by spmallette and joshsh, additional developers of Tinkerpop.

> results[order(-results$evcent), c("login", "evcent")][1:10, ]
            login    evcent
16          okram 1.0000000
167 peterneubauer 0.9739565
165    spmallette 0.6607181
13         joshsh 0.5872818
107     igrigorik 0.5665205
226         thobe 0.5273195
219      stonegao 0.5220129
14   alexaverbuch 0.5173995
178       nawroth 0.4466164
156          ahzf 0.4284547

There’s always more that you can do with these tools, and in the future I’ll discuss some more, but for now I hope this has given you a taste for how to mine social networks from GitHub. Enjoy!

Static Bloggin

This is my first new post written in markdown for the static version of patrick.wagstrom.net/weblog. The only reason I was running both PHP and MySQL on my server was to host wordpress which became a pain in the ass with all of the upgrades. This eliminates all of those nasty security holes and allows me to focus a little bit more on just writing. Which is what a weblog is supposed to be.

I’m running Octopress, which is a blogging framework based on Jekyll. The downside to this is that means that it cannot accomodate dynamic elements, therefore all comments need to be farmed off to an external service. Fortunately, I was already using IntenseDebate. With only a very small amount of work I was able to migrate everything over to the new system. Perhaps most substantial is that I had to write a patch to Octopress to support IntenseDebate. I’ve since created a pull request for IntenseDebate support on Github. Hopefully the authors will see fit to pull it in.

So yeah, it’s a little more work now that I don’t have a web interface to do things like manage images and remember my links, but I can write posts from any text editor, which is VERY handy for when I’m stuck in airplanes and too cheap to pay for WiFi.

Overall I’m not certain if this is a good idea. In the past I’ve extolled various reasons why you shouldn’t try to do it yourself. However there is also merit to doing it yourself. Up until this point I’ve been an active Ruby Hater, and it’s becoming clear that I should at least be peripherally aware of what Ruby can do. Although my extensions to this point have not involved hacking Ruby, it might at some point in the future.

So, for now, enjoy the fact that every post is showing up again in your RSS reader and marvel at the beautiful new theme. With no more worries about annoying security faults and a faster response time.

I am not a climatologist, and neither are most of these people

In the past couple of days I have twice received an opinion piece from the Wall Street Journal which suggests that the models used for estimating climate change are grossly pessimistic and that we really need not be concerned with anthropogenic climate change. It was signed by sixteen scientists and engineers. The problem is that almost none of these people are climatologists - which is the field they are claiming is producing invalid science. Anyone can call themselves a scientist - having a Ph.D. helps - but, just because you are a scientist does not mean that you can speak authoritatively on all issues related to science. Stephen Hawking is a brilliant scientist, but he studies astrophysics, not climatology. I trust him on a lot of things, but I wouldn’t trust him on climate change. Nor would I trust Albert Einstein, Louis Pastuer, Marie Curie, or Isaac Newton on issues of climate change.

So, who are these climate change deniers that have the right frothing at the mouth again? Let’s take a quick look.

  • Claude Allegre, former director of the Institute for the Study of the Earth, University of Paris - Is a geochemist, which might make him qualified. It’s hard to tell as he has spent most of his time doing political work recently. He appears to have a strong contrarian streak, such as in 1996 when he insisted that asbestos was harmless and that anger over it was caused by mass hysteria. That last time I checked the link between asbestos and mesothelioma was pretty firm.
  • J. Scott Armstrong, cofounder of the Journal of Forecasting and the International Journal of Forecasting - This one gave me a decent chuckle. At first I thought he was a climate forecasting scientist, nope. Armstrong’s expertise is in marketing style forecasting, as in trends. His journal is also published by Elsevier. I think I threw up a little in my mouth.
  • Jan Breslow, head of the Laboratory of Biochemical Genetics and Metabolism, Rockefeller University - A medical doctor and not a climatologist. Breslow is perhaps most well known for his work on heart disease. This is great work he has done, but it’s not atmospheric science.
  • Roger Cohen, fellow, American Physical Society - It’s difficult to find information on Cohen. Prior to retirement he worked for ExxonMobil research, but that’s about all I can find. I can’t seem to find any publications on any issue. However, he does have a very common name, making him hard to google. He frequently consort with William Happer, who appears later in the list.
  • Edward David, member, National Academy of Engineering and National Academy of Sciences - As a member of the National Academy of Engineering I have great respect for Dr. David. However, he is an electrical engineer and has been largely retired from research for more than 20 years. Did I mention he was director of research at Exxon from 1977-1985?
  • William Happer, professor of physics, Princeton - Seems to have moved away from research as he’s advanced in his career. During his prime he was a leader in the field of spectroscopy. Which, in case you didn’t know, has nothing to do with climate change. During his 2009 testimony to congress he indicated that an increase in CO2 is good for the planet because it’s good for plants. Yes, very much like the Competitive Enterprise Institute’s “CO2, We Call it Life” vieo.
  • Michael Kelly, professor of technology, University of Cambridge, U.K. - Kelly primarily works on semi-conductors, specifically SRAM. He is not a climatologist or even a chemical engineer or chemist.
  • William Kininmonth, former head of climate research at the Australian Bureau of Meteorology - Kininmonth is, perhaps, a meteorologist, although there is little information easily available about his activities. It is known that he is not a prominent researcher in any field and his “Australasian Climate Research Institute” is run out of his home and appears to be only his own writings.
  • Richard Lindzen, professor of atmospheric sciences, MIT - Lindzen is perhaps the most qualified individual on this list. He is well known for his skepticism of anthropogenic climate change. He stands out from the other signatories because he can speak with true scientific authority on the issue.
  • James McGrath, professor of chemistry, Virginia Technical University - McGrath studies polymers and fuel cells. He is a scientist, but not a climate scientist.
  • Rodney Nichols, former president and CEO of the New York Academy of Sciences - This one took me a while longer to find out information about. I believe that Dr. Nichols is a physicist from Harvard, which means he could be a climatologist. However, looking at his publication record for the last 40 years you’ll find that most of his work is dealing with science and technology policy – issues that are close to my heart. However, this doesn’t qualify him as a climatologist. I’m sure he is well learned in a variety of topics, but I don’t believe he has a deep knowledge of the current research on climatology.
  • Harrison H. Schmitt, Apollo 17 astronaut and former U.S. senator - As an astronaut Harrison Schmitt was on the mission that took the famous “Blue Marble” picture of the earth. In fact, evidence indicates that Schmitt most likely took the photo that has been credited with being a critical catalyst for the environment movement in the 1970’s. Outside of his astronaut career he was a university professor, geologist, and senator from New Mexico. None of these are related to the atmosphere or climate science.
  • Nir Shaviv, professor of astrophysics, Hebrew University, Jerusalem - Shaviv is primarily an astrophysicst known for his work on cosmic rays and luminosity. He has his own theory of global warming which says that the cosmic rays of the sun are responsible for global warming. His theory has not been widely accepted and has faced great challenges because of the fact that the solar output has been decreasing since the mid 1980’s.
  • Henk Tennekes, former director, Royal Dutch Meteorological Service - Also a professor of Aeronautical Engineering at Penn State, Tennekes is most well known for his work on turbulence in airflows. In fact, he literally wrote the book on it. Unfortunately, that’s not a book on climate change. He was reportedly ousted from the Royal Dutch Meteorologic Service for his denial of climate change and his sometimes reliance on biblical texts for justification. Look, I’m a Christian and a scientist, but I realize that I can’t use biblical texts to justify my work, that’s not how science works.
  • Antonio Zichichi, president of the World Federation of Scientists, Geneva - Primarily a sub-nuclear physicist who has worked at labs like CERN and FermiLab. His title of President of the World Federation of Scientists is self bestowed as he is the founder. It should not be considered to be an analog to the Federation of American Scientists. He is a highly cited researcher, and has done significant work in popularizing science in Italy, but he is not a climatologist.

Out of the sixteen people listed I count one atmospheric scientist, Lindzen, and a half, Allegre. In any community of scientists you’ll have dissenters. The fact that they could round up only one and a half climate scientists for this letter should show you just how strong the case for global warming really is. Want more evidence? 255 scientists, all members of the National Academy of Science, including 11 Nobel laureates wrote a scathing response, rejected by the Wall Street Journal and later published in Science.

Looking for Summer Interns in the Software Technology Group at IBM TJ Watson Research Center

IBM TJ Watson Research Center (Photo by Simon Greig)

IBM TJ Watson Research Center — Photo by Simon Greig

Are you one of the best software engineering students in the world? Do you dig mining software repositories? Are you a wizard at social network analysis? Interested in a great summer job looking at what makes software teams work? Even better, want to work with me?

The Software Technologies Group at the IBM TJ Watson Research Center in Hawthorne, NY is looking for summer interns! I started at IBM back in 2007 as an intern and had a great time meeting some of the smartest students from around the world. Students are given a chance to work with the best technology in the world and often end submitting papers to ICSE, CHI, CSCW, or FSE as a result of their work with us.

We suggest that you apply online. If you’ve got questions you can email directly for more information. But hurry up, as we’re going to start our selection and interview process soon.

PS. For faculty, this is a great way offload students for the summer if you’d like to take off to St. Barth’s for a few months.