My Delusional Dream

Open Source and Technology Predictions for 2014

This marks my fourth attempt at being a technology pundit after previous attempts in 2010, 2011, and 2012. I honestly have no idea what happened to my 2013 predictions or if I ever made them. As usual, these predictions are intended to be concrete and testable, rather than vague things like “the singularity happens” without providing a definition of the Singularity.

  • America finally sees chip and pin credit cards

It’s no longer a rarity to find a credit card with a chip on it, but they’re almost always chip and signature cards. While the chip provides additional security as the actual credit card number is never sent, merely a cryptographic challenge/response, the signature still is easily forged and often ignored.

The Target data breach in November and December 2013 showed just how fragile and insecure our credit card system is. Within days of the breach banks were buying their own credit card numbers from black market sites to notify their customers. Banks are proactively issuing new cards to customers, but this leaves them open for the next massive security fault. A shift to chip and pin would make the system massively safer - but this requires infrastructure changes. Expect to see this happen quickly at least one major bank will announce that fradulent charges as a result of magnetic stripe scans for cards that have a chip will be borne by the vendor, not the bank. This will cause uproar, but it needs to happen.

  • Hulu is bought by HBO or Verizon

Hulu has been trying to sell itself for a couple of years. It’s always on-again and off-again. While Hulu is attempting to move forward with original series, it’s not having the same luck that Netflix has had with its high profile series (e.g. “House of Cards”, “Arrested Development”, and “Orange is the New Black”). This is partially because of a vision problem, and partially because Hulu currently lacks access to fine grained customer data that Netflix has. Fortunately for Hulu’s financial state, but HBO and Verizon are looking to expand in this area. HBO is looking to fend off challenges from Netflix and, while it has continually produced high quality series, it may be struggling and access to the data behind Hulu would bolster HBOs demographic knowledge. HBO provided content would also give Hulu a much needed shot in the arm and provide a HBO a lifeline beyond the slowly dying medium of cable television.

In contrast, Verizon is facing challenges from Comcast as it attempts to become a communication and data hub. Verizon has tried to launch their own television networks and distribution, but their streaming platform is still rather poor. Acquistion of Hulu would give Verizon a high quality streaming platform, in roads into Comcast customers, and also a nascent studio that develops content.

There’s an outside chance Amazon could purchase Hulu - it would be a great expansion of Amazon Prime. However, I think that if Jeff Bezos wanted Hulu under the Amazon banner it would have happened already. This most likely means tat Amazon has big plans for whatever will become of Amazon Prime Video as Bezos doesn’t play small. How exactly Bezos intends to notify his 20 million+ Amazon Prime customers that he has high quality video streaming remains a bit of a mystery.

  • SteamOS will finally drive the year of the Linux Desktop

Although Linux has been successful for years as a server platform, the joke among supporters of the Linux desktop has always been “next year will be the year of the Linux desktop”. Valve’s work on SteamOS will finally make that a sort of reality. Gabe Newell has been vocal about migrating away from Windows because the mess that is Windows 8, and it appears that developers are becoming increasingly frustrated. Steam for Linux has already made it possible to play hundreds of games on Linux with no additional effort - expect to see this number continue to grow in 2014. These games will prove a necessary component as Valve goes head to head against Microsoft and Sony for control of the living room. Along the way Valve will pick up something that Linux users have longed for - support for Netflix.

  • Dual operating system tablets will land with a resounding thud

Continuing with discussion of the woes of Windows 8 - we’re hearing that many vendors plan on announcing dual platform tablets at CES. I’m going down right now and saying that this is a stupid idea that will fail in the marketplace. End users don’t want to think about having to reboot a tablet to switch between Android and Windows. They just want it to work. The added cost of the dual platform tablets, combined with the fact that I’m certain they’ll have uniformly terrible user experiences and build qualities, ensures that this is an idea that is dead on arrival.

  • Microsoft merges Windows Phone and Windows RT

Recently Microsoft has been running commercials describing the one consistent interface across all their devices - phones, tablets, and desktops. The problem is that people not only want one interface, they want one system. Apple and Google did the right thing by not creating a new operating system when they created tablets (although, if we’re splitting hairs, one might say that Android Honeycomb was a new operating system). People want to have their apps at their fingertips - and this means that they either have the phone operating system on the tablet or the desktop operating system on the tablet. Microsoft attempted to have it both ways with Windows RT.

  • Snowden will get asylum in Brazil or Germany

Greenwald has hinted that only a small portion of the overall trove of Snowden documents has been released. Both countries have acted as though they’ve been surprised about the revelations that the United States was spying on them, although the reality is that their security apparatus should be gutted wholesale if they didn’t already know this. Nonetheless, while most of these revelations won’t do anything to change the behind the scenes action, politicians in both countries will seek to strike a “special” deal with Snowden that at least gives them the perception of getting inside information on the spying scandal.

  • A major cloud provider will suffer a data breach exposing data on VMs to other customers

This is one that we haven’t heard much about yet, but I’d imagine we’ll see more about it this year. Many companies, both larget and small, are moving their operations to shared cloud hosting environments such as a Amazon EC2, SoftLayer, and RackSpace. Racently it was found that Digital Ocean, a smaller hosting service, had a flaw in their systems that didn’t securely delete VMs. This lead to the possibility that someone could see the data of other customers. Fortunately, it seems as though this was relatively innocuous, but given the potential for problems here, I’d say that this is the first of many such problems we’ll hear about in the future.

  • At least one Silicon Valley “luminary” will lose his job because of comments made about women in tech

There recently was a bit of brouhaha regarding comments that Paul Graham made about women and innovation. Although, according to Paul Graham, his comments were taken out of context. Nevertheless, the whole firestorm was oddly reminiscent of Donglegate and it’s clear this is still a really touchy issue - and for good reason. Given the fact that not all tech luminaries are as socially smart as they are technologically smart, it’s only a matter of time until one of them says something really stupid that will be taken even more out of context than Paul Graham’s comments.

  • An automobile manufacturer will suffer a major security breach in the software installed on their vehicles

More and more it seems as though the software in automobiles is the differntiating factor between brands and even within the brands. While software had a simple start in cars, it’s not unheard of for cars such as the Chevy Volt to have 10 million lines of code that encompasses software with dozens of different licenses. Unfortunately, building good software is hard. Building secure software is even harder. The remote access through apps, such as the OnStar Remote Link App, provide numerous new attack vectors. It doesn’t take a genius to see where this going.

  • A major credit card data breach will lead to legislation that shifts the onus of credit repair from the consumer to the vendor that leaked the data

Right now if your credit card information is stolen through vendor, you’re not responsible for the charges, but you are responsible for cleaning up any mess on your credit report that may result. Quite frankly, this is a massive abdication of responsibility and I think that once we see a large scale data breach that gets enough people really angry (it could be Target), we’ll see legislation that finally asks the folks who suffered the data breach to bear the costs for fixing it. This goes beyond the usual and useless “1 year of free credit monitoring”.

  • The house and the senate will remain largely status quo

I don’t think this needs any dramatic explanation. The house is rigged enough that there are few competitive districts. The senate will remain in democratic hands not because the democrats will run a set of great campaigns, but more because the republicans insist on nominating crazy people.

What are your thoughts about these predictions? Have I started to go too easy? Should I go back to focusing almost exclusively on open source? Do you have predictions of your own that you’d like to share?

Working on Thanksgiving - We’re All Part of the Problem

The days are once again getting depressingly short here on the East Coast. It’s barely light out when I enter my office in the morning and it’s completely dark by the time I leave. This means only one thing - the Christmas shopping season is upon us.

This year, like it has been since K-Mart first decided to be open on Thanksgiving sometime in the 1990’s, a spattering of retail stores have chosen to open up to consumers on Thanksgiving evening and stay open round the clock until late the next night. Can’t you just taste the delicous frenzy of consumerism?

Also, this year, just like previous years there are people who believe it’s a travesty that the stores are open on Thanksgiving. Surprisingly the cries from both the left and the right are similar - “the workers shouldn’t have to work on Thanksgiving!” - although the reasonining is fairly different. The left believes it to be a workers right issue while the right often sees as a sign of the decline of the nuclear family unit. After all, if Mom or Dad need to be at work at 6pm on Thanksgiving then they’re not going to be hanging out with their family and eating food and watching football.

But here’s the rub - people act as though retail workers are the only people who would ever need to work on Thanksgiving. Yes, Walmart deciding to open at 6pm will require hundreds of thousands of people to go to work on Thanksgiving evening. I, honestly, don’t think that’s going to make much of a dent in the number of people working. Here’s why:

We live in a connected society. When we’re spending a relaxing Thanksgiving at home we’re still doing activities that require people to be at work. Turkey Trots on Thanksgiving morning have become almost a rite in the course of the past decade. These races require large numbers of law enforcement and medical personnel to either work or “volunteer” to work the races. If you’re at home you’re almost certainly using electricity, which means that there needs to be a small number of people at the power plant. Those Football games don’t just happen, they require thousands of people to make a successful event - from the players, to vendors, to parking staff, to the crews in the trucks that make sure the production goes off without a hitch. If you think that you’ll clear you conscience by shopping on Amazon instead you should know that someone is wearing a pager and probably getting a phone call because the crush of people like you is going to cause some system to fail. Ditto if you decide to sit down and watch Netflix.

In fact, nearly everything we do on a daily basis requires someone to work. That’s not a terrible thing - that’s just how society works. To act like folks are suddenly shocked because retail workers have to work on Thanksgiving evening is missing a huge part of the issue. We’ve long had people working behind the scenes on Thanksgiving, now we’ve got a lot more people working in the front. So, before you say that “everyone deserves a holiday”, you should ask yourself if you’re really more concerned about the welfare of the worker and his/her family, or if you’re trying to cover up for your own guilt in being part of a larger problem of an interconnected society.

Hacking JazzHub Projects From the Command Line

JazzHub is a new project from IBM that provides the power of Rational Team Concert (RTC) in a cloud environment. If you’re working on a public project it’s entirely free. In fact, it’s also free until at least the end of 2013 for private projects too.

The obvious question that most people are going to ask is “Why should I use JazzHub if I know how to use GitHub?” This is a perfectly fine question and I don’t fault anyone for asking it. In fact, for most projects you’re going to be just fine using GitHub. It has a lot of great features for collaborative software development that have truly revolutionized the way that software is developed. JazzHub is designed for projects that want and need more robust mechanisms for project planning and management. A project on JazzHub can utilize the full power of Rational Team Concert’s excellent support for agile processes to really plan out their software development processes. If it isn’t software development without words like “scrum”, “iteration”, and “sprint”, then you’ll feel right at home using JazzHub.

Rational Team Concert is designed to be used with a thick client that is a nice extension to the Eclipse Integrated Development Environment. As licensing for RTC is done on the server side, you can download and use RTC for work on JazzHub for free. Hop on over to the jazz.net downloads page to grab the newest version of Rational Team Concert for your combination of architecture and operating system.

One seriously underutilized feature of RTC is the inclusion of a very robust command line program for interacting with RTC’s powerful source code management (SCM) facilities. This article attempts to explain what you need to do to create a project on JazzHub and then commit your code from the command line. It’s a perfect way to work if you’re comfortable with dropping down to a shell to manage your files inside of an SCM, as is often the case with Git and Subversion.

Creating A Project

Before we get too far, we’ll start by creating a public project on JazzHub. It is called scmtest. The full naming convention for JazzHub projects is USERNAME | PROJECTNAME. Yes, this means that there is a pair of spaces and a pipe in the project name (as opposed to GitHub which uses USERNAME/PROJECTNAME. It’s not that big of a deal if you remember to use quotes when working from the command line.

JazzHub Create Project Screen

Options Selected for Creating Sample Project

After a minute or two JazzHub will return with your brand new shiny project page. Congratulations! Now you’re on your way to hacking it from the command line.

Project Successfully Created

Project Successfully Created

Setting up lscm

By this point you should have project on JazzHub and also downloaded a copy of the RTC client for your machine. You’ll need to unpack the archive somewhere, in my case I unpacked it to /Applications/RTC-4.0.4. You’ll find a file hierarchy that looks a little like this:

/Applications/RTC-4.0.4
`-- jazz
    |-- client
    |   |-- eclipse
    |   |-- license
    |   `-- sametime
    |-- properties
    |   `-- version
    `-- scmtools
        `-- eclipse

The lscm program resides in jazz/scmtools. There are a variety of ways that you can make it so it’s easy to execute. I symlinked it to /usr/local/bin/lscm using the following command:

~> ln -s /Applications/RTC-4.0.4/jazz/scmtools/eclipse/lscm /usr/local/bin/lscm

If you’re using a newer version of RTC, specifically 4.0.3 or newer, you’re going to have a much better time because there’s a new native code version of lscm rather than the old version that fired up a Java virtual machine every time. This results in a much better and faster experience.

Using lscm

The first thing you’ll need to do is to login to the RTC server. You’ll need to know the CCM server for your JazzHub project; this is included in the original email you’re sent and in my case (and probably your case too) it was https://hub.jazz.net/ccm01. You’ll also want to create an alias for JazzHub, in my case it’s just jazzhub and then tell it to cache your credentials so you don’t need to login all the time.

~> lscm login -r https://hub.jazz.net/ccm01 -n jazzhub -u pwagstro -c

Now you can start creating the directory structure for the project. Before we dive in too far, I recommend reading up on how the RTC SCM works, because it’s a bit of a brain twist from what you’re used to if you normally work with Subversion or Git.

Unlike Git and Subversion where you first start with files on your local drive and then later send them to the remote repository, with RTC’s SCM you start by loading a remote workspace. In the case of JazzHub a workspace is created by default that is called USERNAME | PROJECTNAME Workspace. The following commands will create an empty directory and load the workspace.

~> mkdir scmtest
~/scmtest> cd scmtest
~/scmtest> lscm load -r jazzhub --all "pwagstro | scmtest Workspace"
Nothing to load. File system unmodified.

In a mechanism similar to Subversion, RTC doesn’t want you to have files at the top level of your directory. In fact, if you have files at the top level of your directory you’ll probably break RTC’s command line SCM. Just don’t do it. Instead, you’ll have a collection of directories that represent different major modules. Each module, in turn, belongs to one or more components. In this case will make a module called Test and populate it with a simple README.md file with no content.

~/scmtest> mkdir Test
~/scmtest> cd Test
~/scmtest/Test> touch README.md
~/scmtest/Test> cd ..

The next step is to tell RTC to share the Test module as part of your current workspace and the default component. JazzHub automatically makes a default component called USERNAME | PROJECT Default Component. Just use that and it will make things easier. Here we first share Test then checkin the README.md file.

~/scmtest> lscm create workspace -r jazzhub -d "Default Workspace" "pwagstro | scmtest Workspace" -s "pwagstro | scmtest Stream"
Workspace (1528) "pwagstro | scmtest Workspace" successfully created
~/scmtest> lscm share -r jazzhub "pwagstro | scmtest Workspace" "pwagstro | scmtest Default Component" Test
Shared successfully
~/scmtest> lscm checkin Test/README.md

After checking in the code it hasn’t been pushed to the server yet, rather the change is sitting in staged changeset that needs to be sent to the repository workspace. First, we can check to see what projects are currently staged.

~/scmtest> lscm status
Workspace: (1528) "pwagstro | scmtest Workspace" <-> (1529) "pwagstro | scmtest Stream"
  Component: (1530) "pwagstro | scmtest Default Component"
    Baseline: (1531) 1 "Initial Baseline"
    Outgoing:
      Change sets:
        (1532) *--@  "Share" 06-Oct-2013 10:18 PM

The default changelog message is simply “Share”. This is really bad, but unfortunately lscm doesn’t make it easy to create nice commit messages like Git does. From the above output we can see that our output changeset alias is 1532. This is a value that we can use to change the changeset comment as follows:

~/scmtest> lscm changeset comment 1532 "Initial attempt at sharing"

Now we can look at the output and see if it has our proper commit message.

~/scmtest> lscm status
Workspace: (1528) "pwagstro | scmtest Workspace" <-> (1529) "pwagstro | scmtest>
  Component: (1530) "pwagstro | scmtest Default Component"
    Baseline: (1531) 1 "Initial Baseline"
    Outgoing:
      Change sets:
        (1532) *--@  "Initial attempt at sharing" 06-Oct-2013 10:18 PM

Everything looks good, let’s deliver the changes.

~/scmtest> lscm deliver
Delivering changes:
  Repository: https://hub.jazz.net/ccm01/
  Workspace: (1529) "pwagstro | scmtest Stream"
    Component: (1530) "pwagstro | scmtest Default Component"
      Change sets:
        (1532) ---$  "Initial attempt at sharing" 06-Oct-2013 10:20 PM
          Changes:
            --a-- /Test/
            --a-- /Test/README.md
Deliver command successfully completed.

Congratulations, you’ve now pushed your first file using the RTC SCM to JazzHub.

Synchronizing the Changes to Orion

One of the really valuable features of JazzHub is the ability to edit and checkin your code right from the browser. This, in theory, lets someone do all of their development within the cloud. Go back to your project page on JazzHub and select the Code tab at the top.

Click on the Code Tab

Click on the Code Tab on Your JazzHub Project Page

This should take you right to a page where you’ll see your Test module on the side. Click on the twistie to reveal your blank README.md file. Now, enter something useful into it. Click Save when you’re happy.

Click on the Code Tab

Enter in Something Useful into the Editor

Now, check your changes in by clicking on “Check in”.

Check In Your Code

Check In Your Code

Now click “submit”.

Congratulations! You’ve just successfully committed your first bit of code using the JazzHub editor.

Recent Talk: Abusing the GitHub API and Graph Databases to Gain Insight About Your Project

Last night I, along with a lot of other amazing folks, gave a lightning talk at the Data Science DC meetup. In addition to talks about being a “growth hacker”, random forests, consensus clustering, and “If you give a nerd a number”, there was my humble talk about GitHub, Graph Databases, and gaining insight about the social aspects of your project.

In short, I did an exploration of the Julia programming language using my tool, GitMiner, to evaluate the social aspects of the community around the language. I was pleased when I saw that one of the foremost influencers of Julia, at least when measured by watched repositories was Data Community DC’s own Harlan Harris. I promise I didn’t plan that, but it made for a little more interesting of a story.

For folks that are interested, I’ve posted the code that I used for the analysis as a GitHub repository. Feel free to check it out, fork it, or do whatever you want with it - under the terms of the MIT license, of course.

New Paper: A Network of Rails: A Graph Dataset of Ruby on Rails and Associated Projects

For the last year and a half I’ve been working with Anita Sarma, a professor at the University of Nebraska, Lincoln and her graduate student, Corey Jergensen, to try and understand some of the social dynamics around GitHub. As we began to dig at the ecosystem we realized that we had an opportunity to perform some novel analysis on the community. Specifically, GitHub is a highly networked ecosystem and most of the queries that we were doing were localized around single projects or developers. At this time graph databases were taking off so we decided to learn a new technology while getting some data at the same time.

This resulted in the creation of GitMiner, a tool that utilizes the GitHub APIs to download all the data about a project and it’s related users, issues, pull requests, and basically everything else that you can get out of the API. It then stores this information inside of a graph database - something that I’ve written about before when I first published a dataset on the Tinkerpop family of projects.

Now we’ve had a chance to formally publish a larger set of data, thousands of projects associated with Ruby on Rails. The data are published in this years conference on Mining Software Repositories. If you’d like to read the actual paper, here’s the authors’ pre-print of the paper and the GitHub repository with the actual data.

In the coming weeks/months I’ll probably write more about how to use GitMiner to collect large amounts of data from GitHub and how to crawl this data. For the interim, however, I’ll leave you with this nifty picture of shared developers between projects, which is part of an upcoming submission of ours.

Developers shared between projects in our Ruby on Rails dataset. The size of nodes represents the number of developers on the project, edge width is the number of shared developers between projects, and color represents programming language. [link to full size image]

Citation: Wagstrom, P., Jergensen, C., and Sarma, A. A Network of Rails: A Graph Dataset of Ruby on Rails and Associated Projects. Proceedings of the 2013 Working Conference on Mining Software Repositories, ACM (2013).

The KTHXBAI Experiment

On May 1st, 2012 I embarked on an experiment at work — I started signing work emails to my team and friends inside and outside the office with the words “KTHXBYE” or “KTHXBAI”. The goal was to see how long it would take until someone mentioned or asked about it. About two weeks after I started the experiment a friend from Microsoft noticed it and mentioned it to me. Of course, I replied with a meme:

To which my friend at Microsoft was gracious enough to reply with a meme of his own. This experiment was clearly off to an awesome start.

I expected that I’d hear back from other folks in a matter of days. But then the days turned to weeks and the weeks slowly turned into months. I concluded one of two things: either no one actually read my email, or no one actually caught the reference. Undaunted I persisted. Over the course of the experiment I sent out more than 450 messages with the signature “KTHXBYE” and about 65 with “KTHXBAI”, although I only realized that “KTHXBAI” was the appropriate spelling late into the experiment.

Finally, yesterday, March 5, 2013, the experiment came to an end. My manager asked about what it meant, and she googled the definition. Which, unfortunately led to the Urban Dictionary definition of “KTHXBAI.

My response: “Ughh…”. This led to an explanation that Urban Dictionary shouldn’t be trusted and that, no, I wasn’t telling my co-workers to get bent at the end of every email. I had to introduce the whole concept of LOLCats, which thankfully was backed up by the creation of my 2007 intern project called LOLJazz, which somehow lingers on as a zombie inside of our Rational Team Concert Server. Still, I wasn’t out of the woods, there was the chance that it could still be “actionable”. This is where I had an ace up my sleeve. During the development of Watson, IBM’s Jeopardy! playing computer, the team, which happens to be in my organization, fed the entire Urban Dictionary into Watson. As could be guessed, the importation of Urban Dictionary into Watson led to many hilarious and wholly inappropraite responses. In short, Urban Dictionary was a cesspool and shouldn’t be used as canon. Rather in this case the Cheezeburger kthxbai is a much better source.

A traditional use of KTHXBAI (even if it is misspelled)

And so, my experiment has come to an end. In the end it was sorta a drag as month after month passed with no one mentioning it. After I talked about it as an experiment everyone came out of the woodwork to say they had seen it and wondered what it meant, but didn’t bother to ask. Which leads me to wonder, how often does this happen? Do people even read my emails? Do they just ignore things they don’t understand or perceive as irrelevant? Do they do that to everyone, or just me? Could I start saying that we need to replace the fitzervalve on the flux capacitor in order to keep the keep the servers from frobnicating themselves and get away with it?

Now, it’s time to find a new subversive work experiment…

Looking for an Intern for Summer 2013

Once again I’m looking for an amazingly bright Ph.D. student to work with me over the course of the summer. The position is open to Ph.D. students from any university and at any point of their studies, and I can nearly guarantee it’s going to be an awesome experience.

The primary task will be applying machine learning techniques (lexical analysis, network extraction, predictive analytics) to the usage data from a large piece of commercial software. With a little bit of luck the software will be instrumented by this point in time so you’ll just need to slice and dice the data and find awesome stuff. The goal, of course, is to publish an amazing paper that provides great insight into how users actually use this type of software and provide guidance to architects and developers of such a system.

A loose list of skills that are desirable are:

  • Java: Most of our tools are written in Java. It took me a while to get used to this, but Java has some nice advantages for developing code to run in an enterprise. Here at IBM we really love it and most of our software, including the tool we’re looking at, is built in Java.
  • Software Engineering Processes: Domain expertise in understanding the relationships between the different levels of stakeholders in a software project is immensely helpful and will make it a lot easier to tease great bits of nuggets out of the data.
  • Machine Learning: We use various types of machine learning, both Java libraries and some R to understand the data. On the Java side knowledge of text analysis packages such as OpenNLP is helpful.
  • Statistics: I love R. If you love R it helps out.
  • Visualizations: I’m big on making great visualizations to show off our findings. If you’re a ninja with ggplot or d3 then you probably qualify.

Of course, there’s a variety of other skills that are helpful too. The intern absolutely must be self motivated and able to find answers to questions on their own. This isn’t an unsupervised position, but I travel a lot and am frequently out of the office, which limits my ability to provide direct daily supervision. As a result, excellent communication skills are also helpful — you should know how to ask questions over email in way which is succinct while providing enough information to other people to answer the question. If you’ve got a great profile on StackOverflow you’re probably already there.

There’s some great advantages to spending a summer working with me at IBM TJ Watson Research in Yorktown Heights, NY. First, you’ll be working with some of the smartest people in the world at a facility that has an amazing legacy. IBM Research was the genesis of DRAM, the processors in all major video game consoles, Watson - the Jeopardy! playing computer, LASIK, and thousands of other things. We make the world awesome.

Second, our interns come from around the world and are generally smarter than we are. You know that feeling you get when you go to a conference? You’re always excited about new ideas and feel like you could go home and churn out your thesis in a week. Imagine that feeling for an entire summer! I had a blast when I interned here and met some incredible young researchers who I’m still friends with.

Thirdly, we’re just outside of New York in scenic Westchester County, NY. I took the train into the city every Friday, Saturday, and Sunday when I interned here. It was the perfect combination of excitement from New York City and a setting where you can really get work done. You may be saying “isn’t New York really expensive?”. You’re entirely right. Don’t worry, we pay enough that it’s totally worth your time.

Interested? You can either email me or visit our intern hiring page for more information. We won’t be taking application that much longer, so be sure to act soon.

Rules for Recruiters, Vol 1: GPA Doesn’t Matter If You Have a Ph.D.

Tech jobs are hot in New York right now. Last year while sitting in LaGuardia Airport waiting for a flight I was hacking on some code for work in Eclipse and guy who was shoulder surfing me tried to persuade me to interview for positions he had available at his hedge fund. If you visit any Meetup in the city you’ll hear from dozens of people who are looking for the best and the brightest. When I combine these with a publicly visible Github profile, a resume that’s sitting on my web page, and a fairly complete LinkedIn profile it means that messages from recruiters are constantly flooding my mailbox.

They’re nearly all amateurish wastes of my time.

In this series of posts I’m going to chronicle why they’re such a waste of my time. Here’s a paraphrased recent message I got:

Dear Dr. Wagstrom,

I work for MegaHyperTech, a leading technology placement firm in New York City. We came across your profile on GitHub and later found your resume and think that you may have the talent that our client Quanttastic Solutions is looking for. They’re a hedge fund that makes it feel like you’re working at Google. They hire only the best and the brightest from schools like MIT, Berkeley, CMU, and Michigan. We’d love to send your information over there, but we noticed that you don’t list your GPA on your resume and they only hire individuals with exemplary GPAs. If you’re willing to update your resume to include that information for all your degrees we think that you’d enjoy the challenge.

The recruiter is entirely correct, I don’t list my GPA on my resume. This si done for a couple of reasons: first, my degrees are intertwined. It’s really hard to differentiate the GPAs for computer science, electrical engineering, and computer engineering bachelors degrees. They’re all in the 3.5 - 3.9 range, but really I don’t remember what they were. Likewise, my masters and Ph.D. from Carnegie Mellon are also intertwined and probably have a similar range.

But the bigger issue is that a Ph.D. isn’t about classes. In fact, while working on a Ph.D. if you’re taking a required class that isn’t directly related to your research you probably shouldn’t spend enough time to get an ‘A’ in the class. The measure of the work for a Ph.D. is the thesis and the publications that come out as a result of doing the research. I think of all the times that I met with my advisors I was asked my grades only once, and it was over a concern that I was spending too much time on my homework for my machine learning class.

So here’s my hope that maybe at some point a recruiter will read this. If you ask me for my GPA you’re not going to get it. If your client insists on GPAs for their candidates, then they don’t know what they’re getting.

30 Meters Underwater with a Dead Physical Layer Protocol

A couple of years ago I got the bright idea that I’d get my wife open water SCUBA certification as her Christmas present. She likes aquariums and fish and I thought it would be a fun way to do something different when we travel. Fast forward to the present day and I’ve got a closet filled with neoprene, BCDs, fins, first aid kits, and a dive log filled with all sorts of certification cards from PADI.

We purchased our own equipment relatively early in the process of learning how to SCUBA dive - shortly after getting our open water certification, thanks in large part to a nice tax rebate. For the most part we’ve been very happy with our purchases and I feel like it’s made us much more comfortable when we’re underwater. One of the key components of diving is a dive computer. The most basic dive computers tell you your depth and warn you if you’re ascending too fast or are going to need a decompression stop somewhere along the way. More advanced computers replace your entire diving console and provide a compass and wireless integration of you and your buddy’s pressure gauge. Yeah, we went for that kind of over the top dive computer and bought the Uwatec Galileo Luna.

Uwatec/ScubaPro Galileo Luna Hoseless Air Integrated Dive Computer

Uwatec/ScubaPro Galileo Luna Hoseless Air Integrated Dive Computer

I’ll be the first to admit that this probably wasn’t the wisest of ideas. I spent two weeks researching $55 gel pads for my standing desk and here we just decided to drop $2000 on a couple of dive computers thanks to thirty minutes at our local dive shop. We’ve been completely thrilled with them under water. Where we’ve had more problems is getting data out of them above water. More advanced computers also take periodic samplings of your depth, remaining air, water temperature, etc. You can use this data to reconstruct a dive profile in a way that much richer than what normally appears in your dive log.

Screenshot of jTrak - Does it feel like it's 1999?

Screenshot of jTrak - Does it feel like it’s 1999?

Getting this data off your computer isn’t trivial. Dive computers are expensive for a couple of reasons: they’re produced in relatively low volumes, they often license patented algorithms for estimating your air consumption and remaining bottom time, and, of course, they need to be waterproof. This means that you can’t just drop a USB port on the outside of the case. Nor can you just put a USB port under a rubber flap. At 30 meters you’re facing about 400kPa of pressure - four times the pressure at the surface. Water will find a way in. If it gets in the salt will corrode everything and it will die. Thus, dive computers tend to be very well sealed and make even trivial things like changing the battery a process that requires tools and new grease for the O-rings.

There really isn’t a standard interface to these devices. It seems as though a lot of devices, such as the Mares puck computers, have corrosion resistant metallic contacts that connect to a USB controller with an FTDI USB→Serial chip in it. However, the Uwatec Galileo decided to be more advanced and use what I’m sure was the hip protocol at the time: IrDA.

Now, in case you missed it, IrDA was all the rage in the 1990’s and early 2000’s. Every laptop seemed to ship with an IrDA port built in. You could use it to synchronized data with your Palm or Handspring in the late 1990’s. Once cell phones were more common you could even tether your laptop to your cell phone and get very slow data. In the pre-wifi, pre-edge days this was pretty hot stuff. “Was” being the key word. Hot stuff being around the speed of the 28.8k modem that I used when back in 1994.

You can still find devices that use IrDA, most notably a lot of the heart rate monitors from Polar, but for the most part the technology is from about 10 years ago. This also means that you’re dealing with the headaches of 10 years ago, including the near total lack of Mac support for devices. Those that do support the Mac often only support the PPC Mac and never really fully supported it anyway. Did I mention that MacOS X doesn’t even have full support for IrDA? Just try opening up a socket using AF_INET. It doesn’t exist. Ughh. This was going to be a great adventure.

Setting Baseline Expectations

My first naive attempts were to hack an IrDA driver into the framework of libdivecomputer. There was already support for IrDA dive computers under Windows and Linux, and I had confirmed that they worked just fine with my my computer, how hard could it be? The answer is a lot more complicated than I thought. The first step was to find an IrDA dongle that even worked with Mac OS X. I ordered a couple of cheap ones off eBay and had no luck. I read a few comments from folks saying that the official Uwatec USB->IrDA devices worked with JTrak on Mac OS X, however the official dongles about $70 and JTrak is a bit less than what I’m looking for in a dive logging software. Fortunately, I was able to find another device that looked nearly identical from the outside - the IRJoy USB 2.0 USB IrDA adapter for $30. When this guy arrived a quick scan showed that it was the exact same hardware as the official Uwatec dongle - both were based on the MosChip 7780 chipset.

Plugging the device into my trusty Thinkpad x31 showed that it quickly and easily worked both in Windows and Linux using the SmartTrak software from Uwatec, JTrak, and the test applications from libdivecomputer. I knew that I could at least make some progress. Next up was to test it on my Mac. I plugged in the IrDA stick and fired up JTrak and to my amazement it just worked. That NEVER happens. Poking around showed why it worked, the company behind JTrak had licensed a complete pure Java IrDA stack. Well, at least I could use JTrak if everything else failed. However I had my eyes set on something much prettier, MacDive.

Writing a Driver

I had heard people refer to the fact that the MosChip devices had a Mac driver, but most of those conversations ended many years ago &emdash; as if I needed more evidence that I was dealing with a dead protocol. After some digging around and emailing random customer service addresses I found that the IP for the MosChip devices were sold off to a company in Taiwan called Asix. They provided a couple of different versions of the driver and I eventually found one that worked in full 64 bit mode on Mac OS X Lion. Score.

The driver came with a simple test application that would let me read the data coming over the device as though it was a serial device. Using this test application I was able to position the reader in the line of sight of other IrDA devices and receive data. Neat. The problem is that I was getting the raw bytes of the IrDA sockets. There’s a lot of overhead in there that goes along with handshaking, setting speed, and resending data when connections are interrupted. None of this seemed to be enabled in the driver. The driver simply provided a couple of serial devices that I could open up and use to smack bits back and forth. If I wanted this to work I would need to write a complete IrDA stack on top of this serial device.

The problem is that the IrDA stack is actually fairly complex. Theres’s a myriad of different protocols that stack on top of IrDA to make everything work. This was basically the equivalent of trying to implement TCP/IP using just the raw bits coming over the 802.11 physical layer. In other words, it was a nasty layer mismatch that was not going to do me any favors.

The Multifaceted IrDA Stack - From Wikipedia

The Multifaceted IrDA Stack - From Wikipedia

I continued to email Asix, who were more than helpful, although they seemed most concerned that I would write a driver that would let the user transfer files with Windows and cell phones. After a few more emails I explained was a dive computer was and how much of a niche this issue was and Asix offered me an NDA to work on the driver and attempt to implement the AF_INET stack for Mac OS X. If I were in undergrad this would probably sound like a great idea. However, I’m not. I’ve got a job that keeps me quite busy and has me flying back and forth between New York and Washington on a weekly basis. I just don’t have the time to acquire the knowledge needed to hack together a driver on Mac OS X. Of course, there’s also the issue of me performing gratis work for a for-profit company, which I didn’t really want to do either.

VMs to the Rescue

This left me with really only one simple solution, use what I know already works for communicating with the Galileo Luna, Linux or Windows. In an effort to keep this simple and avoid worrying about license issues I chose to use a very minimal Linux installation under VirtualBox as my guest environment. The next problem was the software to make use of my data. There were a couple of different ways to handle this, either do all of my log work inside of the virtual machine, or just download the data in the virtual machine and copy it over to my Mac to do most of the work on the log. Starting up a VM is a bit of a pain, so the choice was made to use mac dive log software and download the data in the VM then copy it over.

There are a couple of different formats that might be able to fit the bill, SDE, UDCF, UDDF, and ZXL. SDE is the output format from Suunto Dive Explorer software. There doesn’t appear to be much documentation for the format, but it supposedly contains all the necessary information that a diver might want to recreate a dive log on a computer. Supposedly Subsurface, a dive log software package by Linus Torvalds, can import from SDE, so there should be some source code there that I just haven’t had a chance to dig at yet. ZXL is a format designed by DAN to collect information for scientific studies of diving related injuries. UDCF and UDDF are formats developed by a group of interested divers that seem to achieved moderate success. UDCF can be considered to be the little brother to the more robust UDDF. Many tools support UDCF, but it lacks official mechanisms to do things like save the pressure in a tank.

The most promising format seems to be UDDF - the Universal Dive Data Format. UDDF, like most interchange formats, sadly uses XML so it is parseable by neither humans nor machines. It is able to contain information about dive profile, temperature, and air usage, which are the main things I want to track. I wasn’t able to find a tool that used libdivecomputer to produce a UDDF file, so I wrote my own, the cleverly named dc2uddf.

dc2uddf is a simple tool that uses libdivecomputer and libxml2 to download data from a dive computer and save it as a UDDF file. That’s all it does. There isn’t much of a user interface, but it works, and it’s written in C, which makes me feel a little more like a programmer than I normally do. I’m certain there are some things that it is doing incorrectly, if folks discover problems email me or [file issues on github][ghissues] and I’ll be sure to fix them. Along the way I’ve also found several defects in the UDDF standard, so I feel like I’m making the standard better too.

Now I’m at the point where I can download the data using a Linux VM and then copy the data over to my Mac where I can easily import it into the excellent MacDive software, as you can see below.

The Pretty-Pretty Output of MacDive

The Pretty-Pretty Output of MacDive

The Future

I’ve thought about a couple of ways that I could make this a bit more streamlined. The current candidate is to get a Raspberry Pi board and create a small dedicated device for downloading dive computer data. Basically you’d turn it on, put the dive computer within range, press a button on the case for the Raspberry Pi and your data would be automatically downloaded. You could feed it an SD card and it could either use the configuration file on the SD card to upload the data to a remote host or just store a copy on the SD card. However, given the long waits for Raspberry Pis at the current time and my busy schedule I’ll just have to wait on that idea.

I’ve also toyed with the idea of making a service that provides real analytics on dives. Right now there are a couple of different sites that allow you to share dive logs. DiveBoard seems to be the most cross-platform of sites and they’ve even developed a browser plugin based on libdivecomputer to automatically upload your dives from your browser. Aside from their plugin they allow users to upload UDCF, SDE, and ZXL files. They’ve even gone so far as to extend UDCF to allow for pressure information — although this seems to be a clear sign to me that they should consider allowing UDDF uploads.

Another community is Suunto Movescount. This is the successor to Suunto’s Dive Explorer software and reflects the fact that they’ve moved beyond just diving metrics. The problem is that as near as I can see it’s a locked platform. There doesn’t appear to be any way to get your data out of it, or, for that matter, get data from non-Suunto devices into it.

Both of these sites are missing some of the potential for such sites, which is the ability to measure and track rather than just keeping a log. It’s something that sites like RunKeeper are just beginning to explore with efforts like their FitnessReports, but even those reports are rather cursory. There’s a number of metrics that we can calculate both on an individual and across a community that would be highly beneficial to everyone involved - divers, dive shops, travel agents, tour operators, and gear manufacturers, to name just a few. However, the description of these analytics will have to wait for a future post.

On the Facebook Terms of Service

Recently I’ve seen a number of friends and acquantences post some variation of the following message to their Facebook walls:

In response to the new Facebook guidelines I hereby declare that my copyright is attached to all of my personal details, illustrations, comics, paintings, professional photos and videos, etc. (as a result of the Berne Convention). For commercial use of the above my written consent is needed at all time.

By the present communiqué, I notify Facebook that it is strictly forbidden to disclose, copy, distribute, disseminate, or take any other action against me on the basis of this profile and/or its content. The aforementioned prohibited actions also apply to employees, students, agents and/or any staff under Facebook’s direction or control.

The content of this profile is private and confidential information. The violation of my privacy is punished by law (UCC 1 1-308-308 1-103 and the Rome Statute).

Facebook is now an open capital entity. All members are recommended to publish a notice like this, or if you prefer, you may copy and paste this version. If you do not publish a statement at least once, you will be tacitly allowing the use of elements such as your photos as well as the information contained in your profile status updates.

The intent of these postings is to limit the way that Facebook is legally allowed to use or share your information. On the one hand this makes me happy because it seems as though some people are taking their privacy seriously, on the other hand, it’s very frustrating because of the ham-fisted way people are going bout this.

The crux of the problem is that the Facebook Terms of Service supersede any declaration or addendum you attempt to make toward Facebook. Specifically clause 19.5:

Any amendment to or waiver of this Statement must be made in writing and signed by us.

However, you might think there is a loophole that will protect you somehow. Maybe something that Facebook forgot to expressly enumerate. Sorry, that’s covered in clause 19.10:

We reserve all rights not expressly granted to you.

As an additional level of backup the posts typically attempt to cite various portions of the Uniform Commercial Code, most often Article 1. First, it’s important to understand what the UCC is. It is NOT some overarching set of Federal Laws. The UCC is an attempt to harmonize various state laws and make it easier to do business across state lines. In some ways you can think of the UCC a little like the Talmud, the text is important, but so are the comments that go along with it. Unfortunately, the text and comments are copyright, so these semi-binding documents are not accessible to the common man (that’s a whole different problem, one which Carl Malamud and Public.Resource.org are attempting to remedy.

Anyway, we’ll ignore for a moment that the entirety of Article 1 of the UCC deals with definitions and ways to interpret further rules, and therefore probably isn’t the thing you’re looking for. The first reference, UCC 1-308 (which is often mistyped 1-308-308, which renders it null in the eyes of the law) reads:

§ 1-308. Performance or Acceptance Under Reservation of Rights.

(a) A party that with explicit reservation of rights performs or promises performance or assents to performance in a manner demanded or offered by the other party does not thereby prejudice the rights reserved. Such words as “without prejudice,” “under protest,” or the like are sufficient.

(b) Subsection (a) does not apply to an accord and satisfaction.

However, the issue with 1-308 is that your Facebook content, while being a creative work, isn’t a performance in most cases. There isn’t a transaction from Facebook unto you for performing such an action, therefore this most likely doesn’t apply.

Second is UCC 1-103, I have no idea how this got mixed up in here:

§ 1-103. Construction of [Uniform Commercial Code] to Promote its Purposes and Policies: Applicability of Supplemental Principles of Law.

(a) [The Uniform Commercial Code] must be liberally construed and applied to promote its underlying purposes and policies, which are: (1) to simplify, clarify, and modernize the law governing commercial transactions; (2) to permit the continued expansion of commercial practices through custom, usage, and agreement of the parties; and (3) to make uniform the law among the various jurisdictions.

(b) Unless displaced by the particular provisions of [the Uniform Commercial Code], the principles of law and equity, including the law merchant and the law relative to capacity to contract, principal and agent, estoppel, fraud, misrepresentation, duress, coercion, mistake, bankruptcy, and other validating or invalidating cause supplement its provisions.

Reading through this I can’t understand why 1-103 was even brought into this. It’s a simple description of the UCC and highlighting that unless the UCC attempts to supersede laws for things like fraud, duress, and bankruptcy, that they stay in effect.

Finally, let’s look at the appeal of the Rome Statute. I’m going to out on a limb here and say this was added by someone in Europe as the original postings I saw by Americans didn’t include this caveat. I’m assuming that the Rome Statute refers to the Rome Statute of the International Criminal Court. This international agreement established the international criminal court and gave the UN authority to investigate crimes when the host nations have chosen not to investigate. For example, the ICC often comes into play with state sponsored genocide.

One could easily argue that the United States has initiated investigations in privacy and Facebook (see the Senate Judiciary Committee Subcommittee on Privacy, Technology and the Law meeting on July 18, 2012 when Franken tore into Facebook’s manager of Privacy and Public Policy). The fact that the US is conducting investigations would seem to disallow the ICC any sort of jurisdiction. would therefore make such an investigation outside the bounds of the International Criminal Court — which really has non-first-world-problems to deal with, like genocide.

In short, if you’re really concerned about your privacy posting such a message on Facebook doesn’t do anything other than annoy your friends. If you’re really concerned about your privacy on Facebook you need to stop using it altogether.

Important Disclaimer: I am not a lawyer. I’m merely someone who took the time to read the Facebook Terms of Service and look up the relevant portions of the law that people are attempting to quote. None of this should be regarded as real legal advice.