Sunday, September 22, 2013

The Command Line

I recently came across an article by Greg Reda, a data scientist at GrubHub, entitled, "Useful Unix commands for data science". The article is fantastic, and it really got me thinking about what skills data workers really need, and which different skill sets will emerge as the most vital as the data science field grows.

To me, the command line is a pretty comfortable place. I'm old. I remember when Bill Gates buried DOS with Windows 95. I even spent some time rocking CP/M on one of these dinosaurs I dug out of my parents' garage as a kid. So when I read this article about working with data on the command line, my first thought was, "duh.".

Then I thought a little more. There's not a lot of discussion in the data science community about the basics. Instead, people talk about software programs. You hear comparatively little talk about topics like file systems, optimization, and hardware. This is especially ironic given the ubiquity of the term "big data" (a term which, frankly, most people who work with "big" data dislike). When business people talk "big data", they're often just discussing desktop-based tools that run decades-old analysis techniques on sets of data that really aren't very big. The irony grows when you consider that those working with truly big data are more concerned with things like file systems, optimization, and hardware than about what the GUI on their desktop looks like.

This isn't the fault of the data people, as much as it is the nature of business. The people making the decisions--the ones dying to hire people who can work with "big data" because the competition is hiring too--simply do not have the time to learn everything about a new aspect of doing business. So the buzzwords prevail, the GUIs sure look cool, and the person who says "I've used Hadoop" gets hired over the person who says "I did a custom implementation of mapReduce for a rack of GPUs so I could parallelize the execution of awk."

So where does that leave the command line? A sadly neglected place. But a place that any serious data worker is going to have to get used to, even if their bosses don't get it. GUIs are limited, and to do really interesting things with your data you are better off digging into your code, adding support for relevant arguments (don't waste time hooking them to buttons!), dusting off your terminal, and ditching the GUI. And with the magic of a few basic UNIX (or Linux, or Cygwin, or MacOS) commands, you're off and running , while your colleagues are still clicking through their explorer windows loading data sets into their shiny new tool.

No comments:

Post a Comment