Saturday, September 28, 2013

I got Nielsen-ed

Recently, I opened my mailbox and saw this:

Nielsen is one of the world's oldest information companies. For close to 100 years, they've created a business out of providing information about what people consume. While Nielsen is most famous for their TV ratings, their business covers everything from internet usage to what kind of soap you're buying. This is a company that lives and dies by the quality of their data. So how exactly does Nielsen get their data?

We have the huge envelope above--certainly less likely to end up in the trash than something more modest. Not pictured is a postcard they sent me about a week before the envelope came, informing me to expect a large envelope. They are extremely thorough. But it was upon opening the envelope that things really got interesting:

Cash! Cold hard cash! Apparently, this is an old surveying trick. Someone I recently met who works in consumer surveying said for years you could get by with only a quarter.

It's original plan was to send a bunch of garbage responses back to Nielsen, and write a blog about how easy it is to get bad data, and how rarely such issues are discussed by the analytics community. But those two bucks made me play it straight. Amazing. I'm aware of the psychology at work here, yet I still went along.

The actual survey was extremely brief. Given the time it took, I was making about 90$ an hour:

The most interesting part was that about a quarter of the survey had to do with Spanish and Hispanic demographics:

But my favorite question was #3:

I wonder if such a response will increase or decrease my chances of getting a future survey? On one hand, many TV broadcasters are still dismissive of households that no longer consume TV programming through traditional means. On the other hand, Nielsen has a pretty good track record of changing with the times, and the times they are a changin'. (I'm also tickled by the emphasis on "working". But clearly this is an important detail to consider.)

I returned my survey and happily pocketed my candy money. A couple days later, I even got a follow-up:

Now that's thorough. Having had my data harvested by Nielsen, it's interesting to see how much attention to detail they put into data collection (and how much postage they're willing to buy). The form letter is even signed by the "Chief Research Officer". I wonder how many psychologists consulted on that decision?

Sunday, September 22, 2013

The Command Line

I recently came across an article by Greg Reda, a data scientist at GrubHub, entitled, "Useful Unix commands for data science". The article is fantastic, and it really got me thinking about what skills data workers really need, and which different skill sets will emerge as the most vital as the data science field grows.

To me, the command line is a pretty comfortable place. I'm old. I remember when Bill Gates buried DOS with Windows 95. I even spent some time rocking CP/M on one of these dinosaurs I dug out of my parents' garage as a kid. So when I read this article about working with data on the command line, my first thought was, "duh.".

Then I thought a little more. There's not a lot of discussion in the data science community about the basics. Instead, people talk about software programs. You hear comparatively little talk about topics like file systems, optimization, and hardware. This is especially ironic given the ubiquity of the term "big data" (a term which, frankly, most people who work with "big" data dislike). When business people talk "big data", they're often just discussing desktop-based tools that run decades-old analysis techniques on sets of data that really aren't very big. The irony grows when you consider that those working with truly big data are more concerned with things like file systems, optimization, and hardware than about what the GUI on their desktop looks like.

This isn't the fault of the data people, as much as it is the nature of business. The people making the decisions--the ones dying to hire people who can work with "big data" because the competition is hiring too--simply do not have the time to learn everything about a new aspect of doing business. So the buzzwords prevail, the GUIs sure look cool, and the person who says "I've used Hadoop" gets hired over the person who says "I did a custom implementation of mapReduce for a rack of GPUs so I could parallelize the execution of awk."

So where does that leave the command line? A sadly neglected place. But a place that any serious data worker is going to have to get used to, even if their bosses don't get it. GUIs are limited, and to do really interesting things with your data you are better off digging into your code, adding support for relevant arguments (don't waste time hooking them to buttons!), dusting off your terminal, and ditching the GUI. And with the magic of a few basic UNIX (or Linux, or Cygwin, or MacOS) commands, you're off and running , while your colleagues are still clicking through their explorer windows loading data sets into their shiny new tool.

Saturday, September 14, 2013

Natural Language Processing - Making Your Computer Understand You

One of the more interesting branches of analytics is natural language processing (henceforth NLP). In English (haha!), NLP is getting computers to understand language. This is not a trivial task. Think about how many words there are, how many years it took you to master language, how difficult it can be to explain grammar rules, etc. Now make a computer understand all of that.

Despite the challenges of NLP, many victories have been made, and many amazing things developed. We haven't quite reached the era of a Star Trek-style computer, but our phones can now handle basic commands (sometimes), and technology like IBM's Watson are now being adapted to things beside Jeopardy. It's only a matter of a few years before something as powerful as Watson is available on your smartphone.

Such stuff may seem to be the realm of researches, far removed from the work of data scientists in the trenches of industry. But that's not the case at all. Tools like the Natural Language Toolkit (NLTK) make experimenting with NLP fun and easy (if you know Python), and there's a free book available on the site that even includes an intro to Python. And recently, Google open-sourced a a set of algorithms in an open source project they're calling word2vec. Word2vec requires some more technical weightlifting to get started with than the NLTK, but it's also some very cutting edge stuff -- algorithms fresh from the annals of Google Research. Word2vec is especially cool, because it will determine relationships between lexical ideas on it's own. GigaOm has more to say about why word2vec is especially fascinating.

In a few years, you might actually want to use Siri.

Friday, September 6, 2013

Free Twitter Data Analytics Book

From the Data Mining and Machine Learning Lab at Arizona State University comes a free book, Twitter Data Analytics.

This book covers not only the Twitter API, but then goes into how to best store tweets, how to analyze tweets, and how to visualize Twitter activity. It's a full compendium of Twitter analytics, written by some pretty sharp guys getting cash from the US Government for their work with Twitter.

There's a lot of information bouncing around the web about Twitter analytics, but I've never seen the whole topic covered, beginning to end, in one place. Let alone in a format this readable. If you have a bit of programming experience, you know enough to do everything in this book. And I'm fairly certain all the tools they use are free.

The version currently available is a Preprint, and they have a publisher, so this might not be free forever.

Get it here: