Thursday, December 19, 2013

Building a Graph Database with Neo4j of Airline On-Time Flight Performance

My colleague Nicole White and I have released our third video discussing our implementation of the US Department of Transportation's Flight On-Time/Delays dataset. But this time, we've created a graph database, using Neo4j. (The first two videos aren't required viewing, but if you're curious here's us explaining the relational design and a tour of our implementation in Oracle.)

The first third of the video explains exactly what a graph database is, and the video as a whole touches on why graph databases are so cool and powerful. So please watch, learn, and comment:

Wednesday, November 27, 2013

What airline will keep you sitting in a delayed plane the longest?

My colleague Nicole White and I have made another video for our database class. You can watch the first video here. In this second video we implement our database tracking flight on-time performance, and we run some analysis on it, including identifying what might have been the most miserable flight of 2012.

Saturday, November 23, 2013

Tutorial - Building a Classification Tree in Weka (or, "Data Mining for the Scared and Intimidated")

This tutorial is based on material taught and provided by Professor Maytal Saar-Tsechansky as part of The University of Texas' Masters of Science in Business Analytics program.

A classification tree is one of the simplest, yet most powerful data mining techniques. A lot of smoke and mirrors surround what data mining is, and to pull back the curtain a bit I've decided to show the world how easy it is to do basic data mining.

In this example, we'll be using a dataset of Basketball players (Note this is just a sample of players, it is not complete). Download link.

You can open up the attached file in Excel, or pretty much any other spreadsheet program or text editor. There are a lot of fields it this dataset. Here's what they all are:

league: What league the player played in. N = National Basketball Association (NBA); A = American Basketball Association (ABA)
games: Games Played
minutes: Minutes Played
pts: Points
offReb: Offensive Rebounds
defReb: Defensive Rebounds
reb: Rebounds
asts: Assists
stl: Steals
blk: Blocks
turnover: Turnovers
pf: Personal Fouls
fga: Field Goals Attempted
fgm: Field Goals Made
fta: Free Throws Attempted
ftm: Free Throws Made
tpa: Three Pointers Attempted* SEE NOTE BELOW
tpm: Three Pointers Made* SEE NOTE BELOW
total: Total number of seasons played. This value is calculated as follows: "lastSeason - firstSeason + 1"
position: C = Center; F = Forward; G = Guard
firstSeason: First season played. The year corresponds to the first year of the season (i.e. a value of 2000 represents the 2000-2001 season).
lastSeason: Last season played. The year corresponds to the first year of the season (i.e. a value of 2000 represents the 2000-2001 season). Note that 2004 (2004-2005 season) is the last year for which
there is data.
careerEnded: 1 if career has ended, 0 otherwise. This field was calculated as follows: if the "lastSeason" field is lower than 2004, the value is 1, otherwise 0. Note that this calculation naively assumes that no players retired at the end of the 2004 season.
yrsRetired2004: The number of seasons that a player has been retired as of the 2004-2005 season.
class: A field showing whether or not a player was inducted to the Basketball Hall of Fame as a player. This field has value 1 if the player has been inducted, and 0 otherwise.

If you look through the data, you'll see it has some issues. Those of you who are basketball fans will also find some other problems. This data is messy. Welcome to the world of data mining.

The focus of our analysis will the "class" field, which tells us if a player is in the Hall of Fame or not. This is the field we will try to predict.

The first step is by far the most difficult. You have to download and run Weka, and you'll need Java installed to do this. There's about a 50% chance this won't work off the bat (and that's likely Java's fault) so you might have to do some troubleshooting. Welcome to the world of Data Science.

Once we start Weka, we'll see a window like this:

Click "Explorer", and you'll get a new window:

We have to give Weka a file to analyze. You can download a version of the basketball player data formatted for Weka here. Or if you are savvy, you can put in the csv from earlier in and try to make it work.

Once you've downloaded the file, click "Open File" in the upper left corner, and navigate to the Hall-of-fame-train.arff file you just downloaded. (.arff is Weka's native file format.)

You'll see something like this:

There's quite a bit going on here. But in the interesting of sticking to the goal of making a tree, we're going to jump to tree generation. Click the "Classify" tab at the top of the window:

Then click the "Choose" button in the upper-left, and navigate to trees/J48, and click it:

Finally, click the "Start" button in the middle-left of the screen. You'll see some crazy output:

Congratulations, you've just made your first classification tree! Let's take a look at it. Below the "Start" button, right click the entry that just appeared (it has the time, followed by "trees.J48"), and then choose "Visualize tree".

A new window will appear:

This is your tree. But what does this all mean? All the circles correspond to fields from the data, as explained and identified above.. All of the squares are "leaves, places where the tree ends. Inside each of these leaves is a 0 or 1, and some counts. This 0 and 1 correspond to the "class" field from the data--whether or not a player made the hall of fame. The tree was built by trying to find what fields best predicted if this "class" was 0 or 1, or in English, if a player made the Hall of Fame (class=1) or not (class=0). Finally, the numbers in between nodes and leaves show what value the data in the field was split at.

An example should clarify better. At the top of the tree is a node "ftm". This is a field from the dataset, meaning "Free Throws Made". Below "ftm" are "<= 1929"  and ">1929". What this means is that players who made 1929 free throws or less went one way down the tree, and players who made more than 1929 free throws go the other way down the tree.

Following the tree to the left, we see a leaf (square) that reads "0 (434.0/5.0)". The "0" means most of the players in this leaf had class=0, meaning they did not end up in the Hall of Fame. The next number, "434.0" means that 434 players in this leaf had class=0, (remember, class=0 means not in the Hall of Fame). The second number "5.0" means that 5 players had the other class (1), meaning that they did end up in the Hall of Fame.

More interestingly, since this is the first division in the tree, it means that the best predictor for a basketball player making the hall of fame is how many free throws they made. Not points, not games, not position, but boring free throws. Why? The tree can't tell you that. That's up to you to decide.

Moving down to the right, we see a node of "firstSeason" (the first season the player played in), and to the right of that "fga" (field goal attempts). One more move to the right brings us to a leaf reading "1(15.0/1.0)". This means most of the players in that leaf are Class=1 and thus in the Hall of Fame. And "15.0/1.0" means that of the players in the leaf, 15 made the hall of fame (class=1) and 1 did not (class=0). So if a player makes a lot of free throws, started after 1962, and attempted over 12258 field goals, then there is a pretty good chance they made it into the Hall of Fame.

Now you're a real Data Scientist. Hopefully you'll get that $100,000 a year job soon!

If enough people ask, I'll show how you can use this tree to predict if a future player will end up in the Hall of Fame.

Tuesday, October 22, 2013

We're building a database of US flight on-time performance

As part of the classwork in the University of Texas at Austin Business Analytics program, we're doing basic database design. Simple relational databases are something I'm already familiar with, so I teamed up with another experienced colleague of mine to do something a little more interesting.

The US Department of Transportation maintains a dataset of every US flight's on-time performance going back to 1987. We're building a database to hold this information, with the idea of doing some analysis later on. And once we've implemented a traditional relational database, we're going to do the whole thing again as a graph database, using Neo4j, and then compare its functionality to the relational database. Should be interesting.

We made a beautifully nerdy video of our relational database design. Watch and comment!

Thursday, October 10, 2013

"Evil" Analytics and a Code of Ethics

While talking with a seasoned developer at an Austin Startup Week event last night, the topic of data science ethics came up. He mentioned a project he was familiar with that involved using social media data to identify which of your competitor's employees would be easiest to poach. While such a project is not on the level of what the NSA is up to, it certainly raises the issue of how easy (and tempting) it is to use data for questionable purposes.

Ethics is often casually mentioned when discussing the impact of big data, but rarely is ethics given anything more than a cursory acknowledgment. However, the ethical implications of big data are staggering and need to be seriously discussed. It is better to have this tough conversation now, rather than wait until it can't be ignored. Indeed, if tales of ethical lapses on the part of data scientists pile up, the damage to the profession could be irreversible--we'd find ourselves in the position of bankers, but with less pay and no political connections. Now is the time to lay the rules down, so that data projects violating mainstream ethical standards can be labeled as such, and their negative impact to the field lessened.

Via a colleague of mine in UT Austin's Business Analytics program, I learned about a recent effort to establish a set of ethical standards for data science. While there's been a recent proliferation of data science/analytics/big data organizations, hopefully the focus on ethics will make this attempt successful. And you can join for free. Let's finish this conversation now, before people outside the data science field finish it for us.

Saturday, October 5, 2013

The Healthcare Big Data Goldrush

With the Affordable Care Act (Obamacare) in the news so much recently, I've been thinking a lot about my own past experiences in healthcare, and what I saw (and didn't see). Then I saw this New York Times blog post. It's quite a read, but the big takeaways for me are that doctors are losing their sanity and confidence, while research continues to show that the demeanor and confidence of a doctor has far greater results on outcomes than expected.

A combination of irrational exuberance and economic forces (including legislative) have created something of a big data gold rush in healthcare. And like any good gold rush, you're going to see unorganized growth, heartbreak, fortunes made and lost, and more than a few dead bodies.

The truism being thrown around is that healthcare is the next industry to be transformed by data, and that data is going to change healthcare more than anything else (even, perhaps, than Obamacare) over the next decade. And while both proclamations might be true, they fundamentally strip away the most basic aspect of healthcare--the relationship between the patient and the doctor, the sick and the well, the dying and the living, the needy and the needed. No amount of analysis, legislation, technology, or bureaucracy can get around this fact. And yet, it seems to be ignored more each day, sucked under a tidal wave of implementations, electronic health records, politics, "innovations", and process improvement.

Banking, retail--these are industries where human relationships were never at the core, and it's probably not a coincidence that they have responded so well to going under the big-data knife. The continual attempts to humanize these industries ("relationship banking", the ad industry) is a testament to how inhuman money and consumption can be. Healthcare, fundamentally, is very much about the human connection. And while there's certainly potential for data to improve the quality of care, as we fearlessly march forward into our data-driven healthcare future, the human element is getting pushed further and further away from the center.

I do not mean to imply that data has nothing to offer healthcare. It clearly does. What I am saying is that as data professionals we cannot forget that the problems we attempt to solve are, fundamentally, not about data in the end. To forget this is to do a disservice to our colleagues, our customers, and society. In retail and finance you can do a lot without ever thinking about the living, breathing people at the opposite ends of a transaction. But healthcare will not be so accommodating to data-centric problem solving. And yet, as existing data applications in healthcare turn out to not be as fruitful as hoped, the industry's answer is more data!

This is bigger than healthcare. If the big-datification of healthcare turns out to be a turkey, think about what that means for the future of analytics and big data.

Saturday, September 28, 2013

I got Nielsen-ed

Recently, I opened my mailbox and saw this:

Nielsen is one of the world's oldest information companies. For close to 100 years, they've created a business out of providing information about what people consume. While Nielsen is most famous for their TV ratings, their business covers everything from internet usage to what kind of soap you're buying. This is a company that lives and dies by the quality of their data. So how exactly does Nielsen get their data?

We have the huge envelope above--certainly less likely to end up in the trash than something more modest. Not pictured is a postcard they sent me about a week before the envelope came, informing me to expect a large envelope. They are extremely thorough. But it was upon opening the envelope that things really got interesting:

Cash! Cold hard cash! Apparently, this is an old surveying trick. Someone I recently met who works in consumer surveying said for years you could get by with only a quarter.

It's original plan was to send a bunch of garbage responses back to Nielsen, and write a blog about how easy it is to get bad data, and how rarely such issues are discussed by the analytics community. But those two bucks made me play it straight. Amazing. I'm aware of the psychology at work here, yet I still went along.

The actual survey was extremely brief. Given the time it took, I was making about 90$ an hour:

The most interesting part was that about a quarter of the survey had to do with Spanish and Hispanic demographics:

But my favorite question was #3:

I wonder if such a response will increase or decrease my chances of getting a future survey? On one hand, many TV broadcasters are still dismissive of households that no longer consume TV programming through traditional means. On the other hand, Nielsen has a pretty good track record of changing with the times, and the times they are a changin'. (I'm also tickled by the emphasis on "working". But clearly this is an important detail to consider.)

I returned my survey and happily pocketed my candy money. A couple days later, I even got a follow-up:

Now that's thorough. Having had my data harvested by Nielsen, it's interesting to see how much attention to detail they put into data collection (and how much postage they're willing to buy). The form letter is even signed by the "Chief Research Officer". I wonder how many psychologists consulted on that decision?

Sunday, September 22, 2013

The Command Line

I recently came across an article by Greg Reda, a data scientist at GrubHub, entitled, "Useful Unix commands for data science". The article is fantastic, and it really got me thinking about what skills data workers really need, and which different skill sets will emerge as the most vital as the data science field grows.

To me, the command line is a pretty comfortable place. I'm old. I remember when Bill Gates buried DOS with Windows 95. I even spent some time rocking CP/M on one of these dinosaurs I dug out of my parents' garage as a kid. So when I read this article about working with data on the command line, my first thought was, "duh.".

Then I thought a little more. There's not a lot of discussion in the data science community about the basics. Instead, people talk about software programs. You hear comparatively little talk about topics like file systems, optimization, and hardware. This is especially ironic given the ubiquity of the term "big data" (a term which, frankly, most people who work with "big" data dislike). When business people talk "big data", they're often just discussing desktop-based tools that run decades-old analysis techniques on sets of data that really aren't very big. The irony grows when you consider that those working with truly big data are more concerned with things like file systems, optimization, and hardware than about what the GUI on their desktop looks like.

This isn't the fault of the data people, as much as it is the nature of business. The people making the decisions--the ones dying to hire people who can work with "big data" because the competition is hiring too--simply do not have the time to learn everything about a new aspect of doing business. So the buzzwords prevail, the GUIs sure look cool, and the person who says "I've used Hadoop" gets hired over the person who says "I did a custom implementation of mapReduce for a rack of GPUs so I could parallelize the execution of awk."

So where does that leave the command line? A sadly neglected place. But a place that any serious data worker is going to have to get used to, even if their bosses don't get it. GUIs are limited, and to do really interesting things with your data you are better off digging into your code, adding support for relevant arguments (don't waste time hooking them to buttons!), dusting off your terminal, and ditching the GUI. And with the magic of a few basic UNIX (or Linux, or Cygwin, or MacOS) commands, you're off and running , while your colleagues are still clicking through their explorer windows loading data sets into their shiny new tool.

Saturday, September 14, 2013

Natural Language Processing - Making Your Computer Understand You

One of the more interesting branches of analytics is natural language processing (henceforth NLP). In English (haha!), NLP is getting computers to understand language. This is not a trivial task. Think about how many words there are, how many years it took you to master language, how difficult it can be to explain grammar rules, etc. Now make a computer understand all of that.

Despite the challenges of NLP, many victories have been made, and many amazing things developed. We haven't quite reached the era of a Star Trek-style computer, but our phones can now handle basic commands (sometimes), and technology like IBM's Watson are now being adapted to things beside Jeopardy. It's only a matter of a few years before something as powerful as Watson is available on your smartphone.

Such stuff may seem to be the realm of researches, far removed from the work of data scientists in the trenches of industry. But that's not the case at all. Tools like the Natural Language Toolkit (NLTK) make experimenting with NLP fun and easy (if you know Python), and there's a free book available on the site that even includes an intro to Python. And recently, Google open-sourced a a set of algorithms in an open source project they're calling word2vec. Word2vec requires some more technical weightlifting to get started with than the NLTK, but it's also some very cutting edge stuff -- algorithms fresh from the annals of Google Research. Word2vec is especially cool, because it will determine relationships between lexical ideas on it's own. GigaOm has more to say about why word2vec is especially fascinating.

In a few years, you might actually want to use Siri.

Friday, September 6, 2013

Free Twitter Data Analytics Book

From the Data Mining and Machine Learning Lab at Arizona State University comes a free book, Twitter Data Analytics.

This book covers not only the Twitter API, but then goes into how to best store tweets, how to analyze tweets, and how to visualize Twitter activity. It's a full compendium of Twitter analytics, written by some pretty sharp guys getting cash from the US Government for their work with Twitter.

There's a lot of information bouncing around the web about Twitter analytics, but I've never seen the whole topic covered, beginning to end, in one place. Let alone in a format this readable. If you have a bit of programming experience, you know enough to do everything in this book. And I'm fairly certain all the tools they use are free.

The version currently available is a Preprint, and they have a publisher, so this might not be free forever.

Get it here:

Tuesday, August 27, 2013

Same Data, Different Conclusions

Recently, a research firm called Canalys published a press release entitled, "Half of top iPad apps either unavailable or not optimized on Android". The press release was promptly picked up by a handful of major media outlets, including AllThingsD and even The Guardian.

What exactly did Canalys find? To quote their press release directly:
New Canalys’ App Interrogator research highlights one of the deficiencies of the Android ecosystem: limited availability of high-quality, tablet-optimized apps in the Google Play store. Of the top 50 paid and free iPad apps in Apple’s US App Store, based on aggregated daily rankings in the first half of 2013, 30% were absent from Google Play. A further 18% were available, but not optimized for tablet users, offering no more than a smart phone app blown up to the size of a tablet screen. 
Just 52% of apps had Android versions both available through Google Play and optimized (if only a little) for tablet use. ‘Quite simply, building high-quality app experiences for Android tablets has not been among many developers’ top priorities to date,’ said Canalys Senior Analyst Tim Shepherd.
First, note that the press release headline is not clear. It is somewhat easy for someone headline browsing to go away thinking, "Wow, only half the top apps on Android!"

Luckily, Canalys does clarify the headline in the full press release, but the finding is still portrayed dramatically. Just barely over half of the top paid iPad apps are available in full tablet form in the Google Play store? That's quite a deficiency, and certainly noteworthy. But a deeper look into the data reveals a slightly different story. You can find the full data here.

If you simply remove all the the Apple-published apps from the study, suddenly the results change. 59% of the top (now 44 instead of 50) paid apps are available in full versions from Google Play, and 70% are available, but not necessarily "optimized". (It is also worth noting that the published data never defines exactly what "optimized" means).

If we look at Canalys' free app data, the picture becomes even rosier for Android. Again discounting Apple apps, 87% of the top (now 45 instead of 50) free apps are available in full, "optimized" versions for Android tablets.

Combining both free and paid app data now gives us 73% of apps available in "optimized" versions on both platforms, and 79% percent available in some version.

It is also worth noting that Canalys chooses to frame their conclusion as "30% were absent from Google Play," instead of stating that, "70% were present on Google Play." While this is the same conclusion quantitatively, our minds perceive the two statements differently.

Is the data half-empty or half-full?

While here I've chosen to discuss this one Canalys study, the use of "massaged" data (the word "bias" has become too loaded) is commonplace in press releases, analysis reports, and the major media. This is not a revelation, but something many data professionals, and even many well-educated members of the public, have known for a long time.

But then again, who really wants to stare at tables all day except for a few Math teachers and analysts? Communications professionals know this, and they make sure to distill data down into more digestible, dramatic nuggets, lest they lose their share of attention. Can you really blame an organization for dumbing down their quantitative conclusions into a form that the average reader can appreciate? At the end of the day, like in this Onion article about the coverage of Miley Cyrus' VMA performance, it's all about how many eyeballs you can grab.