Wednesday, November 27, 2013

What airline will keep you sitting in a delayed plane the longest?

My colleague Nicole White and I have made another video for our database class. You can watch the first video here. In this second video we implement our database tracking flight on-time performance, and we run some analysis on it, including identifying what might have been the most miserable flight of 2012.

Saturday, November 23, 2013

Tutorial - Building a Classification Tree in Weka (or, "Data Mining for the Scared and Intimidated")

This tutorial is based on material taught and provided by Professor Maytal Saar-Tsechansky as part of The University of Texas' Masters of Science in Business Analytics program.

A classification tree is one of the simplest, yet most powerful data mining techniques. A lot of smoke and mirrors surround what data mining is, and to pull back the curtain a bit I've decided to show the world how easy it is to do basic data mining.

In this example, we'll be using a dataset of Basketball players (Note this is just a sample of players, it is not complete). Download link.

You can open up the attached file in Excel, or pretty much any other spreadsheet program or text editor. There are a lot of fields it this dataset. Here's what they all are:

league: What league the player played in. N = National Basketball Association (NBA); A = American Basketball Association (ABA)
games: Games Played
minutes: Minutes Played
pts: Points
offReb: Offensive Rebounds
defReb: Defensive Rebounds
reb: Rebounds
asts: Assists
stl: Steals
blk: Blocks
turnover: Turnovers
pf: Personal Fouls
fga: Field Goals Attempted
fgm: Field Goals Made
fta: Free Throws Attempted
ftm: Free Throws Made
tpa: Three Pointers Attempted* SEE NOTE BELOW
tpm: Three Pointers Made* SEE NOTE BELOW
total: Total number of seasons played. This value is calculated as follows: "lastSeason - firstSeason + 1"
position: C = Center; F = Forward; G = Guard
firstSeason: First season played. The year corresponds to the first year of the season (i.e. a value of 2000 represents the 2000-2001 season).
lastSeason: Last season played. The year corresponds to the first year of the season (i.e. a value of 2000 represents the 2000-2001 season). Note that 2004 (2004-2005 season) is the last year for which
there is data.
careerEnded: 1 if career has ended, 0 otherwise. This field was calculated as follows: if the "lastSeason" field is lower than 2004, the value is 1, otherwise 0. Note that this calculation naively assumes that no players retired at the end of the 2004 season.
yrsRetired2004: The number of seasons that a player has been retired as of the 2004-2005 season.
class: A field showing whether or not a player was inducted to the Basketball Hall of Fame as a player. This field has value 1 if the player has been inducted, and 0 otherwise.

If you look through the data, you'll see it has some issues. Those of you who are basketball fans will also find some other problems. This data is messy. Welcome to the world of data mining.

The focus of our analysis will the "class" field, which tells us if a player is in the Hall of Fame or not. This is the field we will try to predict.

The first step is by far the most difficult. You have to download and run Weka, and you'll need Java installed to do this. There's about a 50% chance this won't work off the bat (and that's likely Java's fault) so you might have to do some troubleshooting. Welcome to the world of Data Science.

Once we start Weka, we'll see a window like this:

Click "Explorer", and you'll get a new window:

We have to give Weka a file to analyze. You can download a version of the basketball player data formatted for Weka here. Or if you are savvy, you can put in the csv from earlier in and try to make it work.

Once you've downloaded the file, click "Open File" in the upper left corner, and navigate to the Hall-of-fame-train.arff file you just downloaded. (.arff is Weka's native file format.)

You'll see something like this:

There's quite a bit going on here. But in the interesting of sticking to the goal of making a tree, we're going to jump to tree generation. Click the "Classify" tab at the top of the window:

Then click the "Choose" button in the upper-left, and navigate to trees/J48, and click it:

Finally, click the "Start" button in the middle-left of the screen. You'll see some crazy output:

Congratulations, you've just made your first classification tree! Let's take a look at it. Below the "Start" button, right click the entry that just appeared (it has the time, followed by "trees.J48"), and then choose "Visualize tree".

A new window will appear:

This is your tree. But what does this all mean? All the circles correspond to fields from the data, as explained and identified above.. All of the squares are "leaves, places where the tree ends. Inside each of these leaves is a 0 or 1, and some counts. This 0 and 1 correspond to the "class" field from the data--whether or not a player made the hall of fame. The tree was built by trying to find what fields best predicted if this "class" was 0 or 1, or in English, if a player made the Hall of Fame (class=1) or not (class=0). Finally, the numbers in between nodes and leaves show what value the data in the field was split at.

An example should clarify better. At the top of the tree is a node "ftm". This is a field from the dataset, meaning "Free Throws Made". Below "ftm" are "<= 1929"  and ">1929". What this means is that players who made 1929 free throws or less went one way down the tree, and players who made more than 1929 free throws go the other way down the tree.

Following the tree to the left, we see a leaf (square) that reads "0 (434.0/5.0)". The "0" means most of the players in this leaf had class=0, meaning they did not end up in the Hall of Fame. The next number, "434.0" means that 434 players in this leaf had class=0, (remember, class=0 means not in the Hall of Fame). The second number "5.0" means that 5 players had the other class (1), meaning that they did end up in the Hall of Fame.

More interestingly, since this is the first division in the tree, it means that the best predictor for a basketball player making the hall of fame is how many free throws they made. Not points, not games, not position, but boring free throws. Why? The tree can't tell you that. That's up to you to decide.

Moving down to the right, we see a node of "firstSeason" (the first season the player played in), and to the right of that "fga" (field goal attempts). One more move to the right brings us to a leaf reading "1(15.0/1.0)". This means most of the players in that leaf are Class=1 and thus in the Hall of Fame. And "15.0/1.0" means that of the players in the leaf, 15 made the hall of fame (class=1) and 1 did not (class=0). So if a player makes a lot of free throws, started after 1962, and attempted over 12258 field goals, then there is a pretty good chance they made it into the Hall of Fame.

Now you're a real Data Scientist. Hopefully you'll get that $100,000 a year job soon!

If enough people ask, I'll show how you can use this tree to predict if a future player will end up in the Hall of Fame.