Sunday, August 17, 2014

Your Database is So Retro: Old Data, New Databases

(Reposted from Texas Enterprise Magazine)


  • The classic relational database, which functions like a spreadsheet and is best suited for data that can be organized neatly into specific categories, is not always the best
  • Document-based databases can accommodate ambiguities among many different types of data, and are well suited to data lacking structure
  • Graph-based databases impose a network-like structure on data, bringing meaning to complicated “real world” data lacking obvious structure
As data grows in both size and scope, it’s time to consider new ways of storing it. Until recently, back-end hardware and software doing the heavy lifting essentially treated all data identically. Customer demographics, press releases, product photos, and web data were all considered slight variations of the same kind of information. And regardless of the form of that information, it was all shredded, pummeled, and re-boxed into something resembling a collection of spreadsheets. This old way is known as the relational database.
The relational database is robust, effective, and proven. It is logically consistent and easy to understand — think lots of tables with rows and columns just like a spreadsheet, where a cell can point to data in another table. Decades of engineering have made relational databases fast, reliable, and flexible. Almost all the popular business database technologies — most Oracle, Microsoft, and SAP products, and anything containing the letters “SQL” — are relational databases. For many kinds of data, especially data that might easily fit in a spreadsheet (demographics, inventory, sales leads), nothing beats a relational database.
But what if your data isn’t so neat and clean? What if you have millions of documents you need searched at a moment’s notice? What if your data is better represented as a sort of social network, where the relationships between data are just as important as the data itself? What if your data does not fit clearly into any sort of logical structure?
Many new database technologies have come onto the market in the last few years. While these alternative databases are not as time-tested as many relational database products, they offer distinct advantages over relational databases in some common business situations. It is worth understanding that in the era of “big data” not everything is the nail it appears to be when you’re wielding only a relational database hammer. Many organizations are switching over old relational databases to these new technologies and enjoying the advantages over their competition.
I discuss two of the most popular alternatives to relational databases below. These two types of alternative database styles, document-based and graph-based, fall under the buzzword “NoSQL.” NoSQL simply refers to any database that is not strictly a relational database. It does not mean these models don’t support SQL, which is a well-established technology for getting information out of a database; it means they are “not only SQL.” These alternatives provide a fundamentally different way of thinking about data than the relational database model.
Finally, please note that different database products can vary wildly, and that these descriptions should be taken as generic. Many vendors are working hard to make their NoSQL databases better while eliminating disadvantages. Also, the list of advantages and disadvantages is by no means complete, especially in regards to issues that are more technical. Ask your favorite database administrator or search engine if you want to learn more.

Document-Based Databases

What they are: Rather than the tables underlying a relational database, document-based databases organize data into “documents,” which exist in the grey area between a web page and a traditional table found in a relational database. A document can take many forms — a business card, an encyclopedia entry, a web page, an annual report, an entire book — but can be any sort of data.
Documents share definitions of the kind of data in them, but only when necessary. For example, a report and a book might need an “author” field, but a business card would not, and a document-based database can handle these ambiguities in stride. Documents can also be linked together and reference other documents or parts thereof.
Wikipedia, with its pages, users, images, and categories functions like a document database. Document databases are especially suited for storing “unstructured” and “semi-structured” data, or even just data that has mixed structures. For example, think about a database for an e-retailer, where different types of products need information managed with greatly differing structures. It could be hard to fit different types of products with different information storage requirements into a relational database with a very rigid structure for information. But a document database can handle this problem quite well, happily storing products with different relevant information like CDs (artist, link to MP3s), clothes(size, color), and cars(make, model, year, repair history) without causing an information management nightmare.
Advantages: Allows for different kinds of data to be stored easily — you don’t have to make every document the same. Most document-based databases allow for very quick searching of text. The design of the database does not need to be set when you deploy it, and new types of information can easily be added. You don’t need to assign meaning to all the data you enter.
Disadvantages: You don’t need to assign meaning to all the data you enter (yes, that’s both an advantage and a disadvantage). Often slower than a relational database and often requires more storage space. Errors in how data is described can be easily introduced. Similar data is not necessarily treated as such. Not as many protections against duplication of data.
Popular products: MongoDB, CouchDB, MarkLogic

Graph-Based Databases

What they are: “Graph” does not mean “chart.” A graph is a mathematical system that can be described in terms of chunks of information (called “nodes”) and the relationships between these chunks of information (“edges”). Think of a social network: Individuals (nodes) are linked together by friendships (edges). Or a highway system: Towns (nodes) are linked together by roads (edges).
Different kinds of nodes and edges can be used in the same database to add many layers of meaning. Think of a corporate structure: Employees are nodes, the edges between two people are the relationship — teammate of, supervisor of, subordinate of — and employees can have many different relationships with their fellow employees. Projects can also be nodes, and projects can have edges with people — team member of, project leadof — and edges with other projects — dependent on, replaced by. Many kinds of data are well represented by graphs, but it requires a very different way of thinking about information.
Certain kinds of data that are almost impossible to represent in a relational database fit very well in a graph database. For example, a large bank wants to track the flow of support calls through its call center. Some support calls could be easily represented in a relational database — customer A calls at time B, talks to representative C about issue D, which is resolved. Other support calls would be essentially impossible to track in a relational database. Customer A calls at time B, talks to representative C about issue D. Representative C resolves issue D, but then customer A has issue E and F. Representative C transfers customer A to representative G, who can resolve issue E, with a note to then transfer them to representative H, who can resolve issue F. Sadly, representative G cannot resolve issue E, and wants to call back customer A. Customer A says to actually contact customer I, the other holder of the joint account. Such a scenario would require multiple relational databases working in tandem, and would be a sort of kludge. But this scenario is easy to store in a graph. All the letters in the previous example (customer, representatives, issues) would be nodes, and the various relationships (answered by, resolved by, transferred to, return call to) would be edges.
Advantages: Makes it easier to express many kinds of data that require significant kludging to fit in a relational database. Certain kinds of searches that are very difficult in a relational database (i.e., any search where relationships between different kinds of data are important) are very quick and easy. Easily allows for new kinds of data. Very well suited to the irregular, complex data involved in mapping the “real world.”
Disadvantages: Operations on large amounts of data can be very slow. Can use a lot of space. Not widely used in business environments (yet). Very easy to describe data inconsistently, which can quickly reduce the usefulness of the database. Generally requires all data to exist explicitly in relation to other data. Can be conceptually difficult to understand at first.
Popular products: Neo4j, Titan, FlockDB

Thursday, August 7, 2014

Projects: Drugs, Diabetes, Politics, Walmart

I've finally found the time to post about some of the projects I worked on during my second semester in the MS in Business Analytics program at the University of Texas.

First is a project where we text mined an internet forum discussing illegal drugs. This is one of our visualizations, which shows which drugs are commonly taken together:



Next is a project where we analyzed tweets from US Senators to determine what gets retweeted and how Senators can increase their retweets. We learned that being Ted Cruz helps a lot. But even if you aren't Ted Cruz, things like time of day, topics discussed in the tweet, hashtags, and pictures affect the number of retweets.

Then there was a project where we predicted diabetes from electronic medical records. The data came from an old Kaggle competition, but on expert advice we changed our model to optimize for lift (determining likelihood of diabetes) rather than simple yes/no predictions as in the original contest. A big challenge in this project was determining how to deal with the 1000 or so attributes we had to analyze. The full report goes into great detail about different feature selection methods and how they performed relative to one another.


Monday, June 16, 2014

Talking about IBM Watson with Sandy Carter

At SXSW this year, I was interviewed by IBM executive Sandy Carter about IBM's Watson computer system, and she wrote a brief blog post about it. You can see the interview here.

Seeing yourself on camera is strange. I can't tell if I look good or not, so I'd love to hear what people think. And people who have lots of camera experience, do you have any tips? Oh, and those stripes....

Friday, April 4, 2014

Surveillance Vs. Business

I wrote another article for Texas Enterprise, this time about the coming data backlash, how the NSA is bad for business, and what businesses can do about all of this. Read it!

Sunday, March 30, 2014

I presented a research poster

Two of my colleagues and I presented a research poster at a recent UT conference. We built a predictive model to improve the efficacy of tax audits. Better make sure your tax return is clean is this year!

You can read more about it and see the poster on my website.

And thank you to my colleagues, Nicole White and Ying Du.

Tuesday, March 11, 2014

I'm Writing for Another Website

Why has this blog been so silent? Because I'm now writing monthly for Texas Enterprise, a web magazine out of UT. I will still be updating this blog, but I'll also be writing monthly there.

My first article is up now, and it's basically a bumbling rant about how saying "big data" makes you sound a bit...dumb. Please read and leave snarky comments here.

Tuesday, January 28, 2014

Why did my bank say a legitimate credit card charge was fraudulent, but failed to detect actual fraudulent charges?

A friend of mine posted this to Facebook recently:

Modern life may be cushy, but it is not without its trials. Who hasn't experienced the frustration of getting that phone call from the bank (or worse, having their card stop working) because a charge on their card was flagged as possibly fraudulent? Experiencing false-fraud on a card is becoming more and more common--in a very informal survey of eleven of my nearby colleagues, eight have had fraudulent charges appear on their card (and four were affected by the recent Target data breach, but that's another topic altogether).

So what exactly is going on here? Why are banks failing to recognize truly fraudulent charges, but still plenty of fraudulent charges go through? The answer: because of the software.

Every charge you make on a piece of plastic goes through a computer program that attempts to determine if your charge is fraudulent. Into this program goes information about you, about the transaction you just made, about your location, about contact you've had with the bank recently, about pretty much anything the bank can find. This computer program then does one of three things:
1) Nothing (the charge is almost certainly not fraud).
2) Approve the charge, but notify the customer of suspicious activity (the charge might be fraud).
3) Deny the charge, and cut the card off until the customer is reached (the charge is almost certainly fraud).

The exact workings of how the input turns into one of the decisions at the end is something of a trade secret, but a lot of the general workings are very well understood.

The short answer is that the software to detect fraud actually learns from past patterns of fraud. It might be that the store you used your card at is historically 30% more likely to have a fraudulent transaction than the store next door, and this would factor into the decision. It might be that you are transacting online, and online transactions are 150% more likely to be fraudulent. It might be something far more complicated than either of these examples. The real truth is that the patterns the fraud-detection software "learns" are often far too complicated for humans to really grasp the significance of.  It is well-documented that this self-learning software is far better at things like fraud detection than a team of humans attempting to come up with rules on their own. All the humans know is that the software works well enough, and saves a lot of money by flagging fraudulent activity that would otherwise go undetected.

Obviously, there is some tweaking, inspection, and overriding done by the software's human masters. And banks have a lot of leeway to define where the line between clear fraud, possible fraud, and not fraud are (this is the part of the system that most involves human insight, and is most often the part of the system that makes or breaks its usefulness). But this gives a general idea of what's going on behind the scenes after you swipe your card.

So why does this fraud-detection software fail so much? Part of the issue is perception--we tend to focus on the mistakes the software makes, and don't ever consider how much worse a human would be. But the human masters are at work behind the scenes messing things up. Since the cost of missed fraud is often much higher than the cost of inconvenience for a false flag, even when you include the loss of customer goodwill, banks tweak their algorithms to be a bit overzealous in what is marked as fraud; they aim for the"sweet spot" between trying to catch as much fraud as possible while not angering their customers too much.

Ultimately, the short answer to my friend's question is, "We built an algorithm to build a fraud detection system; that algorithm went and built the actual algorithm to determine if a charge is fraud or not. The specifics of what the fraud detection algorithm flags are far less important to us than the algorithm's ability to be right enough to justify its existence."

But here we have two cases of the fraud-detection algorithm failing. So what happened to my friend? Let's begin with the recent Evernote and Netflix charges. Why would a fraud detection system flag these charges as possibly fraudulent? First, it is important to understand that most stolen credit card numbers are coming from large leaks like the Target breach, and not the theft of the physical card or your waiter at Friday's stealing your card information when he takes your card away to run the charges.

Now, put yourself in the mind of someone who has stolen credit card numbers, say a few thousand you purchased from some underground website. You know that only some of these numbers are going to work. You have the means to create a fake card from the information you have, but just walking into a store with one of your newly minted fake cards might not end well. Your purchase may get denied--or worse, the card might be already be marked as stolen and you'd be in a lot of trouble. So what do you do? You have a software script that you feed your stolen numbers into, and it goes and makes accounts for subscription-based web services. Why such services? Because there is a time based factor--you can use the fake account you made to check and see if the subscription still works the following month. And if the subscription works still, you know you have a goldmine: a credit card that works, and has an owner that doesn't pay much attention to their credit card statements. You're assured that this card probably won't be turned away if you use it in person, and if you're really tricky you can make multiple charges over a long period of time.

Because of behavior like this, subscription-based web services are going to be seen by the fraud-detection algorithm has more suspect than, say, groceries at Aldi. Throw in a few other factors (that my friend was abroad at the time and presumably her card company knew that, and possibly her activity was erratic on top of that) and suddenly your transaction is getting flagged. Incorrectly.

But wait....shouldn't the fraud detection algorithm be able to see these are charges that have been happening for months and months? Yes, it should. But it is important to understand that the whole fraud-detection process has to happen very quickly. Processing times for credit cards are measured in seconds, and a lot of that time is transmission of the information to the processing center. The software that detects fraud generally only has milliseconds to make a decision. This obviously complicates things, and makes the engineering of these systems extremely complicated. Verifying every charge that comes in from a subscription-based service against a search of the past month of card activity is slow--it involves looking into a database of past transactions, and a lookup in a database that big is going to be too slow (this is changing, and there are ways to sort of get around this, but that's a digression).

Now what about four Aldi's in one day? Well, my friend said that was a year ago, and since these algorithms get better over time that could be one explanation. Another is that perhaps four of the same chain in one day isn't so weird--there are a lot of psychotic coupon-clippers out there. And I bet we all can name a college student who has eaten Chipotle four times in a 24-hour period. While I believe Aldi doesn't take coupons, it's very likely the fraud detection system doesn't know that.

And I can say this for certain--back in 2013, spending $2,000 four Aldis in one day was not as fraud-like of a pattern as two subscription-based Internet services in early 2014 (a subtle point: the definition of fraud-like changed from 2013 to 2014 as the algorithm continued to learn). Fraud-detection algorithms, above all else, are a product of their past successes and failures. The larger point to be made is we are not as smart as we think we are--unfortunately, overall, the algorithm is going to be better at fraud detection than any human. But when the algorithm is wrong, we can take comfort in the fact that we're not completely outdated (yet?).