Sunday, August 17, 2014

Your Database is So Retro: Old Data, New Databases

(Reposted from Texas Enterprise Magazine)

Takeaway

  • The classic relational database, which functions like a spreadsheet and is best suited for data that can be organized neatly into specific categories, is not always the best
  • Document-based databases can accommodate ambiguities among many different types of data, and are well suited to data lacking structure
  • Graph-based databases impose a network-like structure on data, bringing meaning to complicated “real world” data lacking obvious structure
As data grows in both size and scope, it’s time to consider new ways of storing it. Until recently, back-end hardware and software doing the heavy lifting essentially treated all data identically. Customer demographics, press releases, product photos, and web data were all considered slight variations of the same kind of information. And regardless of the form of that information, it was all shredded, pummeled, and re-boxed into something resembling a collection of spreadsheets. This old way is known as the relational database.
The relational database is robust, effective, and proven. It is logically consistent and easy to understand — think lots of tables with rows and columns just like a spreadsheet, where a cell can point to data in another table. Decades of engineering have made relational databases fast, reliable, and flexible. Almost all the popular business database technologies — most Oracle, Microsoft, and SAP products, and anything containing the letters “SQL” — are relational databases. For many kinds of data, especially data that might easily fit in a spreadsheet (demographics, inventory, sales leads), nothing beats a relational database.
But what if your data isn’t so neat and clean? What if you have millions of documents you need searched at a moment’s notice? What if your data is better represented as a sort of social network, where the relationships between data are just as important as the data itself? What if your data does not fit clearly into any sort of logical structure?
Many new database technologies have come onto the market in the last few years. While these alternative databases are not as time-tested as many relational database products, they offer distinct advantages over relational databases in some common business situations. It is worth understanding that in the era of “big data” not everything is the nail it appears to be when you’re wielding only a relational database hammer. Many organizations are switching over old relational databases to these new technologies and enjoying the advantages over their competition.
I discuss two of the most popular alternatives to relational databases below. These two types of alternative database styles, document-based and graph-based, fall under the buzzword “NoSQL.” NoSQL simply refers to any database that is not strictly a relational database. It does not mean these models don’t support SQL, which is a well-established technology for getting information out of a database; it means they are “not only SQL.” These alternatives provide a fundamentally different way of thinking about data than the relational database model.
Finally, please note that different database products can vary wildly, and that these descriptions should be taken as generic. Many vendors are working hard to make their NoSQL databases better while eliminating disadvantages. Also, the list of advantages and disadvantages is by no means complete, especially in regards to issues that are more technical. Ask your favorite database administrator or search engine if you want to learn more.

Document-Based Databases

What they are: Rather than the tables underlying a relational database, document-based databases organize data into “documents,” which exist in the grey area between a web page and a traditional table found in a relational database. A document can take many forms — a business card, an encyclopedia entry, a web page, an annual report, an entire book — but can be any sort of data.
Documents share definitions of the kind of data in them, but only when necessary. For example, a report and a book might need an “author” field, but a business card would not, and a document-based database can handle these ambiguities in stride. Documents can also be linked together and reference other documents or parts thereof.
Wikipedia, with its pages, users, images, and categories functions like a document database. Document databases are especially suited for storing “unstructured” and “semi-structured” data, or even just data that has mixed structures. For example, think about a database for an e-retailer, where different types of products need information managed with greatly differing structures. It could be hard to fit different types of products with different information storage requirements into a relational database with a very rigid structure for information. But a document database can handle this problem quite well, happily storing products with different relevant information like CDs (artist, link to MP3s), clothes(size, color), and cars(make, model, year, repair history) without causing an information management nightmare.
Advantages: Allows for different kinds of data to be stored easily — you don’t have to make every document the same. Most document-based databases allow for very quick searching of text. The design of the database does not need to be set when you deploy it, and new types of information can easily be added. You don’t need to assign meaning to all the data you enter.
Disadvantages: You don’t need to assign meaning to all the data you enter (yes, that’s both an advantage and a disadvantage). Often slower than a relational database and often requires more storage space. Errors in how data is described can be easily introduced. Similar data is not necessarily treated as such. Not as many protections against duplication of data.
Popular products: MongoDB, CouchDB, MarkLogic

Graph-Based Databases

What they are: “Graph” does not mean “chart.” A graph is a mathematical system that can be described in terms of chunks of information (called “nodes”) and the relationships between these chunks of information (“edges”). Think of a social network: Individuals (nodes) are linked together by friendships (edges). Or a highway system: Towns (nodes) are linked together by roads (edges).
Different kinds of nodes and edges can be used in the same database to add many layers of meaning. Think of a corporate structure: Employees are nodes, the edges between two people are the relationship — teammate of, supervisor of, subordinate of — and employees can have many different relationships with their fellow employees. Projects can also be nodes, and projects can have edges with people — team member of, project leadof — and edges with other projects — dependent on, replaced by. Many kinds of data are well represented by graphs, but it requires a very different way of thinking about information.
Certain kinds of data that are almost impossible to represent in a relational database fit very well in a graph database. For example, a large bank wants to track the flow of support calls through its call center. Some support calls could be easily represented in a relational database — customer A calls at time B, talks to representative C about issue D, which is resolved. Other support calls would be essentially impossible to track in a relational database. Customer A calls at time B, talks to representative C about issue D. Representative C resolves issue D, but then customer A has issue E and F. Representative C transfers customer A to representative G, who can resolve issue E, with a note to then transfer them to representative H, who can resolve issue F. Sadly, representative G cannot resolve issue E, and wants to call back customer A. Customer A says to actually contact customer I, the other holder of the joint account. Such a scenario would require multiple relational databases working in tandem, and would be a sort of kludge. But this scenario is easy to store in a graph. All the letters in the previous example (customer, representatives, issues) would be nodes, and the various relationships (answered by, resolved by, transferred to, return call to) would be edges.
Advantages: Makes it easier to express many kinds of data that require significant kludging to fit in a relational database. Certain kinds of searches that are very difficult in a relational database (i.e., any search where relationships between different kinds of data are important) are very quick and easy. Easily allows for new kinds of data. Very well suited to the irregular, complex data involved in mapping the “real world.”
Disadvantages: Operations on large amounts of data can be very slow. Can use a lot of space. Not widely used in business environments (yet). Very easy to describe data inconsistently, which can quickly reduce the usefulness of the database. Generally requires all data to exist explicitly in relation to other data. Can be conceptually difficult to understand at first.
Popular products: Neo4j, Titan, FlockDB

Thursday, August 7, 2014

Projects: Drugs, Diabetes, Politics, Walmart

I've finally found the time to post about some of the projects I worked on during my second semester in the MS in Business Analytics program at the University of Texas.

First is a project where we text mined an internet forum discussing illegal drugs. This is one of our visualizations, which shows which drugs are commonly taken together:

 

                          


Next is a project where we analyzed tweets from US Senators to determine what gets retweeted and how Senators can increase their retweets. We learned that being Ted Cruz helps a lot. But even if you aren't Ted Cruz, things like time of day, topics discussed in the tweet, hashtags, and pictures affect the number of retweets.

Then there was a project where we predicted diabetes from electronic medical records. The data came from an old Kaggle competition, but on expert advice we changed our model to optimize for lift (determining likelihood of diabetes) rather than simple yes/no predictions as in the original contest. A big challenge in this project was determining how to deal with the 1000 or so attributes we had to analyze. The full report goes into great detail about different feature selection methods and how they performed relative to one another.