Tuesday, January 6, 2015

5 Big Data Trends to Watch for in 2015

(Reposted from Texas Enterprise Magazine)


  • Hadoop will lose its status and the go-to big data buzzword.
  • The low cost and high convenience of the cloud will become increasingly key to successful big data projects.
  • Security and the difficulty of managing big data projects in production will be the key challenges
2014 saw big data IPOs, huge companies betting their future on big data, and a flood of new tools and technologies promising solutions to big data problems.
2015 will be an important year as well, as big data becomes more mainstream and the growing amount of stored information makes analytics increasingly important. Here are five trends to watch for in 2015:

1 – Hadoop is Dead! (Long live Hadoop!)

There is no single buzzword more associated with big data than Hadoop. The two terms have become almost interchangeable in some contexts. But 2015 will see the waning of importance of Hadoop, and executives everywhere will scramble for a new buzzword.
Hadoop has always had issues. It is too intimidating and does not do enough on its own. The first solutions to the Hadoop-is-scary problem were tools like Pig and Hive, which sat on top of Hadoop and made big data problems more manageable. Next, companies like Hortonworks, MapR, and Cloudera began providing a Hadoop stack that let companies focus less on IT and more on how they were going to actually solve their big data problems.
The natural progression for these Hadoop vendors is to continue to differentiate their offerings and expertise, with the goal of having potential customers ask for them by name rather than generic Hadoop. Cloudera arguably began this with Impala, and the pile of cash Hortonworks is sitting on post-IPO will let them invest heavily in R&D. Meanwhile, companies with big data problems not appropriate for a turnkey solution will look less towards Hadoop and more to next-generation technologies like Apache Spark. These new offerings build on Hadoop, so the technology isn’t going anywhere; it is the Hadoop name that will have clearly jumped the shark by the end of 2015.
In short, private Hadoop vendors will move towards marketing themselves as distinct solution providers rather than Hadoop implementers, and the organizations doing big data from scratch will increasingly find newer technologies that offer more power than Hadoop while being easier to wrangle.

2 – Big Data Continues to the Cloud

IT folk in the know understand that cloud is inevitable, and big data is no exception. What will make big data take up cloud even faster than other technologies is how well suited the cloud is to big data.
Almost all big data solutions (including Hadoop) are run on large clusters of essentially off-the-shelf computers. Many organizations have a need for a large cluster, consisting of dozens or even hundreds of computers, but they do not need all this power 24/7. And with normal rates of hardware failure, even a mid-sized cluster can require a full-time staff just to run around and swap defective parts. These two issues — maintaining computing power that lies idle, and staffing an IT department — are issues instantly solved by a well-designed cloud platform. It simply does not make economic sense for an organization to maintain its own cluster, unless it’s dealing with Google-scale big data problems.
Another trend in 2015 will also drive big data work to the cloud:

3 – Big Data as a Service

Traditionally big data on the cloud means virtual machines turned on or off by an organization as computing needs change (think Amazon’s Elastic MapReduce). But more and more, big data problems will be solved by interfaces and software living in the cloud rather than as virtual machines organizations need to manage on their own.
Google fired a warning shot last year when it announced upcoming public access to its internal big data service Dataflow, which essentially lets users run code without worrying about the management of big data ETL pipelines. And an ever-growing array of startups is offering big data solutions as a service. Ersatz Labs is one such startup, offering a simple Web interface to build deep learning models — a technique that as recently as a year ago was only known to academics and researchers. More and more, these big data-centric services will make running a Hadoop cluster unnecessary overhead for the majority of organizations.

4 – Security Slows Down Big Data

Information security has become a big deal. Last year’s Target breach showed just how vulnerable many companies are, and the recent Sony Pictures hack crippled a multi-billion dollar organization. Security, long considered an afterthought, is now at the forefront of business leaders’ minds moving into 2015. No part of corporate IT will be spared, including big data.
So how will a long overdue shift to security-centric IT affect big data? Unfortunately, it is going to make many things slower and more difficult. Many big data technologies are simply not built with security in mind. 2015 is likely the year we hear about a data breach of a Hadoop cluster, with hackers downloading an amount of data that makes Sony’s 11 lost terabytes look puny. Data breaches will lead to panic, which will lead to duct-tape solutions that maybe fix the security holes but leave big data practitioners pulling their hair out contending with walls of security presently not in place.

5 – Poorly Maintained Machine Learning Comes to Haunt Early Adopters

While many organizations still struggle to implement machine learning, some organizations are far along and have run into a new problem: large bases of machine learning models are incredibly difficult to maintain, much more so than a codebase.
Google published research this year detailing struggles with maintaining a large number of machine learning models. While Google has far more machine learning models in production than most organizations, this is not a problem that will go away. Google was just one of the first to encounter this problem, and when Google encounters a big data problem, it is a problem other organizations will face in the future. (Fun fact: over 10 years ago, Google was one of the first companies to publicly discuss Hadoop-scale problems.)
While 2015 will probably not see hundreds of organizations struggling with machine learning gone out of control, 2015 will be the beginning of this discussion — a discussion that will ultimately lead to a boom in demand for an almost impossible to find skill set: data scientists who understand high-level systems architecture.
Do you agree? Disagree? Are any key trends missing? What are your big data predictions for 2015?

Sunday, August 17, 2014

Your Database is So Retro: Old Data, New Databases

(Reposted from Texas Enterprise Magazine)


  • The classic relational database, which functions like a spreadsheet and is best suited for data that can be organized neatly into specific categories, is not always the best
  • Document-based databases can accommodate ambiguities among many different types of data, and are well suited to data lacking structure
  • Graph-based databases impose a network-like structure on data, bringing meaning to complicated “real world” data lacking obvious structure
As data grows in both size and scope, it’s time to consider new ways of storing it. Until recently, back-end hardware and software doing the heavy lifting essentially treated all data identically. Customer demographics, press releases, product photos, and web data were all considered slight variations of the same kind of information. And regardless of the form of that information, it was all shredded, pummeled, and re-boxed into something resembling a collection of spreadsheets. This old way is known as the relational database.
The relational database is robust, effective, and proven. It is logically consistent and easy to understand — think lots of tables with rows and columns just like a spreadsheet, where a cell can point to data in another table. Decades of engineering have made relational databases fast, reliable, and flexible. Almost all the popular business database technologies — most Oracle, Microsoft, and SAP products, and anything containing the letters “SQL” — are relational databases. For many kinds of data, especially data that might easily fit in a spreadsheet (demographics, inventory, sales leads), nothing beats a relational database.
But what if your data isn’t so neat and clean? What if you have millions of documents you need searched at a moment’s notice? What if your data is better represented as a sort of social network, where the relationships between data are just as important as the data itself? What if your data does not fit clearly into any sort of logical structure?
Many new database technologies have come onto the market in the last few years. While these alternative databases are not as time-tested as many relational database products, they offer distinct advantages over relational databases in some common business situations. It is worth understanding that in the era of “big data” not everything is the nail it appears to be when you’re wielding only a relational database hammer. Many organizations are switching over old relational databases to these new technologies and enjoying the advantages over their competition.
I discuss two of the most popular alternatives to relational databases below. These two types of alternative database styles, document-based and graph-based, fall under the buzzword “NoSQL.” NoSQL simply refers to any database that is not strictly a relational database. It does not mean these models don’t support SQL, which is a well-established technology for getting information out of a database; it means they are “not only SQL.” These alternatives provide a fundamentally different way of thinking about data than the relational database model.
Finally, please note that different database products can vary wildly, and that these descriptions should be taken as generic. Many vendors are working hard to make their NoSQL databases better while eliminating disadvantages. Also, the list of advantages and disadvantages is by no means complete, especially in regards to issues that are more technical. Ask your favorite database administrator or search engine if you want to learn more.

Document-Based Databases

What they are: Rather than the tables underlying a relational database, document-based databases organize data into “documents,” which exist in the grey area between a web page and a traditional table found in a relational database. A document can take many forms — a business card, an encyclopedia entry, a web page, an annual report, an entire book — but can be any sort of data.
Documents share definitions of the kind of data in them, but only when necessary. For example, a report and a book might need an “author” field, but a business card would not, and a document-based database can handle these ambiguities in stride. Documents can also be linked together and reference other documents or parts thereof.
Wikipedia, with its pages, users, images, and categories functions like a document database. Document databases are especially suited for storing “unstructured” and “semi-structured” data, or even just data that has mixed structures. For example, think about a database for an e-retailer, where different types of products need information managed with greatly differing structures. It could be hard to fit different types of products with different information storage requirements into a relational database with a very rigid structure for information. But a document database can handle this problem quite well, happily storing products with different relevant information like CDs (artist, link to MP3s), clothes(size, color), and cars(make, model, year, repair history) without causing an information management nightmare.
Advantages: Allows for different kinds of data to be stored easily — you don’t have to make every document the same. Most document-based databases allow for very quick searching of text. The design of the database does not need to be set when you deploy it, and new types of information can easily be added. You don’t need to assign meaning to all the data you enter.
Disadvantages: You don’t need to assign meaning to all the data you enter (yes, that’s both an advantage and a disadvantage). Often slower than a relational database and often requires more storage space. Errors in how data is described can be easily introduced. Similar data is not necessarily treated as such. Not as many protections against duplication of data.
Popular products: MongoDB, CouchDB, MarkLogic

Graph-Based Databases

What they are: “Graph” does not mean “chart.” A graph is a mathematical system that can be described in terms of chunks of information (called “nodes”) and the relationships between these chunks of information (“edges”). Think of a social network: Individuals (nodes) are linked together by friendships (edges). Or a highway system: Towns (nodes) are linked together by roads (edges).
Different kinds of nodes and edges can be used in the same database to add many layers of meaning. Think of a corporate structure: Employees are nodes, the edges between two people are the relationship — teammate of, supervisor of, subordinate of — and employees can have many different relationships with their fellow employees. Projects can also be nodes, and projects can have edges with people — team member of, project leadof — and edges with other projects — dependent on, replaced by. Many kinds of data are well represented by graphs, but it requires a very different way of thinking about information.
Certain kinds of data that are almost impossible to represent in a relational database fit very well in a graph database. For example, a large bank wants to track the flow of support calls through its call center. Some support calls could be easily represented in a relational database — customer A calls at time B, talks to representative C about issue D, which is resolved. Other support calls would be essentially impossible to track in a relational database. Customer A calls at time B, talks to representative C about issue D. Representative C resolves issue D, but then customer A has issue E and F. Representative C transfers customer A to representative G, who can resolve issue E, with a note to then transfer them to representative H, who can resolve issue F. Sadly, representative G cannot resolve issue E, and wants to call back customer A. Customer A says to actually contact customer I, the other holder of the joint account. Such a scenario would require multiple relational databases working in tandem, and would be a sort of kludge. But this scenario is easy to store in a graph. All the letters in the previous example (customer, representatives, issues) would be nodes, and the various relationships (answered by, resolved by, transferred to, return call to) would be edges.
Advantages: Makes it easier to express many kinds of data that require significant kludging to fit in a relational database. Certain kinds of searches that are very difficult in a relational database (i.e., any search where relationships between different kinds of data are important) are very quick and easy. Easily allows for new kinds of data. Very well suited to the irregular, complex data involved in mapping the “real world.”
Disadvantages: Operations on large amounts of data can be very slow. Can use a lot of space. Not widely used in business environments (yet). Very easy to describe data inconsistently, which can quickly reduce the usefulness of the database. Generally requires all data to exist explicitly in relation to other data. Can be conceptually difficult to understand at first.
Popular products: Neo4j, Titan, FlockDB

Thursday, August 7, 2014

Projects: Drugs, Diabetes, Politics, Walmart

I've finally found the time to post about some of the projects I worked on during my second semester in the MS in Business Analytics program at the University of Texas.

First is a project where we text mined an internet forum discussing illegal drugs. This is one of our visualizations, which shows which drugs are commonly taken together:



Next is a project where we analyzed tweets from US Senators to determine what gets retweeted and how Senators can increase their retweets. We learned that being Ted Cruz helps a lot. But even if you aren't Ted Cruz, things like time of day, topics discussed in the tweet, hashtags, and pictures affect the number of retweets.

Then there was a project where we predicted diabetes from electronic medical records. The data came from an old Kaggle competition, but on expert advice we changed our model to optimize for lift (determining likelihood of diabetes) rather than simple yes/no predictions as in the original contest. A big challenge in this project was determining how to deal with the 1000 or so attributes we had to analyze. The full report goes into great detail about different feature selection methods and how they performed relative to one another.


Monday, June 16, 2014

Talking about IBM Watson with Sandy Carter

At SXSW this year, I was interviewed by IBM executive Sandy Carter about IBM's Watson computer system, and she wrote a brief blog post about it. You can see the interview here.

Seeing yourself on camera is strange. I can't tell if I look good or not, so I'd love to hear what people think. And people who have lots of camera experience, do you have any tips? Oh, and those stripes....

Friday, April 4, 2014

Surveillance Vs. Business

I wrote another article for Texas Enterprise, this time about the coming data backlash, how the NSA is bad for business, and what businesses can do about all of this. Read it!

Sunday, March 30, 2014

I presented a research poster

Two of my colleagues and I presented a research poster at a recent UT conference. We built a predictive model to improve the efficacy of tax audits. Better make sure your tax return is clean is this year!

You can read more about it and see the poster on my website.

And thank you to my colleagues, Nicole White and Ying Du.

Tuesday, March 11, 2014

I'm Writing for Another Website

Why has this blog been so silent? Because I'm now writing monthly for Texas Enterprise, a web magazine out of UT. I will still be updating this blog, but I'll also be writing monthly there.

My first article is up now, and it's basically a bumbling rant about how saying "big data" makes you sound a bit...dumb. Please read and leave snarky comments here.