Hadoop Elephant

Do you Hadoop?

For those of you that don’t know, Big Data is a new spin on an old concept. It’s one part industry rebranding old news and one part a change in old news. Large companies have been capturing massive amounts of data and pushing it to their “Data Warehouse”. Then they would use “Business Intelligence” tools to analyze the data. Business Intelligence was a hot term back in the 90’s and early 2000’s and it still is today. Companies like Teradata, Cognos, Business Objects, and SAS have been calling themselves Business Intelligence companies for years. Today, they are more likely to call themselves Big Data companies.  The Hadoop project is throwing a wrench into that plan.  So what is the big deal about Big Data and what’s really new. There are some big differences but you need to look closely

General vs Specific Information – When I started my career with Deloitte, I was in a group called Data Quality and Integrity. We would analyze massive amounts of data using a tool called SAS. For example, we were engaged to analyze all of the Medicare and Medicaid Billing information. I honestly don’t remember how big the dataset was, but if I it was on several tape drives backed up from a mainframe system. So how did we make sense of so much data, we sampled it. Sometimes you don’t need to analyze every piece of data you just needed a general sense of the information. We would take a subset of data and then estimate with a confidence level of +/-3%. For example, 46% of patients administered were prescribed Medication (the real number could have been 43% or 49%. If we had the computing power of today we could get an exact number. Now that we know that about 46% of the patients were prescribed medication, we want to take action with those patients. Big Data would get the names, addresses, phone numbers, facebook pages, and medical history of every patient in the 46%. The use of the data is not general, its specific.

Structured vs Unstructured Data – For years, information within businesses has been driven by databases or sometimes just flat files of information. Most information was numeric and there was a relational structure to it. You typically had one set of data and you analyzed that one set. Big Data couples structured and unstructured data and most likely in its raw format. The computer storage of the world today is not filled with numbers, its filled with images, videos and unstructured text.

Informational vs Actionable – I’ve built several business intelligence systems on Cognos, Business Objects, SAS, and even some in the database. Every single one with just database software. These systems ended in a static or sometimes dynamic report delivered to some executive, sales person, marketing manager, etc. I really wonder how many of them took action on the information. The future will have a different application. Gone are the days of people acting on information. Today, computers act on information and programmers run the business (kind of).  Software programmers and Business Intelligence Analysts will blur the lines together programming will become more information driven and less user input driven.

Contained vs Uncontained – Today companies rarely analyze data other than their own. Occasionally they will buy some sort of third party market information or data source. In most of those cases, they consume them separately. Sometimes, the information was added to the warehouse for the overall analysis but not always. No matter what though, the data usually lived inside the company walls. As we all know, those days are over. Some information will live securely inside the walls of companies. Other data will live outside the company in the cloud. Some internal data will be pushed to the cloud. Information will be uncontained and it will be difficult to constrain. Who owns it and what can it be used for? Privacy issues and ethical issues will only get more complicated.

Small vs Big – This is pretty obvious but I think its worth noting.  The size and scale of information is creating both storage and processing challenges.  The engineers of the world are hard at work to solve this problem.  One of our clients GenieDB (www.geniedb.com) is building software to provide Multi-Data Center MySQL databases.  We have also seen a growth in the number of requests for Hadoop experts.  We are only scratching the surface of this and its big now but it will only get bigger.

The applications for Big Data are endless.  Some of these references will give you some context of what this is all about.  I’m attending a conference on Monday to learn more of the engineering side of Big Data http://goo.gl/dxVeC.  McKinsey’s business view of the Big Data future is another good site to check out: http://goo.gl/hdwyy


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s