How big does my Data need to be to be considered Big enough?

One of the questions I get a lot from clients is “We do not have massive sizes of data to justify incorporating Hadoop into our legacy business application(s) or application area. However, it is clearly a superior way of doing data processing compared to what we have historically done both from a business & technology perspective. The data volumes that our application needs to process are x GB at the most but there are now a variety of formats that we can’t support with a classical RDBMS/Data warehouse style approach. How do we go about tackling this situation? ”

The one thing to clear out of the way is that there is no single universally accepted definition of Big Data. I have always defined Big Data as “the point at which your application architecture breaks down in being able to process X GB/TB data volumes which are being sent in at an ingress velocity with an expectation from the business that insights will be gleaned at a specified egress velocity. The volumes encompass a variety of new data types – unstructured and semi-structured – in addition to the classical structured feeds.” If you are at that point, you do have a business first, then a Big Data problem, nay opportunity depending on how you look at it.

By this yardstick, the definition or consideration of Big Data varies application-by -application, line of business by line of business across the enterprise. What makes it Big Data for an OLTP application may not even be Small Data for an OLAP application.It all depends on the business context.

I just ran across this poll by KD-Nuggets http://www.kdnuggets.com/2013/04/poll-results-largest-dataset-analyzed-data-mined.html) which shows the size of the largest datasets analyzed. As can be seen from the below table, the average was in the 40-50 GB range – which begs the question as to what kind of dataset sizes are being worked on in data projects.

While there are technical arguments in favor of or against processing smaller datasets using a Hadoop platform (more on HDFS block sizes, Namenode memory, performance etc in a follow-up post),but lets take a step back and consider some of the strategic goals that every CXO needs to keep in mind while considering Hadoop whatever the size of their data –

The need to incorporate a Hadoop platform in your lines of business depends on what your business needs are. However, keep in mind that what may not even be a need today can potentially become an urgent business imperative tomorrow. As one CIO on Wall St put it in a recent conversation – “ We need to build a Hadoop platform & grow those skills as we need to learn to be disruptive. We will be disrupted if we don’t.” To me that quote sums it up, Big Data is about business disruption. Harnessing it can help you crack open new markets or new lines of thinking while the lack of can atrophy your business
Understanding and building critical skills in this rapidly maturing technology area are key to retaining and hiring the brightest IT employees. Big Data surpasses Cloud Computing in it’s impact on the direction of a business and is not just a Yahoo or Google or Amazon thing anymore. Cloud is just a horizontal capability with no intrinsic business value. There surely is value in being able to provision compute, network and storage on the fly. However, there is no innate benefit unless you have successful applications running on them. And the currency of every successful application is good ole’ data
Examples abound in every vertical about innovative shops leveraging data to gain competitive advantage. Don’t leave your enterprise beholden to a moribund data architecture
Hadoop’s inherent parallel processing capabilities and ability to run complex analytics in record times (see TeraSort benchmark) provides significant savings in the biggest resource of them all – time. In 2015, Hadoop is neither a dark art nor alchemy. A leading vendor like Hortonworks provides robust quick start capabilities along with security, management and governance frameworks. What’s more, a plethora of existing database, data-warehouse & analytics vendors integrate readily & robustly with data in a Hadoop cluster
Hadoop (Gen 2) is not just a data processing platform. It has multiple personas – a real time, streaming data, interactive platform for any kind of data processing (batch, analytical, in memory & graph based) along with search, messaging & governance capabilities built in. Whatever your application use-case, chances are that you can get a lot done even with a small Hadoop cluster. These run the gamut from Risk mgmt to Transaction Analysis to Drug discovery to IoT. Only limited by ones imagination

We are still in the early days of understanding how Big Data can impact our business & world. Over-regulating data management & architecture, discouraging experimentation among data & business teams, as a result of an overly conservative approach or long budget cycles, is a recipe for suboptimal business results.

IT executives in bimodal organizations recognize that being able to provide an environment where providing agile and responsive data feeds to business owners is key in creating and meeting customer needs. And while the use of enterprise data needs to be governed, there however need to be thresholds for experimentation.

Done right, your Big Data CoE (Center of Excellence) can be your next big profit center.

7 comments

Palma August 26, 2015 - 9:36 pm

Good way of describing gating requirements for Hadoop, and pleasant style of writing. I am a student gathering data
for a course presentation in college! Thank you.

Ona August 26, 2015 - 9:52 pm

I like reading through a post that will make people think.
Also, thanks for allowing me to comment!

Mallory September 8, 2015 - 8:25 pm

I blog frequently and I really thank you for your content.
This article has truly peeked my interest. I will take a note of your
blog and keep checking for new information about once a
week. I subscribed to your RSS feed as well.

Janine September 22, 2015 - 1:11 am

This is a well written and a really good post that clarifies the volume requirements in Hadoop projects.Please post something on the velocity challenge as well (I mean Realtime Predictive Analytics/ML). Cheers!

Dorothy Case September 22, 2015 - 2:19 am

Hi, Good article and fastidious arguments. I am truly enjoying stopping by here.

Markus Zwimmler October 14, 2015 - 11:36 pm

I have begun following your blog to understand about hottest and most relevant technologies in the financial world. This is a great article.Would be thankful if you can begin writing about payments space as well.

Katharine Almas November 28, 2015 - 3:44 pm

Great website, keep it up!

Like this:

Related

How big does my Data need to be to be considered Big enough?

Share this:

Like this:

Related

Vamsi Chemitiganti

Big Data architectural approaches to Financial Risk Mgmt..

Why Digital Transformation should force industry CIO’s to think Big Data, Webscale & Opensource..

You may also like

Big Data and Kubernetes – Best Practices in...

Big Data & Serverless (on K8s) – Industry...

Big Data & Kubernetes – A Reference Architecture...

7 comments

Leave a Comment Cancel Reply