The first post in this four part series on Data lakes will focus on the business reasons to create one. The second post will delve deeper into the technology considerations & choices around data ingest & processing in the lake to satisfy myriad business requirements. The third will tackle the critical topic of metadata management, data cleanliness & governance. The fourth & final post in the series will focus on the business justification to build out a Big Data Center of Excellence (COE).
“Business owners at the C level are saying, ‘Hey guys, look. It’s no longer inordinately expensive for us to store all of our data. I want all of you to make copies. OK, your systems are busy. Find the time, get an extract, and dump it in Hadoop.’”- Mike Lang, CEO of Revelytix
The onset of Digital Architectures in enterprise businesses implies the ability to drive continuous online interactions with global consumers/customers/clients or patients. The goal is not just provide engaging visualization but also to personalize services clients care about across multiple modes of interaction. Mobile applications first begun forcing the need for enterprise to begin supporting multiple channels of interaction with their consumers. For example Banking now requires an ability to engage consumers in a seamless experience across an average of four to five channels – Mobile, eBanking, Call Center, Kiosk etc. Healthcare is a close second where caregivers expect patient, medication & disease data at their fingertips with a few finger swipes on an iPad app.
Big Data has been the chief catalyst in this disruption. The Data Lake architectural & deployment pattern makes it possible to first store all this data & then enables the panoply of Hadoop ecosystem projects & technologies to operate on it to produce business results.
Let us consider a few of the major industry verticals and the sheer data variety that players in these areas commonly possess –
The Healthcare & Life Sciences industry possess some of the most diverse data across the spectrum ranging from –
- Structured Clinical data e.g. Patient ADT information
- Free hand notes
- Patient Insurance information
- Device Telemetry
- Medication data
- Patient Trial Data
- Medical Images – e.g. CAT Scans, MRIs, CT images etc
The Manufacturing industry players are leveraging the below datasets and many others to derive new insights in a highly process oriented industry-
- Supply chain data
- Demand data
- Pricing data
- Operational data from the shop floor
- Sensor & telemetry data
- Sales campaign data
Data In Banking– Corporate IT organizations in the financial industry have been tackling data challenges due to strict silo based approaches that inhibit data agility for many years now.
Consider some of the traditional sources of data in banking –
- Customer Account data e.g. Names, Demographics, Linked Accounts etc
- Core Banking Data
- Transaction Data which captures the low level details of every transaction (e.g debit, credit, transfer, credit card usage etc)
- Wire & Payment Data
- Trade & Position Data
- General Ledger Data e.g AP (accounts payable), AR (accounts receivable), cash management & purchasing information etc.
- Data from other systems supporting banking reporting functions.
Industries have changed around us since the advent of relational databases & enterprise data warehouses. Relational Databases (RDBMS) & Enterprise Data Warehouses (EDW) were built with very different purposes in mind. RDBMS systems excel at online transaction processing (OLTP) use cases where massive volumes of structured data needs to be processed quickly. EDW’s on the other hand perform online analytical processing functions (OLAP) where data extracts are taken from OLTP systems, loaded & sliced in different ways to . Both these kinds of systems are not simply suited to handle not just immense volumes of data but also highly variable structures of data.
Let us consider the main reasons why legacy data storage & processing techniques are unsuited to new business realities of today.
- Legacy data technology enforces a vertical scaling method that is sorely unsuited to handling massive volumes of data in a scale up/scale down manner
- The structure of the data needs to be modeled in a paradigm called ’schema on write’ which sorely inhibits time to market for new business projects
- Traditional data systems suffer bottlenecks when large amounts of high variety data are processed using them
- Limits in the types of analytics that could be performed. In industries like Retail, Financial Services & Telecommunications, enterprise need to build detailed models of customers accounts to predict their overall service level satisfaction in realtime. These models are predictive in nature and use data science techniques as an integral component. The higher volumes of data along with attribute richness that can be provided to them (e.g. transaction data, social network data, transcribed customer call data) ensures that the models are highly accurate & can provide an enormous amount of value to the business. Legacy systems are not a great fit here.
Given all of the above data complexity and the need to adopt agile analytical methods – what is the first step that enterprises must adopt?
The answer is the adoption of the Data Lake as an overarching data architecture pattern. Lets define the term first. A data lake is two things – a small or massive data storage repository and a data processing engine. A data lake provides “massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs“. Data Lake are created to ingest, transform, process, analyze & finally archive large amounts of any kind of data – structured, semistructured and unstructured data.
Illustration – The Data Lake Architecture Pattern
What Big Data brings to the equation beyond it’s strength in data ingest & processing is a unified architecture. For instance, MapReduce is the original framework for writing applications that process large amounts of structured and unstructured data stored in the Hadoop Distributed File System (HDFS). Apache Hadoop YARN opened Hadoop to other data processing engines (e.g. Apache Spark/Storm) that can now run alongside existing MapReduce jobs to process data in many different ways at the same time. The result is that ANY kind of application processing can be run inside a Hadoop runtime – batch, realtime, interactive or streaming.
Visualization – Mobile applications first begun forcing the need for enterprise to begin supporting multiple channels of interaction with their consumers. For example Banking now requires an ability to engage consumers in a seamless experience across an average of four to five channels – Mobile, eBanking, Call Center, Kiosk etc. The average enterprise user is also familiar with BYOD in the age of self service. The Digital Mesh only exacerbates this gap in user experiences as information consumers navigate applications as they consume services across a mesh that is both multi-channel as well as provides Customer 360 across all these engagement points.While information management technology has grown at a blistering pace, the human ability to process and comprehend numerical data has not. Applications being developed in 2016 are beginning to adopt intelligent visualization approaches that are easy to use,highly interactive and enable the user to manipulate corporate & business data using their fingertips – much like an iPad app. Tools such as intelligent dashboards, scorecards, mashups etc are helping change a visualization paradigms that were based on histograms, pie charts and tons of numbers. Big Data improvements in data lineage, quality are greatly helping the visualization space.
The Final Word –
Specifically, Data Lake architectural pattern provide the following benefits –
The ability to store enormous amounts of data with a high degree of agility & low cost: The Schema On Read architecture makes it trivial to ingest any kind of raw data into Hadoop in a manner that preserves it’s structure. Business analysts can then explore this data and then defined a schema to suit the needs of their particular application.
The ability to run any kind of Analytics on the data: Hadoop supports multiple access methods (batch, real-time, streaming, in-memory, etc.) to a common data set. You are only restricted by your use case.
the ability to analyze, process & archive data while dramatically cutting cost : Since Hadoop was designed to work on low-cost commodity servers which have direct attached storage – it helps dramatically lower the overall cost of storage. Thus enterprises are able to retain source data for long periods, thus providing business applications with far greater historical context.
The ability to augment & optimize Data Warehouses: Data lakes & Hadoop technology are not a ‘rip & replace’ proposition. While they provide a much lower cost environment than data warehouses, they can also be used as the compute layer to augment these systems. Data can be stored, extracted and transformed in Hadoop. Then a subset of the data i.e the results are loaded into the data warehouse. This enables the EDW to leverage compute cycles and storage to perform truly high value analytics.
The next post of the series will dive deeper into the architectural choices one needs to make while creating a high fidelity & business centric enterprise data lake.