Open Enterprise Hadoop – as secure as Fort Knox

Previous posts in this blog have discussed customers leveraging Open Source, Big Data and Hadoop related technologies for a range of use cases across industry verticals. We have seen how a Hadoop-powered “Data Lake” can not only provide a solid foundation for a new generation of applications that provide analytics and insight, but can also increase the number of access points to an organization’s data. As diverse types of both external and internal enterprise data are ingested into a central repository, the inherent security risks must be understood and addressed by a host of actors in the architecture. Security is thus highly essential for organizations that store and process sensitive data in the Hadoop ecosystem. Many organizations must adhere to strict corporate security polices as well as rigorous industry guidelines. So how does open source Hadoop stack upto demanding standards such as PCI-DSS? 

We have from time to time, noted the ongoing digital transformation across industry verticals. For instance, banking organizations are building digital platforms that aim to engage customers, partners and employees. Retailers & Banks now recognize that the key to win the customer of the day is to offer a seamless experience across multiple channels of engagement. Healthcare providers want to offer their stakeholders – patients, doctors,nurses, suppliers etc with multiple avenues to access contextual data and services; the IoT (Internet of Things) domain is abuzz with the possibilities of Connected Car technology.

The aim of this blogpost is to disabuse those notions which float around from time to time where a Hadoop led 100% open source ecosystem is cast as being somehow insecure or unable to fit well into a corporate security model. It is only to dispel such notions about open source, the Open Source Alliance has noted well that – “Open source enables anyone to examine software for security flaws. The continuous and broad peer-review enabled by publicly available source code improves security through the identification and elimination of defects that might otherwise be missed. Gartner for example, recommends the open source Apache Web server as a more secure alternative to closed source Internet Information servers. The availability of source code also facilitates in-depth security reviews and audits by government customers.” [2]

It is a well understood fact that data is the most important asset a business possess and one that nefarious actors are usually after. Let us consider the retail industry- cardholder data such as card numbers or PAN (Primary Account Numbers) & other authentication data is much sought after by the criminal population.

The consequences of a data breach are myriad & severe and can include –

  • Revenue losses
  • Reputational losses
  • Regulatory sanction and fines etc

Previous blogposts have chronicled cybersecurity in some depth. Please refer to this post as a starting point for a somewhat exhaustive view of cybersecurity. This awareness has led to an increased adoption in risk based security frameworks. E.g ISO 27001, the US National Institute of Standards and Technology (NIST) Cybersecurity Framework and SANS Critical Controls. These frameworks offer a common vocabulary, a set of guidelines that enable enterprises to  identify and prioritize threats, quickly detect and mitigate risks and understand security gaps.

In the realm of payment card data – regulators,payment networks & issuer banks themselves recognize this and have enacted compliance standard – the PCI DSS (Personal Cardholder Information – Data Security Standards). PCI is currently in its third generation incarnation or v3.0 which was introduced over the course of 2014. It is the most important standard for a host of actors –  merchants, processors, payment service providers or really any entity that stores or uses payment card data. It is also important to note that the core process of compliance all applications and systems in a merchant or a payment service provider.

The  PCI standards council recommends the following 12 components for PCI-DSS as depicted in the below table.

PCI_DSS_12_requirements_grande

Illustration: PCI Data Security Standard – high level overview (source: shopify.com)

While PCI covers a whole range of areas that touch payment data such as POS terminals, payment card readers, in store networks etc – data security is front & center.

It is to be noted though that according to the Data Security Standards body who oversee the creation & guidance around the PCI , a technology vendor or product cannot be declared as being cannot “PCI Compliant.”

Thus, the standard has wide implications on two different dimensions –

1. The technology itself as it is incorporated at a merchant as well as

2. The organizational culture around information security policies.

My experience in working at both Hortonworks & Red Hat has shown me that open source is certified at hundreds of enterprise customers running demanding workloads in verticals such as financial services, retail, insurance, telecommunications & healthcare. The other important point to note is that these customers are all PCI, HIPPA and SOX compliant across the board.

It is a total misconception that off the shelf and proprietary point solutions are needed to provide broad coverage across the above pillars. Open enterprise Hadoop offers comprehensive and well rounded implementations across all of the five areas and what more it is 100% open source. 

Let us examine how security in Hadoop works.

The Security Model for Open Enterprise Hadoop – 

The Hadoop community has thus adopted both a top down as well as bottom up approach when looking at security as well as examining at all potential access patterns and across all components of the platform.

Hadoop and Big Data security needs to be considered across the below two prongs – 

  1. What do the individual projects themselves need to support to guarantee that business architectures built using them are highly robust from a security standpoint? 
  2. What are the essential pillars of security that the platform which makes up every enterprise cluster needs to support? 

Let us consider the first. The Apache Hadoop project contains 25+ technologies in the realm of data ingestion, processing & consumption. While anything beyond a cursory look is out of scope here, an exhaustive list of the security hooks provided into each of the major projects are covered here [1].

For instance, Apache Ranger manages fine-grained access control through a rich user interface that ensures consistent policy administration across Hadoop data access components. Security administrators have the flexibility to define security policies for a database, table and column, or a file, and can administer permissions for specific LDAP-based groups or individual users. Rules based on dynamic conditions such as time or geolocation, can also be added to an existing policy rule. The Ranger authorization model is highly pluggable and can be easily extended to any data source using a service-based definition.[1]

Administrators can use Ranger to define a centralized security policy for the following Hadoop components and the list is constantly enhanced:

  • HDFS
  • YARN
  • Hive
  • HBase
  • Storm
  • Knox
  • Solr
  • Kafka

Ranger works with standard authorization APIs in each Hadoop component, and is able to enforce centrally administered policies for any method used to access the data lake.[1]

 Now the second & more important question from an overall platform perspective. 

There are five essential pillars from a security standpoint that address critical needs that security administrators place on data residing in a data lake. If any of these pillars is vulnerable from an implementation standpoint, it ends up creating risk built into organization’s Big Data environment. Any Big Data security strategy must address all five pillars, with a consistent implementation approach to ensure their effectiveness.

Security_Pillars

                             Illustration: The Essential Components of Data Security

  1. Authentication – does the user possess appropriate credentials? This is implemented via the Kerberos authentication protocol & allied concepts such as Principals, Realms & KDC’s (Key Distribution Centers).
  2. Authorization – what resources is the user allowed to access based on business need & credentials?  Implemented in each Hadoop project & integrated with an organizations LDAP/AD/.
  3. Perimeter Security – prevents unauthorized outside access to the cluster. Implemented via Apache Knox Gateway which extends the reach of Hadoop services to users outside of a Hadoop cluster. Knox also simplifies Hadoop security for users who access the cluster data and execute jobs.
  4. Centralized Auditing  – implemented via Apache Atlas and it’s integration with Apache Ranger.
  5. Security Administration – deals with the central setup & control all security information using a central console.  uses Apache Ranger to provide centralized security administration and management. The Ranger Administration Portal is the central interface for security administration. Users can create and update policies, which are then stored in a policy database.

ranger_centralized_admin

                                           Illustration: Centralized Security Administration

It is also to be noted that as Hadoop adoption grows at an incremental pace, workloads that harness data for complex business analytics and decision-making may need more robust data-centric protection (namely data masking, encryption, tokenization). Thus, in addition to the above Hadoop projects as Apache Ranger, enterprises can essentially take an augmentative approach.  Partner solutions that offer data centric protection for Hadoop data such as Dataguise DgSecure for Hadoop which clearly complement an enterprise ready Hadoop distribution (such as those from the open source leader Hortonworks) are definitely worth a close look.

Summary

While implementing Big Data architectures in support of business needs, security administrators should look to address coverage for components across each of the above areas as they design the infrastructure. A rigorous & bottom-up approach to data security makes it possible to enforce and manage security across the stack through a central point of administration, which will likely prevent any potential security gaps and inconsistencies. This approach is especially important for newer technology like Hadoop where exciting new projects & data processing engines are always being incubated at a rapid clip. After all, the data lake is all about building a robust & highly secure platform on which data engines – Storm,Spark etc and processing frameworks like Mapreduce function to create business magic. 

 References – 

[1] Hortonworks Data Security Guide
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_Security_Guide/

[2] Open Source Alliance of America
http://opensourceforamerica.org/learn-more/benefits-of-open-source-software/

Why Software Defined Infrastructure & why now..(1/6)

The ongoing digital transformation in key verticals like financial services, manufacturing, healthcare and telco has incumbent enterprises fending off a host of new market entrants. Enterprise IT’s best answer is to increase the pace of innovation as a way of driving increased differentiation in business processes. Though data analytics & automation remain the lynchpin of this approach – software defined infrastructure (SDI) built on the notions of cloud computing has emerged as the main infrastructure differentiator & that for a host of reasons which we will discuss in this two part blog.

Software Defined Infrastructure (SDI) is essentially an idea that brings together  advances in a host of complementary areas spanning both infrastructure software, data as well as development environments. It supports a new way of building business applications. The core idea in SDI is that massively scalable applications (in support of diverse customer needs) describe their behavior characteristics (via configuration & APIs) to underlying datacenter infrastructure which simply obeys those commands in an automated fashion while abstracting away the underlying complexities.

SDI as an architectural pattern was originally made popular by the web scale giants – the so-called FANG companies of tech — Facebook , Amazon , Netflix and Alphabet (the erstwhile Google) but has begun making it’s way into the enterprise world gradually.

Common Business IT Challenges prior to SDI – 
  1. Cost of hardware infrastructure is typically growing at a high percentage every year as compared to  growth in the total  IT budget. Cost pressures are driving an overall re look at the different tiers across the IT landscape.
  2. Infrastructure is not completely under the control of the IT-Application development teams as yet.  Business realities that dictate rapid app development to meet changing business requirements
  3. Even for small, departmental level applications, still needed to deploy expensive proprietary stacks which are not only cost and deployment footprint prohibitive but also take weeks to spin up in terms of provisioning cycles.
  4. Big box proprietary solutions leading to a hard look at Open Source technologies which are lean and easy to use with lightweight deployment footprint.Apps need to dictate footprint; not vendor provided containers.
  5. Concerns with acquiring developers who are tooled on cutting edge development frameworks & methodologies. You have zero developer mindshare with Big Box technologies.

Key characteristics of an SDI

  1. Applications built on a SDI can detect business events in realtime and respond dynamically by allocating additional resources in three key areas – compute, storage & network – based on the type of workloads being run.
  2. Using an SDI, application developers can seamlessly deploy apps while accessing higher level programming abstractions that allow for the rapid creation of business services (web, application, messaging, SOA/ Microservices tiers), user interfaces and a whole host of application elements.
  3. From a management standpoint, business application workloads are dynamically and automatically assigned to the available infrastructure (spanning public & private cloud resources) on the basis of the application requirements, required SLA in a way that provides continuous optimization across the life cycle of technology.
  4. The SDI itself optimizes the entire application deployment by both externally provisioned APIs & internal interfaces between the five essential pieces – Application, Compute, Storage, Network & Management.

The SDI automates the technology lifecycle –

Consider the typical tasks needed to create and deploy enterprise applications. This list includes but is not limited to –

  • onboarding hardware infrastructure,
  • setting up complicated network connectivity to firewalls, routers, switches etc,
  • making the hardware stack available for consumption by applications,
  • figure out storage requirements and provision those
  • guarantee multi-tenancy
  • application development
  • deployment,
  • monitoring
  • updates, failover & rollbacks
  • patching
  • security
  • compliance checking etc.
The promise of SDI is to automate all of this from a business, technology, developer & IT administrator standpoint.
 SDI Reference Architecture – 
 The SDI encompasses SDC (Software Defined Compute) , SDS (Software Defined Storage), SDN (Software Defined Networking), Software Defined Applications and Cloud Management Platforms (CMP) into one logical construct as can be seen from the below picture.
FS_SDDC

                      Illustration: The different tiers of Software Defined Infrastructure

The core of the software defined approach are APIs.  APIs control the lifecycle of resources (request, approval, provisioning,orchestration & billing) as well as the applications deployed on them. The SDI implies commodity hardware (x86) & a cloud based approach to architecting the datacenter.

The ten fundamental technology tenets of the SDI –

1. Highly elastic – scale up or scale down the gamut of infrastructure (compute – VM/Baremetal/Containers, storage – SAN/NAS/DAS, network – switches/routers/Firewalls etc) in near real time

2. Highly Automated – Given the scale & multi-tenancy requirements, automation at all levels of the stack (development, deployment, monitoring and maintenance)

3. Low Cost – Oddly enough, the SDI operates at a lower CapEx and OpEx compared to the traditional datacenter due to reliance on open source technology & high degree of automation. Further workload consolidation only helps increase hardware utilization.

4. Standardization –  The SDI enforces standardization and homogenization of deployment runtimes, application stacks and development methodologies based on lines of business requirements. This solves a significant IT challenge that has hobbled innovation at large financial institutions.

5. Microservice based applications –  Applications developed for a SDI enabled infrastructure are developed as small, nimble processes that communicate via APIs and over infrastructure like messaging & service mediation components (e.g Apache Kafka & Camel). This offers huge operational and development advantages over legacy applications. While one does not expect Core Banking applications to move over to a microservice model anytime soon, customer facing applications that need responsive digital UIs will need definitely consider such approaches.

6. ‘Kind-of-Cloud’ Agnostic –  The SDI does not enforce the concept of private cloud, or rather it encompasses a range of deployment options – public, private and hybrid.

7. DevOps friendly –  The SDI enforces not just standardization and homogenization of deployment runtimes, application stacks and development methodologies but also enables a culture of continuous collaboration among developers, operations teams and business stakeholders i.e cross departmental innovation. The SDI is a natural container for workloads that are experimental in nature and can be updated/rolled-back/rolled forward incrementally based on changing business requirements. The SDI enables rapid deployment capabilities across the stack leading to faster time to market of business capabilities.

8. Data, Data & Data –  The heart of any successful technology implementation is Data. This includes customer data, transaction data, reference data, risk data, compliance data etc etc. The SDI provides a variety of tools that enable applications to process data in a batch, interactive, low latency manner depending on what the business requirements are.

9. Security –  The SDI shall provide robust perimeter defense as well as application level security with a strong focus on a Defense In Depth strategy.

10. Governance –  The SDI enforces strong governance requirements for capabilities ranging from ITSM requirements – workload orchestration, business policy enabled deployment, autosizing of workloads to change management, provisioning, billing, chargeback & application deployments.

The next & second blog in this series will cover the challenges in running massive scale applications.

Data Lakes power the future of Industrial Analytics..(1/4)

The first post in this four part series on Data lakes will focus on the business reasons to create one. The second post will delve deeper into the technology considerations & choices around data ingest & processing in the lake to satisfy myriad business requirements. The third will tackle the critical topic of metadata management, data cleanliness & governance. The fourth & final post in the series will focus on the business justification to build out a Big Data Center of Excellence (COE).

Business owners at the C level are saying, ‘Hey guys, look. It’s no longer inordinately expensive for us to store all of our data. I want all of you to make copies. OK, your systems are busy. Find the time, get an extract, and dump it in Hadoop.’”- Mike Lang, CEO of Revelytix

The onset of Digital Architectures in enterprise businesses implies the ability to drive continuous online interactions with global consumers/customers/clients or patients. The goal is not just provide engaging visualization but also to personalize services clients care about across multiple modes of interaction. Mobile applications first begun forcing the need for enterprise to begin supporting multiple channels of interaction with their consumers. For example Banking now requires an ability to engage consumers in a seamless experience across an average of four to five channels – Mobile, eBanking, Call Center, Kiosk etc. Healthcare is a close second where caregivers expect patient, medication & disease data at their fingertips with a few finger swipes on an iPad app.

Big Data has been the chief catalyst in this disruption. The Data Lake architectural & deployment pattern makes it possible to first store all this data & then enables the panoply of Hadoop ecosystem projects & technologies to operate on it to produce business results.

Let us consider a few of the major industry verticals and the sheer data variety that players in these areas commonly possess – 

The Healthcare & Life Sciences industry possess some of the most diverse data across the spectrum ranging from – 

  • Structured Clinical data e.g. Patient ADT information
  • Free hand notes
  • Patient Insurance information
  • Device Telemetry 
  • Medication data
  • Patient Trial Data
  • Medical Images – e.g. CAT Scans, MRIs, CT images etc

The Manufacturing industry players are leveraging the below datasets and many others to derive new insights in a highly process oriented industry-

  • Supply chain data
  • Demand data
  • Pricing data
  • Operational data from the shop floor 
  • Sensor & telemetry data 
  • Sales campaign data

Data In Banking– Corporate IT organizations in the financial industry have been tackling data challenges due to strict silo based approaches that inhibit data agility for many years now.
Consider some of the traditional sources of data in banking –

  • Customer Account data e.g. Names, Demographics, Linked Accounts etc
  • Core Banking Data
  • Transaction Data which captures the low level details of every transaction (e.g debit, credit, transfer, credit card usage etc)
  • Wire & Payment Data
  • Trade & Position Data
  • General Ledger Data e.g AP (accounts payable), AR (accounts receivable), cash management & purchasing information etc.
  • Data from other systems supporting banking reporting functions.

Industries have changed around us since the advent of relational databases & enterprise data warehouses. Relational Databases (RDBMS) & Enterprise Data Warehouses (EDW) were built with very different purposes in mind. RDBMS systems excel at online transaction processing (OLTP) use cases where massive volumes of structured data needs to be processed quickly. EDW’s on the other hand perform online analytical processing functions (OLAP) where data extracts are taken from OLTP systems, loaded & sliced in different ways to . Both these kinds of systems are not simply suited to handle not just immense volumes of data but also highly variable structures of data.

awesome-lake

Let us consider the main reasons why legacy data storage & processing techniques are unsuited to new business realities of today.

  • Legacy data technology enforces a vertical scaling method that is sorely unsuited to handling massive volumes of data in a scale up/scale down manner
  • The structure of the data needs to be modeled in a paradigm called ’schema on write’ which sorely inhibits time to market for new business projects
  • Traditional data systems suffer bottlenecks when large amounts of high variety data are processed using them 
  • Limits in the types of analytics that could be performed. In industries like Retail, Financial Services & Telecommunications, enterprise need to build detailed models of customers accounts to predict their overall service level satisfaction in realtime. These models are predictive in nature and use data science techniques as an integral component. The higher volumes of data along with attribute richness that can be provided to them (e.g. transaction data, social network data, transcribed customer call data) ensures that the models are highly accurate & can provide an enormous amount of value to the business. Legacy systems are not a great fit here.

Given all of the above data complexity and the need to adopt agile analytical methods  – what is the first step that enterprises must adopt? 

The answer is the adoption of the Data Lake as an overarching data architecture pattern. Lets define the term first. A data lake is two things – a small or massive data storage repository and a data processing engine. A data lake provides “massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs“.[1] Data Lake are created to ingest, transform, process, analyze & finally archive large amounts of any kind of data – structured, semistructured and unstructured data.

DL_1

                                  Illustration – The Data Lake Architecture Pattern

What Big Data brings to the equation beyond it’s strength in data ingest & processing is a unified architecture. For instance, MapReduce is the original framework for writing applications that process large amounts of structured and unstructured data stored in the Hadoop Distributed File System (HDFS). Apache Hadoop YARN opened Hadoop to other data processing engines (e.g. Apache Spark/Storm) that can now run alongside existing MapReduce jobs to process data in many different ways at the same time. The result is that ANY kind of application processing can be run inside a Hadoop runtime – batch, realtime, interactive or streaming.

Visualization  – Mobile applications first begun forcing the need for enterprise to begin supporting multiple channels of interaction with their consumers. For example Banking now requires an ability to engage consumers in a seamless experience across an average of four to five channels – Mobile, eBanking, Call Center, Kiosk etc. The average enterprise user is also familiar with BYOD in the age of self service. The Digital Mesh only exacerbates this gap in user experiences as information consumers navigate applications as they consume services across a mesh that is both multi-channel as well as provides Customer 360 across all these engagement points.While information management technology has grown at a blistering pace, the human ability to process and comprehend numerical data has not. Applications being developed in 2016 are beginning to adopt intelligent visualization approaches that are easy to use,highly interactive and enable the user to manipulate corporate & business data using their fingertips – much like an iPad app. Tools such as intelligent dashboards, scorecards, mashups etc are helping change a visualization paradigms that were based on histograms, pie charts and tons of numbers. Big Data improvements in data lineage, quality are greatly helping the visualization space.

The Final Word

Specifically, Data Lake architectural pattern provide the following benefits – 

The ability to store enormous amounts of data with a high degree of agility & low cost: The Schema On Read architecture makes it trivial to ingest any kind of raw data into Hadoop in a manner that preserves it’s structure.  Business analysts can then explore  this data and then defined a schema to suit the needs of their particular application.

The ability to run any kind of Analytics on the data: Hadoop supports multiple access methods (batch, real-time, streaming, in-memory, etc.) to a common data set.  You are only restricted by your use case.

the ability to analyze, process & archive data while dramatically cutting cost : Since Hadoop was designed to work on low-cost commodity servers which have direct attached storage – it helps dramatically lower the overall cost of storage.  Thus enterprises are able to retain source data for long periods, thus providing business applications with far greater historical context.

The ability to augment & optimize Data Warehouses: Data lakes & Hadoop technology are not a ‘rip & replace’ proposition. While they provide a much lower cost environment than data warehouses, they can also be used as the compute layer to augment these systems.  Data can be stored, extracted and transformed in Hadoop. Then a subset of the data i.e the results are loaded into the data warehouse. This enables the EDW to leverage compute cycles and storage to perform truly high value analytics.

The next post of the series will dive deeper into the architectural choices one needs to make while creating a high fidelity & business centric enterprise data lake.

References – 

[1] https://en.wikipedia.org/wiki/Data_lake

The six megatrends helping enterprises derive massive value from Big Data..

The world is one big data problem.” – Andrew McAfee, associate director of the Center for Digital Business at MIT Sloan

Though Data as a topic has been close to my heart, it was often a subject I would not deal much with given my preoccupation with applications, middleware, cloud computing & DevOps. However I grabbed the chance to teach a Hadoop course in 2012 and it changed the way I looked at data – not merely an enabler but as the true oil of business. Fast forward to 2016, I have almost completed an amazing and enriching year at Hortonworks.  It is a good time for retrospection about how Big Data is transforming businesses across the Fortune 500 landscape. Thus, I present what is not merely the ‘Art of the Possible’ but ‘Business Reality’ –  distilled insights from an year of working with real world customers. Companies pioneering Big Data  into commercial applications to drive shareholder value & customer loyalty.

Six Megatrends

  Illustration – The Megatrends helping enterprises derive massive value from Big Data 

Please find presented the six megatrends that will continue to drive Big Data into enterprise business & IT architectures for the foreseeable future.

  1. The Internet of Anything (IoAT) – The rise of the machines has been well documented but enterprises have just begun waking up to the possibilities in 2016. The paradigm of  harnessing IoT data by leveraging Big Data techniques has begun to gain industry wide adoption & cachet. For example in the manufacturing industry, data is being gathered from a wide variety of sensors that are distributed geographically along factory locations running 24×7. Predictive maintenance strategies that pull together sensor data, prognostics are critical to efficiency & also to optimize the business. In other verticals like healthcare & insurance, massive data volumes are now being reliably generated from diverse sources of telemetry such as patient monitoring devices as well as human manned endpoints at hospitals. In transportation, these devices include cars in the consumer space, trucks & other field vehicles, geolocation devices. Others include field machinery in oil exploration & server logs across IT infrastructure. In the personal consumer space, personal fitness devices like FitBit, Home & Office energy management sensors etc. All of this constitutes the trend that Gartner terms the Digital Mesh. The Mesh really is built from coupling machine data these with the ever growing social media feeds, web clicks, server logs etc.The Digital Mesh leads to an interconnected information deluge which encompasses classical IoT endpoints along with audio, video & social data streams. This leads to huge security challenges and opportunity from a business perspective  for forward looking enterprises (including Governments). Applications that are leveraging Big Data to ingest, connect & combine these disparate feeds into one holistic picture of an entity – whether individual or institution – are clearly beginning to differentiate themselves. IoAT is starting to be a huge part of digital transformation initiatives with more usecases emerging in 2017 across industry verticals.
  2. The Emergence of Unified Architectures – The onset of Digital Architectures in enterprise businesses implies the ability to drive continuous online interactions with global consumers/customers/clients or patients. The goal is not just provide engaging visualization but also to personalize services clients care about across multiple modes of interaction. Mobile applications first begun forcing the need for enterprise to begin supporting multiple channels of interaction with their consumers. For example Banking now requires an ability to engage consumers in a seamless experience across an average of four to five channels – Mobile, eBanking, Call Center, Kiosk etc. Healthcare is a close second where caregivers expect patient, medication & disease data at their fingertips with a few finger swipes on an iPad app. What Big Data brings to the equation beyond it’s strength in data ingest & processing is a unified architecture. For instance, MapReduce is the original framework for writing applications that process large amounts of structured and unstructured data stored in the Hadoop Distributed File System (HDFS). Apache Hadoop YARN opened Hadoop to other data processing engines (e.g. Apache Spark/Storm) that can now run alongside existing MapReduce jobs to process data in many different ways at the same time. The result is that ANY kind of application processing can be run inside a Hadoop runtime – batch, realtime, interactive or streaming.
  3. Consumer 360 Mobile applications first begun forcing the need for enterprise to begin supporting multiple channels of interaction with their consumers. For example Banking now requires an ability to engage consumers in a seamless experience across an average of four to five channels – Mobile, eBanking, Call Center, Kiosk etc. The healthcare industry stores patient data across multiple silos – ADT (Admit Discharge Transfer) systems, medication systems, CRM systems etc. Data Lakes provide an ability to visualize all of the patients data in one place thus improving outcomes. The Digital Mesh (covered above) only exacerbates this semantic gap in user experiences as information consumers navigate applications as they consume services across the mesh. A mesh that is both multi-channel as well as one that needs a 360 degree customer view across all these engagement points. Applications developed in 2016 and beyond must take a 360 degree based approach to ensuring a continuous client experience across the spectrum of endpoints and the platforms that span them from a Data Visualization standpoint. Every serious business needs to provide a unified view of a customer across tens of product lines and geographies.
  4. Machine Learning, Data Science & Predictive Analytics – Most business problems are data challenges and an approach centered around data analysis helps extract meaningful insights from data thus helping the business It is a common capability now for many enterprises to possess the capability to acquire, store and process large volumes of data using a low cost approach leveraging Big Data and Cloud Computing.  At the same time the rapid maturation of scalable processing techniques allows us to extract richer insights from data. What we commonly refer to as Machine Learning – a combination of  of econometrics, machine learning, statistics, visualization, and computer science – extract valuable business insights hiding in data and builds operational systems to deliver that value.Data Science has evolved to a new branch called “Deep Neural Nets” (DNN). DNN Are what makes possible the ability of smart machines and agents to learn from data flows and to make products that use them even more automated & powerful. Deep Machine Learning involves the art of discovering data insights in a human-like pattern. The web scale world (led by Google and Facebook) have been vocal about their use of Advanced Data Science techniques and the move of Data Science into Advanced Machine Learning. Data Science is an umbrella concept that refers to the process of extracting business patterns from large volumes of both structured, semi structured and unstructured data. It is emerging the key ingredient in enabling a predictive approach to the business. Data Science & it’s applications across a range of industries are covered in the blogpost http://www.vamsitalkstech.com/?p=1846
  5. Visualization  – Mobile applications first begun forcing the need for enterprise to begin supporting multiple channels of interaction with their consumers. For example Banking now requires an ability to engage consumers in a seamless experience across an average of four to five channels – Mobile, eBanking, Call Center, Kiosk etc. The average enterprise user is also familiar with BYOD in the age of self service. The Digital Mesh only exacerbates this gap in user experiences as information consumers navigate applications as they consume services across a mesh that is both multi-channel as well as provides Customer 360 across all these engagement points.While information management technology has grown at a blistering pace, the human ability to process and comprehend numerical data has not. Applications being developed in 2016 are beginning to adopt intelligent visualization approaches that are easy to use,highly interactive and enable the user to manipulate corporate & business data using their fingertips – much like an iPad app. Tools such as intelligent dashboards, scorecards, mashups etc are helping change a visualization paradigms that were based on histograms, pie charts and tons of numbers. Big Data improvements in data lineage, quality are greatly helping the visualization space.
  6. DevOps – Big Data powered by Hadoop has now evolved into a true application architecture ecosystem as mentioned above. The 30+ components included in an enterprise grade platform like the Hortonworks Data Platform (HDP) include APIs (Application Programming Interfaces) to satisfy every kind of data need that an application could have – streaming, realtime, interactive or batch. Couple that with improvements in predictive analytics. In 2016, enterprise developers leveraging Big Data have been building scalable applications with data as a first class citizen. Organizations using DevOps are already reaping the rewards as they are able to streamline, improve and create business processes to reflect customer demand and positively affect customer satisfaction. Examples abound in the Webscale world (Netflix, Pinterest, and Etsy) but we now have existing Fortune 1000 companies in verticals like financial services, healthcare, retail and manufacturing who are benefiting from Big Data & DevOps.Thus, 2016 will be the year when Big Data techniques are no longer be the preserve of classical Information Management teams but move to the umbrella application development area which encompasses the DevOps and Continuous Integration & Delivery (CI-CD) spheres.
    One of DevOps chief goal’s is to close the long-standing gap between the engineers who develop and test IT capability and the organizations that are responsible for deploying and maintaining IT operations. Using traditional app dev methodologies, it can take months to design, test and deploy software. No business today has that much time—especially in the age of IT consumerization and end users accustomed to smart phone apps that are updated daily. The focus now is on rapidly developing business applications to stay ahead of competitors that can better harness the Big Data business capabilities. The micro services architecture approach advocated by DevOps combines the notion of autonomous, cooperative yet loosely coupled applications built as a conglomeration of business focused services is a natural fit for the Digital Mesh.  The most important additive and consideration to micro services based architectures in 2016 are  Analytics Everywhere.

The Final Word – 

We have all heard about the growth of data volumes & variety. 2016 is perhaps the first year where forward looking business & technology executives have begin capturing commercial value from the data deluge by balancing analytics with creative user experience. 

Thus, modern data applications are making Big Data ubiquitous. Rather than existing as back-shelf tools for the monthly ETL run or for reporting, these modern applications can help industry firms incorporate data into every decision they make.  Applications in 2016 and beyond are beginning  to recognize that Analytics are pervasive, relentless, realtime and thus embedded into our daily lives.