Previous posts in this blog have discussed customers leveraging Open Source, Big Data and Hadoop related technologies for a range of use cases across industry verticals. We have seen how a Hadoop-powered “Data Lake” can not only provide a solid foundation for a new generation of applications that provide analytics and insight, but can also increase the number of access points to an organization’s data. As diverse types of both external and internal enterprise data are ingested into a central repository, the inherent security risks must be understood and addressed by a host of actors in the architecture. Security is thus highly essential for organizations that store and process sensitive data in the Hadoop ecosystem. Many organizations must adhere to strict corporate security polices as well as rigorous industry guidelines. So how does open source Hadoop stack upto demanding standards such as PCI-DSS?
We have from time to time, noted the ongoing digital transformation across industry verticals. For instance, banking organizations are building digital platforms that aim to engage customers, partners and employees. Retailers & Banks now recognize that the key to win the customer of the day is to offer a seamless experience across multiple channels of engagement. Healthcare providers want to offer their stakeholders – patients, doctors,nurses, suppliers etc with multiple avenues to access contextual data and services; the IoT (Internet of Things) domain is abuzz with the possibilities of Connected Car technology.
The aim of this blogpost is to disabuse those notions which float around from time to time where a Hadoop led 100% open source ecosystem is cast as being somehow insecure or unable to fit well into a corporate security model. It is only to dispel such notions about open source, the Open Source Alliance has noted well that – “Open source enables anyone to examine software for security flaws. The continuous and broad peer-review enabled by publicly available source code improves security through the identification and elimination of defects that might otherwise be missed. Gartner for example, recommends the open source Apache Web server as a more secure alternative to closed source Internet Information servers. The availability of source code also facilitates in-depth security reviews and audits by government customers.” 
It is a well understood fact that data is the most important asset a business possess and one that nefarious actors are usually after. Let us consider the retail industry- cardholder data such as card numbers or PAN (Primary Account Numbers) & other authentication data is much sought after by the criminal population.
The consequences of a data breach are myriad & severe and can include –
- Revenue losses
- Reputational losses
- Regulatory sanction and fines etc
Previous blogposts have chronicled cybersecurity in some depth. Please refer to this post as a starting point for a somewhat exhaustive view of cybersecurity. This awareness has led to an increased adoption in risk based security frameworks. E.g ISO 27001, the US National Institute of Standards and Technology (NIST) Cybersecurity Framework and SANS Critical Controls. These frameworks offer a common vocabulary, a set of guidelines that enable enterprises to identify and prioritize threats, quickly detect and mitigate risks and understand security gaps.
In the realm of payment card data – regulators,payment networks & issuer banks themselves recognize this and have enacted compliance standard – the PCI DSS (Personal Cardholder Information – Data Security Standards). PCI is currently in its third generation incarnation or v3.0 which was introduced over the course of 2014. It is the most important standard for a host of actors – merchants, processors, payment service providers or really any entity that stores or uses payment card data. It is also important to note that the core process of compliance all applications and systems in a merchant or a payment service provider.
The PCI standards council recommends the following 12 components for PCI-DSS as depicted in the below table.
Illustration: PCI Data Security Standard – high level overview (source: shopify.com)
While PCI covers a whole range of areas that touch payment data such as POS terminals, payment card readers, in store networks etc – data security is front & center.
It is to be noted though that according to the Data Security Standards body who oversee the creation & guidance around the PCI , a technology vendor or product cannot be declared as being cannot “PCI Compliant.”
Thus, the standard has wide implications on two different dimensions –
1. The technology itself as it is incorporated at a merchant as well as
2. The organizational culture around information security policies.
My experience in working at both Hortonworks & Red Hat has shown me that open source is certified at hundreds of enterprise customers running demanding workloads in verticals such as financial services, retail, insurance, telecommunications & healthcare. The other important point to note is that these customers are all PCI, HIPPA and SOX compliant across the board.
It is a total misconception that off the shelf and proprietary point solutions are needed to provide broad coverage across the above pillars. Open enterprise Hadoop offers comprehensive and well rounded implementations across all of the five areas and what more it is 100% open source.
Let us examine how security in Hadoop works.
The Security Model for Open Enterprise Hadoop –
The Hadoop community has thus adopted both a top down as well as bottom up approach when looking at security as well as examining at all potential access patterns and across all components of the platform.
Hadoop and Big Data security needs to be considered across the below two prongs –
- What do the individual projects themselves need to support to guarantee that business architectures built using them are highly robust from a security standpoint?
- What are the essential pillars of security that the platform which makes up every enterprise cluster needs to support?
Let us consider the first. The Apache Hadoop project contains 25+ technologies in the realm of data ingestion, processing & consumption. While anything beyond a cursory look is out of scope here, an exhaustive list of the security hooks provided into each of the major projects are covered here .
For instance, Apache Ranger manages fine-grained access control through a rich user interface that ensures consistent policy administration across Hadoop data access components. Security administrators have the flexibility to define security policies for a database, table and column, or a file, and can administer permissions for specific LDAP-based groups or individual users. Rules based on dynamic conditions such as time or geolocation, can also be added to an existing policy rule. The Ranger authorization model is highly pluggable and can be easily extended to any data source using a service-based definition.
Administrators can use Ranger to define a centralized security policy for the following Hadoop components and the list is constantly enhanced:
Ranger works with standard authorization APIs in each Hadoop component, and is able to enforce centrally administered policies for any method used to access the data lake.
Now the second & more important question from an overall platform perspective.
There are five essential pillars from a security standpoint that address critical needs that security administrators place on data residing in a data lake. If any of these pillars is vulnerable from an implementation standpoint, it ends up creating risk built into organization’s Big Data environment. Any Big Data security strategy must address all five pillars, with a consistent implementation approach to ensure their effectiveness.
Illustration: The Essential Components of Data Security
- Authentication – does the user possess appropriate credentials? This is implemented via the Kerberos authentication protocol & allied concepts such as Principals, Realms & KDC’s (Key Distribution Centers).
- Authorization – what resources is the user allowed to access based on business need & credentials? Implemented in each Hadoop project & integrated with an organizations LDAP/AD/.
- Perimeter Security – prevents unauthorized outside access to the cluster. Implemented via Apache Knox Gateway which extends the reach of Hadoop services to users outside of a Hadoop cluster. Knox also simplifies Hadoop security for users who access the cluster data and execute jobs.
- Centralized Auditing – implemented via Apache Atlas and it’s integration with Apache Ranger.
- Security Administration – deals with the central setup & control all security information using a central console. uses Apache Ranger to provide centralized security administration and management. The Ranger Administration Portal is the central interface for security administration. Users can create and update policies, which are then stored in a policy database.
Illustration: Centralized Security Administration
It is also to be noted that as Hadoop adoption grows at an incremental pace, workloads that harness data for complex business analytics and decision-making may need more robust data-centric protection (namely data masking, encryption, tokenization). Thus, in addition to the above Hadoop projects as Apache Ranger, enterprises can essentially take an augmentative approach. Partner solutions that offer data centric protection for Hadoop data such as Dataguise DgSecure for Hadoop which clearly complement an enterprise ready Hadoop distribution (such as those from the open source leader Hortonworks) are definitely worth a close look.
While implementing Big Data architectures in support of business needs, security administrators should look to address coverage for components across each of the above areas as they design the infrastructure. A rigorous & bottom-up approach to data security makes it possible to enforce and manage security across the stack through a central point of administration, which will likely prevent any potential security gaps and inconsistencies. This approach is especially important for newer technology like Hadoop where exciting new projects & data processing engines are always being incubated at a rapid clip. After all, the data lake is all about building a robust & highly secure platform on which data engines – Storm,Spark etc and processing frameworks like Mapreduce function to create business magic.
 Hortonworks Data Security Guide
 Open Source Alliance of America