Home Big Data Big Data and Kubernetes – Best Practices in Deploying Containers with Big Data..(4/4)

Big Data and Kubernetes – Best Practices in Deploying Containers with Big Data..(4/4)

by Vamsi Chemitiganti

Of all enterprise IT workloads, data-based infrastructure has been the hardest to deliver using a cloud-based deployment model. Data assets are typically the last to move over to a cloud – public or private. While Big Data vendors jockey to provide hybrid cloud-enabled platforms, their customers – Enterprise IT-  still needs to take into account five key business principles.

                                        Photo by Becky Phan on Unsplash

Introduction

With Big Data and allied concepts such as machine learning & AI increasing in popularity and adoption across industries, implementations are increasing both in number and scope. While the first couple of generations of these implementations focused on datacenter infrastructure, enterprises are now looking at adopting a Cloud-First strategy to run their data workloads.

Traditional data development has been focused on a waterfall like model using monolithic stacks and provisioning methodologies. This is in sharp contrast to flexible and agile development models that leverage a DevOps based process. Given that Big Data and AI-based workloads are driving the digital agenda, IT has to now support a two-part path via the Hybrid Cloud.  While traditional (read legacy) data-driven applications that provide batch-oriented data analysis and BI insights can stay on bare metal or hypervisor-based infrastructure, new age Data applications (such as streaming analysis, real-time inferencing) typically move to a private-public cloud architecture.

Recommendation #1 Irrespective of where the Data lives – Aim For and Stay Cloud Agnostic

In any large Big Data initiative, a range of non-traditional data has to be identified and then ingested into a set of commodity servers either in an on-premise data center or using a cloud provider such as Amazon AWS or Microsoft Azure. It then needs to be curated, by applying business level processing. This could include identifying businesses using fundamental analysis or applying algorithms that spot patterns in data that pertain to attractiveness based on certain trending themes etc.

All of these non-traditional data streams can be stored on commodity hardware clusters. This can be done at a fraction of the cost of traditional SAN storage. The combined data can then be analyzed effectively in near real-time thus providing support for advanced business capabilities.

However, what data assets stay on-premises in a private cloud and what moves to a public cloud needs to be predicated on business impact, data sensitivity, security, application architecture, and cost. There is no “one size fits all” approach.

Recommendation #2 Provide a Unified Big Data Operator and Developer Experience across Clouds

As Data moves from the edge to customer premises to the cloud, the applications that need to be seeded with the data need to be located accordingly.

  • This means a Unified API to manage your workloads across a single pane of glass
  • Can applications be developed without undue lockin to the cloud provider?
  • Also does the architecture support a unified management and monitoring model
  • Can Security and Governance be applied in as unified a manner as possible? This means a unified Identity Management backing RBAC capabilities and governance (appropriate data element access based on role and business policy) be built into the architecture.

Does your cloud deployment model account for these scenarios?

Recommendation #3 Cloud Aware CI/CD for Data Scientists and Developers

As opposed to legacy Data applications that operate as monolithic entities, Big Data applications should aim to be run natively on the cloud and support as-a-service semantics. Accordingly, Big Data applications will be composed of microservices.

Cloud-native applications are stateless, technology-independent and can be deployed in almost any context. Cloud-native is today’s choice when designing and deploying applications that are fast (in terms of deployment and change time), agile and deliver value for money.

Recommendation #4 Use Serverless Technology to Augment Big Data

The Serverless movement uses an event-based model to seed specialized functions. Big Data architectures overwhelmingly use Kafka as a source. Thus, the notion of stream processing pipelines and such applications is not new to the Big Data community. However, Big Data developers and Data scientists can use as many Serverless functions as it makes sense for their business process. For instance, they can use it to front their applications, do Model validation much faster than is possible with a legacy development process, run on demand pipelines, etc. Serverless ensures that data developers don’t need to maintain and manage Hadoop/Spark infrastructure and servers. ETL and data processing pipelines & jobs can be set up and scaled on demand.  

Recommendation #5 Manage Costs via a Managed Service

While the public clouds may seem an attractive option from a provisioning standpoint, many early adopters have complained about the high costs of running a complete data infrastructure on them. At the same time, private clouds suffer from issues ranging from availability to choice for developers. Given that Big Data architectures will overwhelmingly be based on containers and will be orchestrated using Kubernetes, a Managed Service such as Platform 9’s Managed Kubernetes is a great start. Customers can get a working, highly available, production-grade infrastructure running your workloads on any cloud – in minutes. Such a platform also ensures de-risking of your cloud investments by ensuring that the right provisioning, chargeback models are put in place.  

Data Native Architectures converge with Cloud Native Architectures

Most Cloud Native Architectures are designed in response to Digital Business initiatives – where it is important to personalize and to track minute customer interactions. The main components of a Cloud Native Platform leverage a microservices based design. Given all this, it is important to note that a Big Data stack based on Hadoop (Gen 2) is not just a data processing platform. It has multiple personas – a real-time, streaming data, interactive platform that can perform any kind of data processing (batch, analytical, in memory & graph based) while providing search, messaging & governance capabilities. Thus, Hadoop provides not just massive data storage capabilities but also provides multiple frameworks to process the data resulting in response times of milliseconds with the utmost reliability whether that be real-time data or historical processing of backend data. My bet on 2019 & 2020 is that these capabilities will increasingly be harnessed as part of a DevOps process to develop a microservices based deployment.

Conclusion

The key part of ensuring that digital interactions are infused with insights is Data. Enterprises need to arrive at a careful and well thought out Cloud Native data architecture whether they use data lakes or data warehouses on a hybrid cloud. Architectures and development processes need to be remodeled around the above constructs. Enterprises will derive a lot of value from managed service offerings across the above areas.

Discover more at Industry Talks Tech: your one-stop shop for upskilling in different industry segments!

You may also like

1 comment

Malpani October 10, 2019 - 10:38 am

Hey Vamsital, you have really done great work, you ‘ve carefully selected a great topic on big data analytics for business. I benefited much from it as it explains beautifully. And it also relates to my blog on:
big data for business

Reply

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.