The Seven Characteristics of Cloud Native Application Architectures..

We are in the middle of a series of blogs on Software Defined Datacenters (SDDC) @ http://www.vamsitalkstech.com/?p=1833. The key business imperative driving the SDDC architectures is their ability to natively support digital applications. Digital applications are “Cloud Native” (CN) in the sense that these platforms are originally being written for cloud frameworks – instead of being ported over to the Cloud as an afterthought. Thus, Cloud Native application development emerging as the most important trend in digital platforms. This blog post will define the seven key architectural characteristics of these CN applications.

Image Credit – Shutterstock

What is driving the need for Cloud Native Architectures… 

The previous post in the blog covered the monolithic architecture pattern. Monolithic architectures , which currently dominate the enterprise landscape, are coming under tremendous pressures in various ways and are increasingly being perceived to be brittle. Chief among these forces include – massive user volumes, DevOps style development processes, the need to open up business functionality locked within applications to partners and the heavy human requirement to deploy & manage monolithic architectures etc. Monolithic architectures also introduce technical debt into the datacenter – which makes it very difficult for the business lines to introduce changes as customer demands change – which is a key antipattern for digital deployments.

Why Legacy Monolithic Architectures Won’t Work For Digital Platforms..

Applications that require a high release velocity presenting many complex moving parts, which are worked on by few or many development teams are an ideal fit for the CN pattern.

Introducing Cloud Native Applications…

There is no single and universally accepted definition of a Cloud Native application. I would like to define a CN Application as “an application built using a combination of technology paradigms that are native to cloud computing – including distributed software development, a need to adopt DevOps practices, microservices architectures based on containers, API based integration between the layers of the application, software automation from infrastructure to code, and finally orchestration & management of the overall application infrastructure.”

Further, Cloud Native applications need to be architected, designed, developed, packaged, delivered and managed based on a deep understanding of the frameworks of cloud computing (IaaS and PaaS).

Characteristic #1 CN Applications dynamically adapt to & support massive scale…

The first & foremost characteristic of a CN Architecture is the ability to dynamically support massive numbers of users, large development organizations & highly distributed operations teams. This requirement is even more critical when one considers that cloud computing is inherently multi-tenant in nature.

Within this area, the typical concerns need to be accommodated –

  1. the ability to grow the deployment footprint dynamically (Scale-up)  as well as to decrease the footprint (Scale-down)
  2. the ability to gracefully handle failures across tiers that can disrupt application availability
  3. the ability to accommodate large development teams by ensuring that components themselves provide loose coupling
  4. the ability to work with virtually any kind of infrastructure (compute, storage and network) implementation

Characteristic #2 CN applications need to support a range of devices and user interfaces…

The User Experience (UX) is the most important part of a human facing application. This is particularly true of Digital applications which are omnichannel in nature. End users could not care less about the backend engineering of these applications as they are focused on an engaging user experience.

Demystifying Digital – the importance of Customer Journey Mapping…(2/3)

Accordingly, CN applications need to natively support mobile applications. This includes the ability to support a range of mobile backend capabilities – ranging from authentication & authorization services for mobile devices, location services, customer identification, push notifications, cloud messaging, toolkits for iOS and Android development etc.

Characteristic #3 They are automated to the fullest extent they can be…

The CN application needs to be abstracted completely from the underlying infrastructure stack. This is key as development teams can focus on solely writing their software and does not need to worry about the maintenance of the underlying OS/Storage/Network. One of the key challenges with monolithic platforms (http://www.vamsitalkstech.com/?p=5617) is their inability to efficiently leverage the underlying infrastructure as they have a high degree of dependency to it. Further, the lifecycle of infrastructure provisioning, configuration, deployment, and scaling is mostly manual with lots of scripts and pockets of configuration management.

The CN application, on the other hand, has to be very light on manual asks given its scale. The provision-deploy-scale cycle is highly automated with the application automatically scaling to meet demand and resource constraints and seamlessly recovering from failures. We discussed Kubernetes in one of the previous blogs.

Kubernetes – Container Orchestration for the Software Defined Data Center (SDDC)..(5/7)

Frameworks like these support CN Applications in providing resiliency, fault tolerance and in generally supporting very low downtime.

Characteristic #4 They support Continuous Integration and Continuous Delivery…

The reduction of the vast amount of manual effort witnessed in monolithic applications is not just confined to their deployment as far as CN applications are concerned. From a CN development standpoint, the ability to quickly test and perform quality control on daily software updates is an important aspect. CN applications automate the application development and deployment processes using the paradigms of CI/CD (Continuous Integration and Continuous Delivery).

The goal of CI is that every time source code is added or modified, the build process kicks off & the tests are conducted instantly. This helps catch errors faster and improve quality of the application. Once the CI process is done, the CD process builds the application into an artifact suitable for deployment after combining it with suitable configuration. It then deploys it onto the execution environment with the appropriate identifiers for versioning in a manner that support rollback. CD ensures that the tested artifacts are instantly deployed with acceptance testing.

 Characteristic #5 They support multiple datastore paradigms…

The RDBMS has been a fixture of the monolithic application architecture. CN applications, however, need to work with data formats of the loosely structured kind as well as the regularly structured data. This implies the need to support data streams that are not just high speed but also are better suited to NoSQL/Hadoop storage. These systems provide Schema on Read (SOR) which is an innovative data handling technique. In this model, a format or schema is applied to data as it is accessed from a storage location as opposed to doing the same while it is ingested. As we will see later in the blog, individual microservices can have their own local data storage.

A Holistic New Age Technology Approach To Countering Payment Card Fraud (3/3)…

Characteristic #6 They support APIs as a key feature…

APIs have become the de facto model that provide developers and administrators with the ability to assemble Digital applications such as microservices using complicated componentry. Thus, there is a strong case to be made for adopting an API centric strategy when developing CN applications. CN applications use APIs in multiple ways – firstly as the way to interface loosely coupled microservices (which abstract out the internals of the underlying application components). Secondly, developers use well-defined APIs to interact with the overall cloud infrastructure services.Finally, APIs enable the provisioning, deployment, and management of platform services.

Why APIs Are a Day One Capability In Digital Platforms..

Characteristic #7 Software Architecture based on microservices…

As James Lewis and Martin Fowler define it – “..the microservice architectural style is an approach to developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API. These services are built around business capabilities and independently deployable by fully automated deployment machinery. There is a bare minimum of centralized management of these services, which may be written in different programming languages and use different data storage technologies.” [1]

Microservices are a natural evolution of the Service Oriented Architecture (SOA) architecture. The application is decomposed into loosely coupled business functions and mapped to microservices. Each microservice is built for a specific granular business function and can be worked on by an independent developer or team. As such it is a separate code artifact and is thus loosely coupled not just from a communication standpoint (typically communication using a RESTful API with data being passed around using a JSON/XML representation) but also from a build, deployment, upgrade and maintenance process perspective. Each microservice can optionally have its localized datastore. An important advantage of adopting this approach is that each microservice can be created using a separate technology stack from the other parts of the application. Docker containers are the right choice to run these microservices on. Microservices confer a range of advantages ranging from easier build, independent deployment and scaling.

A Note on Security…

It goes without saying that security is a critical part of CN applications and needs to be considered and designed for as a cross-cutting concern from the inception. Security concerns impact the design & lifecycle of CN applications ranging from deployment to updates to image portability across environments. A range of technology choices is available to cover various areas such as Application level security using Role-Based Access Control, Multifactor Authentication (MFA), A&A (Authentication & Authorization)  using protocols such as OAuth, OpenID, SSO etc. The topic of Container Security is very fundamental one to this topic and there are many vendors working on ensuring that once the application is built as part of a CI/CD process as described above, they are packaged into labeled (and signed) containers which can be made part of a verified and trusted registry. This ensures that container image provenance is well understood as well as protecting any users who download the containers for use across their environments.

Conclusion…

In this post, we have tried to look at some architecture drivers for Cloud-Native applications. It is a given that organizations moving from monolithic applications will need to take nimble , small steps to realize the ultimate vision of business agility and technology autonomy. The next post, however, will look at some of the critical foundational investments enterprises will have to make before choosing the Cloud Native route as a viable choice for their applications.

References..

[1] Martin Fowler – https://martinfowler.com/intro.html

Apache Mesos: Cluster Manager for the Software Defined Data Center ..(3/7)

The second and previous blog in this six part series (@ http://www.vamsitalkstech.com/?p=4670)  discussed technical challenges with running large scale Digital Applications on traditional datacenter architectures. In this third blog, we will deep dive into another important ecosystem platform – Apache Mesos, a project that aims to abstract away various system resources – CPU, memory, network and disk resources to provide consuming digital applications with a giant cluster from which they can utilize capacity – a key requirement of the Software Defined Datacenter (SDDC). The next blogpost will deep dive into Linux Containers & Docker.

Introduction and the need for Apache Mesos..

This blog has from time to time discussed how Digital applications are a diverse blend of several different and broad technology paradigms – Big Data, Intelligent Middleware, Messaging, Business Process Management, Data Science et al.

To that end almost every Enterprise Datacenter supporting Digital workloads typically has clusters of multi-varied applications installed. Most traditional datacenters have used either physical or virtual machines (VMs) as the primary runtime unit to run such applications. These VMs are typically provisioned based on application asks and have applications deployed onto them. These VMs then are formed into logical clusters which are essentially a series of machines serving a given business application in an n-tier architecture.

As load increases on these servers, more VMs are provisioned into the cluster and so on. The challenge with this traditional model is that it is fairly static in nature in the sense that machines are preallocated to run certain kinds of workloads (databases, webservers, developer servers etc). The challenge with Digital and Cloud Native applications are that scaling needs to happen dynamically and applications think of the infrastructure as being infinite.  These applications present various challenges and headaches that call for the Datacenter to be software defined as we discussed in the last blog below. We will continue our look at the SDDC by considering one of the important projects in this landscape – Apache Mesos.

Why Digital Platforms Need A Software Defined Datacenter..(2/6)

Apache Mesos is a project that was developed at the University of California at Berkeley circa 2009. While it was initially created to solve the challenge of provisioning and scaling Spark clusters, the Mesos project evolved to become a centralized cluster manager. The central idea of Mesos is to pool together all the physical resources of the cluster and making it available as a single reservoir of highly available resources for different applications (or frameworks) to consume. Over time, Mesos has begun supporting complex n-tier application platforms that leverage capabilities such as Hadoop, Middleware, Jenkins, Kafka, Spark, Machine Learning etc.

As with almost all innovative Cloud & Big Data projects, the adoption of Apache Mesos has primarily been in the web scale arena. Prominent users include highly technical engineering shops such as Twitter, Netflix, Airbnb, Uber, eBay, Yelp and Apple. However, there seems to be early adopter activity with increased acceptance in the Fortune 100. For instance, Verizon signed on in 2015 to use a Mesosphere DC/OS (based on Apache Mesos) for datacenter orchestration.

The Many Definitions of Mesos..

At it’s simplest, Mesos is an Open Source Cluster Manager. What does that mean? Mesos can be described as a cluster manger because it ensures that datacenter hardware resources are managed and advantageously shared among multiple distributed technologies – Big Data, Message Oriented Middleware, Application Servers, Mobile apps etc. Mesos also enables applications to scale with a high degree of resiliency, without having to bother about details of the underlying infrastructure.

The model of resource allocation followed by Mesos allows a range of constituents sys-admins, developers & DevOps teams to request resources (CPU, RAM, Storage) from a cloud provider.

Mesos has alternatively been described as a Datacenter Kernel as it provides a single unified view of node resources to software frameworks that wish to consume them via APIs. Mesos performs the role of an Intelligent global level scheduler that can match a massive pool of hardware resources to distributed applications that want to consume these resources. Mesos aggregates all the resources into a large virtual pool using not just virtual machines and containers but primitives such as CPU, I/O and RAM. It also breaks applications into small units that can be assigned across this pool. Mesos also provides APIs in multiple languages to allow applications to be built for it. Apache Spark, the most popular data processing engine, was built originally as a Mesos framework.

It is also called a Data Center Operating System (DCOS) as it performs a similar role to the operating system. Any application that can run on Linux runs on Mesos.

apache-mesos

To illustrate how Mesos works. Consider two clusters in a datacenter – Cluster A and Cluster B. Cluster A has 8 nodes with each node/server possessing 4 CPUs and 64 GB RAM; Cluster B has 5 nodes with each node/server having with 4 CPUs and 64 GB RAM. Mesos can essentially combine both these clusters into one virtual cluster of 52 CPUs and 832 GB RAM. The advantage of this approach is that cluster usage is greatly improved because applications share resources much more efficiently.

Mesos and Cloud Native Applications..

We discussed the differences between Cloud Native and legacy applications in the previous post @ http://www.vamsitalkstech.com/?p=4670 . Mesos has been impactful when running stateless Cloud Native applications as opposed to running traditional applications which are built on a stateful/ vertical scaling paradigm. While the defining features of Cloud Native applications are worthy of a dedicated blogpost, these applications can scale to handle massive & increasing amounts of load while tolerating any failure without impacting service. These applications are also intrinsically distributed in nature and are typically composed of loosely coupled microservices. Examples include – stateless web applications running on a Platform as a Service (PaaS), CI/CD applications working on Jenkins, NoSQL databases like HBase, Cassandra, Couchbase and MongoDB. Stateful applications that persist data using a RDBMS to disk aren’t good workloads for Mesos as yet.

When Cloud Native Digital applications are run on Mesos, several of the headaches encountered in running these on legacy datacenters are  ameliorated, namely –

  1. Clusters can be dynamically provisioned by Mesos  based on demand spikes
  2. Location independence for microservices
  3. Fault tolerance

As it matures, Mesos has also began supporting multi datacenter deployments with web scale shops like Uber running Cassandra as a framework across datacenters at scale. In the case of Uber, each datacenter has it’s own Mesos cluster with independent frameworks that exchange information periodically. The Cassandra database includes a seed node that bootstraps the gossip process for new nodes joining the cluster. A custom seed provider was created to launch Cassandra nodes which allows new nodes to be rolled out automatically into the Mesos cluster in each datacenter. (Credit – Abhishek Verma – Uber)

Mesos Architecture..

There are three main architectural primitives in Mesos – Master, Slave, Frameworks. The central orchestrator in the Mesos system is called a Master and the worker processes are called Slaves.

As depicted below, the Master process manages the overall cluster and delegates tasks to the slaves based on the resources requested by Frameworks.

The core Mesos process is installed on all nodes and their personality is given at runtime. The Slaves run application workloads that are requested by appropriate frameworks. This overall setup of Master and Slave daemons makes up a Mesos cluster.

Frameworks which are commonly called Mesos applications and are composed of three main components. First off, they have a scheduler which registers with the Master to receive resource offers and then executors which launch workloads or tasks on the slaves. The Resource offers are a simple list of a slave’s available capacity – CPU and Memory. The Master receives these offers from the slaves and then provides them to the frameworks.  A task can be anything really – a simple script or a command, or a MapReduce job or an initialization of a Jetty/Tomcat/JBOSS AS etc.

The Mesos executor is a process on the Slave that runs tasks. The executor is a program or command on the slaves which runs the tasks. No matter which isolation module is used, the executor packages all resources and runs the task on the slave node.  When the task is complete, the containers are destroyed and the Slaves resources are released back to the Master.

For Master HA, you can run multiple masters with only one Active at a given point communicating with the slave nodes. Once the Hot Master fails, Apache Zookeeper is used to manage leader election to a standby Master as depicted. Master quorum is a minimum of 3 nodes but most production deployments are recommended to have 5 Master nodes. Once a new Master is elected, all of the cluster/slave and framework information is submitted to the new Master by the frameworks so that state before failure can be reconstructed. Mesos has elaborate recovery processes for the frameworks, the schedulers and the Slave nodes.

Apache Mesos Architecture comprises of Master Nodes, Slave Nodes and Frameworks.

By some measures, Mesos is a very straightforward concept. Frameworks need to run tasks and they are traffic managed by Masters which coordinate tasks on worker machines called – Slaves.

From a production deployment standpoint, the following components are required – An odd number of Mesos Masters, Many Slave machines needed to run applications, a Zookeeper ensemble for HA configurations and an optional Docker engine running on each slave.

The Mesos Resource Allocation Process..

Mesos follows a default resource scheduling model known as two-tier scheduling. This model may seem a little convoluted but it is important to keep in mind that it was designed to satisfy the requirements & constraints of many different frameworks without having to know details of each.

The Master’s allocation module receives resource offers from slaves which then forwards them on to the framework schedulers. These offers are not just high level in terms of the resources but also how much of these resources to offer. The framework schedulers can accept or reject the Master’s offers based on their current capacity requirements. The Master’s allocation module is customizable based on specific requirements that implementing enterprises may have. The default allocation algorithm is known as Dominant Resource Fairness (DRF) and is based on fair sharing of cluster resources among requesting applications. For instance, DRF ensures that requests are equalized i.e CPU hungry applications are provided a higher share of CPU heavy resources & Memory intensive applications are provided the same fractional amount of RAM.

Mesos follows a two level resource allocation policy (Image Credit – Apache Mesos Project Documentation)

To better illustrate the resource allocation method in Mesos, let us discuss the sequence of events in the above figure from the Apache Mesos documentation[1]

  1. The Slave Node – as depicted, Agent 1 can offer reports to4 CPUs and 4 GB of memory for allocation to any framework that can use it. It reports this  available capacity to the master. The allocation policy module offers framework 1 these resources.
  2. The Master sends a resource offer describing what is available on agent 1 to framework 1.
  3. The Framework’s scheduler then provides the master withmore information on the two tasks to run on the agent, using <2 CPUs, 1 GB RAM> for the first task, and <1 CPUs, 2 GB RAM> for the second task.
  4. The master sends the tasks to the agent, which allocates appropriate resources to the framework’s executor, which in turn launches the two tasks (depicted with dotted-line borders in the figure). Because 1 CPU and 1 GB of RAM are still unallocated, the allocation module may now offer them to framework 2.

Mesos integration with other SDDC components – Linux Containers, Docker, OpenStack, Kubernetes etc

The Mesosphere stack (Credit – Alexander Rukletsov)

As with other platforms we are discussing in this series, Mesos does not stand alone in the SDDC and leverages other technologies as needed and as discussed in the last post (@ http://www.vamsitalkstech.com/?p=4670). However it needs to be stated that Mesos does have overlapping functionality at times with technologies such as Kubernetes and OpenStack.

However, let us consider the integration points between these technologies  –

  1. Linux Containers -Over the last few years, linux containers have emerged as a viable and lightweight alternative to hypervisors as way of running multiple applications on a given OS. Different containers share one underlying OS and perform with less overhead than virtual machines. Given that one of the chief goals of Mesos is to run multiple frameworks on the same set of hardware, Mesos implements what are called isolation modules and isolation mechanisms to achieve its goal of multi-tenency for different applications running on the same hardware. Mesos supports popular technologies for process isolation – cgroups, Solaris Zones, Docker containers. The first two are the default but the Mesos project has recently added Docker as an isolation mechanism.
  2. Schedulers – There is no single widely accepted definition as to what constitutes a Container Orchestration  technology. The tooling to achieve this has become one of the trickiest parts of launching containers at scale discussion with multiple projects attempting to capture this market. The requirement in the case of Mesos is straightforward –  frameworks constitute applications which need to make the the most efficient use of hardware. This means avoiding the overhead of VMs and leveraging containers –  cgroups or Docker or Rocket etc. Hence Mesos needs to be able to support container orchestration as a core feature. Mesos follows a pluggable model for container orchestration by supporting schedulers like Kubernetes or YARN or Marathon or Docker Swarm. All these tools provide service that organize containers into a clusters and running them on specified servers & overall lifecycle management and scheduling of applications running as  containers. At large webscale properties, massive container oriented environments running hundreds of microservices are all being managed with this combination of tools using Mesos.Mesos needs to be able to start and stop services in response to failure conditions etc.
  3. Private and Public Cloud Infrastructure as a Service (IaaS) Providers– Mesos works at a different layer of abstraction than a IaaS provider such as Openstack and aims to solve different problems. While OpenStack provides provisioned infrastructure across OS, Storage, Networking et, Mesos intends to achieve better cloud instance utilization. Mesos integrates well with Openstack and runs on top of resources offered up by Openstack to run frameworks on them. Mesos itself runs on a Linux instance on an existing OpenStack deployments though it also can simply run on bare metal as well. It simply requires to run a small Linux process on each of the nodes. Mesos is also significantly simpler than OpenStack and it only takes a few hrs if even to get it up and running.
    Mesos has also been deployed on public cloud technology with both Microsoft Azure and Amazon AWS. Azure’s container services are built on Mesos. Netflix leverages Mesos extensively on their EC2 cloud and have also written an advanced scheduling library called Fenzo. Fenzo ensures that a first fit kind of assignment is followed where tasks are ‘bin packed’ onto Agents by the requested use of CPU, memory and network bandwidth. Fenzo also autoscales cluster usage based on demand and also spreads tasks of a given job across EC2 availability zones for high availability. [2]With the stage set from a technology standpoint, let us look over at a few real world use cases where Mesos has been deployed in mission critical applications at various Netflix.

Mesos Deployment @ Netflix..

Netflix are one of the largest adopters and contributors to Mesos and they use it across a wide variety of business capabilities. These use cases include real time anomaly detection, data science lifecycle (training and model building batch jobs, machine learning orchestration), and other business applications. These workloads span a range of technical architectures- batch processing, stream processing and running microservices based applications.

Netflix runs their business applications as a collection of microservices deployed on Amazon EC2 and their first use of Mesos was to perform fine grained resource allocation for compute tasks to gain greater unit efficiency on EC2. The first use case for Mesos at large enterprises is typically around increasing the usage and efficiency of elastic cloud services. In Netflix’s case, they needed the cluster scheduler to increase both agent ephemerality as well as autoscale agents based on demand.

Major Application Use Cases –

  • Mantis – Netflix deals with a lot of operational data that is constantly streaming in to their environment. They have a range of use cases on streaming data such as real-time dashboarding, alerting, anomaly detection, metric generation, and ad-hoc interactive exploration of streaming data. With this Mantis is a reactive stream processing platform that is deployed as a cloud native service which focuses on operational data streams. The other goal of Mantis is to make it easy for different development teams to obtain access to real time events and then to build applications on them. The current throughput of Mantis is around 8 million events per second and Apache Mesos is running hundreds of stream-processing jobs around the clock. For certain kinds of streaming applications, this amounts to tracking millions of unique combinations of data all the time.

    Mantis Architecture is based on Apache Mesos ..
  •  As mentioned above, Netflix runs their Application services stack on Amazon EC2 and most workloads run on linux containers. Netflix created Titus to create a container management platform and to provision Docker containers on EC2.  Netflix had to do this as Amazon ECS was not upto par yet as a container orchestration solution for EC2. The use cases supported by Titus include serving batch jobs which help with algorithm training (similar titles for recommendations, A/B test cell analysis, etc.) as well as hourly ad-hoc reporting and analysis jobs. Titus recently added support for service style invocation for Netflix resources that are used to provide consistent development environments and more fine grained resource management.

  • Titus is a Container management platform that provisions Docker containers on EC2.

    Meson – One of the most important capabilities that Netflix possess is its uncanny ability to predict what movies and shows that its subscribers want to watch based on their previous watching history and similar segmentation data. Netflix excels at personalizing video recommendations and this capability is powered by machine learning algorithms. To ensure that a very large number of machine learning workflow pipelines can be efficiently created, scheduled and managed – Netflix created Meson on top of Apache Mesos. It is critical that for this system to scale and for the algorithms themselves to be fast, reliable and efficient, these pipelines are run over a large cluster of Amazon AWS instances. As depicted below, Meson manages a large number of jobs with differing CPU, Memory and Disk requirements. Once the slaves/agents are chosen, Spark jobs are run on these shared clusters. Meson uses Linux cgroups based isolation. All of the resource scheduling is handled via Fenzo (described above)

    Meson is a platform used to create high velocity Data Science pipelines that power much of Netflix’s intelligent applications.

Conclusion..

Apache Mesos is a promising new technology which attempts to solve scaling and clustering challenges encountered in the Software Defined Datacenter (SDDC). The biggest benefits of using Mesos are more efficient use of infrastructure across complex applications with native support for multitenant applications. Mesos can ensure that multiple kinds of applications or frameworks can share a given set of nodes. This ensures not just more efficient sharing of hardware but also fault tolerance and load balancing for complex Cloud Native applications.

While, Mesos has had a good degree of adoption in the webscale properties where it was first created (Twitter, Netflix, Uber, Airbnb etc to name the most prominent), it still needs to be proven as a dependable and robust platform in the datacenter.

The next post in this series will explore another exciting technology Docker, the emerging standard in the Linux container space.

References

[1] Apache Mesos Documentation – http://mesos.apache.org/documentation/latest/architecture/

[2] Distributed Resource Scheduling with Apache Mesos at Netflix – Medium.com

View story at Medium.com

Why Software Defined Infrastructure & why now..(1/6)

The ongoing digital transformation in key verticals like financial services, manufacturing, healthcare and telco has incumbent enterprises fending off a host of new market entrants. Enterprise IT’s best answer is to increase the pace of innovation as a way of driving increased differentiation in business processes. Though data analytics & automation remain the lynchpin of this approach – software defined infrastructure (SDI) built on the notions of cloud computing has emerged as the main infrastructure differentiator & that for a host of reasons which we will discuss in this two part blog.

Software Defined Infrastructure (SDI) is essentially an idea that brings together  advances in a host of complementary areas spanning both infrastructure software, data as well as development environments. It supports a new way of building business applications. The core idea in SDI is that massively scalable applications (in support of diverse customer needs) describe their behavior characteristics (via configuration & APIs) to underlying datacenter infrastructure which simply obeys those commands in an automated fashion while abstracting away the underlying complexities.

SDI as an architectural pattern was originally made popular by the web scale giants – the so-called FANG companies of tech — Facebook , Amazon , Netflix and Alphabet (the erstwhile Google) but has begun making it’s way into the enterprise world gradually.

Common Business IT Challenges prior to SDI – 
  1. Cost of hardware infrastructure is typically growing at a high percentage every year as compared to  growth in the total  IT budget. Cost pressures are driving an overall re look at the different tiers across the IT landscape.
  2. Infrastructure is not completely under the control of the IT-Application development teams as yet.  Business realities that dictate rapid app development to meet changing business requirements
  3. Even for small, departmental level applications, still needed to deploy expensive proprietary stacks which are not only cost and deployment footprint prohibitive but also take weeks to spin up in terms of provisioning cycles.
  4. Big box proprietary solutions leading to a hard look at Open Source technologies which are lean and easy to use with lightweight deployment footprint.Apps need to dictate footprint; not vendor provided containers.
  5. Concerns with acquiring developers who are tooled on cutting edge development frameworks & methodologies. You have zero developer mindshare with Big Box technologies.

Key characteristics of an SDI

  1. Applications built on a SDI can detect business events in realtime and respond dynamically by allocating additional resources in three key areas – compute, storage & network – based on the type of workloads being run.
  2. Using an SDI, application developers can seamlessly deploy apps while accessing higher level programming abstractions that allow for the rapid creation of business services (web, application, messaging, SOA/ Microservices tiers), user interfaces and a whole host of application elements.
  3. From a management standpoint, business application workloads are dynamically and automatically assigned to the available infrastructure (spanning public & private cloud resources) on the basis of the application requirements, required SLA in a way that provides continuous optimization across the life cycle of technology.
  4. The SDI itself optimizes the entire application deployment by both externally provisioned APIs & internal interfaces between the five essential pieces – Application, Compute, Storage, Network & Management.

The SDI automates the technology lifecycle –

Consider the typical tasks needed to create and deploy enterprise applications. This list includes but is not limited to –

  • onboarding hardware infrastructure,
  • setting up complicated network connectivity to firewalls, routers, switches etc,
  • making the hardware stack available for consumption by applications,
  • figure out storage requirements and provision those
  • guarantee multi-tenancy
  • application development
  • deployment,
  • monitoring
  • updates, failover & rollbacks
  • patching
  • security
  • compliance checking etc.
The promise of SDI is to automate all of this from a business, technology, developer & IT administrator standpoint.
 SDI Reference Architecture – 
 The SDI encompasses SDC (Software Defined Compute) , SDS (Software Defined Storage), SDN (Software Defined Networking), Software Defined Applications and Cloud Management Platforms (CMP) into one logical construct as can be seen from the below picture.
FS_SDDC

                      Illustration: The different tiers of Software Defined Infrastructure

The core of the software defined approach are APIs.  APIs control the lifecycle of resources (request, approval, provisioning,orchestration & billing) as well as the applications deployed on them. The SDI implies commodity hardware (x86) & a cloud based approach to architecting the datacenter.

The ten fundamental technology tenets of the SDI –

1. Highly elastic – scale up or scale down the gamut of infrastructure (compute – VM/Baremetal/Containers, storage – SAN/NAS/DAS, network – switches/routers/Firewalls etc) in near real time

2. Highly Automated – Given the scale & multi-tenancy requirements, automation at all levels of the stack (development, deployment, monitoring and maintenance)

3. Low Cost – Oddly enough, the SDI operates at a lower CapEx and OpEx compared to the traditional datacenter due to reliance on open source technology & high degree of automation. Further workload consolidation only helps increase hardware utilization.

4. Standardization –  The SDI enforces standardization and homogenization of deployment runtimes, application stacks and development methodologies based on lines of business requirements. This solves a significant IT challenge that has hobbled innovation at large financial institutions.

5. Microservice based applications –  Applications developed for a SDI enabled infrastructure are developed as small, nimble processes that communicate via APIs and over infrastructure like messaging & service mediation components (e.g Apache Kafka & Camel). This offers huge operational and development advantages over legacy applications. While one does not expect Core Banking applications to move over to a microservice model anytime soon, customer facing applications that need responsive digital UIs will need definitely consider such approaches.

6. ‘Kind-of-Cloud’ Agnostic –  The SDI does not enforce the concept of private cloud, or rather it encompasses a range of deployment options – public, private and hybrid.

7. DevOps friendly –  The SDI enforces not just standardization and homogenization of deployment runtimes, application stacks and development methodologies but also enables a culture of continuous collaboration among developers, operations teams and business stakeholders i.e cross departmental innovation. The SDI is a natural container for workloads that are experimental in nature and can be updated/rolled-back/rolled forward incrementally based on changing business requirements. The SDI enables rapid deployment capabilities across the stack leading to faster time to market of business capabilities.

8. Data, Data & Data –  The heart of any successful technology implementation is Data. This includes customer data, transaction data, reference data, risk data, compliance data etc etc. The SDI provides a variety of tools that enable applications to process data in a batch, interactive, low latency manner depending on what the business requirements are.

9. Security –  The SDI shall provide robust perimeter defense as well as application level security with a strong focus on a Defense In Depth strategy.

10. Governance –  The SDI enforces strong governance requirements for capabilities ranging from ITSM requirements – workload orchestration, business policy enabled deployment, autosizing of workloads to change management, provisioning, billing, chargeback & application deployments.

The next & second blog in this series will cover the challenges in running massive scale applications.