We are in the middle of a series of blogs on Software Defined Datacenters (SDDC) @ http://www.vamsitalkstech.com/?p=1833. The key business imperative driving the SDDC architectures is their ability to natively support digital applications. Digital applications are “Cloud Native” (CN) in the sense that these platforms are originally being written for cloud frameworks – instead of being ported over to the Cloud as an afterthought. Thus, Cloud Native application development emerging as the most important trend in digital platforms. This blog post will define the seven key architectural characteristics of these CN applications.
What is driving the need for Cloud Native Architectures…
The previous post in the blog covered the monolithic architecture pattern. Monolithic architectures , which currently dominate the enterprise landscape, are coming under tremendous pressures in various ways and are increasingly being perceived to be brittle. Chief among these forces include – massive user volumes, DevOps style development processes, the need to open up business functionality locked within applications to partners and the heavy human requirement to deploy & manage monolithic architectures etc. Monolithic architectures also introduce technical debt into the datacenter – which makes it very difficult for the business lines to introduce changes as customer demands change – which is a key antipattern for digital deployments.
Applications that require a high release velocity presenting many complex moving parts, which are worked on by few or many development teams are an ideal fit for the CN pattern.
Introducing Cloud Native Applications…
There is no single and universally accepted definition of a Cloud Native application. I would like to define a CN Application as “an application built using a combination of technology paradigms that are native to cloud computing – including distributed software development, a need to adopt DevOps practices, microservices architectures based on containers, API based integration between the layers of the application, software automation from infrastructure to code, and finally orchestration & management of the overall application infrastructure.”
Further, Cloud Native applications need to be architected, designed, developed, packaged, delivered and managed based on a deep understanding of the frameworks of cloud computing (IaaS and PaaS).
Characteristic #1 CN Applications dynamically adapt to & support massive scale…
The first & foremost characteristic of a CN Architecture is the ability to dynamically support massive numbers of users, large development organizations & highly distributed operations teams. This requirement is even more critical when one considers that cloud computing is inherently multi-tenant in nature.
Within this area, the typical concerns need to be accommodated –
- the ability to grow the deployment footprint dynamically (Scale-up) as well as to decrease the footprint (Scale-down)
- the ability to gracefully handle failures across tiers that can disrupt application availability
- the ability to accommodate large development teams by ensuring that components themselves provide loose coupling
- the ability to work with virtually any kind of infrastructure (compute, storage and network) implementation
Characteristic #2 CN applications need to support a range of devices and user interfaces…
The User Experience (UX) is the most important part of a human facing application. This is particularly true of Digital applications which are omnichannel in nature. End users could not care less about the backend engineering of these applications as they are focused on an engaging user experience.
Accordingly, CN applications need to natively support mobile applications. This includes the ability to support a range of mobile backend capabilities – ranging from authentication & authorization services for mobile devices, location services, customer identification, push notifications, cloud messaging, toolkits for iOS and Android development etc.
Characteristic #3 They are automated to the fullest extent they can be…
The CN application needs to be abstracted completely from the underlying infrastructure stack. This is key as development teams can focus on solely writing their software and does not need to worry about the maintenance of the underlying OS/Storage/Network. One of the key challenges with monolithic platforms (http://www.vamsitalkstech.com/?p=5617) is their inability to efficiently leverage the underlying infrastructure as they have a high degree of dependency to it. Further, the lifecycle of infrastructure provisioning, configuration, deployment, and scaling is mostly manual with lots of scripts and pockets of configuration management.
The CN application, on the other hand, has to be very light on manual asks given its scale. The provision-deploy-scale cycle is highly automated with the application automatically scaling to meet demand and resource constraints and seamlessly recovering from failures. We discussed Kubernetes in one of the previous blogs.
Frameworks like these support CN Applications in providing resiliency, fault tolerance and in generally supporting very low downtime.
Characteristic #4 They support Continuous Integration and Continuous Delivery…
The reduction of the vast amount of manual effort witnessed in monolithic applications is not just confined to their deployment as far as CN applications are concerned. From a CN development standpoint, the ability to quickly test and perform quality control on daily software updates is an important aspect. CN applications automate the application development and deployment processes using the paradigms of CI/CD (Continuous Integration and Continuous Delivery).
The goal of CI is that every time source code is added or modified, the build process kicks off & the tests are conducted instantly. This helps catch errors faster and improve quality of the application. Once the CI process is done, the CD process builds the application into an artifact suitable for deployment after combining it with suitable configuration. It then deploys it onto the execution environment with the appropriate identifiers for versioning in a manner that support rollback. CD ensures that the tested artifacts are instantly deployed with acceptance testing.
Characteristic #5 They support multiple datastore paradigms…
The RDBMS has been a fixture of the monolithic application architecture. CN applications, however, need to work with data formats of the loosely structured kind as well as the regularly structured data. This implies the need to support data streams that are not just high speed but also are better suited to NoSQL/Hadoop storage. These systems provide Schema on Read (SOR) which is an innovative data handling technique. In this model, a format or schema is applied to data as it is accessed from a storage location as opposed to doing the same while it is ingested. As we will see later in the blog, individual microservices can have their own local data storage.
Characteristic #6 They support APIs as a key feature…
APIs have become the de facto model that provide developers and administrators with the ability to assemble Digital applications such as microservices using complicated componentry. Thus, there is a strong case to be made for adopting an API centric strategy when developing CN applications. CN applications use APIs in multiple ways – firstly as the way to interface loosely coupled microservices (which abstract out the internals of the underlying application components). Secondly, developers use well-defined APIs to interact with the overall cloud infrastructure services.Finally, APIs enable the provisioning, deployment, and management of platform services.
Characteristic #7 Software Architecture based on microservices…
As James Lewis and Martin Fowler define it – “..the microservice architectural style is an approach to developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API. These services are built around business capabilities and independently deployable by fully automated deployment machinery. There is a bare minimum of centralized management of these services, which may be written in different programming languages and use different data storage technologies.” 
Microservices are a natural evolution of the Service Oriented Architecture (SOA) architecture. The application is decomposed into loosely coupled business functions and mapped to microservices. Each microservice is built for a specific granular business function and can be worked on by an independent developer or team. As such it is a separate code artifact and is thus loosely coupled not just from a communication standpoint (typically communication using a RESTful API with data being passed around using a JSON/XML representation) but also from a build, deployment, upgrade and maintenance process perspective. Each microservice can optionally have its localized datastore. An important advantage of adopting this approach is that each microservice can be created using a separate technology stack from the other parts of the application. Docker containers are the right choice to run these microservices on. Microservices confer a range of advantages ranging from easier build, independent deployment and scaling.
A Note on Security…
It goes without saying that security is a critical part of CN applications and needs to be considered and designed for as a cross-cutting concern from the inception. Security concerns impact the design & lifecycle of CN applications ranging from deployment to updates to image portability across environments. A range of technology choices is available to cover various areas such as Application level security using Role-Based Access Control, Multifactor Authentication (MFA), A&A (Authentication & Authorization) using protocols such as OAuth, OpenID, SSO etc. The topic of Container Security is very fundamental one to this topic and there are many vendors working on ensuring that once the application is built as part of a CI/CD process as described above, they are packaged into labeled (and signed) containers which can be made part of a verified and trusted registry. This ensures that container image provenance is well understood as well as protecting any users who download the containers for use across their environments.
In this post, we have tried to look at some architecture drivers for Cloud-Native applications. It is a given that organizations moving from monolithic applications will need to take nimble , small steps to realize the ultimate vision of business agility and technology autonomy. The next post, however, will look at some of the critical foundational investments enterprises will have to make before choosing the Cloud Native route as a viable choice for their applications.
 Martin Fowler – https://martinfowler.com/intro.html
The third and previous blog in this seven part series (@ http://www.vamsitalkstech.com/?p=4659) discussed Apache Mesos, a project that aims to abstract away various system resources – CPU, memory, network and disk resources to provide consuming digital applications with a giant cluster from which they can utilize capacity – a key requirement of the Software Defined Datacenter (SDDC). In this fourth blog, we will discuss another important ecosystem technology & project – Linux Containers and Docker – which forms the foundational runtime component in the SDDC. The next blog will discuss Kubernetes – Google’s container orchestration platform.
We can agree that the Digital application is inherently a distributed application. Such applications have historically been extremely hard to develop, setup and manage across a large fleet of data center servers that are a mix of platforms and technologies. Thus it is no surprise that one of the most disruptive developments in the last five years has been the innovation in the Linux container space. Containers now enable the running distributed applications at scale.
Due to business reasons, Digital applications demand constant updates, changes and incremental revisions in response to changing customer needs. The Software Defined Datacenter (SDDC) thus needs a runtime paradigm that enables not just efficient hardware usage but also supports standardized application environments that are portable simplified and consistent across hybrid clouds and hypervisors. Containers fill this need and are thus emerging to be the natural unit of deployment across the SDDC. Much has been written on the topic of Docker and Linux Container technology. My goal for this blog post is to distill key insights in the container ecosystem.
The Technologies of Linux Containers & Docker
Unlike Virtual Machines, Container Engines such as Docker share a common OS (Image Credit – MSFT Azure)
Linux Containers are alike and yet different from virtual machines. They are alike in the sense that each Container shares system resources on the underlying hardware platform – CPU, RAM, and Network – as with VMs. However, while each VM maintains its separate copy of the Operating System (OS), containers share the same OS kernel while keeping themselves separate from other containers running on the same OS. How do they do that?
Though the terms ‘Docker’ and ‘Container’ have become almost synonymous – it needs to be noted that Docker is a company focused on developing technology enablement around containers in areas such as orchestration, networking, and management. Docker was an open source project (now renamed to Moby ) that provided capabilities such as a standard description of container formats, utilities for application packaging, deployment & lifecycle management of applications inside Linux Containers. It provides a Docker CLI command line tool for the lifecycle management of image-based containers.
Prior to the explosion of interest in Linux containers & the founding of Docker, traditional Linux distributions (with a minimum kernel level of 3.8) supported two foundational paradigms – control groups (cgroups) and kernel namespaces. Linux containers use both these features to achieve their goal of isolation and portability. Cgroups enables the host to limit the resources each container process can use from a CPU, Memory, Filesystem, User ID components and Network standpoint. This ensures that containers running on a host cannot starve others of resources thus avoiding the “Noisy Neighbor” problem that bedeviled a lot of cloud deployments.
Kernel Namespaces ensure another kind of isolation for process interactions within the OS. Containers can only view and modify resources in the same namespace. This ensures a security mechanism where other containers and processes on the host cannot launch attacks on a given application running on a tenant container or on the host itself. Thus the combination of both these technologies ensures that multiple applications running within their individual containers can share CPU and Memory without needing the overhead of virtualization. Docker also grants each container its own networking implementation thus ensuring that resources such as socket and interfaces can also be protected.
Companies including Red Hat, IBM, Google, Cisco, VMware, and CoreOS have greatly aided with the development of and accessibility of containers in their platforms and products.
We discussed how Container Images are Immutable. This is the key advantage of using container technology such as Docker & is made possible by the notion of a Union filesystem. What are Union filesystems and how do they enforce immutability? Much like the image in a Virtual Machine sense, Containers also run from an image, which typically are a snapshot of a filesystem but tend to be much smaller than VM images since the Container is installed on a host kernel.
Union filesystems are best described as a layered architecture – in that each layer is created independently and then added atop of the previous layer. An example of a Union filesystem is a Linux Kernel – an OS – then a data base like Oracle – then Tomcat – and a web application on it. The top layer is always the Writable layer. The real advantage in using a union filesystem is that using these images becomes super efficient from a storage and execution standpoint. Union filesystems also help in sharing portions of the OS across containers. Simply put, an image contains everything an application needs – from it’s dependencies and external libraries. When an Image is run, it is called a Container. In the case of Docker, it uses a layered copy on write filesystem called AUFS (Another Union Filesystem).
Containers and Developers..
Containers are possibly the first infrastructure software category created by developers in mind. The prominence of Linux Containers has Docker coincided with the onset of agile development practices under the DevOps umbrella – CI/CD etc. Containers are an excellent choice to create agile delivery pipelines and continuous deployment. At their core, Containers enable the creation of multiple self-contained execution environments over the same operating system.
Developers are naturally excited about Linux Containers for five specific reasons –
- Containers allow for image consistency across OS environments. This is a huge help in accelerating the development process from development to debugging to production. Developers can just focus on building their applications (in dev environments that match the test and prod) and packaging them in containers. This just takes a lot of the inefficiency around environment dissimilarities out of the equation.
- Containers are treated as a standard linux process by the kernel & thus are orders of magnitude quicker from a startup time when compared to VMs. This means that developers can start their applications in a manner of seconds as long as they run them in a container.
- Containers also provide development organizations the ability to standardize application development workflows and update processes. This solves the scalability problem that digital applications have caused large organizations.
- Digital applications are leading the move to adopt microservices. Microservices offer a way to build applications as a collection of discrete services as opposed to a monolithic architecture. By there very nature, microservices can be built and managed by different teams. Containerization affords a lightweight way of building and deploying microservices.
- Containers offer a portable way of delivering applications (across Operating Systems) as well as provide horizontal scalability.
Digital Application development using Containers..
There are a few key runtime components involved in operationalizing a small to medium to large scale container infrastructure as the above illustration depicts.
- Firstly, developers create container images. These images describe an application and it’s dependencies. An easy way to conceptualize an image is to think of it as a basic deployment template. Image are also immutable in that they are read only and any changes happen in the top most layer which is writable. Modifying an image is to create a new one. Images thus have a Parent Child relationship. Developers create images by building their applications on their developer environments, performing unit tests and then pushing to a repository. Once the container is built with the necessary dependencies, these tools run a battery of tests to validate business functionality. A large part of this process is usually best automated using CI/CD tools like Jenkins, CruiseControl or Buildbot etc.
- The built images are then made available in a Container Registry. This is either maintained internally or sourced from a trusted external source. As the name suggests, Registries maintain a catalog of container images of frequently used software – e.g. Custom applications and other software packages such as WordPress, Relational databases, Web Servers, Big Data technologies and Application Servers etc
- The next step is to create and deploy (runtime) containers from these images on a set of servers. Once images are released as a result of application development, sys admins work on the provisioning of the servers to run these images. Once a Container engine is installed on the server, images are loaded on and they take the runtime shape of containers. The mode of getting these images on these servers follows either a push/pull mechanism.
- Scheduling of containers on servers is also a process that usually done by Sys Admins. This involves running containers of certain kinds on servers that match up to certain CPU, I/O and Network capacity requirements
- To create complex real world deployments, not only do the servers and networking have to be created but these containers are also interconnected (e.g. a web server container to an application server) using Discovery mechanisms. These containers then need to also connect to a host of enterprise services. Customer traffic is then routed to the clustered containers running on these servers. Monitor the logs and performance of these containers and the microservices running on them.
- The process repeats from step #1 above.
Industry Adoption of Containers.
In a few years, containers will deliver the bulk of compute workloads across public cloud providers such as Amazon AWS, Google Compute Engine and Microsoft Azure. Given that the VM options on these clouds can run multiple containers which can scale on demand, the industry will begin to gravitate to higher utilization density. The SDDC has already begun incorporating hybrid architectures that run both containers and VMs in a complementary fashion.The Software Defined Datacenter will incorporate a hybrid model consisting of applications running on both Linux Containers and Virtual Machines.
Customers also have choices of traditional enterprise operating systems such as Red Hat Enterprise Linux or Microsoft Windows or can also run containers on OS’s developed for the purpose of hosting containers at hyper scale. These OS’s just provide tools to manage containers and nothing else. Examples include Red Hat Atomic Platform and CoreOS. Moving up the stack, pioneers such as Google and Red Hat have added core support for containers in projects such as OpenStack, Kubernetes, Mesos, OpenShift & CloudFoundry by helping with networking and persistent storage. Kubernetes (which we will cover in the next post) also handles provisioning on multiple public cloud platforms. Config Mgmt platforms such as Ansible, Chef and Puppet now support containerized deployments.
Technical Considerations for Container Adoption
Some key considerations that industry players are tackling from the standpoint of running containers at scale –
- Container Orchestration – Organize groups of containers into compassable applications, scheduling them on servers that match their resource requirements, placement of containers based on network topology etc
- Container Networking – Containers follow a pluggable model and the network is no different. Key considerations – an enterprise network connectivity stack is needed to not only provide the interconnect between different containers but also to integrate them with existing Layer 2/3 networks. Additionally, network isolation needs to be provided for microservices running on these containers using either a dedicated IP address for each or an overlay network.
- Management and Monitoring -Life cycle processes ranging from Management and Monitoring encompass a range of questions – application patching with low downtime, graceful failures in cloud native applications, container scale up & scale down based on traffic patterns etc.
Containers and your Enterprise…
So what is the best way to adopt containers across a large enterprise?
- Develop your container strategy in the context of the Nexus of Forces (i.e., information, mobile, social and cloud) initiatives in your organization — Containers are at the junction of these technologies.
- Institute an organizational process to examine the business value of any initiative to adopt Containers. Understand what tools and platforms to adopt that will abstract away the complexities of using containers.
- Understanding skills required to leverage containers. Containers are a new way for both developers and SysOps. Dependency management moves to the developers but they realize tremendous benefits in adopting these for high-velocity Digital applications
- Identifying, measuring and benchmarking key success metrics that measure the ROI of the overall container investments.
To sum up, the Linux (and Windows) container space is exploding both from a mindshare as well as an adoption standpoint. What is hugely encouraging is that a host of next generation platform technologies (ranging from IaaS to PaaS) are not just choosing to support containers as their basic runtime unit but are also focusing on becoming the defacto solution supporting a host of container ecosystem usecases – provisioning, orchestration, management, CI/CD et al. The next two blogs will respectively discuss how Google Kubernetes and Red Hat OpenShift overcome these challenges and abstract away much of the complexity around container deployments.
The next blog post in this series will discuss Google Kubernetes, the dominant project in the container orchestration space.
 Introducing Moby Project – https://blog.docker.com/2017/04/introducing-the-moby-project/
The second and previous blog in this six part series (@ http://www.vamsitalkstech.com/?p=4670) discussed technical challenges with running large scale Digital Applications on traditional datacenter architectures. In this third blog, we will deep dive into another important ecosystem platform – Apache Mesos, a project that aims to abstract away various system resources – CPU, memory, network and disk resources to provide consuming digital applications with a giant cluster from which they can utilize capacity – a key requirement of the Software Defined Datacenter (SDDC). The next blogpost will deep dive into Linux Containers & Docker.
Introduction and the need for Apache Mesos..
This blog has from time to time discussed how Digital applications are a diverse blend of several different and broad technology paradigms – Big Data, Intelligent Middleware, Messaging, Business Process Management, Data Science et al.
To that end almost every Enterprise Datacenter supporting Digital workloads typically has clusters of multi-varied applications installed. Most traditional datacenters have used either physical or virtual machines (VMs) as the primary runtime unit to run such applications. These VMs are typically provisioned based on application asks and have applications deployed onto them. These VMs then are formed into logical clusters which are essentially a series of machines serving a given business application in an n-tier architecture.
As load increases on these servers, more VMs are provisioned into the cluster and so on. The challenge with this traditional model is that it is fairly static in nature in the sense that machines are preallocated to run certain kinds of workloads (databases, webservers, developer servers etc). The challenge with Digital and Cloud Native applications are that scaling needs to happen dynamically and applications think of the infrastructure as being infinite. These applications present various challenges and headaches that call for the Datacenter to be software defined as we discussed in the last blog below. We will continue our look at the SDDC by considering one of the important projects in this landscape – Apache Mesos.
Apache Mesos is a project that was developed at the University of California at Berkeley circa 2009. While it was initially created to solve the challenge of provisioning and scaling Spark clusters, the Mesos project evolved to become a centralized cluster manager. The central idea of Mesos is to pool together all the physical resources of the cluster and making it available as a single reservoir of highly available resources for different applications (or frameworks) to consume. Over time, Mesos has begun supporting complex n-tier application platforms that leverage capabilities such as Hadoop, Middleware, Jenkins, Kafka, Spark, Machine Learning etc.
As with almost all innovative Cloud & Big Data projects, the adoption of Apache Mesos has primarily been in the web scale arena. Prominent users include highly technical engineering shops such as Twitter, Netflix, Airbnb, Uber, eBay, Yelp and Apple. However, there seems to be early adopter activity with increased acceptance in the Fortune 100. For instance, Verizon signed on in 2015 to use a Mesosphere DC/OS (based on Apache Mesos) for datacenter orchestration.
The Many Definitions of Mesos..
At it’s simplest, Mesos is an Open Source Cluster Manager. What does that mean? Mesos can be described as a cluster manger because it ensures that datacenter hardware resources are managed and advantageously shared among multiple distributed technologies – Big Data, Message Oriented Middleware, Application Servers, Mobile apps etc. Mesos also enables applications to scale with a high degree of resiliency, without having to bother about details of the underlying infrastructure.
The model of resource allocation followed by Mesos allows a range of constituents sys-admins, developers & DevOps teams to request resources (CPU, RAM, Storage) from a cloud provider.
Mesos has alternatively been described as a Datacenter Kernel as it provides a single unified view of node resources to software frameworks that wish to consume them via APIs. Mesos performs the role of an Intelligent global level scheduler that can match a massive pool of hardware resources to distributed applications that want to consume these resources. Mesos aggregates all the resources into a large virtual pool using not just virtual machines and containers but primitives such as CPU, I/O and RAM. It also breaks applications into small units that can be assigned across this pool. Mesos also provides APIs in multiple languages to allow applications to be built for it. Apache Spark, the most popular data processing engine, was built originally as a Mesos framework.
It is also called a Data Center Operating System (DCOS) as it performs a similar role to the operating system. Any application that can run on Linux runs on Mesos.
To illustrate how Mesos works. Consider two clusters in a datacenter – Cluster A and Cluster B. Cluster A has 8 nodes with each node/server possessing 4 CPUs and 64 GB RAM; Cluster B has 5 nodes with each node/server having with 4 CPUs and 64 GB RAM. Mesos can essentially combine both these clusters into one virtual cluster of 52 CPUs and 832 GB RAM. The advantage of this approach is that cluster usage is greatly improved because applications share resources much more efficiently.
Mesos and Cloud Native Applications..
We discussed the differences between Cloud Native and legacy applications in the previous post @ http://www.vamsitalkstech.com/?p=4670 . Mesos has been impactful when running stateless Cloud Native applications as opposed to running traditional applications which are built on a stateful/ vertical scaling paradigm. While the defining features of Cloud Native applications are worthy of a dedicated blogpost, these applications can scale to handle massive & increasing amounts of load while tolerating any failure without impacting service. These applications are also intrinsically distributed in nature and are typically composed of loosely coupled microservices. Examples include – stateless web applications running on a Platform as a Service (PaaS), CI/CD applications working on Jenkins, NoSQL databases like HBase, Cassandra, Couchbase and MongoDB. Stateful applications that persist data using a RDBMS to disk aren’t good workloads for Mesos as yet.
When Cloud Native Digital applications are run on Mesos, several of the headaches encountered in running these on legacy datacenters are ameliorated, namely –
- Clusters can be dynamically provisioned by Mesos based on demand spikes
- Location independence for microservices
- Fault tolerance
As it matures, Mesos has also began supporting multi datacenter deployments with web scale shops like Uber running Cassandra as a framework across datacenters at scale. In the case of Uber, each datacenter has it’s own Mesos cluster with independent frameworks that exchange information periodically. The Cassandra database includes a seed node that bootstraps the gossip process for new nodes joining the cluster. A custom seed provider was created to launch Cassandra nodes which allows new nodes to be rolled out automatically into the Mesos cluster in each datacenter. (Credit – Abhishek Verma – Uber)
There are three main architectural primitives in Mesos – Master, Slave, Frameworks. The central orchestrator in the Mesos system is called a Master and the worker processes are called Slaves.
As depicted below, the Master process manages the overall cluster and delegates tasks to the slaves based on the resources requested by Frameworks.
The core Mesos process is installed on all nodes and their personality is given at runtime. The Slaves run application workloads that are requested by appropriate frameworks. This overall setup of Master and Slave daemons makes up a Mesos cluster.
Frameworks which are commonly called Mesos applications and are composed of three main components. First off, they have a scheduler which registers with the Master to receive resource offers and then executors which launch workloads or tasks on the slaves. The Resource offers are a simple list of a slave’s available capacity – CPU and Memory. The Master receives these offers from the slaves and then provides them to the frameworks. A task can be anything really – a simple script or a command, or a MapReduce job or an initialization of a Jetty/Tomcat/JBOSS AS etc.
The Mesos executor is a process on the Slave that runs tasks. The executor is a program or command on the slaves which runs the tasks. No matter which isolation module is used, the executor packages all resources and runs the task on the slave node. When the task is complete, the containers are destroyed and the Slaves resources are released back to the Master.
For Master HA, you can run multiple masters with only one Active at a given point communicating with the slave nodes. Once the Hot Master fails, Apache Zookeeper is used to manage leader election to a standby Master as depicted. Master quorum is a minimum of 3 nodes but most production deployments are recommended to have 5 Master nodes. Once a new Master is elected, all of the cluster/slave and framework information is submitted to the new Master by the frameworks so that state before failure can be reconstructed. Mesos has elaborate recovery processes for the frameworks, the schedulers and the Slave nodes.
By some measures, Mesos is a very straightforward concept. Frameworks need to run tasks and they are traffic managed by Masters which coordinate tasks on worker machines called – Slaves.
From a production deployment standpoint, the following components are required – An odd number of Mesos Masters, Many Slave machines needed to run applications, a Zookeeper ensemble for HA configurations and an optional Docker engine running on each slave.
The Mesos Resource Allocation Process..
Mesos follows a default resource scheduling model known as two-tier scheduling. This model may seem a little convoluted but it is important to keep in mind that it was designed to satisfy the requirements & constraints of many different frameworks without having to know details of each.
The Master’s allocation module receives resource offers from slaves which then forwards them on to the framework schedulers. These offers are not just high level in terms of the resources but also how much of these resources to offer. The framework schedulers can accept or reject the Master’s offers based on their current capacity requirements. The Master’s allocation module is customizable based on specific requirements that implementing enterprises may have. The default allocation algorithm is known as Dominant Resource Fairness (DRF) and is based on fair sharing of cluster resources among requesting applications. For instance, DRF ensures that requests are equalized i.e CPU hungry applications are provided a higher share of CPU heavy resources & Memory intensive applications are provided the same fractional amount of RAM.
To better illustrate the resource allocation method in Mesos, let us discuss the sequence of events in the above figure from the Apache Mesos documentation
- The Slave Node – as depicted, Agent 1 can offer reports to4 CPUs and 4 GB of memory for allocation to any framework that can use it. It reports this available capacity to the master. The allocation policy module offers framework 1 these resources.
- The Master sends a resource offer describing what is available on agent 1 to framework 1.
- The Framework’s scheduler then provides the master withmore information on the two tasks to run on the agent, using <2 CPUs, 1 GB RAM> for the first task, and <1 CPUs, 2 GB RAM> for the second task.
- The master sends the tasks to the agent, which allocates appropriate resources to the framework’s executor, which in turn launches the two tasks (depicted with dotted-line borders in the figure). Because 1 CPU and 1 GB of RAM are still unallocated, the allocation module may now offer them to framework 2.
Mesos integration with other SDDC components – Linux Containers, Docker, OpenStack, Kubernetes etc
As with other platforms we are discussing in this series, Mesos does not stand alone in the SDDC and leverages other technologies as needed and as discussed in the last post (@ http://www.vamsitalkstech.com/?p=4670). However it needs to be stated that Mesos does have overlapping functionality at times with technologies such as Kubernetes and OpenStack.
However, let us consider the integration points between these technologies –
- Linux Containers -Over the last few years, linux containers have emerged as a viable and lightweight alternative to hypervisors as way of running multiple applications on a given OS. Different containers share one underlying OS and perform with less overhead than virtual machines. Given that one of the chief goals of Mesos is to run multiple frameworks on the same set of hardware, Mesos implements what are called isolation modules and isolation mechanisms to achieve its goal of multi-tenency for different applications running on the same hardware. Mesos supports popular technologies for process isolation – cgroups, Solaris Zones, Docker containers. The first two are the default but the Mesos project has recently added Docker as an isolation mechanism.
- Schedulers – There is no single widely accepted definition as to what constitutes a Container Orchestration technology. The tooling to achieve this has become one of the trickiest parts of launching containers at scale discussion with multiple projects attempting to capture this market. The requirement in the case of Mesos is straightforward – frameworks constitute applications which need to make the the most efficient use of hardware. This means avoiding the overhead of VMs and leveraging containers – cgroups or Docker or Rocket etc. Hence Mesos needs to be able to support container orchestration as a core feature. Mesos follows a pluggable model for container orchestration by supporting schedulers like Kubernetes or YARN or Marathon or Docker Swarm. All these tools provide service that organize containers into a clusters and running them on specified servers & overall lifecycle management and scheduling of applications running as containers. At large webscale properties, massive container oriented environments running hundreds of microservices are all being managed with this combination of tools using Mesos.Mesos needs to be able to start and stop services in response to failure conditions etc.
- Private and Public Cloud Infrastructure as a Service (IaaS) Providers– Mesos works at a different layer of abstraction than a IaaS provider such as Openstack and aims to solve different problems. While OpenStack provides provisioned infrastructure across OS, Storage, Networking et, Mesos intends to achieve better cloud instance utilization. Mesos integrates well with Openstack and runs on top of resources offered up by Openstack to run frameworks on them. Mesos itself runs on a Linux instance on an existing OpenStack deployments though it also can simply run on bare metal as well. It simply requires to run a small Linux process on each of the nodes. Mesos is also significantly simpler than OpenStack and it only takes a few hrs if even to get it up and running.
Mesos has also been deployed on public cloud technology with both Microsoft Azure and Amazon AWS. Azure’s container services are built on Mesos. Netflix leverages Mesos extensively on their EC2 cloud and have also written an advanced scheduling library called Fenzo. Fenzo ensures that a first fit kind of assignment is followed where tasks are ‘bin packed’ onto Agents by the requested use of CPU, memory and network bandwidth. Fenzo also autoscales cluster usage based on demand and also spreads tasks of a given job across EC2 availability zones for high availability. With the stage set from a technology standpoint, let us look over at a few real world use cases where Mesos has been deployed in mission critical applications at various Netflix.
Mesos Deployment @ Netflix..
Netflix are one of the largest adopters and contributors to Mesos and they use it across a wide variety of business capabilities. These use cases include real time anomaly detection, data science lifecycle (training and model building batch jobs, machine learning orchestration), and other business applications. These workloads span a range of technical architectures- batch processing, stream processing and running microservices based applications.
Netflix runs their business applications as a collection of microservices deployed on Amazon EC2 and their first use of Mesos was to perform fine grained resource allocation for compute tasks to gain greater unit efficiency on EC2. The first use case for Mesos at large enterprises is typically around increasing the usage and efficiency of elastic cloud services. In Netflix’s case, they needed the cluster scheduler to increase both agent ephemerality as well as autoscale agents based on demand.
Major Application Use Cases –
Mantis – Netflix deals with a lot of operational data that is constantly streaming in to their environment. They have a range of use cases on streaming data such as real-time dashboarding, alerting, anomaly detection, metric generation, and ad-hoc interactive exploration of streaming data. With this Mantis is a reactive stream processing platform that is deployed as a cloud native service which focuses on operational data streams. The other goal of Mantis is to make it easy for different development teams to obtain access to real time events and then to build applications on them. The current throughput of Mantis is around 8 million events per second and Apache Mesos is running hundreds of stream-processing jobs around the clock. For certain kinds of streaming applications, this amounts to tracking millions of unique combinations of data all the time.
As mentioned above, Netflix runs their Application services stack on Amazon EC2 and most workloads run on linux containers. Netflix created Titus to create a container management platform and to provision Docker containers on EC2. Netflix had to do this as Amazon ECS was not upto par yet as a container orchestration solution for EC2. The use cases supported by Titus include serving batch jobs which help with algorithm training (similar titles for recommendations, A/B test cell analysis, etc.) as well as hourly ad-hoc reporting and analysis jobs. Titus recently added support for service style invocation for Netflix resources that are used to provide consistent development environments and more fine grained resource management.
Meson – One of the most important capabilities that Netflix possess is its uncanny ability to predict what movies and shows that its subscribers want to watch based on their previous watching history and similar segmentation data. Netflix excels at personalizing video recommendations and this capability is powered by machine learning algorithms. To ensure that a very large number of machine learning workflow pipelines can be efficiently created, scheduled and managed – Netflix created Meson on top of Apache Mesos. It is critical that for this system to scale and for the algorithms themselves to be fast, reliable and efficient, these pipelines are run over a large cluster of Amazon AWS instances. As depicted below, Meson manages a large number of jobs with differing CPU, Memory and Disk requirements. Once the slaves/agents are chosen, Spark jobs are run on these shared clusters. Meson uses Linux cgroups based isolation. All of the resource scheduling is handled via Fenzo (described above)
Apache Mesos is a promising new technology which attempts to solve scaling and clustering challenges encountered in the Software Defined Datacenter (SDDC). The biggest benefits of using Mesos are more efficient use of infrastructure across complex applications with native support for multitenant applications. Mesos can ensure that multiple kinds of applications or frameworks can share a given set of nodes. This ensures not just more efficient sharing of hardware but also fault tolerance and load balancing for complex Cloud Native applications.
While, Mesos has had a good degree of adoption in the webscale properties where it was first created (Twitter, Netflix, Uber, Airbnb etc to name the most prominent), it still needs to be proven as a dependable and robust platform in the datacenter.
The next post in this series will explore another exciting technology Docker, the emerging standard in the Linux container space.
 Apache Mesos Documentation – http://mesos.apache.org/documentation/latest/architecture/
 Distributed Resource Scheduling with Apache Mesos at Netflix – Medium.com
The ongoing digital transformation in key verticals like financial services, manufacturing, healthcare and telco has incumbent enterprises fending off a host of new market entrants. Enterprise IT’s best answer is to increase the pace of innovation as a way of driving increased differentiation in business processes. Though data analytics & automation remain the lynchpin of this approach – software defined infrastructure (SDI) built on the notions of cloud computing has emerged as the main infrastructure differentiator & that for a host of reasons which we will discuss in this two part blog.
Software Defined Infrastructure (SDI) is essentially an idea that brings together advances in a host of complementary areas spanning both infrastructure software, data as well as development environments. It supports a new way of building business applications. The core idea in SDI is that massively scalable applications (in support of diverse customer needs) describe their behavior characteristics (via configuration & APIs) to underlying datacenter infrastructure which simply obeys those commands in an automated fashion while abstracting away the underlying complexities.
SDI as an architectural pattern was originally made popular by the web scale giants – the so-called FANG companies of tech — Facebook , Amazon , Netflix and Alphabet (the erstwhile Google) but has begun making it’s way into the enterprise world gradually.
- Cost of hardware infrastructure is typically growing at a high percentage every year as compared to growth in the total IT budget. Cost pressures are driving an overall re look at the different tiers across the IT landscape.
- Infrastructure is not completely under the control of the IT-Application development teams as yet. Business realities that dictate rapid app development to meet changing business requirements
- Even for small, departmental level applications, still needed to deploy expensive proprietary stacks which are not only cost and deployment footprint prohibitive but also take weeks to spin up in terms of provisioning cycles.
- Big box proprietary solutions leading to a hard look at Open Source technologies which are lean and easy to use with lightweight deployment footprint.Apps need to dictate footprint; not vendor provided containers.
- Concerns with acquiring developers who are tooled on cutting edge development frameworks & methodologies. You have zero developer mindshare with Big Box technologies.
Key characteristics of an SDI –
- Applications built on a SDI can detect business events in realtime and respond dynamically by allocating additional resources in three key areas – compute, storage & network – based on the type of workloads being run.
- Using an SDI, application developers can seamlessly deploy apps while accessing higher level programming abstractions that allow for the rapid creation of business services (web, application, messaging, SOA/ Microservices tiers), user interfaces and a whole host of application elements.
- From a management standpoint, business application workloads are dynamically and automatically assigned to the available infrastructure (spanning public & private cloud resources) on the basis of the application requirements, required SLA in a way that provides continuous optimization across the life cycle of technology.
- The SDI itself optimizes the entire application deployment by both externally provisioned APIs & internal interfaces between the five essential pieces – Application, Compute, Storage, Network & Management.
The SDI automates the technology lifecycle –
Consider the typical tasks needed to create and deploy enterprise applications. This list includes but is not limited to –
- onboarding hardware infrastructure,
- setting up complicated network connectivity to firewalls, routers, switches etc,
- making the hardware stack available for consumption by applications,
- figure out storage requirements and provision those
- guarantee multi-tenancy
- application development
- updates, failover & rollbacks
- compliance checking etc.
Illustration: The different tiers of Software Defined Infrastructure
The core of the software defined approach are APIs. APIs control the lifecycle of resources (request, approval, provisioning,orchestration & billing) as well as the applications deployed on them. The SDI implies commodity hardware (x86) & a cloud based approach to architecting the datacenter.
The ten fundamental technology tenets of the SDI –
1. Highly elastic – scale up or scale down the gamut of infrastructure (compute – VM/Baremetal/Containers, storage – SAN/NAS/DAS, network – switches/routers/Firewalls etc) in near real time
2. Highly Automated – Given the scale & multi-tenancy requirements, automation at all levels of the stack (development, deployment, monitoring and maintenance)
3. Low Cost – Oddly enough, the SDI operates at a lower CapEx and OpEx compared to the traditional datacenter due to reliance on open source technology & high degree of automation. Further workload consolidation only helps increase hardware utilization.
4. Standardization – The SDI enforces standardization and homogenization of deployment runtimes, application stacks and development methodologies based on lines of business requirements. This solves a significant IT challenge that has hobbled innovation at large financial institutions.
5. Microservice based applications – Applications developed for a SDI enabled infrastructure are developed as small, nimble processes that communicate via APIs and over infrastructure like messaging & service mediation components (e.g Apache Kafka & Camel). This offers huge operational and development advantages over legacy applications. While one does not expect Core Banking applications to move over to a microservice model anytime soon, customer facing applications that need responsive digital UIs will need definitely consider such approaches.
6. ‘Kind-of-Cloud’ Agnostic – The SDI does not enforce the concept of private cloud, or rather it encompasses a range of deployment options – public, private and hybrid.
7. DevOps friendly – The SDI enforces not just standardization and homogenization of deployment runtimes, application stacks and development methodologies but also enables a culture of continuous collaboration among developers, operations teams and business stakeholders i.e cross departmental innovation. The SDI is a natural container for workloads that are experimental in nature and can be updated/rolled-back/rolled forward incrementally based on changing business requirements. The SDI enables rapid deployment capabilities across the stack leading to faster time to market of business capabilities.
8. Data, Data & Data – The heart of any successful technology implementation is Data. This includes customer data, transaction data, reference data, risk data, compliance data etc etc. The SDI provides a variety of tools that enable applications to process data in a batch, interactive, low latency manner depending on what the business requirements are.
9. Security – The SDI shall provide robust perimeter defense as well as application level security with a strong focus on a Defense In Depth strategy.
10. Governance – The SDI enforces strong governance requirements for capabilities ranging from ITSM requirements – workload orchestration, business policy enabled deployment, autosizing of workloads to change management, provisioning, billing, chargeback & application deployments.