A deep look into OpenShift v2

The PaaS (Platform As A Service) market is dominated by two worthy technologies – OpenShift from Red Hat and Pivotal’s CloudFoundry. It is amazing that a disruptive technology category like PaaS is overwhelmingly dominated by these two open source ecosystems , which results in great choice for consumers.

I have used OpenShift extensively and have worked with customers on large & successful deployments. While I do believe that CloudFoundry is robust technology as well, this post will focus on what I personally know better – OpenShift.

Platform as a Service (PaaS) is a cloud application delivery model that is typically between IaaS and SaaS.

The Three Versions of OpenShift

OpenShift is Red Hat’s PaaS technology for both private and public clouds. There are three different versions: OpenShift Origin,Online and Enterprise.

OpenShift Origin, the community ( and open source) version of OpenShift, is the upstream project for the other two versions. It is hosted on GitHub and released under an Apache 2 license.

OpenShift Online is the public PaaS as a Service, currently hosted on Amazon AWS.

OpenShift Enterprise is the hardened version of OpenShift with ISV & vendor certifications.

3_flavors_of_openshift

OpenShift Terminology

The Broker and the Node are two main server types in OSE. The Broker is the manager cum orchesterator of the overall infrastructure and the overall brains behind the operation. The Nodes are VMs or baremetal servers where end user applications live.

The Broker exposes a WebGUI (a charming throwback to the 1990s) but more importantly an enterprise class and robust REST API. Typically one or more brokers that manage multiple nodes. Multiple brokers can be clustered for HA purposes.

Application

This is the typical web application or BPM application or integration application that will run on OpenShift. OpenShift v2 was focused on webapp workloads but offered a variety of mature extension points as cartridges.

You can interact with the OpenShift platform via RHC client command-line tools
you install on your local machine, the OpenShift Web Console, or a plug-in you
install in Eclipse to interact with your application in the OpenShift cloud.

Gear

A gear is a server container with a set of resources that allows users to run their
applications.

Cartridge

Cartridges give gears a personality and make them containers for specialized applications. Cartridges are the plug-
ins that house the framework or components that can be used to create and run an
application. One or more cartridges run on each gear, and the same cartridge can
run on many gears for clustering or scaling. There are two kinds of cartridges:
Standalone & Embedded. You can  also create a custom cartridge like its running in OSE. EAP+Autoconfig for some F5 stuff. E.g. Create an F5 cartridge that can call out the F5 API every-time it detects a load situation.

Gears

Scalable application

Application scaling enables your application to react to changes in traffic and au‐
tomatically allocate the necessary resources to handle your increased demand. OpenShift is unique in that it can do application scaling both ways – scale & descale – dynamically. The OpenShift infrastructure monitors incoming web traffic and automatically brings
up new gears with the appropriate web cartridge online to handle more requests.
When traffic decreases, the platform retires the extra resources. There is a web page
dedicated to explaining how scaling works on OpenShift.

Foundational Blocks of OpenShift v2 

The OpenShift Enterprise multi-tenancy model is based on Red Hat Enterprise Linux, and it provides a secure isolated environment that incorporates the following three security mechanisms:
SELinux
SELinux is an implementation of a mandatory access control (MAC) mechanism in the Linux kernel. It checks for allowed operations at a level beyond what standard discretionary access controls (DAC) provide. SELinux can enforce rules on files and processes, and on their actions based on defined policy. SELinux provides a high level of isolation between applications running within OpenShift Enterprise because each gear and its contents are uniquely labeled.
Control Groups (cgroups)
Control Groups allow you to allocate processor, memory, and input and output (I/O) resources among applications. They provide control of resource utilization in terms of memory consumption, storage and networking I/O utilization, and process priority. This enables the establishment of policies for resource allocation, thus ensuring that no system resource consumes the entire system and affects other gears or services.
Kernel Namespaces
Kernel namespaces separate groups of processes so that they cannot see resources in other groups. From the perspective of a running OpenShift Enterprise application, for example, the application has access to a running Red Hat Enterprise Linux system, although it could be one of many applications running within a single instance of Red Hat Enterprise Linux.
How does the overall system flow work? 

1. As mentioned above, the Broker and the Node are two server types in OSE. The Broker is the manager and orchesterator of the overall infrastructure. The Nodes are where the applications live.

The MongoDB that sits behind the broker manages all the metadata about the application; the Broker also manages the DNS stuff with a BIND plugin; Auth piece manages credentials and stuff.

2. As mentioned, the gear is the resource container and is where the application lives. The gear is actually CGroups and SELinux config. Using CGroups and SE Linux enables one to create high density, multitenant apps.

3. The cartridge is the framework and stack definition. E.g. Create a Tomcat definition with spring, struts installed on it.

4. The overall flow is that the user uses the REST API to request the Broker to create a Tomcat application for instance and asks for a Scaled or a NonScaled application for a given gear size(this is all completely customizable).

The Broker communicates with the Nodes and enquires which of those have the capacity and “can you host this application?”. The Broker then gets back a simple health-check from the nodes and decides to place the app on the nodes that have been identified based on a health indicator. All this communication happens via MCollective (which is based on ActiveMQ).

5. A bunch of things happen at that point of time once the Broker decides where to place the workload.

a) A UNIX level user is created on the gear. Ever application gets a UUID and a user associated with that.So it creates a home directory for the user, put the SELinux policies associated with that user on that gear, so that they have very limited and scoped access to whatever is on that gear only. If you go and if you do a ps, you cannot see anything that is not yours. It is all controlled by SELinux and all that stuff gets setup when you create an application.

b) The gear is created with whatever cgroups config with memory,cpu and disk space etc based on what the user picked.

c) The next step is the actual cartridge install and puts all actual stack (tomcat in this instance) and the associated libraries and all that stuff. Then it will go startup Tomcat on that node, do all the DNS resolution so that the app is publicly addressable.

d) Base OpenShift ships with lots of default cartridges with all kinds of applications. Spec 2.0 made it a lot easy to create new cartridges.

You can write custom cartridges than expose more ports and services across gears.

The other thing is you can build hybrid cartridges so that the first gear spun up would spin up the first framework and the next gear would spin up the next framework. So that a webfront end and then a vertex cluster running on the backend and all the intergear communication is handled by OSE. It is all very customizable.

For instance, you can add EAP and JDG/Business Rules in one hybrid cartridge and have it all be autoscaled. Or you can build multiple cartridges and one will be the main cartridge and have all these embedded cartridges back that up.

The difference between the hybrid and the single framework is that when it receives a scale-up event, it has to decide what to install on a given gear, so the logic gets a little more complicated.

openshift-enterprise-workflow

e) Networking and OSE.

The first thing to keep in mind is that all of OpenShift Online (which a lot of people know and understand) is really designed around an Amazon EC2 infrastructure. The 5 ports per gear limitation is really around EC2 limitations and does not really apply in enterprise where you are typically running this on a fully controlled VM.

Intergear communication between the gears of an app is via IPTables.And it is all configurable as well as SELinux policies with the ports etc. It is all customizable in the Enterprise version.

It is important to understand how routing works on a node to better understand the security architecture of OpenShift Enterprise. An OpenShift Enterprise node includes several front ends to proxy traffic to the gears connected to its internal network.
The HTTPD Reverse Proxy front end routes standard HTTP ports 80 and 443, while the Node.js front end similarly routes WebSockets HTTP requests from ports 8000 and 8443. The port proxy routes inter-gear traffic using a range of high ports. Gears on the same host do not have direct access to each other.
In a scaled application, at least one gear runs HAProxy to load balance HTTP traffic across the gears in the application, using the inter-gear port proxy.
OpenShift_Networking

THE COMPLETE PICTURE –
————————————

All OpenShift applications are built around a Git source control workflow – you code locally, then push your changes to the server. The server then runs a number of hooks to build and configure your application, and finally restarts your application. Optionally, applications can elect to be built using Jenkins, or run using “hot deployment” which speeds up the deployment of code to OpenShift.

As a developer on OpenShift, you make code changes on your local machine, check those changes in locally, and then “push” those changes to OpenShift.

Following a workflow –

The git push is really the trigger of a build/deploy/a start and stop. All that happens via a git push. GIT here is not really for source control but as a communication mechanism for an application push to the node.

So if one follows the workflow from the developer side, they typically use the client tools that speak REST/CLI/Eclipse to the broker i.e. create the app,or,start and stop the application. Once that application is created; everything you do from changing Tomcat config, changing your source, deploying a new war etc is all going to go through GIT.

The process is that I get to whatever client tool I use, REST/CLI/Eclipse etc to communicate to the broker asking for an application creation. The broker,via MCollective, goes to the nodes and picks the ones it intends to create the gears on.

Once the gears are created along with a Unix user/ Cgroups/SELinux/home directory/cartridge config etc based on whatever you pickup.

There it does routing configuration on the gear as all the applications bind to a loopback address on the nodes so they’re not externally routable. But on every node, there is also an Apache instance running so it multiplexes all that traffic so that all nodes are externally routable.

There is a callback to the Broker to go and handle all that DNS and the application is now accessible from the browser. At this point, all you have is a simple hello world templated app.

Also check out the below link on customizing  OSE autoscale behavior via HAProxy

https://www.openshift.com/blogs/customizing-autoscale-functionality-in-openshift

GIT and OSE

Every OpenShift application you create has its own Git repository that only you can access. If you create your application from the command line, rhc will automatically download a copy of that repository (Git calls this ‘cloning’) to your local system. If you create an application from the web console, you’ll need to tell Git to clone the repository. Find the Git URL from the application page, and then run:

$ git clone

Once you make changes, you’ll need to ‘add’ and ‘commit’ those changes – ‘add’ tells Git that a file or set of files will become part of a larger check in, and ‘commit’ completes the check in. Git requires that each commit have a message to describe it.

$ git add .
$ git commit -m “A checkin to my application”

Finally, you’re ready to send your changes to your application – you’ll ‘push’ these changes with:

$ git push

And thats it!

OpenShift v3, which was just launched this week at Red Hat Summit 2015, introduces big changes with a move to Docker and Kubernetes in lieu of vanilla linux containers & barebones orchestration. Subsequent posts will focus on v3 as it matures.

Why Web Scale IT will redefine business and why every CXO should care…

Conformity is the jailer of freedom and the enemy of growth.” – John F. Kennedy

bds_3238-1080x675
A relatively new term that has entered the lexicon of modern IT is Web-scale systems. Credit to Cameron Haight from Gartner for coining it a couple of years ago. I have spent a lot of my time in the last couple of years in working with marquee names worldwide on gradually percolating these practices into their critical business lines & applications. However, my goal today is not to define this loosely understood concept but to rather speak to it from a business perspective in this post. We will get to the technology prongs of web-scale in followup posts.

The provenance of the term “WebScale IT”  is the recognition of the fact that the Web scale giants led by the Big Four – Google, Amazon, Facebook and Apple have built robust platforms (as opposed to standalone or loosely federated applications) that have not only contributed to their outstanding business success but have also led to the creation of (open source) software technologies that enable business systems to operate at massive scale in terms of billions of users and at millions of systems.  They have done all this while constantly churning out innovative offerings while still continuously adapting & learning from customer feedback. No mean feat this.

These four are followed by two other new age giants –  Netflix & LinkedIn. The Netflix stack is one of the primary open source projects that enable the creation of microservices & loosely coupled applications at scale. W

As an analyst from Citi remarked a few months ago – “It is impossible to overtake them*”.

* –  How Four Companies took  over the internet

http://money.cnn.com/2012/11/12/technology/techonomy-big-four/

The term is apt since no Fortune x00 has been able to match this velocity of change and quite a few have even seen their business models upended by the blazing speed at which these four giants have been able to cause tectonic shifts in the landscape. A few examples, Amazon AWS in Cloud Computing, Apple iTunes in the music business, Google’s Gmail (just to name one in their amazing panoply of offerings) & Facebook’s relentless march to be the personal collaboration platform. Google and Facebook have spun off a number of open source projects that now form the basis of distributed computing & massively scalable compute & data architecture.

160-Features-tech-war-5

[Image Courtesy – FastCompany]

How has all this been possible? In large degree due to visionary leadership backed by a culture of risk taking and innovation.

Gartner rightly proclaims that by 2017, 50% of all global enterprises will adopt web-scale  architectures & practices in some shape or form.

http://www.gartner.com/newsroom/id/2675916

The point has been well made that every business now is a software business. The ability of an enterprise to compete with it’s rivals depends largely on the quality of information systems, data analytics as well as a culture that rewards risk taking and internal entrepreneurship.

The below five business reasons make it impossible for an CXO to neglect the adoption of web-scale practices in at-least a few critical applications –

1. Digital Transformation – Every large Fortune x00 enterprise is under growing pressure to transform lines of business or their entire enterprise into a digital operation. I define digital in this context as being able to – “adapt high levels of automation while enabling the business to support multiple channels by which products and services can be delivered to customers. Further the culture of digital encourages constant innovation and agility resulting high levels of customer & employee satisfaction.”

2. Smart Data & Analytics –  Web-scale techniques ensure that the right data is in the hands of the right employee at the right time so that contextual services can be offered in real time to customers. This has the effect of optimizing existing workflows while also enabling the creation of new business models.

3. Cost Savings – Oddly enough, the move to web-scale only reduces business and IT costs. You not only end up doing more with less employees due to higher levels of automation but also are able to constantly cut costs due to adopting technologies like Cloud Computing which enable one to cut CapEx and OpEx. Almost all webscale IT is dominated by open source technologies & APIs, which are much more cost effective than proprietaty platforms.

4. A Culture of Collaboration – The most vibrant enterprises that have implemented web-scale practices not only offer “IT/Business As A Service” but also have instituted strong cultures of symbiotic relationships between customers (both current & prospective), employees , partners and developers etc.

5. Building for the Future – The core idea behind implementing web-scale architecture and data management practices is “Be disruptive in your business or be disrupted by competition”. Web-scale practices enable the building of business platforms around which ecosystems can be created and then sustained based on increasing revenue.

The virtuous feedback loop encouraged by constant customer data & feedback enables these platforms to behave like growing organisms. Now that the stage has been laid for “Why Web-scale?”, I will cover the technology prongs from a high level in the next post.

Enter Apache Ambari – the Elephant Rider…

As an aficionado of Hadoop technology & a devoted student of it’s gradual progression from a Big data processing framework to an enterprise application & data ecosystem (more on that in follow-up posts), one of the big problems or rather gaps with Generation 1 was the lack of an open source enterprise grade operations team console.

This gap essentially meant that operations and admin teams had to be reasonably well educated in Hadoop semantics & internals to be able to do common use cases like – Provisioning – i.e cluster creation & replication, management & monitoring at scale, operational scripting and resource allocation monitoring.

This challenge is also exacerbated by the fact that Big Data clusters do not live in isolation in your datacenter. You need to integrate them with existing assets like data warehouses, RDBMS’s as well as extend them with whatever configuration & automation technology is currently in place. e.g. Puppet.

Another significant issue is that a leading Hadoop distro like Hortonworks has 22+ technology frameworks in addition to the two core Hadoop – HDFS & YARN. These include services like Hive, Pig, Spark, Storm etc.

You use the right framework for the right job depending on the business problem you are looking to solve. For instance, Hive for interactive SQL etc, Storm for stream processing, Mapreduce for batch etc. This means that not all slaves run similar clusters as they run different workloads.

The challenge then from a operational perspective is to install specific services on nodes or clusters within the overall infrastructure that needs to provide batch, interactive or streaming support to applications that need to access data within them.

Enter Apache Ambari thanks to work in the Apache Software Foundation, led by Hortonworks and many other community members…

What Ambari Does

Ambari enables system administrators to perform all the above usecases from a provisioning, management, monitoring & integration standpoint.

The key usecases where Ambari shines at –

1.Provision a Hadoop Cluster 

No matter the size of your Hadoop cluster, the deployment and maintenance of hosts is simplified using Ambari.

Provisioning of clusters, services and components can be accomplished in one of two different ways. One can leverage a well designed UI or use the Blueprints API capability to automate cluster installs without any manual intervention. Ambari Blueprints are a great feature where the operator can provide a declarative definition of a Hadoop cluster. One can specify a Stack, the Component layout within the stack and the necessary custom Configuration(s) so that a simple REST API call to the Ambari server results in an on the fly creation of the Hadoop cluster.

Blueprints also enable users to create their own custom Hadoop configurations depending on the business capability that they need to achieve. Thus clusters are built using the layout, components and stack defined in the Blueprints.

The architecture of an Ambari install is described below but the web UI is a client side JS application which access the server via a REST API. Thus the server can act as a full-fledged REST provider which enables easy integration with other management tooling. There are a rich set of permissions available based on what the users role is. The Ambari server can optionally plug into your LDAP/RDBMS authentication repository so that one does not need to duplicate permissions in the Ambari database.

2.Manage a Hadoop cluster

Ambari provides tools to simplify cluster management. The Web interface & the REST API allow a sys admin or an enterprise systems mgmt console to control the lifecycle of Hadoop services and components, modify configurations and manage the ongoing growth of your cluster.

Ambari leverages two other open source projects Nagios and Ganglia ], which are preconfigured and installed along with the Ambari server & agent.

AmbariMetrics

3.Monitor a Hadoop cluster

Ambari can monitor the health and status of a Hadoop cluster right down to a granular detail for metric display on the web UI. Ganglia integration provides not just tons of useful metrics that are all Hadoop based (e.g. HDFS Disk Usage, Links, DataNode info etc) but also provides nice histograms for one to understand patterns of usage from a threshold and trend perspective.

cluster_wide_metrics_2

4.Integrate Hadoop with Enterprise Systems Mgmt Disciplines

No Hadoop cluster lives in isolation from a management perspective in an enterprise datacenter. While Ambari shines at Hadoop management & monitoring, it is also fairly easy to hook it up with other enterprise mgmt consoles as well as with any systems mgmt products like Puppet, Chef and Ansible etc.

Ambari accomplishes this via three primary ways –

a) the REST API that enables integration with existing tools. Certifications have been established for established operations team tooling. Some major examples include Microsoft System Center Operations Manager, HP Operations Manager and Teradata Viewpoint.

b) the Stacks feature which enables one to define a group of services that can be installed as a unit along with locations of the repos etc where these could be found in your infrastructure. e.g. HDFS, Zookeeper, Hive, Pig, Spark etc. Stacks can also be versioned and support a definition hierarchy in that base stacks can be created that permit inheritance that can be used to build higher order stacks.

c) Blueprints – which are a declarative way of creating a live running cluster from scratch. Blueprints are JSON files which have two main parts, a list of host groups (node information) and allied configuration (which define services that go into these nodes) .

Blueprints use specific versions of stacks.

blueprints

Blueprints are submitted to the Ambari REST API. This workflow is covered in the above visual. I will cover these in more detail when  we talk about CloudBreak, a Hadoop As A Service from HDP (Hortonworks Data Platform).

Architecture –

The Ambari architecture operates on a client server model. The Ambari Server serves as the collection point for data from across the cluster and it exposes a REST API. The web UI leverages the REST API to interface with the server.
Each slave node in the cluster has a copy of the Ambari Agent that enables the server to perform provisioning, inventory, config mgmt. and operational activities on the cluster workloads.

AmbariArch

As mentioned above, Ambari also comes pre-packaged with two other popular open source monitoring technologies – Ganglia and Nagios. Each node in the cluster also runs an instance if the
Ganglia Monitor daemon (gmond). The gmond collects metric information which is passed to the Ganglia Connector daemon gmetad, and then on to the Ambari Server for display on the web UI.

The other component is Nagios which is used for alerting against threshold conditions.

Ambari Views

Ambari Views is a newer project that offers a systematic way to add new tools to Ambari using a plug-in model. Any newly-added service or proprietary tools are available to be integrated in the Ambari Web user interface using Views.

These new tools might include such things as a Pig query editor, a Hive query editor, a workflow design tool, a Tez visualization tool, or an Hadoop distributed file system browser.
To sum up…

The top level Apache Ambari project has been moving at a rapid clip in terms of innovation – it had 13 releases in 2013 and 6 releases in 2014.

The latest community version (2.0) released March 2015 supports automated rolling upgrades for components such as ZookKeeper, Ranger, the core YARN & HDFS services. Further Hortonworks are putting in place certification processes as well as proven methodologies to aid enterprise customers in this new capability. This handles potentially vexing issues like component interdependencies and switching correct software versions in a rolling fashion

More information on this new capability at the below link –

Introducing Automated Rolling Upgrades with Apache Ambari 2.0

For anyone intending to deploy, run or even planning a Hadoop cluster of any size, Ambari can make the operations angle a breeze.

For more information, please check out the below link –

http://hortonworks.com/hadoop/ambari/

Another illustration of OpenStack’s growing clout in the enterprise vendor landscape..

As a followup to the previous post on Forrester’s recognition of OpenStack’s growing prominence as the de-facto private cloud operating system, I wanted to quickly note the  gradual transformation of a major proprietary vendor, EMC, from first an open source naysayer to onlooker to finally morphing into a open source participant in critical communities in distributed computing like Cloud & Big Data.  This is testament to the ascendence of the open source community in building out web scale systems and data architectures.

Here is a great blogpost by Dorian Narvesh of EMC on their new reference architecture program.

EMC Going “Big” at OpenStack Summit Vancouver

In Dorian’s eloquent words –

“What a difference an year makes! EMC as a bystander in Atlanta to an Event Sponsor in Paris and today a Premier Sponsor in Vancouver. Crawling, walking, and now running!”

EMC are not just certifying their suite of storage arrays (EMC VNX and XtremIO) as well as their Software Defined Storage (ScaleIO & ViPR), and Brocade switches (an EMC partner and much talked about EMC acquisition target) on partner distributions like Red Hat, Mirantis and Canonical but are also putting in a rigorous optimization process.

Reference Architectures will be released for all the above distros in the course of time. The below is the one they just released with one of the above vendors (all from the EMC blog).

OpenStack-2

EMC are also creating new drivers for these products to certify them with OpenStack Cinder and Manila for the Kilo release. EMC’s high level contributions to the broader OpenStack community are highlighted in the slide below.

Openstack-3

Couple all of this with the just announced open sourcing of one of their core software defined storage controller, ViPR which provides storage automation functionality. This is EMC’s foray into creating their first open source community based on one of its commercial software products.

http://www.emc.com/about/news/press/2015/20150505-02.htm

The upshot of all of this is that EMC customers now have a range of choices in terms of  OpenStack cloud whether it is an end to end EMC solution, or, Red Hat or Canonical or Mirantis or the upcoming VMware distro of OpenStack.

Kudos to EMC on this significant accomplishment!

OpenStack comes of Age & drives the Software Defined Datacenter!

OpenStack is a project & technology that is close to my heart. While at Red Hat, I spent a lot of time working with early adopter customers across financial services, telco & media & entertainment who were building heavy duty OpenStack clouds. While I missed the Vancouver Summit (May 2015) due to personal reasons but have been following all the different projects as they go along their merry way adding features & continually improving stability.
A major analyst firm (one of the big two) Forrester just released a report that seems to sums up what acolytes like myself have been saying all along – “OpenStack is ready – are you?”

This report clearly validates that pathbreaking early adopters (use-cases in a bit) are beginning to move proof of concepts and pilots to production & reaping the benefits of transformational technology.

Forrester_OpenStackreadiness

For the uninitiated, OpenStack promises a complete ecosystem for building out private clouds. The OpenStack Foundation is backed by some major technology players, including Red Hat, HP, IBM, AT&T, Comcast, Canonical/Ubuntu, HO, VMWare, Nebula, Rackspace and Suse. Built from multiple sub-projects as a modular system — OpenStack allows an IT organization to build out a scalable private (or hybrid) cloud architecture that is based on an open standard, unlike Amazon Web Services (AWS). This is particularly relevant in financial services given the cost and regulatory pressures banking organizations are under, as well as a need to derive competitive advantage from agile implementations without incurring the security and business risks of a public cloud.

One of the key differentiator’s of the OpenStack project is that it is composed up of various sub-projects and one does not need to do an end to end implementation of the entire ecosystem to realize it’s benefits. It is designed to provide a very high degree of flexibility as modules can generally be used in combination, or be used as one off projects.

The report validates that OpenStack has been production grade for a while and early adopters who were brave enough to run initial production grade applications are reaping the benefits of being prescient & courageous about being able to follow their convictions. In today’s fast moving business landscape where forces like Big Data, Social,Mobile and Cloud increasingly dominate, you will either be disruptive or be disrupted. The race to build platforms and expose those business capabilities as APIs while providing seamless provisioning and service to your development community is key. OpenStack is all about being an elastic, agile and cost effective Cloud OS that enables you to layer in higher order capabilities like PaaS and SaaS.
Let’s trace a brief history of OpenStack from a release maturity standpoint as reproduced below from the OpenStack community website.
Series Status Releases Date
Liberty Under development Due Oct 15, 2015
Kilo Current stable release, security-supported 2015.1.0 Apr 30, 2015
Juno Security-supported 2014.2.3 Apr 13, 2015
2014.2.2 Feb 5, 2015
2014.2.1 Dec 5, 2014
2014.2 Oct 16, 2014
Icehouse Security-supported 2014.1.4 Mar 12, 2015
2014.1.3 Oct 2, 2014
2014.1.2 Aug 8, 2014
2014.1.1 Jun 9, 2014
2014.1 Apr 17, 2014
Havana EOL 2013.2.4 Sep 22, 2014
2013.2.3 Apr 03, 2014
2013.2.2 Feb 13, 2014
2013.2.1 Dec 16, 2013
2013.2 Oct 17, 2013
Grizzly EOL 2013.1.5 Mar 20, 2014
2013.1.4 Oct 17, 2013
2013.1.3 Aug 8, 2013
2013.1.2 Jun 6, 2013
2013.1.1 May 9, 2013
2013.1 Apr 4, 2013
Folsom EOL 2012.2.4 Apr 11, 2013
2012.2.3 Jan 31, 2013
2012.2.2 Dec 13, 2012
2012.2.1 Nov 29, 2012
2012.2 Sep 27, 2012
Essex EOL 2012.1.3 Oct 12, 2012
2012.1.2 Aug 10, 2012
2012.1.1 Jun 22, 2012
2012.1 Apr 5, 2012
Diablo EOL 2011.3.1 Jan 19, 2012
2011.3 Sep 22, 2011
Cactus Deprecated 2011.2 Apr 15, 2011
Bexar Deprecated 2011.1 Feb 3, 2011
Austin Deprecated 2010.1 Oct 21, 2010
As far as version and component maturity goes, most forward looking customers started heavy duty evaluation with the Havana release in 2013. It is as if they said “Alright – everything is here from the perspective of an enterprise cloud operating system – compute, storage, network, a path to upgrades, certifications from established vendors and an enterprise roadmap etc etc.”
 whats-new-openstack-kilo-8-638
                           (Src – Cloud Enabled)
And when the next version, Icehouse came along, people said “that’s production and we are going to run this for real”.
And so, from an enterprise standpoint, what made Icehouse palatable, was that it was made upgradeable (take an OpenStack Havana cloud and upgrade it in place to Icehouse without bringing your workloads down)
Before that from Grizzly to Havana, you had to take your workloads down as upgrades happened since OpenStack trunk releases a major version every six months.
Add to that the 3 year enterprise lifecycle from best in class vendors like Red Hat, that avoided a lot of the churn with having to keep up with releases made it a very attractive proposition.
Block storage and Software Defined Networking – SDN (more around SDN in a series of followup posts) were also mature beginning the Havana release. Plugins had been worked out & certified for the major vendors like EMC, Hitachi etc. Project Ceilometer was also created and was crucial with an ability to get visibility into metrics. Project Sahara was created as an incubator to run Hadoop workloads on an OpenStack cloud. Cloud plays a role in the life cycle of development, deployment and optimization of Big Data applications. Innovation at the largest enterprises is often shackled by the absence of a responsive and agile infrastructure. It can take days to procure servers to host bursts of workloads that may not be feasible for existing IT departments to rapidly turn around. This problem is by no means unique to Big Data but what makes it even more relevant here is that processing needs in this specialized area are typically larger than your average IT application.
Rackspace_OSP
The other major gain from a deployment standpoint was Project Heat.
So, there is some value in being able to provision VMs into OpenStack but people have been doing that forever using a VMWare or a Red Hat Virtualization but where Heat gives you a distinct advantage is in being able to provision entire application stacks.
Using Heat, the granular unit in the service catalog or the contract between the service catalog and the infrastructure becomes an application. Heat provides an amazing this level of orchestration.
You can think of people using EC2 Cloudformations API..use 7 API calls to provision 7 VMs. More to provision storage, so you are looking at 10-15 calls. You need to check the status of these calls, check the result, put it into some workflow, check for errors etc.
With Heat, it is a single API call and then you define the stack declaratively along with the network dependencies. When you invoke this from a service catalog, you can imagine that the service catalog can have its own workflow and that is going to be something like .. “User requests resources”.”Check to make sure that the resources are available”.”Check users permissions and authority”.
Currently most large customers use some kind of proprietary provisioning tool to provisioning complex stacks with compute,network and storage dependencies with a bunch of 3rd party callouts, which as a model is a huge pain in that if something goes wrong, you don’t know what failed and how to even rollback. When Heat is leveraged, all of these desirables become native capability. A single API call, with a well defined audit, stack-trace and rollback.
The other thing that is important from large customers is that a couple of years ago, most started to standardize on a declarative model and these patterns. They already have dozens of patterns they have to provision their public cloud workloads back to private clouds built on OpenStack.Given that Heat is Amazon CloudFormations (CF) compliant, all those APIs are easily translatable.
Anything you can do using OpenStack is automatable and describable via Heat.
From the Forrester report –

Regardless, OpenStack’s own continuous release cycle of new OpenStack modules reflects the agile, continuous delivery that many evolving BT organizations look to mirror (see Figure 2). Whether enterprises establish and build a center of OpenStack excellence internally or leverage it through a series of vendor partners, they are turning to OpenStack as the platform layer of their solutions across core projects tying into the larger BT agenda.

So what are some of the enterprise use cases for OpenStack –
A) Create a DIY IaaS and basic PaaS in support of DevOps– 

The two step process to do this looks like the below –
i) Replace homegrown cloud platforms –  
Typically for a PaaS..i.e running systems of engagement like web properties and three tier application. A lot of technical debt built on BMC to orchestrate VMs, PaaS’s and Containers.
ii) Install the PaaS on OpenStack and give developers self service access and potentially even run workloads on Docker managed containers on their Linux operating system or even on a container optimized OS like RHEL Atomic.
B) Telco & NFV –
NFV (Network Functions Virtualization) to virtualize network functions like routers,switches and do so to automate network functions that are expensive and hardware based.
C) Big Data / HPC workloads 
D) Move to a Software driven datacenter by replacing expensive and proprietary virtualization solutions
In summary, OpenStack has a ton of promise and is slowly emerging as the de-facto standard for building out private clouds, and as a way of reducing CapEx and OpeEx while converting massive amounts of compute, network and storage resources into a Software Defined Infrastructure. While there are still kinks to be worked out in terms of making the SDN more robust, better native provisioning, deployment & config management support – I expect that all these will duly be worked out in the months to come.

Keystone usecases for Big Data technology


“The lesson is, we all need to expose ourselves to the winds of change.” – Andy Grove 

“Information is the oil of the 21st century, and analytics is the combustion engine.” – Peter Sondergaard, Senior Vice President, Gartner Research

 There is widespread recognition and understanding that Big Data is not a technology challenge or a data management problem but rather a business opportunity. The last few posts effectively highlighted that the 4 challenges associated with the 4 V’s preclude conventional IT solutions to managing & harnessing that precious commodity of business data.

While Big Data as an amorphous technology space is evolving and means different things to different groups within IT, the business insights that are gleaned from it help one outperform competition. All of this leads to the conclusion that Data is an enterprise asset and should be managed as one. Data silos follow organizational silos and effectively putting an end to this mindset only improves organizational collaboration,

With this in mind, I have tried to capture 17 proven horizontal use-cases (both customer facing and internal) for Big Data. These are horizontal in the sense that they’re common as a trend (and an opportunity) to virtually every industry vertical – financial services, healthcare, telecom etc.

  1. Improve customer satisfaction
  2. Detect Fraud before, as it & after it happens
  3. Deal with device data (read sensors, IoT etc)
  4. Unified log processing
  5. Improve enterprise security
  6. Engagement via social media
  7. Google search my corporate data haystack
  8. Single view of artifact – customer, patient, supplier etc
  9. Build a picture of your customer over time
  10. Manage Risk
  11. Augment my data warehouse
  12. Stay compliant with regulation
  13. Break down organizational silos
  14. Personalize business services to customers
  15. Analytics @ speed
  16. Dynamically adjust offerings – portfolios, product pricing, promotions etc
  17. Digitally transform my business

I will be elaborating on each of these over this week describing the use-case & real world customer scenarios.

Coming up this week..

For a while, I have been intending to blog about horizontal (read cross industry) use-cases and applications of Big Data architecture. Will be able to post something meaty early on in the week. My intention is to describe horizontal business problems and to describe architectural patterns that comprise satisfaction of the same in later posts.

Big Data #2 – The velocity of Big Data & CEP vs Stream Processing

The term ‘Data In Motion’ is widely used to represent the speed at which large volumes of data are moving into corporate applications in a broad range of industries. Big Data also needs to be Fast Data. Fast in terms of processing potentially millions of messages per second while also accommodating multiple data formats across all these sources.

Stream Processing refers to the ability to process these messages (typically sensor or machine data) as soon as they hit the wire with a high event throughput to query ratio.

Some common use-cases are processing set-top data from converged media products in the Cable & Telco space to analyze device performance or outage scenarios; fraud detection in the credit card industry, stock ticker data & market feed data regd other financial instruments that must be analyzed in split second time; smart meter data in the Utilities space, driver monitoring in transportation etc. The data stream pattern could be  batch or real-time owing to the nature of the business event etc.

While the structure and the ingress velocity of all this data varies depending on the industry & the use case, what stays common is that these data stream must all be analyzed & reacted upon as soon as is possible with an eye towards business advantage.

The leading Big Data Stream processing technology is Storm. More on it’s architecture and design in subsequent posts. If you currently do not have a stream based processing solution in place, Storm is a great choice – one that is robust, proven in the field and well supported as part of the Hortonworks HDP.

Storm_concepts-300x170

A range of reference usecases listed here –

http://storm.apache.org/documentation/Powered-By.html

Now, at the surface a lot of this sounds awfully close to Complex Event Processing (CEP). For those of you who have been in the IT industry for a while, CEP is a mature application domain & is defined as the below by Wikipedia.

Complex Event Processing, or CEP, is primarily an event processing concept that deals with the task of processing multiple events with the goal of identifying the meaningful events within the event cloud. CEP employs techniques such as detection of complex patterns of many events, event correlation and abstraction, event hierarchies, and relationships between events such as causality, membership, and timing, and event-driven processes.”

However, these two technologies differ along the following lines –

  1. Target use cases
    CEP as a technology & product category has a very valid use-case when one has an ability to define analytics (based on simple business rules) on pre-modeled discrete (rather than continuous) events. Big Data Stream processing on the other hand enable one to apply a range of complex analytics that can infer across multiple aspects of every message.
  2. Scale -Compared to Big Data Stream Processing, CEP supports a small to medium size scale. You are typically talking a few 10,000’s of messages while supporting latency of a magnitude of seconds to sub-seconds (based on the CEP product). Big Data stream processing on the other hand operates across 100,000’s of messages every second while supporting a range of latencies. For instance, Hortonworks has benchmarked Storm as processing one million 100 byte messages per second per node.Thus, Stream processing technologies have been proven at web scale versus CEP, which is still a niche and high-end capability depending upon the vertical (low latency trading) or the availability of enlightened architects & developers who get the concept.
  3. Variety of data -CEP operates on structured data types (as opposed to Stream Processing which is agnostic to the arriving data format) where the schema of the arriving events has been predefined into some kind of a fact model. E.g. a Java POJO.
    The business logic is then defined using a vendor provided API.
  4. Type of analytics – CEP’s strong suit is it’s ability to take atomic events and correlate them into compound events (via the API). Most of the commercially viable CEP engines thus provide a rules based language to define these patterns. Patterns which are temporal and spatial in nature. E.g. JBOSS Drools DRLBig Data Stream Processing on the other hand supports a super set of such analytical capabilities with map-reduce, predictive analytics, machine learning and advanced visualization etc etc.
  5. Scalability –CEP engines are typically small multi-node installs with vertical scalability being the core model of scaling clusters in production. Stream processing on the other hand provides scalable and auto-healing across typical 2 CPU dual core boxes. The model of scalability & resilience is horizontal.Having said all of this, it is important to note  that these two technologies (Hadoop ecosystem & CEP) can also co-exist. Many customers are looking to build “live” data marts where data is kept in memory as it streams in, a range of queries are continuously applied and realtime views are then shown to end users.Microsoft has combined their StreamInsight CEP engine into Apache Hadoop where MapReduce is used to run scaled out reducers performing complex event processing over the data partitions.

    3007.StreamInsight 2.png-481x359
    Source – http://blogs.msdn.com/b/microsoft_business_intelligence1/archive/2012/02/22/big-data-hadoop-and-streaminsight.aspx

Big Data #1 – the 4V’s of Big Data

“Big data” is high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.

Source – Gartner Research

Industry analysts widely describe the 3 V’s (Volume, velocity and variety) as the trifecta from a definition perspective.Let’s add a fourth V to it, Veracity – which pertains to the signal to noise ratio and the concomitant problem of unclean data.

The below infographic from IBM’s research group does a tremendous job of describing each of the V’s in one succinct picture.

ibm-big-data

Src – http://www.ibmbigdatahub.com/infographic/four-vs-big-data

Volume refers to the rapidly expanding data volumes. We now create as much information every two days as mankind had created from the start of time until 2003. In the past, you only had humans creating all this data now it’s machines as well (see IoT).

thus, Volume = Tremendous Scale (with a diverse variety of data)

One can see many industry examples in the graphic. Prior to the emergence of projects in the Apache Hadoop ecosystem, a lot of this data was getting trashed as organizations did not know how to accommodate them with conventional RDBMS/EDW techniques.

Existing approaches simply cannot scale and handle all this data.

A modern data platform built on Hadoop changes all that as we will see in subsequent posts.

Velocity refers to the speed at which these feeds (spanning a multitude of business clients ranging from sensors in cars, power generation & distribution, RFIDs in manufacturing, stock ticks in financial services and social media feeds) are moving into corporate applications – which then need to cleanse, normalize and sort the actionable information from the haystack. The other important aspect of Velocity is that all this new data is not just batch oriented but also streaming, real-time & near-time.

This velocity is also  both ingress velocity (“the speed at which these feeds can be sent into your architecture”) as well as egress velocity (“the speed at which actionable results need to be gleaned from this data”).

Variety refers to the emergence of semi-structured and unstructured data. Data in the past used to processed & stored in formats compatible with spreadsheets and relational databases. Now you get all sorts of data like photos, tweets,XML, emails etc etc. This adds tremendous strain on techniques that deal with collecting, processing and storing data. Existing approaches are simply unable to accommodate and scale.

The fourth V (Veracity) refers to the data governance process for all this data. How is the development and management of all these data pipelines to be managed with a focus on meta data mgmt, data lineage, data cleanliness etc. The sheer business impetus being the need to reason on data that is consistent and correct.

According to the above model, the challenges of big data management result from the expansion of all four properties, rather than just the volume alone — the sheer amount of data to be managed.

One question is I get very frequently is if existing technology like Complex Event Processing (CEP) can help handle the velocity aspect of Big Data, or, that processing of big data volumes seems awfully close to what they’ve encountered with CEP before.

Let’s talk about this in the next post before addressing the other V’s one by one.

Six new age IT capabilities that financial services CIO’s should look to leverage..

As data growth continues unabated and cost pressures continue to pile up with increased line of business demands for IT agility; the CIO in financial services is at a crossroads.

Couple this with the need to make bold and big bets in strategically important business areas like Risk Mgmt, Compliance, Trading systems etc you have all the makings of a perfect technology storm.

WebscaleUtility1

Figure 1 – IT is changing from service provider to service partner

As Brett King notes in his seminal work on Bank 3.0, this is the age of the hyper-connected consumer. Customers are expecting to be able to Bank from anywhere, be it a mobile device or use internet banking from their personal computer. IT is thus advancing enterprise destiny than at any point of time in the past.

Thus there is significant pressure on Banking infrastructures in three major ways –

  • to be able to adapt to this new way of doing things and to be able to offer multiple channels and avenues for such consumers to come in and engage with the business
  • offer agile applications that can detect customer preferences and provide value added services on the fly. Services that not only provide a better experience but also help in building a longer term customer relationship
  • to be able to help the business prototype, test, refine and rapidly develop new business capabilities

The core nature of Corporate IT is thus changing from an agency dedicated to keeping the trains running on time to one that is focused on innovative approaches like being able to offer IT as a service (much like a utility) as discussed above. It is a commonly held belief that large Banks are increasingly turning into software development organizations.

WebscaleUtility2

Figure 2 – IT operations faces pressures from both business and line of business development teams

The example of web based start-ups forging ahead of established players in areas like mobile payments is a result of the former’s ability to quickly develop, host, and scale applications in a cloud environment. The benefits of rapid application and business capability development has largely been missing in the with Bank IT’s private application platform in the enterprise data center.

WebscaleUtility3

Figure 3 – Business Applications need to be brought to market faster at a very high velocity

  1. Business process and workflow automation not only to remove the dependence on hardwired and legacy processes but also to understand and refine the current state of business processes
  2. In memory/Complex event processing capabilities to be able to detect and react to Big Data streams like social media feeds, payment data. The applicability of these technologies in areas like fraud detection has been well documented
  3. In the Data space, the burgeoning Hadoop space now provides abilities to leverage any processing paradigm from batch to interactive to real-time. Complementary technologies like data grids and data virtualization technology not only to do faster analytics but also to be able to provide regulatory and compliance related stakeholders with 360 degree views of customers. Such modern data architectures can minimize risk and maximize opportunity
  4. Open Integration that focuses on bringing together disparate capabilities across the enterprise
  5. Cloud computing as an enabler for infrastructure agility and application mobility. Be able to leverage a mix of private and public cloud capabilities in a secure manner.
  6. Platform As a service (PaaS) to deliver faster time to market and change to critical applications and business capabilities around X-as a service (XaaS)

A final word of advice in this age of cost cutting is for executives to look to leverage Open Source where possible across the above areas. The worlds leading web properties and Fortune 1000 institutions use a variety of open source technologies and have shown the way forward. Robust and well supported offerings are now available across the above technology spectrum and these can help cut IT budgets by billions of dollars.