Vamsi Chemitiganti's weekly musings on applying Big Data, Cloud, & Middleware technology to solving industry challenges. Published every Friday or Sunday (if I'm very busy). All opinions are entirely my own. I write this blog so my readers don't have to spend money on expensive consultants.
Distributed Ledger technology (DLT) and applications built for DLT’s – such as cryptocurrencies – are arguably the hottest topics in tech. This post summarizes seven key blogs on the topic of Blockchain and Bitcoin published at VamsiTalksTech.com. It aims to serve as a handy guide for business and technology audiences tasked with understanding and implementing this groundbreaking technology.
We have been discussing the capabilities of Blockchain and Bitcoin for quite some time on this blog. The impact of Blockchain on many industries is now clearly apparent. But can the DLT movement enable business efficiency and profitability?
# 1 – Introduction to Bitcoin –
Bitcoin (BTC) is truly the first decentralized, peer to peer, high secure and purely digital currency. Bitcoin & it’s other cousins such as Ether and AltCoins now regularly get widespread (& mostly positive) notice by a range of industry actors- ranging from Consumers, Banking Institutions, Retailers & Regulators. Riding on the real pathbreaker – the Blockchain, the cryptocurrency movement will help drive democratization in the financial industry and society at large in the years to come. This blog post discusses BTC from a business standpoint.
The term Blockchain is derived from a design pattern that describes a chain of data blocks that map to individual transactions. Each transaction that is conducted in the real world (e.g a Bitcoin wire transfer) results in the creation of new blocks in the chain. The new blocks so created are done so by calculating a cryptographic hash function of its previous block thus constructing a chain of blocks – hence the name. This post introduces the business potential of Blockchain to the reader.
# 5 – How Blockchain will lead to Industry disruption –
Blockchain lies at the heart of the Bitcoin implementation & is easily the most influential part of the BTC platform ecosystem. Blockchain is thus both a technology platform and a design pattern for building global scale industry applications that make all of the above possible. Its design makes it possible to be used as a platform for digital currency also enable it to indelibly record any kind of transaction – be it a currency, or, a medical record, or, supply chain data, or, a document etc into it.
# 7 – Key Considerations in Adapting the Blockchain for the Enterprise –
With advances in various Blockchain based DLTs (distributed ledger technology) platforms such as HyperLedger & Etherium et al, enterprises have begun to take baby steps to adapt the Blockchain (BC) to industrial scale applications. This post discusses some of the stumbling blocks the author has witnessed enterprises are running into as they look to adopt Blockchain based Distributed Ledger Technology in real-world applications.
The true disruption of Blockchain based distributed ledgers will in moving companies to an operating model where they leave behind siloed and stovepiped business processes, to the next generation of distributed business processes predicated on a seamless global platform. The DLT based platform will enable the easy transaction, exchange, and contraction of digital assets. However, before enterprises rush in, they need to perform an adequate degree of due diligence to avoid some of the pitfalls we have highlighted above.
We are in the middle of a series of blogs on Software Defined Datacenters (SDDC) @ http://www.vamsitalkstech.com/?p=1833. The key business imperative driving the SDDC architectures is their ability to natively support digital applications. Digital applications are “Cloud Native” (CN) in the sense that these platforms are originally being written for cloud frameworks – instead of being ported over to the Cloud as an afterthought. Thus, Cloud Native application development emerging as the most important trend in digital platforms. This blog post will define the seven key architectural characteristics of these CN applications.
What is driving the need for Cloud Native Architectures…
The previous post in the blog covered the monolithic architecture pattern. Monolithic architectures , which currently dominate the enterprise landscape, are coming under tremendous pressures in various ways and are increasingly being perceived to be brittle. Chief among these forces include – massive user volumes, DevOps style development processes, the need to open up business functionality locked within applications to partners and the heavy human requirement to deploy & manage monolithic architectures etc. Monolithic architectures also introduce technical debt into the datacenter – which makes it very difficult for the business lines to introduce changes as customer demands change – which is a key antipattern for digital deployments.
Applications that require a high release velocity presenting many complex moving parts, which are worked on by few or many development teams are an ideal fit for the CN pattern.
Introducing Cloud Native Applications…
There is no single and universally accepted definition of a Cloud Native application. I would like to define a CN Application as “an application built using a combination of technology paradigms that are native to cloud computing – including distributed software development, a need to adopt DevOps practices, microservices architectures based on containers, API based integration between the layers of the application, software automation from infrastructure to code, and finally orchestration & management of the overall application infrastructure.”
Further, Cloud Native applications need to be architected, designed, developed, packaged, delivered and managed based on a deep understanding of the frameworks of cloud computing (IaaS and PaaS).
Characteristic #1 CN Applications dynamically adapt to & support massive scale…
The first & foremost characteristic of a CN Architecture is the ability to dynamically support massive numbers of users, large development organizations & highly distributed operations teams. This requirement is even more critical when one considers that cloud computing is inherently multi-tenant in nature.
Within this area, the typical concerns need to be accommodated –
the ability to grow the deployment footprint dynamically (Scale-up) as well as to decrease the footprint (Scale-down)
the ability to gracefully handle failures across tiers that can disrupt application availability
the ability to accommodate large development teams by ensuring that components themselves provide loose coupling
the ability to work with virtually any kind of infrastructure (compute, storage and network) implementation
Characteristic #2 CN applications need to support a range of devices and user interfaces…
The User Experience (UX) is the most important part of a human facing application. This is particularly true of Digital applications which are omnichannel in nature. End users could not care less about the backend engineering of these applications as they are focused on an engaging user experience.
Accordingly, CN applications need to natively support mobile applications. This includes the ability to support a range of mobile backend capabilities – ranging from authentication & authorization services for mobile devices, location services, customer identification, push notifications, cloud messaging, toolkits for iOS and Android development etc.
Characteristic #3 They are automated to the fullest extent they can be…
The CN application needs to be abstracted completely from the underlying infrastructure stack. This is key as development teams can focus on solely writing their software and does not need to worry about the maintenance of the underlying OS/Storage/Network. One of the key challenges with monolithic platforms (http://www.vamsitalkstech.com/?p=5617) is their inability to efficiently leverage the underlying infrastructure as they have a high degree of dependency to it. Further, the lifecycle of infrastructure provisioning, configuration, deployment, and scaling is mostly manual with lots of scripts and pockets of configuration management.
The CN application, on the other hand, has to be very light on manual asks given its scale. The provision-deploy-scale cycle is highly automated with the application automatically scaling to meet demand and resource constraints and seamlessly recovering from failures. We discussed Kubernetes in one of the previous blogs.
Frameworks like these support CN Applications in providing resiliency, fault tolerance and in generally supporting very low downtime.
Characteristic #4 They support Continuous Integration and Continuous Delivery…
The reduction of the vast amount of manual effort witnessed in monolithic applications is not just confined to their deployment as far as CN applications are concerned. From a CN development standpoint, the ability to quickly test and perform quality control on daily software updates is an important aspect. CN applications automate the application development and deployment processes using the paradigms of CI/CD (Continuous Integration and Continuous Delivery).
The goal of CI is that every time source code is added or modified, the build process kicks off & the tests are conducted instantly. This helps catch errors faster and improve quality of the application. Once the CI process is done, the CD process builds the application into an artifact suitable for deployment after combining it with suitable configuration. It then deploys it onto the execution environment with the appropriate identifiers for versioning in a manner that support rollback. CD ensures that the tested artifacts are instantly deployed with acceptance testing.
Characteristic #5 They support multiple datastore paradigms…
The RDBMS has been a fixture of the monolithic application architecture. CN applications, however, need to work with data formats of the loosely structured kind as well as the regularly structured data. This implies the need to support data streams that are not just high speed but also are better suited to NoSQL/Hadoop storage. These systems provide Schema on Read (SOR) which is an innovative data handling technique. In this model, a format or schema is applied to data as it is accessed from a storage location as opposed to doing the same while it is ingested. As we will see later in the blog, individual microservices can have their own local data storage.
Characteristic #6 They support APIs as a key feature…
APIs have become the de facto model that provide developers and administrators with the ability to assemble Digital applications such as microservices using complicated componentry. Thus, there is a strong case to be made for adopting an API centric strategy when developing CN applications. CN applications use APIs in multiple ways – firstly as the way to interface loosely coupled microservices (which abstract out the internals of the underlying application components). Secondly, developers use well-defined APIs to interact with the overall cloud infrastructure services.Finally, APIs enable the provisioning, deployment, and management of platform services.
Characteristic #7 Software Architecture based on microservices…
As James Lewis and Martin Fowler define it – “..the microservicearchitecturalstyle is an approach to developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API. These services are built around business capabilities and independently deployable by fully automated deployment machinery. There is a bare minimum of centralized management of these services, which may be written in different programming languages and use different data storage technologies.” 
Microservices are a natural evolution of the Service Oriented Architecture (SOA) architecture. The application is decomposed into loosely coupled business functions and mapped to microservices. Each microservice is built for a specific granular business function and can be worked on by an independent developer or team. As such it is a separate code artifact and is thus loosely coupled not just from a communication standpoint (typically communication using a RESTful API with data being passed around using a JSON/XML representation) but also from a build, deployment, upgrade and maintenance process perspective. Each microservice can optionally have its localized datastore. An important advantage of adopting this approach is that each microservice can be created using a separate technology stack from the other parts of the application. Docker containers are the right choice to run these microservices on. Microservices confer a range of advantages ranging from easier build, independent deployment and scaling.
A Note on Security…
It goes without saying that security is a critical part of CN applications and needs to be considered and designed for as a cross-cutting concern from the inception. Security concerns impact the design & lifecycle of CN applications ranging from deployment to updates to image portability across environments. A range of technology choices is available to cover various areas such as Application level security using Role-Based Access Control, Multifactor Authentication (MFA), A&A (Authentication & Authorization) using protocols such as OAuth, OpenID, SSO etc. The topic of Container Security is very fundamental one to this topic and there are many vendors working on ensuring that once the application is built as part of a CI/CD process as described above, they are packaged into labeled (and signed) containers which can be made part of a verified and trusted registry. This ensures that container image provenance is well understood as well as protecting any users who download the containers for use across their environments.
In this post, we have tried to look at some architecture drivers for Cloud-Native applications. It is a given that organizations moving from monolithic applications will need to take nimble , small steps to realize the ultimate vision of business agility and technology autonomy. The next post, however, will look at some of the critical foundational investments enterprises will have to make before choosing the Cloud Native route as a viable choice for their applications.
“There are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don’t know. But there are also unknown unknowns. There are things we don’t know we don’t know.” – Donald Rumsfeld, 2002, Fmr US Secy of Defense
With machine learning increasing in popularity and adoption across industries, models are increasing in number and scope. McKinsey estimates that large enterprises have seen an increase of about 10 – 25% in their complex models which are being employed across areas as diverse as customer acquisition, risk management, insurance policy management, insurance claims processing, fraud detection and other advanced analytics. However, this increase is accompanied by a rise in model risk where incorrect model results, or design, contributes to erroneous business decisions. In this blog post, we discuss the need for model risk management (MRM) and a generic framework to achieve the same from an industry standpoint.
Model Risk Management in the Industry
The Insurance industry has extensively used predictive modeling across a range of business functions including policy pricing, risk management, customer acquisition, sales, and internal financial functions. However as predictive analytics has become increasingly important there is always a danger, or a business risk, incurred due to the judgment of the models themselves. While the definition of a model can vary from one company to another,we would like to define a model as a representation of some real-world phenomenon based on the real-world inputs (both quantitative and qualitative) shown to it, which is generated by operating on the inputs using an algorithm to produce a business insight or decision. The model can also provide some level of explanation for the reasons it arrived at the corresponding business insight. There are many ways to create and deliver models to applications. These vary from spreadsheets to specialized packages and platforms. We have covered some of these themes from a model development perspective in a previous blog @ – http://www.vamsitalkstech.com/?p=5321.
Models confer a multitude of benefits, namely:
The ability to reason across complex business scenarios spanning customer engagement, back-office operations, and risk management
The ability to automate decision-making based on historical patterns across large volumes of data
The audit-ability of the model which can explain to the business user how the model arrived at a certain business insight
The performance and the composition of a model depend on the intention of the designer. The reliability of the model depends primarily on access to adequate and representative data and secondly on the ability of the designer to model complex real-world scenarios and not always assume best-case scenarios.
As the financial crisis of 2008 illustrated, the failure of models brought down the insurance company AIG which caused severe disruption to the global financial system, set off the wider crisis in the global economy. Over the last few years, the growing adoption of Machine Learning models has resulted in their increased adoption into key business processes. This illustrates that if models go wrong, it can cause severe operational losses. This should illustrate the importance of putting in place a strategic framework for managing model risk.
A Framework for Model Risk Management
The goal of Model Risk Management (MRM) is to ensure that the entire portfolio of models is governed like any other business asset. To that effect, a Model Risk Management program needs to include the following elements:
Model Planning – The first step in the MRM process is to form a structure by which models created across the business are done so in a strategic and planned manner. This phase covers the ability to ensure that model objectives are well defined across the business, duplication is avoided, best practices around model development are ensured, & making sure modelers are provided the right volumes of data with high quality to create the most effective models possible. We have covered some of these themes around data quality in a previous blogpost @ http://www.vamsitalkstech.com/?p=5396
Model Validation & Calibration – As models are created for specific business functions, they must be validated for precision , and calibrated to reflect the correct sensitivity  & specificity  that the business would like to allow for. Every objective could have it’s own “sweet spot” (i.e. threshold) that they want to attain by using the model. For example: a company who wants to go green but realizes that not all of it’s customers have access to (or desire to use) electronic modes of communication might want to send out the minimum number of flyers that can get the message out but still keep their carbon footprint to a minimum without losing revenue by not reaching the correct set of customers. All business validation is driven by the business objectives that must be reached and how much wiggle room there is for negotiation.
Model Management – Models that have made it to this stage must now be managed. Management here reflects answering questions suck: who should use what model for what purpose, how long should the models be used without re-evaluation, what is the criteria for re-evaluation, who will monitor the usage to prevent wrong usage, etc. Management also deals with logistics like where do the models reside, how are they accessed & executed, who gets to modify them versus just use them, how will they be swapped out when needed without disrupting business processes dependent on them, how should they be versioned, can multiple versions of a model be deployed simultaneously, how to detect data fluctuations that will disrupt model behavior prior to it happening, etc.
Model Governance – Model Governance covers some of the most strategic aspects of Model Risk Management. The key goal of this process is to ensure that the models are being managed in conformance with industry governance and are being managed with a multistage process across their lifecycle – from Initiation to Business Value to Retirement.
Regulatory Guidance on Model Risk Management
The most authoritative guide on MRM comes from the Federal Reserve System – FRB SR 11-7/OCC Bulletin 2011-12.  And though it is not directly applicable to the insurance industry (it’s meant mainly for the banking industry), its framework is considered by many to contain thought leadership on this topic. The SR 11-7 framework includes documentation as part of model governance. An article in the Society of Actuaries April 2016 Issue 3 , details a thorough method to use for documenting a model, the process surrounding it, and why such information is necessary. In a highly regulated industry like insurance, every decision made (e.g. assumptions made, judgment calls given circumstances at the time, etc.) in the process of creating a model could be brought under scrutiny & effects the risk of the model itself. With adequate documentation you can attempt to mitigate any risks you can foresee, and have a good starting point for those that might blindside you down the road.
And Now a Warning…
Realize that even after putting MRM into place, models are still limited – they cannot cope with what Donald Rumsfeld dubbed the “unknown unknowns”. As stated in an Economist article : “Almost a century ago Frank Knight highlighted the distinction between risk, which can be calibrated in probability distributions, and uncertainty, which is more elusive and cannot be so neatly captured…The models may have failed but it was their users who vested too much faith in them”. Models, by their definition, are built using probability distributions based on previous experience to predict future outcomes. If the underlying probability distribution changes radically, they can no longer attempt to predict the future – because the assumption upon which they were built no longer holds. Hence the human element must remain vigilant and not put all their eggs into the one basket of automated predictions. A human should always question if the results of a model make sense and intervene when they don’t.
As the saying goes – “Models do not kill markets, people do.” A model is only as good as the assumptions and algorithm choices made by the designer, as well as the quality & scope of the data fed to it. However, enterprises need to put in place an internal model risk management program that ensures that their portfolio of models are constantly updated, enriched with data, and managed as any other strategic corporate asset. And never forget, that a knowledgeable human must remain in the loop.
As times change, so do architectural paradigms in software development. For the more than fifteen years the industry has been developing large scale JEE/.NET applications, the three-tier architecture has been the dominant design pattern. However, as enterprises embark or continue on their Digital Journey, they are facing a new set of business challenges which demand fresh technology approaches. We have looked into transformative data architectures at a great degree of depth in this blog, now let us now consider a rethink in the Applications themselves. Applications that were earlier deemed to be sufficiently well-architected are now termed as being monolithic. This post solely focuses on the underpinnings of why legacy architectures will not work in the new software-defined world. My intention is not to merely criticize a model (the three-tier monolith) that has worked well in the past but merely to reason why it may be time for a generally well accepted newer paradigm.
Traditional Software Platform Architectures…
Digital applications support a wider variety of frontends & channels, they need to accommodate larger volumes of users, they need wider support for a range of business actors – partners, suppliers et al via APIs. Finally, these new age applications need to work with unstructured data formats (as opposed to the strictly structured relational format). From an operations standpoint, there is a strong need for a higher degree of automation in the datacenter. All of these requirements call for agility as the most important construct in the enterprise architecture.
As we will discuss, legacy applications (typically defined as created more than 5+ years ago) are beginning to emerge as one of the key obstacles in doing Digital. The issue is not just in the underlying architectures themselves but also in the development culture involved building and maintaining such applications.
Consider the vast majority of applications deployed in enterprise data centers. These applications deliver collections of very specific business functions – e.g. onboarding new customers, provisioning services, processing payments etc. Whatever be the choice of vendor application platform, the vast majority of existing enterprise applications & platforms essentially follows a traditional three-tier software architecture with specific separation of concerns at each tier (as the vastly simplified illustration depicts below).
The first tier is the Presentation tier which is depicted at the top of the diagram. The job of the presentation tier is to present the user experience. This includes the user interface components that present various clients with the overall web application flow and also renders UI components. A variety of UI frameworks that provide both flow and UI rendering is typically used here. These include Spring MVC, Apache Struts, HTML5, AngularJS et al.
The middle tier is the Business logic tier where all the business logic for the application is centralized while separating it from the user interface layer. The business logic is usually a mix of objects and business rules written in Java using frameworks such EJB3, Spring etc. The business logic is housed in an application server such as JBoss AS or Oracle WebLogic AS or IBM WebSphere AS – which provides enterprise services (such as caching, resource pooling, naming and identity services et al) to the business components run on these servers. This layer also contains data access logic and also initiates transactions to a range of supporting systems – message queues, transaction monitors, rules and workflow engines, ESB (Enterprise Service Bus) based integration, accessing partner systems using web services, identity, and access management systems et al.
The Data tier is where traditional databases and enterprise integration systems logically reside. The RDBMS rules this area in three-tier architectures & the data access code is typically written using an ORM (Object Relational Mapping) framework such as Hibernate or iBatis or plain JDBC code.
Across all of these layers, common utilities & agents are provided to address cross-cutting concerns such as logging, monitoring, security, single sign-on etc.
The application is packaged as an enterprise archive (EAR) which can be composed of a single or multiple WAR/JAR files. While most enterprise-grade applications are neatly packaged, the total package is typically compiled as a single collection of various modules and then shipped as one single artifact. It should bear mentioning that dependency & version management can be a painstaking exercise for complex applications.
Let us consider the typical deployment process and setup for a thee tier application.
From a deployment standpoint, static content is typically served from an Apache webserver which fronts a Java-based webserver (mostly Tomcat) and then a cluster of backend Java-based application servers running multiple instances of the application for High Availability. The application is Stateful (and Stateless in some cases) in most implementations. The rest of the setup with firewalls and other supporting systems is fairly standard.
While the above architectural template is fairly standard across industry applications built on Java EE, there are some very valid reasons why it has begun to emerge as an anti-pattern when applied to digital applications.
Challenges involved in developing and maintaining Monolithic Applications …
Let us consider what Digital business usecases demand of application architecture and where the monolith is inadequate at satisfying.
The entire application is typically packaged as a single enterprise archive (EAR file), which is a combination of various WAR and JAR files. While this certainly makes the deployment easier given that there is only one executable to copy over, it makes the development lifecycle a nightmare. The reason being that even a simple change in the user interface can cause a rebuild of the entire executable. This results in not just long cycles but makes it extremely hard on teams that span various disciplines from the business to QA.
What follows from such long “code-test & deploy” cycles are that the architecture becomes change resistant, the code very complex over time and as a whole the system subsequently becomes not agile at all in responding to rapidly changing business requirements.
Developers are constrained in multiple ways. Firstly the architecture becomes very complex over a period of time which inhibits quick new developer onboarding. Secondly, the architecture force-fits developers from different teams into working in lockstep thus forgoing their autonomy in terms of their planning and release cycles. Services across tiers are not independently deployable which leads to big bang releases in short windows of time. Thus it is no surprise that failures and rollbacks happen at an alarming rate.
From an infrastructure standpoint, the application is tightly coupled to the underlying hardware. From a software clustering standpoint, the application scales better vertically while also supporting limited horizontal scale-out. As volumes of customer traffic increase, performance across clusters can degrade.
The Applications are neither designed nor tested to operate gracefully under failure conditions. This is a key point which does not really get that much attention during design time but causes performance headaches later on.
An important point is that Digital applications & their parts are beginning to be created using different languages such as Java, Scala, and Groovy etc. The Monolith essentially limits such a choice of languages, frameworks, platforms and even databases.
The Architecture does not natively support the notion of API externalization or Continuous Integration and Delivery (CI/CD).
As highlighted above, the architecture primarily supports the relational model. If you need to accommodate alternative data approaches such as NoSQL or Hadoop, you are largely out of luck.
Operational challenges involved in running a Monolithic Application…
The difficulties in running a range of monolithic applications across an operational infrastructure have already been summed up in the other posts on this blog.
The primary issues include –
The Monolithic architecture typically dictates a vertical scaling model which ensures limits on its scalability as users increase. The typical traditional approach to ameliorate this has been to invest in multiple sets of hardware (servers, storage arrays) to physically separate applications which results in increases in running cost, a higher personnel requirement and manual processes around system patch and maintenance etc.
Capacity management tends to be a bit of challenge as there are many fine-grained resources competing for compute, network and storage resources (vCPU, vRAM, virtual Network etc) that are essentially running on a single JVM. Lots of JVM tuning is needed from a test and pre-production standpoint.
A range of functions needed to be performed around monolithic Applications lack any kind of policy-driven workload and scheduling capability. This is because the Application does very little to drive the infrastructure.
The vast majority of the work needed to be done to provision, schedule and patch these applications is done by system administrators and consequently, automation is minimal at best.
The same is true in Operations Management. Functions like log administration, other housekeeping, monitoring, auditing, app deployment, and rollback are vastly manual with some scripting.
It deserves mention that the above Monolithic design pattern will work well for Departmental (low user volume) applications which have limited business impact and for applications serving a well-defined user base with well delineated workstreams. The next blog post will consider the microservices way of building new age architectures. We will introduce and discuss Cloud Native Application development which has been popularized across web-scale enterprises esp Netflix. We will also discuss how this new paradigm overcomes many of the above-discussed limitations from both a development and operations standpoint.
The fourth and previous blog in this seven part series on Software Defined Datacenters (@ http://www.vamsitalkstech.com/?p=5010) discussed how Linux Containers & Docker, are emerging as a key component of digital applications. We looked at various drivers & challenges stemming from running Containerized Applications from both a development and IT operations standpoint. In the fifth blog in this series, we will discuss another key emergent technology – Google’s Kubernetes (k8s)– which acts as the foundational runtime orchestrator for large scale container infrastructure. We will take the discussion higher up the stack in the next blog with OpenShift – Red Hat’s PaaS (Platform as a Service) platform – which includes Kubernetes and provides a powerful, agile & polyglot environment to build and manage microservices based applications.
The Importance of Container Orchestration…
With Cloud Native application development emerging as the key trend in Digital platforms, containers offer a natural choice for a variety of reasons within the development process. In a nutshell, Containers are changing the way applications are being architected, designed, developed, packaged, delivered and managed. That is the reason why Container Orchestration has become a critical “must have” since for enterprises to be able to derive tangible business value – they must be able to run large scale containerized applications.
While containers have existed in Unix based operating systems such as Solaris and FreeBSD, pioneering work in the Linux OS community has led to the mainstreaming of this disruptive technology. Now, despite all the benefits afforded to both developers and IT Operations by containers, there are critical considerations involved in running containers at scale in complex n-tier real world applications across multiple datacenters.
What are some of the key considerations in running containers at scale –
Consideration #1 – You need a Model/Paradigm/Platform for the lifecycle management of containers –
This includes the ability to organize applications into groups of containers, scheduling these applications on host servers that match their resource requirements, deploy applications as changes happen, manage complex storage integration, network topologies and provide seamless ways to destroy, restart etc etc
This covers a range of lifecycle processes ranging fromconstant deployments to upgrades to monitoringand monitoring. Granular issues include support for application patching with minimal downtime, support for canary releases, graceful failures in cloud-native applications, (container) capacity scale up & scale down based on traffic patterns etc.
Consideration #3 – Support DevelopMENT PROCESSES moving to DevOps and microservices –
These reasons vary from rapid feature development, ability to easily accommodate CI/CD approaches, flexibility (as highlighted in the above point). For instance,k8s removes one of the biggest challenges with using vanilla containers along with CI/CD tools like Jenkins – the challenge of linking individual containers that run microservices with one another. Other useful features include load balancing, service discovery, rolling updates and red/green deployments.
While the above drivers are just general guidelines, the actual tipping point for large scale container adoption will vary from enterprise to enterprise. However, the common precursor to supporting containerized applications at scale has to be an enterprise grade management and orchestration platform. And for some very concrete reasons we will discuss,k8s is fast emerging as the defacto leader in this segment.
Introducing Kubernetes (K8s)…
Kubernetes (kube or k8s) is an open-source platform that aims to automate the scheduling, deploying and managing applications running on containers. Kubernetes (and platforms built leveraging it) are designed to bring both development and operations teams together. This affects how Cloud Native applications are architected, composed, deployed, and managed.
k8s was incubated at Google (given their expertise in running billions of container workloads at scale) over the last decade. One caveat, the famous cluster controller & container management system known as Borg is deployed extensively at Google. Borg is a predecessor to k8s but is generally believed that while k8s borrows its core design tenets from Borg, it only contains a subset of the features present in Borg. 
Again, from  – “Kubernetes traces its lineage directly from Borg. Many of the developers at Google working on Kubernetes were formerly developers on the Borg project. We’ve incorporated the best ideas from Borg in Kubernetes, and have tried to address some pain points that users identified with Borg over the years.“
However, k8s is not a Google-only project anymore. In 2015 it was donated to the Cloud Native Foundation. The next year, 2015 also saw the k8s foundational release 1.0. Since then the project has been moving with a fair degree of feature & release velocity. The next version 1.4 was released in 2016. With the current 1.7 release, k8s has found wider industry wide adoption. The last year has seen heavy contributions from the likes of Red Hat, Microsoft, Mirantis, and Fujitsu et al to thek8s codebase.
k8s is infrastructure agnostic with clusters deployable on pretty much any Linux distribution – Red Hat, CentOS, Debian, Ubuntu etc. K8s also runs on all popular cloud platforms – AWS, Azure and Google Cloud. It is also virtually hypervisor agnostic supporting – VMWare, KVM, and libvirt. It also supports both Docker or Windows Containers or rocket (rkt) runtimes with expanding support as newer runtimes become available.
After this brief preamble, let us now discuss the architecture and internals of this exciting technology. We will then discuss why it has begun to garner massive adoption and why it deserves a much closer look by enterprise IT teams.
The Architecture of Kubernetes…
As depicted in the below diagram, Kubernetes (k8s) follows a master-slave methodology much like Apache Mesos and Apache Hadoop.
The k8s Master is the control plane of the architecture. It is responsible for scheduling deployments, acting as the gateway for the API, and for overall cluster management. As depicted in the below illustration, It consists of several components, such as an API server, a scheduler, and a controller manager. The master is responsible for the global, cluster-level scheduling of pods and handling of events. For high availability and load balancing, multiple masters can be setup.The core API server which runs in the master hosts a RESTful service that can be queried to maintain the desired state of the cluster and to maintain workloads. The admin path always goes through the Master to access the worker nodes and never goes directly to the workers.The Scheduler service is used to schedule workloads on containers running on the slave nodes. It works in conjunction with the API server to distribute applications across groups of containers working on the cluster. It’s important to note that the management functionality only accesses the master to initiate changes in the cluster and does not access the nodes directly.
The second primitive in the architecture is the concept of a Node. A node refers to a host which may be virtual or physical. The node is the worker in the architecture and runs application stack components on what are called Pods. It needs to be noted that each node runs several kubernetes components such as a kubelet and a kube proxy. The kubelet is an agent process that works to start and stop groups of containers running user applications, manages images etc and communicates with the Docker engine. The kube-proxy works as a proxy networking service that redirects traffic to specific services and pods (we will define these terms in a bit). Both these agents communicate with the Master via the API server.
Nodes (which are VMs or bare metal servers) are joined together to form Clusters. As the name connotes, Clusters are a pool of resources – compute, storage and networking – that are used by the Master to run application components. Nodes, which used to be known as minions in prior releases, are the workers. Nodes host end user applications using their local resources such as compute, network and storage. Thus they include components to aid in logging, service discovery etc. Most of the administrative and control interactions are done via the kubectl script or by performing RESTful calls to the API server. The state of the cluster and the workloads running on it is constantly synchronized with the Master using all these components.Clusters can be easily made highly available and scaled up/down on demand. They can also be federated across cloud providers and data centers if a hybrid architecture is so desired.
The next and perhaps the most important runtime abstraction in k8s is called a Pod. It is recommended that applications deployed in a K8s infrastructure be composed of lightweight and stateless microservices. These microservices can be deployed in individual or multiple containers. If the former strategy is chosen, each container only performs a specialized business activity. Though k8s also supports stateful applications, stateless applications confer a variety of benefits including loose coupling, auto-scaling etc.
The Pod is essentially the unit of infrastructure that runs an application or a set of related applications and as such it always exists in the context of a set of Linux namespaces or cgroups. A Pod is a group of one or more containers which always run on the same host. They are always scheduled together and share a common IP address/port configuration. However, these IP assignments cannot be guaranteed to stay the same over time. This can lead to all kinds of communication issues over complex n-tier applications. Kubernetes provides an abstraction called a Service – which is a grouping of a set of pods mapped to a common IP address.
The pod level inter-communication happens over IPC mechanisms. Pods also share local storage running on the node with the shared storage essentially mounted on.The infrastructure can provide services to the pod that span resources and process management.The key advantage here is that Pods can run related groups of applications with the advantage that individual containers can be made not only more lightweight but also versioned independently, which greatly aids in complex software projects where multiple teams are working on their own microservices which can be created and updated on their own separate cadence.
Labels are key value pairs that k8s uses to identify a particular runtime element be it a node, pod or service. They are most frequently applied to pods and can be anything that makes sense to the user or the administrator. Example of a pod label would be –(app=mongodb, cluster=eu3,language=python). Label Selectors determine what Pods are targeted by a Service.
From an HA standpoint, administrators can declare a configuration policy that states the number of pods that they need to have running at any given point. This ensures that pod failures can be recovered from automatically by starting new pods. An important HA feature is the notion of replica sets. The Replication controller ensures that there are a specified set of pods available to a given application and in the event of failure, new pods can be started to ensure that the actual state matches the desired state. Such pods are called replica sets. Workloads that are stateful are covered for HA using what are called pet sets.
The Replication Controller component running in the Master node determines which pods it controls and then uses a pod template file (typically a JSON or YAML file) to create new pods. It also is in charge of ensuring that the number of pods stays in consonance with replica counts. It is important to note that while the Replication controller just replaces dead or dysfunctional pods on the nodes that hosted them, it does not more pods across nodes.
Storage & Networking in Kubernetes –
Local pod storage is ephemeral and is reclaimed when the pod dies or is taken offline but if data needs to be persistent or shared between pods K8s provides Volumes. So really, depending on the use case, k8s supports a range of storage options from local storage to network storage (NFS, Ceph, Gluster, Ceph) to cloud storage (Google Cloud or AWS). More details around these emerging features are found at the K8s official documentation. 
Kubernetes has a pluggable networking implementation that works with the design of the underlying network. , there are four networking challenges to solve:
Container-container communication within a host – this is based purely on IPC & localhost mechanisms
Interpod communication across hosts – Here Kubernetes mandates that all pods be able to communicate with one another without NAT and that the IP of a pod is the same from within the pod and outside of it.
Pod to Service communications – provided by the Service implementation. As we have seen above, K8s services are provided with IP addresses that clients can reach them by. These IP addresses are proxied by the kube-proxy process which runs on all nodes sends to the service which then routes the external request to the correct pod.
External to Service communication – again provided by the Service implementation. This is done primarily by mapping the load balancer configuration to services running in the cluster. As outlined above, when traffic is sent to a node, the the kube-proxy process ensures that the traffic is routed to the appropriate service.
Network administrators looking to implement the K8s cluster model have a variety of choices from open source projects such as – Flannel, OpenContrail etc.
Why is Kubernetes such an exciting (and important) Cloud technology –
We have discussed the business & technology advantages of building an SDDC over the previous posts in this series. As a project, k8s has very lofty goals to simplify the lifecycle of not only containers but also to enable the deployment & management of distributed systems across any kind of modern datacenter infrastructure. It’s designed to promote extensibility and pluggability (via APIs) as we will see in the next blog with OpenShift.
There are three specific reasons why k8s is rapidly becoming a de facto choice for Container orchestration-
Once containers are used to full-blown applications, organizations need to deal with several challenges to enable efficiency in the overall development & deployment processes. These include enabling a rapid speed of application development among various teams working on APIs, UX front ends, business logic, data etc.
The ability to scale application deployments and to ensure a very high degree of uptime by leveraging a self-healing & immutable infrastructure. A range of administrative requirements from monitoring, logging, auditing, patching and managing storage & networking all come to consideration.
The need to abstract developers away from the infrastructure. This is accomplished by allowing dev teams to specific their infrastructure requirements via declarative configuration.
Kubernetes is emerging as the most popular platform to deploy and manage digital applications based on a microservices architecture. As a sign of its increased adoption and acceptance, Kubernetes is being embedded in Platform as a Service (PaaS) offerings where it offers all of the same advantages for administrators (deploying application stacks) while also freeing up developers of complex underlying infrastructure. The next post in this series will discuss OpenShift, Red Hat’s market leading PaaS offering, which leverages best of breed projects such as Docker and Kubernetes.
With advances in various Blockchain based DLTs (distributed ledger technology) platforms such as HyperLedger & Etherium et al, enterprises have begun to take baby steps to adapt the Blockchain (BC) to industrial scale applications. This post discusses some of the stumbling blocks the author has witnessed enterprises are running into as they look to get started on this journey.
Blockchain meets the Enterprise…
The Blockchain is a system & architectural design pattern for recording (immutable) transactions while providing an unalterable historical audit trail. This approach (proven with the hugely successful Bitcoin) guarantees a high degree of security, transparency, and anonymity for distributed applications purpose built for it. Bitcoin is but the first application of this ground breaking design pattern.
Due to its origins in the Bitcoin ecosystem, there has been a high degree association of the Blockchain with the cryptocurrency movement. However, a wide range of potential enterprise applications has been identified in industries such as financial services, healthcare, manufacturing, and retail etc – as depicted in the below illustration.
Last year, we took an in-depth look into the business potential of the Blockchain design pattern at the below post.
We can then define the Enterprise Blockchain as “a highly secure, resilient, algorithmic & accurate globally distributed ledger (or global database or the biggest filesystem or the largest spreadsheet) that provides an infrastructure pattern to build multiple types of applications that help companies (across every vertical), consumers and markets discover new business models, transact, trade & exchange information & assets.”
While some early deployments and initial standards making activity have been seen in financial services and healthcare, it is also finding significant adoption in optimizing internal operations for globally diversified conglomerates. For instance, tech major IBM claims to host one of the largest blockchain enterprise deployments. The application known as IGF provides working capital to about to 4,000+ customers, distributors, and partners. IBM uses its blockchain platform to manage disputes in the $48 billion IGF program. . The near linear scalability of the blockchain ensures that the IGF can gradually increase the number of members participating in the network.
In particular, the Financial Services Industry has had several bodies aiming to create standards around use cases such as consumer and correspondent banking payments and around the trade lifecycle. Some examples of these are R3 Corda, HyperLedger, and Ethereum. However, there is still a large amount of technology innovation, adoption and ecosystem development that needs to happen before the technology is consumable by your everyday bank or manufacturer or insurer.
The Four Modes of Blockchain Adoption in the Enterprise…
There are certain criteria that need to be met for a business process to benefit from a distributed ledger. First off, the business process should comprise various actors (both internal and external to the organization), secondly, there should be no reason to have a central authority or middleman to verify daily transactions except when disputes arise. Third, the process should call for strict audit trail as well as transaction immutability. The assets stored on the blockchain can really be anything – data, contracts or transactions etc.
At a high level, there are four modes of adoption, or, ways in which a BC technology can make its way into an enterprise –
Organic Proof Of Concept’s (POC) – These are driven by innovation groups inside the company tasked with exploring the latest technology advances. Oftentimes, these are technology-driven initiatives in search of a business problem. The approach works like this – management targets specific areas in technology where the firm needs to develop capabilities around. The innovation team works on defining an appropriate technical approach, reference stack & architecture (in this case for applications that have been determined to be suitable to be POC’d on a DLT) et al. The risk in this approach is that much of the best practices, learnings etc from other organizations, vendors, and solution providers are not leveraged.
Participation in Industry Consortia – A consortium is a group of companies engaged in a similar business task. These kinds of initiatives are being driven by like minded enterprises banding together (within specific sectors such as financial services, insurance, and healthcare) to define common use-cases that can benefit from sector specific common standards from a DLT standpoint & the ensuing network effects. Consortiums tend to mitigate risk both from a business and a cost standpoint as several companies typically band together to explore the technology. However, these can be difficult to pull off many a time due to competitive and cultural reasons.
In many cases, Regulators are pushing industry leaders to look into use cases (such as Risk Management, BackOffice Processes, and Fraud Detection) which can benefit from adopting distributed ledger technology (DLT).
Partnerships with Blockchain start-ups – These arrangements enable the (slow to move) incumbent market leading enterprises to partner with the brightest entrepreneurial minds in the BC world who are building path-breaking applications which will upend business models. The focus of such efforts has been to identify a set of use-cases & technology approaches that would immensely help the organization from applying BC technology to their internal and external business challenges. The advantages of this approach are that the skills shortage when established companies tackle immature technology projects can be ameliorated by working with younger organizations.
Having noted all this, the majority of proof of concepts driven out of enterprises are failing or performing suboptimally.
I feel that this is due to various reasons some of which we will discuss below. Point to be noted is that we are assuming that there is strong buy-in around BC and DLT at the highest levels of the organization. Scepticism about this proven design pattern and overcoming it is quite another topic altogether.
The Key Considerations for a Successful Enterprise Blockchain or Distributed Ledger (DLT)…
CONSIDERATION #1 – Targeting the right business use case for the DLT…
As we saw in the above sections, the use cases identified for DLT need to reflect a few foundational themes -non-reliance on a middleman, a business process supporting a truly distributed deployment, building trust among a large number of actors/counterparties, ability to support distributed consensus, and transparency. Due to its flat, peer to peer nature – Blockchain/DLT conclusively eliminates the need for any middleman.It is important that a target use case be realistic from both a functional requirement standpoint as well as from a business process understanding. The majority of enterprise applications can do perfectly well with a centralized database and applying DLT technology to them can cause projects to fail.
CONSIDERATION #2 – The Revenge of the Non Functional requirements…
Generally speaking, the current state of DLT platforms is that they fall short in a few key areas that enterprises usually take for granted in other platforms such as Cloud Computing, Middleware, Data platforms etc. These include key areas such as data privacy, transaction throughput, high speed of performance etc. If one recalls, the community Blockchain (that Bitcoin was built on) prioritized anonymity over privacy. This can sometimes be undesirable in areas such as payments processing or healthcare where the identitiy of consumers is governed by strict KYC (Know Your Customer) mandates. Thus, from an industry standpoint most DLT platforms are 24 months or so away from coming up to par in these areas in a manner that enterprises can leverage them.
Some of the other requirements, such as performance and scalability, are sometimes not directly tied to business features but lack of support for them can stymie any ambitious intended use of the technology. For instance, a key requirement in payments processing and supplier management is the ability for the platform to process a large number of transactions per second. Most DLT’s can only process around ten transactions per second on a permissionless network. This is far far from the ideal throughput needed in use-cases such as payments processing, IoT etc.
The good news is that the DLT community are acutely aware of enhancements that need to be done to the underlying platforms (e.g reduced block size etc) to increase throughput but these changes will take time to make their way into the mass market given the rigorous engineering work that needs to happen.
The Blockchain/DLT is not a data management paradigm. This is important for adopters to understand. Also, there currently exist very few standards and guidance on integrating distributed applications (Dapps) custom built for DLTs with underlying enterprise assets. These assets include enterprise middleware stacks, identity management platforms, corporate security systems, application data silos, BPM (Business Process Management) and Robotic Process Automation systems etc. For the BC to work for any business capability and as a complete business solution, necessary integration must be provided with a reasonable number of backend systems that influence the business cases- most such integration is sorely lacking. Interoperability is still in its infancy despite vendor claims.
CONSIDERATION #4 – Understand that Smart Contracts are still in their infancy…
The blockchain introduces the important notion of programmable digital instruments or contracts. An important illustration of the possibilities of blockchain is this concept of a “Smart contract”. Instead of static data objects that are inserted into the distributed ledger, a Smart Contract is a program that can perform the generation of downstream actions when appropriate conditions are met. They only become immutable once accepted into the ledger. Business rules are embedded in a contract that can automatically trigger based on certain conditions being met. E.g. a credit pre-qualification or assets transferred after a payment is made or after legal approval is provided etc.
Smart Contracts are being spoken about as the key functionality for any DLT platform based on Blockchain. While this hype is justified in some sense, it should be noted that smart contracts are again not standards based across major DLT platforms. Which means that they’re not auditable & verifiable across both local and global jurisdictions or when companies use different underlying commercial DLTs. The technology will evolve over the next few years but it is still very early days to run large scale production grade applications that leverage Smart Contracts.
CONSIDERATION #5 – SECURITY and DATA PRIVACY CONCERNS…
The promise of the original blockchain platform which ran Bitcoin was very simple. It provided a truly secure, trustable and immutable record on which any digital asset could be run. Parties using the system were all in a permissionless mode which meant that their identities were hidden from one another and from any central authority. While this may work for Bitcoin like projects, the vast majority of industry verticals will need strong legal agreements and membership management capabilities which follow them. Accordingly, these platforms will need to be permission-ed.
CONSIDERATION #6 – Blockchain Implementations need to be treated as AN INTEGRAL part ofDigital Transformation…
Blockchain as a technology definitely sounds way more exotic than Digital projects which have all the idea currency at the moment. However, an important way to visualize the organizational BC is that it provides an environment of instantaneous collaboration with business partners and customers. That is a core theme of Digital Transformation as one can appreciate. Accordingly, Blockchain/DLT proof of concepts themselves should be centrally funded & governed, skills should be grown in this area from both a development, administration and project management standpoint. Projects should be tracked using fair business metrics and appropriate governance mechanisms instituted as with any other digital initiative.
Surely, Blockchain based distributed ledgers are going to usher in the next generation of distributed business processes. These will enable the easy transaction, exchange, and contraction of digital assets. However, before enterprises rush in, they need to perform an adequate degree of due diligence to avoid some of the pitfalls highlighted above.
“We live in a world awash with data. Data is proliferating at an astonishing rate—we have more and more data all the time, and much of it was collected in order to improve decisions about some aspect of a business, government, or society. If we can’t turn that data into better decision making through quantitative analysis, we are both wasting data and probably creating suboptimal performance.”
— Tom Davenport, 2013 – Professor Babson College, Best Selling Author and Leader at Deloitte Analytics
Data Monetization is the organizational ability to turn data into cost savings & revenues in existing lines of business and to create new revenue streams.
Digitization is driving Banks and Insurance companies to reinvent themselves…
Enterprises operating in the financial services and the insurance industry have typically taken a very traditional view of their businesses. As waves of digitization have begun slowly upending their established business models, firms have begun to recognize the importance of harnessing their substantial data assets which have been built over decades. These assets include fine-grained data about internal operations, customer information and external sources (as depicted in the below illustration). So what does the financial opportunity look like? PwC’s Strategy& estimates that the incremental revenue from monetizing data could potentially be as high as US$ 300 billion  every year beginning 2019. This is across all the important segments of financial services- capital markets, commercial banking, consumer finance & banking, and insurance. FinTechs are also looking to muscle into this massive data opportunity,
The compelling advantages of Data Monetization have been well articulated across new business lines, customer experience, cost reduction et al. One of the key aspects of Digital transformation is data and the ability to create new revenue streams or to save costs using data assets.
..Which leads to a huge Market Opportunity for Data Monetization…
In 2013, PwC estimated that the market opportunity in data monetization was a nascent – US $175 million. This number has begun to grow immensely over the next five years with consumer banking and capital markets leading the way.
Digital first has been a reality in the Payments industry with Silicon Valley players like Google and Apple launching their own payments solutions (in the form of Google Pay and Apple Pay).
Visionary Banks & FinTechs are taking the lead in Data Monetization…
Leader firms such as Goldman Sachs & AIG have heavily invested in capabilities around data monetization. In 2012, Goldman purchased the smallest of the three main credit-reporting firms – TransUnion. In three years, Goldman has converted TransUnion into a data-mining machine. In addition to credit-reporting, TransUnion now gathers billions of data points about Americans consumers. This data is constantly analyzed and then sold to lenders, insurers, and others. Using data monetization, Goldman Sachs has made nearly $600 million in profit. It is expected to make about five times its initial $550 million investment. 
From the WSJ article…
By the time of its IPO in 2015, TransUnion had 30 million gigabytes of data, growing at 25% a year and ranging from voter registration in India to drivers’ accident records in the U.S. The company’s IPO documents boasted that it had anticipated the arrival of online lenders and “created solutions that catered to these emerging providers.”
As are forward looking Insurers …
The insurance industry is reckoning with a change in consumer behavior. Younger consumers are very comfortable with using online portals to shop for plans, compare them, purchase them and do other activities that increase the amount of data being collected by the companies. While data and models that operate on them have been critical in the insurance industry, it has been stronger around the actuarial areas. The industry has now begun heavily leveraging data monetization strategies across areas such as new customer acquisition, customer Underwriting, Dynamic Pricing et al. A new trend is for them to form partnerships with Automakers to tap into a range of telematics information such as driver behavior, vehicle performance, and location data. In fact, Automakers are already ingesting and constantly analyzing this data with the intention of leveraging it for a range of use-cases which include selling this data to insurance companies.
Leading carriers such as AXA are leveraging their data assets to strengthen broker and other channel relationships. AXA’s EB360 platform helps brokers with a range of analytic infused functions – e.g. help brokers track the status of applications, manage compensation, and commissions and monitor progress on business goals. AXA has also optimized user interfaces to ensure that data entry is minimized while supporting rapid quoting thus helping brokers easily manage their business thus strengthening the broker-carrier relationship.
Introducing Five Data Monetization Strategies across Financial Services & Insurance…
Let us now identify and discuss five strategies that enable financial services firms to progressively monetize their data assets. It must be mentioned that doing so requires an appropriate business transformation strategy to be put into place. Such a strategy includes clear business goals such as improving core businesses to entering lateral business areas.
Monetization Strategy #1 – Leverage Data Collected during Business Operations to Ensure Higher Efficiency in Business Operations…
The simplest and easiest way to monetize on data is to begin collecting disparate data generated during the course of regular operations. An example in Retail Banking is to collect data on customer branch visits, online banking usage logs, clickstreams etc. Once collected, the newer data needs to be fused with existing Book of Record Transaction (BORT) data to then obtain added intelligence on branch utilization, branch design & optimization, customer service improvements etc. It is very important to ensure that the right metrics are agreed upon and tracked across the monetization journey.
Monetization Strategy #2 – Leverage Data to Improve Customer Service and Satisfaction…
The next progressive step in leveraging both internal and external data is to use it to drive new revenue streams in existing lines of business. This requires fusing both internal and external data to create new analytics and visualization. This is used to drive use cases relating to cross sell and up-sell of products to existing customers.
Monetization Strategy #3 – Use Data to Enter New Markets…
A range of third-party data needs to be integrated and combined with internal data to arrive at a true picture of a customer. Once the Single View of a Customer has been created, the Bank/Insurer has the ability to introduce marketing and customer retention and other loyalty programs in a dynamic manner. These include the ability to combine historical data with real time data about customer interactions and other responses like clickstreams – to provide product recommendations and real time offers.
An interesting angle on this is to provide new adjacent products much like the above TransUnion example illustrates.
Monetization Strategy #4 – Establish a Data Exchange…
The Data Exchange is a mechanism where firms can fill in holes in their existing data about customers, their behaviors, and preferences. Data exchanges can be created using a consortium based approach that includes companies that span various verticals. Companies in the consortium can elect to share specific datasets in exchange for data while respecting data privacy and regulatory constraints.
Online transactions in both Banking and Insurance are increasing in number year on year. If Data is true customer gold then it must be imperative on companies to collect as much of it as they can. The goal is to create products that can drive longer & continuous online interactions with global customers. Tools like Personal Financial Planning products, complementary banking and insurance services are examples of where firms can offer free products that augment existing offerings.
A recent topical example in Telecom is Verizon Up, a program from the wireless carrier where consumers can earn credits (that they can use for a variety of paid services – phone upgrades, concert tickets, uber credits and movie premieres etc) in exchange for providing access to their browsing history, app usage, and location data. Verizon also intends to use the data to deliver targeted advertising to their customers. 
How Data Science Is a Core Capability for any Data Monetization Strategy…
Data Science and Machine learning approaches are the true differentiators and the key ingredients in any data monetization strategy. Further, it is a given that technological investments in Big Data Platforms, analytic investments in areas such as machine learning, artificial intelligence are also needed to stay on the data monetization curve.
How does this tie into practical use-cases discussed above? Let us consider the following usecases of common Data Science algorithms –
Customer Segmentation– For a given set of data, predict for each individual in a population, a discrete set of classes that this individual belongs to. An example classification is – “For all retail banking clients in a given population, who are most likely to respond to an offer to move to a higher segment”.
Pattern recognition and analysis – discover new combinations of business patterns within large datasets. E.g. combine a customer’s structured data with clickstream data analysis. A major bank in NYC is using this data to bring troubled mortgage loans to quick settlements.
Customer Sentiment analysis is a technique used to find degrees of customer satisfaction and how to improve them with a view of increasing customer net promoter scores (NPS).
Market basket analysis is commonly used to find out associations between products that are purchased together with a view to improving marketing products. E.g Recommendation engines which to understand what banking products to recommend to customers.
Regression algorithms aim to characterize the normal or typical behavior of an individual or group within a larger population. It is frequently used in anomaly detection systems such as those that detect AML (Anti Money Laundering) and Credit Card Fraud.
Profiling algorithms divide data into groups, or clusters, of items that have similar properties.
Causal Modeling algorithms attempt to find out what business events influence others.
Banks and Insurers who develop data monetization capabilities will be positioned to create new service offerings and revenues. Done right (while maintaining data privacy & consumer considerations), the monetization of data represents a truly transformational opportunity for financial services players in the quest to become highly profitable.
This is the third in a series of blogs on Data Science that I am jointly authoring with Maleeha Qazi, (https://www.linkedin.com/in/maleehaqazi/). We have previously covered some of the inefficiencies that result from a siloed data science process @ http://www.vamsitalkstech.com/?p=5046 & the ideal way Data Scientists would like their models deployed for the maximal benefit and use – as a Service @ http://www.vamsitalkstech.com/?p=5321. As the name of this third blog post suggests, the success of a data science initiative depends on data. If the data going into the process is “bad” then the results cannot be relied upon. Our goal is to also suggest some practical steps that enterprises can take from a data quality & governance process standpoint.
“However, under the strong influence of the current AI hype, people try to plug in data that’s dirty & full of gaps, that spans years while changing in format and meaning, that’s not understood yet, that’s structured in ways that don’t make sense, and expect those tools to magically handle it. ” – Monica Rogati (Data Science Advisor and ex-VP Jawbone – 2017) 
Different posts in this blog have discussed Data Science and other Analytical approaches to some degree of depth. What is apparent is that whatever the kind of analytics – descriptive, predictive, or prescriptive – the availability of a wide range of quality data sources is key. However, along with volume and variety of data, the veracity, or the truth, in the data is as important. This blog post discusses the main factors that determine the quality of data from a Data Scientist’s perspective.
The Top Issues of Data Quality
As highlighted in the above illustration, the top quality issues that data assets typically face are the following:
Incomplete Data: The data provided for analysis should span the entire cross-section of known data about how the organization views its customers and products. This would include data generated from various applications that belong to the business, and external data bought from various vendors to enriched the knowledge base. The completeness criteria measures if all of the information about entities under consideration is available and useable.
Inconsistent & Inaccurate Data: Consistency measures what data values give conflicting information and must be fixed. It also measures if all the data elements conform to specific and uniform formats and are stored in a consistent manner. Inaccurate data either has duplicate, missing or erroneous values. It also does not reflect an accurate picture of the state of the business at the point in time it was pulled.
Lack of Data Lineage & Auditability: The data framework needs to support audit-ability, i.e provide an audit trail of how the data values were derived from source to analysis point; the various transformations performed on it to arrive at the data set being considered for analysis.
Lack of Contextuality: Data needs to be accompanied by meaningful metadata – data that describes the concepts within the dataset.
Temporally Inconsistent: This measures if the data was temporally consistent and meaningful given the time it was recorded.
What Business Challenges does Poor Data Quality Cause…
Data Quality causes the following business challenges in enterprises:
Customer dissatisfaction: Across industries like Banking, Insurance, Telecom & Manufacturing, the ability to get a unified view of the customer & their journey is at the heart of the enterprise’s ability to promote relevant offerings & detect customer dissatisfaction. Currently, most industry players are woeful at putting together this comprehensive Single View of their Customers (SVC). Due to operational silos, each department possesses its own siloed & limited view of the customer across multiple channels. These views are typically inconsistent, lack synchronization with other departments, & miss a high amount of potential cross-sell and upsell opportunities. This is a data quality challenge at its core.
Lost revenue: The Customer Journey problem has been an age-old issue which has gotten exponentially more complicated over the last five years as the staggering rise of mobile technology and the Internet of Things (IoT) have vastly increased the number of enterprise touch points that customers are exposed to in terms of being able to discover and purchase new products/services. In an OmniChannel world, an increasing number of transactions are being conducted online. In verticals like the Retail industry and Banking & Insurance industries, the number of online transactions conducted approaches an average of 40%. Adding to the problem, more and more consumers are posting product reviews and feedback online. Companies thus need to react in real-time to piece together the source of consumer dissatisfaction.
Time and cost in data reconciliation: Every large enterprise nowadays runs expensive data re-engineering projects due to their data quality challenges. These are an inevitable first step in other digital projects which cause huge cost and time overheads.
Increased time to market for key projects: Poor data quality causes poor data agility, which increases the time to market for key projects.
Poor data means suboptimal analytics: Poor data quality causes the analytics done using it to be suboptimal – algorithms will end up giving wrong conclusions because the input provided to them is incorrect at best & inconsistent at worst.
Why is Data Quality a Challenge in Enterprises
The top reasons why data quality has been a huge challenge in the industry are:
Prioritization conflicts: For most enterprises, the focus of their business is the product(s)/service(s) being provided, book-keeping is a mandatory but secondary concern. And since keeping the business running is the most important priority, keeping the books accurate for financial matters is the only aspect that gets most of the technical attention it deserves. Other data aspects are usually ignored.
Organic growth of systems: Most enterprises have gone through a series of book-keeping methods and applications, most of which have no compatibility with one another. Warehousing data from various systems as they are deprecated, merging in data streams from new systems, and fixing data issues as these processes happen is not prioritized till something on the business end fundamentally breaks. Band-aids are usually cheaper and easier to apply than to try and think ahead to what the business will need in the future, build it, and back-fill it with all the previous systems’ data in an organized fashion.
Lack of time/energy/resources: Nobody has infinite time, energy, or resources. Doing the work of making all the systems an enterprise chooses to use at any point in time talk to one another, share information between applications, and keep a single consistent view of the business is a near-impossible task. Many well-trained resources, time & energy is required to make sure this can be setup and successfully orchestrated on a daily basis. But how much is a business willing to pay for this? Most do not see short-term ROI and hence lose sight of the long-term problems that could be caused by ignoring the quality of data collected.
What do you want to optimize?: There are only so many balls an enterprise can have up in the air to focus on without dropping one, and prioritizing those can be a challenge. Do you want to optimize the performance of the applications that need to use, gather and update the data, OR do you want to make sure data accuracy/consistency (one consistent view of the data for all applications in near real-time) is maintained regardless? One will have to suffer for the other.
How to Tackle Data Quality
With the advent of Big Data and the need to derive value from ever increasing volumes and a variety of data, data quality becomes an important strategic capability. While every enterprise is different, certain common themes emerge as we consider the quality of data:
The sheer number of transaction systems found in a large enterprise causes multiple challenges across the data quality dimensions. Organizations need to have valid frameworks and governance models to ensure the data’s quality.
Data quality has typically been thought of as just data cleansing and fixing missing fields. However, it is very important to address the originating business processes that cause this data to take multiple dimensions of truth. For example, centralize customer onboarding in one system across channels rather than having every system do its own onboarding.
It is clear from the above that data quality and its management is not a one time or siloed application exercise. As part of a structured governance process, it is very important to adopt data profiling and other capabilities to ensure high-quality data.
Enterprises need to define both quantitative and qualitative metrics to ensure that data quality goals are captured across the organization. Once this is done, an iterative process needs to be followed to ensure that a set of capabilities dealing with data governance, auditing, profiling, and cleansing is applied to continuously ensure that data is brought up to, and kept at, a high standard. Doing so can have salubrious effects on customer satisfaction, product growth, and regulatory compliance.
This blog has from time to time discussed issues around the defensive portion of financial services industry (Banking, Payment Processing, and Insurance etc). Anti Money Laundering (AML) is a critical area where institutions need to protect themselves and their customers from malicious activity. This post summarizes eight key blogs on the topic of AML published at VamsiTalksTech.com. It aims to serve as a handy guide for business and technology audiences tasked with implementing complex AML projects.
Money laundering has emerged as an umbrella crime which facilitates public corruption, drug trafficking, tax evasion, terrorism financing etc. Banks and other financial institutions are expected to conduct business in a manner that protects their countries of operations and consumers from security risks such as laundering, terrorist financing, and corruption (the ML/TF risks). Given the global reach of financial products, a variety of regulatory authorities is concerned about money laundering. Technology has become key to meeting the regulatory expectations as well as reducing costs in these onerous programs. As the below graphic from PwC  demonstrates this is one of the most pressing issues facing financial services industry.
The Six Critical Gaps in Global AML Programs…
From an industry standpoint, the highest priority issues that are being pointed out by regulators include the following –
Institutions failing to develop AML frameworks that are unique to the risks run by organizations given their product and geographic mix
Failure to develop real-time insights into business transactions and assigning them elevated risks based on their elements
Developing AML models that draw from the widest possible sources of data – both internal and external – to understand a true picture of the business
Demonstrating a consistent approach across geographies
Leveraging the latest developments in analytics including Machine Learning to enable the automation of AML programs
Lack of appropriate business governance & change management in setting, monitoring and managing AML compliance programs, policies and procedures
With this background in mind, the complete list of AML blogs on VamsiTalksTech is included below.
# 1 – Why Banks should Digitize their Operations and how this will help their AML programs –
Digitization implies a mix of business models predicated on agile systems, rapid & iterative development and more importantly – a Data First strategy. These have significant impacts on AML programs as well in addition to helping increase market share.
# 2 – Why Data Silos are a huge challenge in many cross organization projects such as AML –
Organizational Data Silos inhibit the effectiveness of AML programs as compliance officers cannot gain a single view of a customer or single view of a suspicious transaction or view the social graph in critical areas such as trade finance. This blog discusses the Silo anti-pattern and ways to mitigate silos from proliferating.
The headline is self-explanatory but we discuss the five major work streams on global AML projects – Customer Due Diligence, Entity Analysis, Downstream Analytics, Ongoing Monitoring and Investigation Lifecycle.
According to Pricewaterhouse Coopers, the estimates of global money laundering flows were between 2-5% of global GDP  in 2016 – however, only 1% of these transactions were caught. Certainly, the global financial industry has a long way to go before they effectively stop these nefarious actors but there should be no mistaking that technology is a huge part of the answer.