Why Linux Containers and Docker are the Runtime for the Software Defined Data Center (SDDC)..(4/7)

The third and previous blog in this seven part series (@ http://www.vamsitalkstech.com/?p=4659)  discussed Apache Mesos, a project that aims to abstract away various system resources – CPU, memory, network and disk resources to provide consuming digital applications with a giant cluster from which they can utilize capacity – a key requirement of the Software Defined Datacenter (SDDC). In this fourth blog, we will discuss another important ecosystem technology & project – Linux Containers and Docker – which forms the foundational runtime component in the SDDC. The next blog will discuss Kubernetes – Google’s container orchestration platform.

Much like shipping goods in Containers over Oceans, Linux Containers offer a portable, lightweight & convenient way to ship business applications. (Image Credit – WallPapers 13)

Executive Summary…

We can agree that the Digital application is inherently a distributed application. Such applications have historically been extremely hard to develop, setup and manage across a large fleet of data center servers that are a mix of platforms and technologies. Thus it is no surprise that one of the most disruptive developments in the last five years has been the innovation in the Linux container space. Containers now enable the running distributed applications at scale. 

Due to business reasons, Digital applications demand constant updates, changes and incremental revisions in response to changing customer needs. The Software Defined Datacenter (SDDC) thus needs a runtime paradigm that enables not just efficient hardware usage but also supports standardized application environments that are portable simplified and consistent across hybrid clouds and hypervisors.  Containers fill this need and are thus emerging to be the natural unit of deployment across the SDDC. Much has been written on the topic of Docker and Linux Container technology. My goal for this blog post is to distill key insights in the container ecosystem.

The Technologies of Linux Containers & Docker

Unlike Virtual Machines, Container Engines such as Docker share a common OS (Image Credit – MSFT Azure)

Linux Containers are alike and yet different from virtual machines. They are alike in the sense that each Container shares system resources on the underlying hardware platform – CPU, RAM, and Network – as with VMs. However, while each VM maintains its separate copy of the Operating System (OS), containers share the same OS kernel while keeping themselves separate from other containers running on the same OS.  How do they do that?

Though the terms ‘Docker’ and ‘Container’ have become almost synonymous – it needs to be noted that Docker is a company focused on developing technology enablement around containers in areas such as orchestration, networking, and management. Docker was an open source project (now renamed to Moby [1]) that provided capabilities such as a standard description of container formats, utilities for application packaging, deployment & lifecycle management of applications inside Linux Containers. It provides a Docker CLI command line tool for the lifecycle management of image-based containers.

Prior to the explosion of interest in Linux containers & the founding of Docker, traditional Linux distributions (with a minimum kernel level of 3.8) supported two foundational paradigms – control groups (cgroups) and kernel namespaces.  Linux containers use both these features to achieve their goal of isolation and portability. Cgroups enables the host to limit the resources each container process can use from a CPU, Memory, Filesystem, User ID components and Network standpoint. This ensures that containers running on a host cannot starve others of resources thus avoiding the “Noisy Neighbor” problem that bedeviled a lot of cloud deployments.

Kernel Namespaces ensure another kind of isolation for process interactions within the OS. Containers can only view and modify resources in the same namespace. This ensures a security mechanism where other containers and processes on the host cannot launch attacks on a given application running on a tenant container or on the host itself. Thus the combination of both these technologies ensures that multiple applications running within their individual containers can share CPU and Memory without needing the overhead of virtualization. Docker also grants each container its own networking implementation thus ensuring that resources such as socket and interfaces can also be protected.

Companies including Red Hat, IBM, Google, Cisco, VMware, and CoreOS have greatly aided with the development of and accessibility of containers in their platforms and products.

Layered Filesystems..

Various Image Layers in Docker. Each layer in the file system is mounted on the previous.The topmost is the Writable Container. (Image Credit – Docker)

We discussed how Container Images are Immutable. This is the key advantage of using container technology such as Docker & is made possible by the notion of a Union filesystem. What are Union filesystems and how do they enforce immutability? Much like the image in a Virtual Machine sense, Containers also run from an image, which typically are a snapshot of a filesystem but tend to be much smaller than VM images since the Container is installed on a host kernel.

Union filesystems are best described as a layered architecture – in that each layer is created independently and then added atop of the previous layer.  An example of a Union filesystem is a Linux Kernel – an OS – then a data base like Oracle – then Tomcat – and a web application on it. The top layer is always the Writable layer. The real advantage in using a union filesystem is that using these images becomes super efficient from a storage and execution standpoint. Union filesystems also help in sharing portions of the OS across containers. Simply put, an image contains everything an application needs – from it’s dependencies and external libraries. When an Image is run, it is called a Container. In the case of Docker, it uses a layered copy on write filesystem called AUFS (Another Union Filesystem).

Containers and Developers..

Containers are possibly the first infrastructure software category created by developers in mind. The prominence of Linux Containers has Docker coincided with the onset of agile development practices under the DevOps umbrella – CI/CD etc. Containers are an excellent choice to create agile delivery pipelines and continuous deployment. At their core, Containers enable the creation of multiple self-contained execution environments over the same operating system.

Developers are naturally excited about Linux Containers for five specific reasons –

  1. Containers allow for image consistency across OS environments. This is a huge help in accelerating the development process from development to debugging to production. Developers can just focus on building their applications (in dev environments that match the test and prod) and packaging them in containers. This just takes a lot of the inefficiency around environment dissimilarities out of the equation.
  2. Containers are treated as a standard linux process by the kernel & thus are orders of magnitude quicker from a startup time when compared to VMs. This means that developers can start their applications in a manner of seconds as long as they run them in a container.
  3. Containers also provide development organizations the ability to standardize application development workflows and update processes. This solves the scalability problem that digital applications have caused large organizations.
  4. Digital applications are leading the move to adopt microservices. Microservices offer a way to build applications as a collection of discrete services as opposed to a monolithic architecture. By there very nature, microservices can be built and managed by different teams. Containerization affords a lightweight way of building and deploying microservices.
  5. Containers offer a portable way of delivering applications (across Operating Systems) as well as provide horizontal scalability.

    Digital Application development using Containers..

Digital Application Development and Deployment Workflow using Containers.

There are a few key runtime components involved in operationalizing a small to medium to large scale container infrastructure as the above illustration depicts.

  1. Firstly, developers create container images. These images describe an application and it’s dependencies. An easy way to conceptualize an image is to think of it as a basic deployment template. Image are also immutable in that they are read only and any changes happen in the top most layer which is writable. Modifying an image is to create a new one. Images thus have a Parent Child relationship. Developers create images by building their applications on their developer environments, performing unit tests and then pushing to a repository. Once the container is built with the necessary dependencies, these tools run a battery of tests to validate business functionality. A large part of this process is usually best automated using CI/CD tools like Jenkins, CruiseControl or Buildbot etc.
  2. The built images are then made available in a Container Registry. This is either maintained internally or sourced from a trusted external source. As the name suggests, Registries maintain a catalog of container images of frequently used software – e.g. Custom applications and other software packages such as WordPress, Relational databases, Web Servers, Big Data technologies and Application Servers etc
  3. The next step is to create and deploy (runtime) containers from these images on a set of servers. Once images are released as a result of application development, sys admins work on the provisioning of the servers to run these images. Once a Container engine is installed on the server, images are loaded on and they take the runtime shape of containers. The mode of getting these images on these servers follows either a push/pull mechanism.
  4. Scheduling of containers on servers is also a process that usually done by Sys Admins. This involves running containers of certain kinds on servers that match up to certain CPU, I/O and Network capacity requirements
  5. To create complex real world deployments, not only do the servers and networking have to be created but these containers are also interconnected (e.g. a web server container to an application server) using Discovery mechanisms. These containers then need to also connect to a host of enterprise services. Customer traffic is then routed to the clustered containers running on these servers. Monitor the logs and performance of these containers and the microservices running on them.
  6. The process repeats from step #1 above.

Industry Adoption of Containers.

In a few years, containers will deliver the bulk of compute workloads across public cloud providers such as Amazon AWS, Google Compute Engine and Microsoft Azure. Given that the VM options on these clouds can run multiple containers which can scale on demand, the industry will begin to gravitate to higher utilization density. The SDDC has already begun incorporating hybrid architectures that run both containers and VMs in a complementary fashion.The Software Defined Datacenter will incorporate a hybrid model consisting of applications running on both Linux Containers and Virtual Machines.

Customers also have choices of traditional enterprise operating systems such as Red Hat Enterprise Linux or Microsoft Windows or can also run containers on OS’s developed for the purpose of hosting containers at hyper scale. These OS’s just provide tools to manage containers and nothing else. Examples include Red Hat Atomic Platform and CoreOS. Moving up the stack, pioneers such as Google and Red Hat have added core support for containers in projects such as OpenStack, Kubernetes, Mesos, OpenShift & CloudFoundry by helping with networking and persistent storage. Kubernetes (which we will cover in the next post) also handles provisioning on multiple public cloud platforms. Config Mgmt platforms such as Ansible, Chef and Puppet now support containerized deployments.

Technical Considerations for Container Adoption

Some key considerations that industry players are tackling from the standpoint of running containers at scale –

  1. Container Orchestration –  Organize groups of containers into compassable applications, scheduling them on servers that match their resource requirements, placement of containers based on network topology etc
  2. Container Networking – Containers follow a pluggable model and the network is no different. Key considerations – an enterprise network connectivity stack is needed to not only provide the interconnect between different containers but also to integrate them with existing Layer 2/3 networks. Additionally, network isolation needs to be provided for microservices running on these containers using either a dedicated IP address for each or an overlay network.
  3. Management and Monitoring -Life cycle processes ranging from Management and Monitoring encompass a range of questions – application patching with low downtime, graceful failures in cloud native applications, container scale up & scale down based on traffic patterns etc.

Containers and your Enterprise…

So what is the best way to adopt containers across a large enterprise?

  • Develop your container strategy in the context of the Nexus of Forces (i.e., information, mobile, social and cloud) initiatives in your organization — Containers are at the junction of these technologies.
  • Institute an organizational process to examine the business value of any initiative to adopt Containers. Understand what tools and platforms to adopt that will abstract away the complexities of using containers.
  • Understanding skills required to leverage containers. Containers are a new way for both developers and SysOps. Dependency management moves to the developers but they realize tremendous benefits in adopting these for high-velocity Digital applications
  • Identifying, measuring and benchmarking key success metrics that measure the ROI of the overall container investments.

Conclusion..

To sum up, the Linux (and Windows) container space is exploding both from a mindshare as well as an adoption standpoint. What is hugely encouraging is that a host of next generation platform technologies (ranging from IaaS to PaaS) are not just choosing to support containers as their basic runtime unit but are also focusing on becoming the defacto solution supporting a host of container ecosystem usecases – provisioning, orchestration, management, CI/CD et al. The next two blogs will respectively discuss how Google Kubernetes and Red Hat OpenShift overcome these challenges and abstract away much of the complexity around container deployments.

The next blog post in this series will discuss Google Kubernetes, the dominant project in the container orchestration space.

References

[1] Introducing Moby Project –  https://blog.docker.com/2017/04/introducing-the-moby-project/

The Deployment Architecture of an Enterprise API Management Platform..

We discussed the emergence of Application Programming Interfaces (APIs) as providing a key business capability in Digital Platforms @ http://www.vamsitalkstech.com/?p=3834. The next post then discussed the foundational technology, integration & governance capabilities that any Enterprise API Platform must support @ http://www.vamsitalkstech.com/?p=5102.  This final post in the API series will discuss a deployment model for an API Management Platform.

Background..

The first two posts in this series discussed the business background to API Management and the need for an Enterprise API Strategy. While details will vary across vendor platforms, the intention of this post is to discuss key runtime components of an API management platform and the overall developer workflow in creating APIs & runtime workflow to that enables client applications to access them.

Architectural Components of an API Management Platform..

The important runtime components of an API management platform are depicted in the below illustration. Note that we have abstracted out network components (firewalls, reverse proxies, VLANs, switches etc) as well as the internal details of application architecture which would normally be impacted by an API Platform.

The major components of an API Management Platform and the request flow across the architecture.

Let us cover the core components of the above:

  1. API Gateway -The API Gateway has emerged as the dominant deployment artifact in API Architectures. As the name suggests Gateways are based on a facade design pattern. The Gateway (or typically a set of highly available Gateways) acts as a proxy to traffic between client applications (used by customers, partners and employees) and back end services (ranging from mainframes to microservices). The Gateway is essentially an appliance or a software process that abstracts all API traffic into an organization and exposes business capabilities typically via a REST interface. Clients are exposed different views of the same API – coarse grained or granular – depending on the kind of client application (thick/thin) and access control permissions.  Gateways include protocol translation and request routing as their core functionality. It is also not uncommon to deploy multiple Gateways – in an internal and external fashion – depending on business requirements in terms of partner interactions etc. Gateways also include functionality such as caching requests for performance, load balancing, authentication, serving static content etc. The API Gateway can thus be managed using a set of policy controls. Performance characteristics such as throughput, scalability, caching, load balancing and failover are managed using a cluster of API Gateways.  The introduction of an API Gateway also ensures that application design is impacted going forward. API Gateways can be implemented in many forms – as a software platform or as an appliance. Public cloud providers have also begun offering mature API Gateways that integrate well with a range of backend services that they provide both from an IaaS and a PaaS standpoint. For instance, Amazon’s API Gateway integrates natively with AWS Lambda and EC2 Container Service for microservice deployments on AWS.
  2. Security -Though it is not a standalone runtime artifact, Security ends to be called out as one of the most important logical requirements of an API Management platform. APIs have to follow the same access control mechanisms, security constrains for different user roles etc as their underlying datasources. This is key as backend applications and organizational data need to be protected from a variety of targets – denial of service attacks, malware, access control violations etc. Accordingly, policy based protection using API keys, JSON/XML signature scanning & threat protection, encryption for Data in motion and at rest, OAuth support etc – all need to be provided as standard features.
  3. Developer portal -A Developer portal is the entry point for developers and can also serve as a developer onboarding tool. Thus, typically it is a web based portal integrated with the API Gateway. Developers use the portal to study API specs, download SDKs for different programming languages, register their APIs and to monitor their API performance. It also provides a visual interface to help developers build/test their APIs and also provides support for a high degree of automation using a continuous delivery model. For internal developers, the ability to provide self service consumption of API developer stacks (Node.js/ JavaScript frameworks/Java runtimes/ PaaS integration etc) is a highly desirable capability.
  4. Management and Monitoring -Ensuring that the exposed APIs are maintaining their QOS (Quality of Service) as helping admins monitor their quota of resource consumption is key from a Operations standpoint. Further, the M&M functionality should also aid operators in resolving complex systems issues and ensuring a high degree of availability during upgrades etc.
  5. Billing and Chargeback -Here we refer to the ability to tie in the usage of APIs to back office applications that can charge users based on their metered usage of the backend applications. This is typically provided through logging and auditing capability.
  6. Governance -From a Governance standpoint, the ability to track APIs across their lifecycle,  a handy catalog of available APIs, an ability to audit their usage and the underlying assets they expose and the ability for business to set policies on their usage etc.

API Design Process..

Most API Platforms provide a developer toolkit with varying degrees of integration with a runtime platform. Handy SKDs for iOS, Android and Javascript development are provided.

An internal developer uses the developer toolkit (e.g. Eclipse with an offline plugin) and/or an API Designer tool included with a vendor platform to create the API based on organizational policies. Extensive CLI (Command Line Interface) is also provided to perform all functions which can be done using the GUI. These include, local unit & system test capabilities and an ability to publish the tested APIs to a repository from where the runtime can access, deploy and update the APIs.

From a data standpoint, multiple databases including RDBMS, NoSQL are supported for data access. During the creation of the API, depending on whether the developer already has an existing data model in mind, the business logic is mapped closely with the data schema, or, one can also work top down to create the backend once the API interface has been defined using a model driven approach. These also include settings for security permissions with support for OAuth and any other third party authentication dependencies.

Once defined and tested, the API is published onto the runtime. During this process access control privileges, access policies and the endpoint itself are defined. The API is then ready for external consumption and discovery.

Runtime Flow Across the Architecture..

In the simplest case – once the API has been deployed and tested it is made available for public discovery and consumption. Client Applications then begin to leverage the API and this can be done in a variety of ways. For example – user interactions on mobile applications, webpages and B2B services trigger calls to the API Gateway. The Gateway performs a range of functions to process the request – from security authorization to load-balancing before accessing policies setup for that particular API. The Gateway then invokes the API by calling the backend system typically via message oriented middleware such as an ESB or a Message Broker. Once the backend responds with the appropriate payload ,the data is sent to the requesting application. Systems and Administration teams can view detailed operational metrics and logs to monitor API performance.

A Note on Security..

It should come as no surprise that the security aspect of an API Management Platform is one of the most critical aspects of the implementation. While API Security is a good subject for a followup post and too exhaustive to be covered in a short blurb – several standards such as OAuth2, OpenID Connect, JSON Security & Policy languages are all topics that need to be explored by both organizational developers and administrators.  Extensive flow mapping and scenario testing are mandated here. Also, endpoint security from a client application standpoint is key. Your Servers, Desktops, Supported Mobile devices need to be updated and secured with the latest antivirus & other standard IT Security/access control policies.

Conclusion..

In this post, we tried to highlight the major components of an API Management Platform from a technology standpoint. While there are a range of commercial & open source platforms, it is important to evaluate them from a feature standpoint as well as from an ecosystem capability perspective as developers began implementing microservices based Digital Architectures.

Risk Management – Industry Insights & Reference Architectures…

Financial Risk Management as it pertains to different industries – Banking, Capital Markets and Insurance – has been one of the most discussed topics in this blog. The business issues and technology architecture of systems dedicated to aggregating, measuring & visualizing Risk are probably one of the more complex tasks in the worlds of finance & insurance. This post summarizes ten key blogs on the topic of Financial Risk published at VamsiTalksTech.com. It aims to serve as a handy guide for business and technology audiences tasked with implementing Risk projects.

Image Credit – ShutterStock

The twin effects of the global financial crisis & the FinTech boom has caused Financial Services, Insurance and allied companies to become laser focused on on risk management.  What was once a concern primarily of senior executives in the financial services sector has now become a top-management priority in nearly every industry.

Whatever be the kind of Risk, certain themes are common from a regulatory intention standpoint-

  1. Limiting risks that may cause wider harm to the economy by restricting certain activities such as preventing banks with retail operations from engaging in proprietary trading activities
  2. Requiring that banks increase the amount of and quality of capital held on reserve to back their assets and by requiring higher liquidity positions
  3. Ensuring that banks put in place appropriate governance standards ensuring that boards and management interact not just internally but also with regulators and their clients
  4. Upgrading governance standards, enabling a fundamental change in bank governance and the way boards interact with both management and regulators. These ambitions were expressed in various new post‐crisis rules and approaches.
  5.  Tackle the “too big to fail” challenge for highly complex businesses spanning multiple geographies, product lines and multifaceted customer segments. Accurate risk reporting ensures adequate capital conservation buffers.

With this background in mind,  complete list of Risk use case blogs on VamsiTalksTech is included below.

# 1 – Why Banks and Other Financial Institutions Should Digitize Risk Management –

Banks need to operate their IT across two distinct prongs – defense and offense. Defensive in areas like Risk, Fraud and Compliance (RFC) ; Offensive as in revenue producing areas of the business like Customer 360 (whether Institutional or Retail), Digital Marketing, Mobile Payments, Omni channel Wealth Management etc. If one really thinks about it – the biggest activity that banks do is manipulate and deal in information whether customer or transaction or general ledger etc.

Why Banks, Payment Providers and Insurers Should Digitize Their Risk Management..

# 2 – Case Study of a Big Data Enabled IT Architecture for Risk Data Measurement – Volcker Rule/Dodd Frank –

While industry analysts can discuss the implications of a certain Risk mandate, it is certainly most help for Business & IT audiences to find CIOs discussing overall strategy & specific technology tools. This blogpost discusses how two co-CIOs charged with an enterprise technology mandate are focused on growing and improving a Global Banking leaders internal systems, platforms and applications especially from a Risk standpoint.

How a Pioneering Bank leverages Hadoop for Enterprise Risk Data Aggregation & Reporting..



# 3 – A POV on Bank Stress Testing – CCAR and DFast

An indepth discussion of Bank Stress Testing from both a business and technology standpoint.

A POV on Bank Stress Testing – CCAR & DFAST..

# 4 – Capital Markets – Architectural Approaches to the practice of Risk Management

In Capital Markets, large infrastructures ,on a typical day, process millions of derivative trades. The main implication is that there are a large number of data inserts and updates to handle. Once the data is loaded into the infrastructure there needs to be complex mathematical calculations that need to be done in near real time to calculate intraday positions. Most banks use techniques like Monte Carlo modeling and other computational simulations to build & calculate these exposures. Hitherto, these techniques were extremely expensive from both the cost of hardware and software needed to run them. Neither were tools & projects available that supported a wide variety of data processing paradigms – batch, interactive, realtime and streaming. This post examines a detailed reference architecture applicable to areas such as Market, Credit & Liquidity Risk Measurement.

Big Data architectural approaches to Financial Risk Mgmt..

# 5 – Risk Management in the Insurance Industry  – Solvency II 

A discussion of Solvency II – the Insurance industry’s equivalent of Basel III – from both a business and technology standpoint.

Why the Insurance Industry Needs to Learn from Banking’s Risk Management Nightmares..

# 6 – FRTB (Fundamental Review of the Trading Book)

An in-depth business and technology discussion of the highlights and key implications of the FRTB (Fundamental Review of the Trading Book).

A POV on the FRTB (Fundamental Review of the Trading Book)…

# 7 – Architecture and Data Management Antipatterns

How not to architect Financial Service IT platforms using Risk Applications as an example.

The Five Deadly Sins of Financial Services IT..

# 8 – The Intelligent Banker Needs Better Risk Management –

The Intelligent Banker needs better Risk Management

# 9 – Implications of Basel III

This blogpost discusses the key implications of Basel III.

Towards better Risk Management..Basel III

#10 The Implications of BCBS 239

This blogpost discusses the data management and governance implications of BCBS 239. The BCBS 239 provides guidelines to overhaul an organization’s risk data aggregation capabilities and internal risk reporting practices.

BCBS 239 and the need for smart data management

Conclusion..

Industry clearly requires a fresh way of thinking about Risk Management. Leader firms will approach Risk as a way to create customer value and a board level conversation around such themes rather than as a purely defensive and regulatory challenge. Surely, this will mean that budgets for innovation related spending in areas such as Digital Transformation will also slowly percolate over to Risk. As firms either digitize or deal with gradually eroding market share, business systems that work with and leverage risk will emerge as a strong enterprise capability over the upcoming 3-5 year horizon.

A Digital Reference Architecture for the Industrial Internet Of Things (IIoT)..

A few weeks ago on the invitation of DZone Magazine, I jointly authored a Big Data Reference Architecture along with my friend & collaborator, Tim Spann (https://www.linkedin.com/in/timothyspann/). Tim & I distilled our experience working on IIoT projects to propose an industrial strength digital architecture. It brings together several technology themes – Big Data , Cyber Security, Cognitive Applications, Business Process Management and Data Science. Our goal is to discuss a best in class architecture that enables flexible deployment for new IIoT capabilities allowing enterprises to build digital applications. The abridged article was featured in the new DZone Guide to Big Data: Data Science & Advanced Analytics which can be downloaded at  https://dzone.com/guides/big-data-data-science-and-advanced-analytics

How the Internet Of Things (IoT) leads to the Digital Mesh..

The Internet of Things (IoT) has become one of the four top hyped up technology paradigms affecting the world of business. The other usual suspects being Big Data, AI/Machine Learning & Blockchain. Cisco predicts that the IOT is expected to impact about 25 billion connected things by 2020 and affect about $2 trillion of economic value globally across a diverse range of verticals. These devices are not just consumer oriented devices such as smartphones and home monitoring systems but dedicated industry objects such as sensors, actuators, engines etc.

The interesting angle to all this is the fact that autonomous devices are already beginning to communicate with one another using IP based protocols. They largely exchanging state & control information around various variables. With the growth of computational power on these devices, we are not far off from their sending over more granular and interesting streaming data – about their environment, performance and business operations – all of which will enable a higher degree of insightful analytics to be performed on the data. Gartner Research has termed this interconnected world where decision making & manufacturing optimization can occur via IoT as the “Digital Mesh“.

The evolution of technological innovation in areas such as Big Data, Predictive Analytics and Cloud Computing now enables the integration and analysis of massive amounts of device data at scale while performing a range of analytics and business process workflows on the data.

Image Credit – Sparkling Logic

According to Gartner, the Digital Mesh will thus lead to an interconnected data information deluge powered by the continuous data from these streams. These streams will encompasses classical IoT endpoints (sensors, field devices, actuators etc) sending data in a variety of formats –  text, audio, video & social data streams – along with new endpoints in areas as diverse as Industrial Automation, Remote Healthcare, Public Transportation, Connected Cars, Home Automation etc. These intelligent devices will increasingly begin communicating with their environments in a manner that will encourage collaboration in a range of business scenarios. The industrial cousin of IoT is the Industrial Internet of Things (IIIoT).

Defining the Industrial Internet Of Things (IIoT)

The Industrial Internet of Things (IIoT) can be defined as a ecosystem of capabilities that interconnects machines, personnel and processes to optimize the industrial lifecycle.  The foundational technologies that IIoT leverages are Smart Assets, Big Data, Realtime Analytics, Enterprise Automation and Cloud based services.

The primary industries impacted the most by the IIoT will include Industrial Manufacturing, the Utility industry, Energy, Automotive, Transportation, Telecom & Insurance.

According to Markets and Markets, the annual worldwide Industrial IoT market is projected to exceed $319 billion in 2020, which represents an 8% a compound annual growth rate (CAGR). The top four segments are projected to be manufacturing, energy and utilities, auto & transportation and healthcare.[1]

Architectural Challenges for Industrial IoT versus Consumer IoT..

Consumer based IoT applications generally receive the lion’s share of media attention. However the ability of industrial devices (such as sensors) to send ever more richer data about their operating environment and performance characteristics is driving a move to Digitization and Automation across a range of industrial manufacturing.

Thus, there are four distinct challenges that we need to account for in an Industrial IOT scenario as compared to Consumer IoT.

  1. The IIoT needs Robust Architectures that are able to handle millions of device telemetry messages per second. The architecture needs to take into account that all kinds of devices operating in environments ranging from the constrained to
  2. IIoT also calls for the highest degrees of Infrastructure and Application reliability across the stack. For instance, a lost message or dropped messages in a healthcare or a connected car scenario may mean life or death for a patient, or, an accident.
  3. An ability to integrate seamlessly with existing Information Systems. Lets be clear, these new age IIOT architectures need to augment existing systems such as Manufacturing Execution Systems (MES) or Traffic Management Systems. In Manufacturing, MES systems continually improve the product lifecycle and perform better resource scheduling and utilization. This integration helps these systems leverage the digital intelligence and insights across (potentially) millions of devices across complex areas of operation.
  4. An ability to incorporate richer kinds of analytics than has been possible before that provide a great degree of context. This ability to reason around context is what provides an ability to design new business models which cannot be currently imagined due to lack of agility in the data and analytics space.

What will IIoT based Digital Applications look like..

Digital Applications are being designed for specific device endpoints across industries. While the underlying mechanisms and business models differ from industry to industry, all of these use predictive analytics based on a combination of real time data processing & data science algorithms. These techniques extract insights from streaming data to provide digital services on existing toolchains, provide value added customer service, predict device performance & failures, improve operational metrics etc.

Examples abound. For instance, a great example in manufacturing is the notion of a Digital Twin which Gartner called out last year. A Digital twin is a software personification of an Intelligent device or system.  It forms a bridge between the real world and the digital world. In the manufacturing industry, digital twins can be setup to function as proxies of Things like sensors and gauges, coordinate measuring machines, vision systems, and white light scanning. This data is sent over a cloud based system where it is combined with historical data to better maintain the physical system.

The wealth of data being gathered on the shop floor will ensure that Digital twins will be used to reduce costs and increase innovation. Thus, in global manufacturing – Data science will soon make it’s way into the shop floor to enable the collection of insights from these software proxies. We covered the phenomenon of Servitization in manufacturing in a previous blogpost.

In the Retail industry, an ability to detect a customer’s location in realtime and combining that information with their historical buying patterns can drive real time promotions and an ability to dynamically price retail goods.

Solution Requirements for an IIoT Architecture..

At a high level, the IIoT reference architecture should support six broad solution areas-

  1. Device Discovery – Discovering a range of devices (and their details)  on the Digital Mesh for an organization within and outside the firewall perimeter
  2. Performing Remote Lifecycle Configuration of these devices ranging from startup to modification to monitoring to shut down
  3. Performing Deep Security level introspection to ensure the patch levels etc are adequate
  4. Creating Business workflows on the Digital Mesh. We will do this by marrying these devices to enterprise information systems (EISs)
  5. Performing Business oriented Predictive Analytics on these devices, this is critical to 
  6. On a futuristic basis, support optional integration with the Blockchain to support a distributed organizational ledger that can coordinate activity across all global areas that an enterprise operates in.

Building Blocks of the Architecture

Listed below are the foundational blocks of our reference architecture. Though the requirements will vary across industries, an organization can reasonably standardize on a number of foundational components as depicted below and then incrementally augment them as the interactions between different components increase based on business requirements.

Our reference architecture includes the following major building blocks –

  • Device Layer
  • Device Integration Layer
  • Data & Middleware Tier
  • Digital Application Layer

It also includes the following cross cutting concerns which span across the above layers –

  • Device and Data Security
  • Business Process Management
  • Service Management
  • UX Design
  • Data Governance – Provenance, Auditing, Logging

The next section provides a brief overview of the reference architecture’s components at a logical level.

A Big Data Reference Architecture for the Industrial Internet depicting multiple functional layers

Device Layer – 

The first requirement of IIIoT implementations is to support connectivity from the Things themselves or the Device layer depicted at the bottom. The Device layer includes a whole range of sensors, actuators, smartphones, gateways and industrial equipment etc. The ability to connect with devices and edge devices like routers, smart gateways using a variety of protocols is key. These network protocols include Ethernet, WiFi, and Cellular which can all directly connect to the internet. Other protocols that need a gateway device to connect include Bluetooth, RFID, NFC, Zigbee et al. Devices can connect directly with the data ingest layer shown above but it is preferred that they connect via a gateway which can perform a range of edge processing.

This is important from a business standpoint for instance, in certain verticals like healthcare and financial services, there exist stringent regulations that govern when certain identifying data elements (e.g. video feeds) can leave the premises of a hospital or bank etc. A gateway cannot just perform intelligent edge processing but also can connect thousands of device endpoints and facilitate bidirectional communication with the core IIoT architecture. 

The ideal tool for these constantly evolving devices, metadata, protocols, data formats and types is Apache NiFi.  These agents will send the data to an Apache NiFi gateway or directly into an enterprise Apache NiFi cluster in the cloud or on-premise.

Apache NiFi Eases Dataflow Management & Accelerates Time to Analytics In Banking (2/3)..

 A subproject of Apache NiFi – MiNiFi provides a complementary data collection approach that supplements the core tenets of NiFi in dataflow management. However due to its small footprint and low resource consumption, is well suited to handle dataflow from sensors and other IOT devices. It provides central management of agents while providing full chain of custody information on the flows themselves.

For remote locations, more powerful devices like the Arrow BeagleBone Black Industrial and MyPi Industrial, it is very simple to run a tiny Java or C++ MiNiFi agent for secure connectivity needs.

The data sent by the device endpoints are then modeled into an appropriate domain representation based on the actual content of the messages. The data sent over also includes metadata around the message. A canonical model can optionally be developed (based on the actual business domain) which can support a variety of applications from a business intelligence standpoint.

 Apache NiFi supports the flexibility of ingesting changing file formats, sizes, data types and schemas. The devices themselves can send a range of feeds in different formats. E.g. XML now and based on upgraded capabilities – richer JSON tomorrow. NiFi supports ingesting any file type that the devices or the gateways may send.  Once the messages are received by Apache NiFi, they are enveloped in security with every touch to each flow file controlled, secured and audited.   NiFi flows also provide full data provenance for each file, packet or chunk of data sent through the system.  NiFi can work with specific schemas if there are special requirements for file types, but it can also work with unstructured or semi structured data just as well.  From a scalability standpoint, NiFi can ingest 50,000 streams concurrently on a zero-master shared nothing cluster that horizontally scales via easy administration with Apache Ambari.

Data and Middleware Layer – 

The IIIoT Architecture recommends a Big Data platform with native message oriented middleware (MOM) capabilities to ingest device mesh data. This layer will also process device data in such a fashion – batch or real-time – as the business needs demand.

Application protocols such as AMQP, MQTT, CoAP, WebSockets etc are all deployed by many device gateways to communicate application specific messages.  The reason for recommending a Big Data/NoSQL dominated data architecture for IIOT is quite simple. These systems provide Schema on Read which is an innovative data handling technique. In this model, a format or schema is applied to data as it is accessed from a storage location as opposed to doing the same while it is ingested. From an IIOT standpoint, one must not just deal with the data itself but also metadata such as timestamps, device id, other firmware data such as software version, device manufactured data etc. The data sent from the device layer will consist of time series data and individual measurements.

The IIoT data stream can thus be visualized as a constantly running data pump which is handled by a Big Data pipeline takes the raw telemetry data from the gateways, decides which ones are of interest and discards the ones not deemed significant from a business standpoint.  Apache NiFi is your gateway and gate keeper.   It ingests the raw data, manages the flow of thousands of producers and consumers, does basic data enrichment, sentiment analysis in stream, aggregation, splitting, schema translation, format conversion and other initial steps to prepare the data. It does that all with a user-friendly web UI and easily extendible architecture.  It will then send raw or processed data to Kafka for further processing by Apache Storm, Apache Spark or other consumers.  Apache Storm is a distributed real-time computation engine that reliably processes unbounded streams of data.  Storm excels at handling complex streams of data that require windowing and other complex event processing. While Storm processes stream data at scale, Apache Kafka distributes messages at scale. Kafka is a distributed pub-sub real-time messaging system that provides strong durability and fault tolerance guarantees. NiFi, Storm and Kafka naturally complement each other, and their powerful cooperation enables real-time streaming analytics for fast-moving big data. All the stream processing is handled by NiFi-Storm-Kafka combination.  

Apache Nifi, Storm and Kafka integrate very closely to manage streaming dataflows.

 

Appropriate logic is built into the higher layers to support device identification, ID lookup, secure authentication and transformation of the data. This layer will process data (cleanse, transform, apply a canonical representation) to support Business Automation (BPM), BI (business intelligence) and visualization for a variety of consumers. The data ingest layer will also providing notification and alerts via Apache NiFi.

Here are some typical uses for this event processing pipeline:

a. Real-time data filtering and pattern matching

b. Enrichment based on business context

c. Real-time analytics such as KPIs, complex event processing etc

d. Predictive Analytics

e. Business workflow with decision nodes and human task nodes

Digital Application Tier – 

Once IIoT knowledge has become part of the Hadoop based Data Lake, all the rich analytics, machine learning and deep learning frameworks, tools and libraries now become available to Data Scientists and Analysts.   They can easily produce insights, dashboards, reports and real-time analytics with IIoT data joined with existing data in the lake including social media data, EDW data, log data.   All your data can be queried with familiar SQL through a variety of interfaces such as Apache Phoenix on HBase, Apache Hive LLAP and Apache Spark SQL.   Using your existing BI tools or the open sourced Apache Zeppelin, you can produce and share live reports.   You can run TensorFlow in containers on YARN for deep learning insights on your images, videos and text data; while running YARN clustered Spark ML pipelines fed by Kafka and NiFi to run streaming machine learning algorithms on trained models.

A range of predictive applications are suitable for this tier. The models themselves should seek to answer business questions around things like -Asset failure, the key performance indicators in a manufacturing process and how they’re trending, insurance policy pricing etc. 

Once the device data has been ingested into a modern data lake, key functions that need to be performed include data aggregation, transformation, enriching, filtering, sorting etc.

As one can see, this can get very complex very quick – both from a data storage and processing standpoint. A Cloud based infrastructure with its ability to provide highly scalable compute, network and storage resources is a natural fit to handle bursty IIoT applications. However, IIoT applications add their own diverse requirements of computing infrastructure, namely the ability to accommodate hundreds of kinds of devices and network gateways – which means that IT must be prepared to support a large diversity of operating systems and storage types

The tier is also responsible for the integration of the IIoT environment into the business processes of an enterprise. The IIoT solution ties into existing line-of-business applications and standard software solutions through adapters or Enterprise Application Integration (EAI) and business-to-business (B2B) gateway capabilities. End users in business-to-business or business-to-consumer scenarios will interact with the IIOT solution and the special- purpose IIoT devices through this layer. They may use the IIoT solution or line-of-business system UIs, including apps on personal mobile devices, such as smartphones and tablets.

Security Implementation

The topic of Security is perhaps the most important cross cutting concern across all layers of the IIoT architecture stack. Needless to say, each of the layers must support the strongest data encryption, authentication and authentication capabilities for devices, users and partner applications. Accordingly, capabilities must be provided to ingest and store security feeds, IDS logs for advanced behavioral analytics, server logs, device telemetry. These feeds must be constantly analyzed across three domains – the Device domain, the Business domain and the IT domain. The below blogpost delves into some of these themes and is a good read to get a deeper handle on this issue from a SOC (security operations center) standpoint.

An Enterprise Wide Framework for Digital Cybersecurity..(4/4)

Conclusion

It is evident from the above that IIoT will enormous opportunity for businesses globally. It will also create layers of complexity and opportunity for Enterprise IT. The creation of smart digital services on the data served up will further depend on the vertical industries. Whatever be the kind of business model – whether tracking behavior, location sensitive pricing, business process automation etc – the end goal of IT architecture should be to create enterprise business applications that are ultimately data native and analytics driven.

DZone-GuideToBigData-Apr17

Demystifying Digital – Reference Architecture for Single View of Customer / Customer 360..(3/3)

The first post in this three part series on Digital Foundations @ http://www.vamsitalkstech.com/?p=2517 introduced the concept of Customer 360 or Single View of Customer (SVC).  This second post in the series discussed the concept of Customer Journey Mapping (CJM) – http://www.vamsitalkstech.com/?p=3099 . We discussed specific benefits from both a business & operational standpoint that are enabled by SVC & CJM. The third & final post will focus on a technical design & architecture needed to achieve both these capabilities.

Business Requirements for Single View of Customer & Customer Journey Mapping…

The following key business requirements need to be supported for three key personas- Customer, Marketing & Customer Service – from a SVC and CJM standpoint.

  1. Provide an Integrated Experience: A fully integrated omnichannel experience for both the customer and internal stakeholder (marketing, customer service, regulatory, managerial etc) roles. This means a few important elements – consistent information across all touchpoints, the right information to the right user at the right time, an ability to view the CJM graph with realtime metrics on Customer Lifetime Value (CLV) etc.
  2. Continuously Learning Customer Facing System: An ability for the customer facing portion of the architecture to learn constantly to fine-tune it’s understanding of the customers real time picture. This includes an ability to understand the customer’s journey.
  3. Contextual yet Seamless Movement across Channels: The ability for customers to transition seamlessly from one channel to the other while conducting business transactions.
  4. Ability to introduce Marketing Programs for existing Customers: An ability to introduce marketing and customer retention and other loyalty programs in a dynamic manner. These include and ability to combine historical data with real time data about customer interactions and other responses like clickstreams – to provide product recommendations and real time offers.
  5. Customer Acquisition: An ability to perform low cost customer acquisition and to be able to run customized offers for segments of customers from a back-office standpoint.

Key Gaps in existing Single View (SVC) Architectures ..

It needs to be kept in mind that every organization is different from an IT legacy investment and operational standpoint. As such, a “one-size-fits-all” architecture is impossible to create. However, highlighted below are some common key data and application architecture gaps that I have observed from a data standpoint while driving to a SVC (Single View of Customer) with multiple leading enterprises.

  1. The lack of a single, unique & global customer identifier – The need to create a single universal customer identifier (based on various departmental or line of business identifiers) and to use it as a primary key in the customer master list
  2. Once the identifier is created in either the source system or in the datalake, organizations need to figure out a way to cascade that identifier into the Book of Record systems (CRM systems, webapps and ERP systems) so that the architecture can begin knitting together a single view of the customer. This may also involve periodically go out across the BOR systems, link all the customers data and pull the data into the lake;
  3. Many companies deal with multiple customer on-boarding systems. At some point, these  on-boarding processes need  to be centralized. For instance in Banking esp In Capital markets, customer on-boarding done in six or seven different areas; all of these ideally need to be consolidated into one.
  4. Graph Data Semantics – Once created, the Master Customer identifier should be mapped to all the other identifiers lines of business use to uniquely identify their customer; the ability to use simple or more complex matching techniques (Rule based matching, machine learning based matching & search based matching) is highly called for.
  5. MDM (Master Data Management) systems have traditionally automated some of this process by creating & owning that unique customer identifier. However Big Data capabilities help by linking that unique customer identifier to all the other ways the customer may be mapped across the organization. To this end,  data may be exported into an MDM system backed by a traditional RDBMS; or; the computation of the unique identifier can be done in a data lake and then exported into an MDM system.

Let us discuss the generic design of the architecture (depicted above) with a focus on the following subsystems –

A Reference Architecture for Single View of Customer/ Customer 360
  1. At the very top, different channels depict with different touch points In today’s connected world, the customer experience spans multiple different touch points throughout the customer lifecycle. A customer should be able to move through multiple different touch points during the buying process. Customers should be able to start, pause transactions (e.g. An Auto Loan application) from one channel and restart/complete them from another.
  2. A Big Data enabled application architecture is chosen. This needs to account for two different data processing paradigms. The first is a realtime component. The architecture must be capable of handling events within a few milliseconds. The second is an ability to handle massive scale data analysis in a retrospective manner. Both these components are provided by a Hadoop stack. The real time component leverages – Apache NiFi, Apache HBase, HDFS, Kafka, Storm and Spark. The batch component leverages  HBase, Apache Titan, Apache Hive, Spark and MapReduce.
  3. The range of Book of Record and external systems send data into the central datalake. Both realtime and batch components highlighted above send the data into the lake. The design of the lake itself will be covered in more detail in the below section.
  4. Starting from the upper-left side, we have the Book of Record Systems sending across transactions. These are ingested into the lake using any of the different ingestion frameworks provided in Hadoop. E.g. Flume, Kafka, Sqoop, HDFS API for batch transfers etc.  The ingestion layer depicted is based on Apache NiFi and is used to load data into the data lake.  Functionally, it is made up of real time data loaders and end of day data loaders. The real time loaders load the data as it is created in the feeder systems, the EOD data loaders will adjust the data end of the day based on the P&L sign off and the end of day close processes.  The main data feeds for the system will be from the book of record transaction systems (BORTS) but there may also be multiple data feeds from transaction data providers and customer information systems.
  5. The UI Framework is standardized across all kinds of clients. For instance this could be an HTML 5 GUI Framework that contains reusable widgets that can be used for mobile and browser based applications.  The framework also need to deal with common mobile issues such as bandwidth and be able to automatically throttle the data back where bandwidth is limited.It also needs to facilitate the construction of large user defined pivot tables for ad hoc reporting. It utilizes UI framework components for its GUI construction and communicates with the application server via the web services layer.
  6. API access is also provided by Web Services for partner applications to leverage: This is the application layer that that provides a set of RESTful web services that control the GUI behavior and that control access to the persistent data and the data that is cached on the data fabric.
  7. The transactions are taken through the pipeline of enrichment and the profiles of customers are stored in HBase. .
  8. The core data processing platform is then based on a datalake pattern which has been covered in this blog before. It includes the following pattern of processing.
    1. Data is ingested real time into a HBase database (which uses HDFS as the underlying storage layer). Tables are designed in HBase to store the profile of a trade and it’s lifecycle.
    2. Producers are authenticated at the point of ingest.
    3. Once the data has been ingested into HDFS, it is taken through a pipeline of processing (L0 to L3) as depicted in the below blogpost.

      http://www.vamsitalkstech.com/?p=667

  9. Speed Layer: The computational grid that makes up the Speed layer can be a distributed in memory data fabric like Infinispan or GemFire, or a computation process can be overlaid directly onto a stateful data fabric technology like Spark or GemFire. The choice is dependent of the language choices that have been made in building the other key analytic libraries. If multiple language bindings are required (e.g. C# & Java) then the data fabric will typically be a different product than the Grid.

Data Science for Customer 360

 Consider the following usecases that are all covered under Customer 360 –

  1. The ability to segment customers into categories based on granular data attributes
  2. Improve customer targeting for new promotions & increasing acquisition rate
  3. Increasing cross sell and upsell rates
  4. Understanding influencers among customer segments & helping these net promoters recommend products to other customers
  5. Performing market basket analysis of what products/services are typically purchased together
  6. Understanding customer risk profiles
  7. Creating realtime views of customer lifetime value (CLV)
  8. Reducing customer attrition

The obvious capability that underlies all of these is Data Science. Thus, Predictive Analytics is the key compelling paradigm that enables the buildout of the dynamic Customer 360.

The Predictive Analytics workflow always starts with a business problem in mind. Examples of these would be “A marketing project to detect which customers are likely to buy new products or services in the next six months based on their historical & real time product usage patterns – which are denoted by x,y or z characteristics” or “Detect realtime fraud in credit card transactions.” or “Perform certain algorithms based on the predictions”. In usecases like these, the goal of the data science process is to be able to segment & filter customers by corralling them into categories that enable easy ranking. Once this is done, the business is involved to setup easy and intuitive visualization to present the results. In the machine learning process, an entire spectrum of algorithms can be tried to solve such business problems.

A lot of times, business groups working on Customer 360 projects have a hard time explaining what they would like to see – both data and the visualization. In such cases, a prototype makes things way more easy from a requirements gathering standpoint.  Once the problem is defined, the data scientist/modeler identifies the raw data sources (both internal and external) which comprise the execution of the business challenge.  They spend a lot of time in the process of collating the data (from Oracle, DB2, Mainframe, Greenplum, Excel sheets, External datasets etc). The cleanup process involves fixing a lot of missing values, corrupted data elements, formatting fields that indicate time and date etc.

The Data Scientist working with the business needs to determine how much of this raw data is useful and how much of it needs to be massaged to create a Customer 360 view. Some of this data needs to be extrapolated to form the features using formulas – so that a model can be created. The models created often involve using languages such as R and Python.

Feature engineering takes in business features in the form of feature vectors and creates predictive features from them. The Data Scientist takes the raw features and creates a model using a mix of various algorithms. Once the model has been repeatedly tested for accuracy and performance, it is typically deployed as a service.

The transformation phase involves writing code to be able to to join up like elements so that a single client’s complete dataset is gathered in the Data Lake from a raw features standpoint.  If more data is obtained as the development cycle is underway,  the Data Science team has no option but to go back & redo the whole process.

Models as a Service (MaaS) is the Data Science counterpart to Software as a Service.The MaaS takes in business variables (often hundreds of them as inputs) and provides as output business decisions/intelligence, measurements, visualizations that augment decision support systems.

Once these models are deployed and updated nightly based on their performance – the serving layer takes advantage of them to drive real time 360 decisioning.

To Sum Up…

In this short series we have discussed that customers and data about their history, preferences, patterns of behavior, aspirations etc are the most important corporate asset. Big Data technology and advances made in data storage, processing and analytics can help architect a dynamic Single View that can help maximize competitive advantage across every industry vertical.

The Definitive Reference Architecture for Market Surveillance (CAT, UMIR and MiFiD II) in Capital Markets..

We have discussed the topic of market surveillance reporting to some depth in previous blogs. e.g.http://www.vamsitalkstech.com/?p=2984.  Over the last decade, Global Financial Markets have embraced the high speed of electronic trading. This trend has only accelerated with the concomitant explosion in trading volumes. The diverse range of instruments & the proliferation of trading venues pose massive regulatory challenges in the area of market conduct supervision and abuse prevention. Banks, Broker dealers, Exchanges and other market participants across the globe are now shelling out millions of dollars in fines for failure to accurately report on market abuse violations. In response to this complex world of high volume & low touch electronic trading, global capital markets regulators have been hard at work across different jurisdictions & global hubs e.g. the FINRA in the US, the IROC in Canada and the ESMA in the European Union. Regulators have created extensive reporting regimes for surveillance with a view to detecting suspicious patterns of trade behavior (e.g, dumping, quote stuffing & non bonafide fake orders etc). The intent to increase market transparency on both the buy and the sell side. Based on the scrutiny Capital Markets players are under, a Big Data Analytics based architecture has become a “must-have” to ensure timely & accurate compliance with these mandates. This blog attempts to discuss such a reference architecture.

Business Technology Requirements for Market Surveillance..

The business requirements for the Surveillance architecture are covered at the below link in more detail but are reproduced below in a concise fashion.

A POV on European Banking Regulation.. MAR, MiFiD II et al

Some of the key business requirements that can be distilled from regulatory mandates include the below:

  • Store heterogeneous data – Both MiFiD II and MAR mandate the need to perform trade monitoring & analysis on not just real time data but also historical data spanning a few years. Among others this will include data feeds from a range of business systems – trade data, eComms, aComms, valuation & position data, order management systems, position management systems, reference data, rates, market data, client data, front, middle & back office, data, voice, chat & other internal communications etc. To sum up, the ability to store a range of cross asset (almost all kinds of instruments), cross format (structured & unstructured including voice), cross venue (exchange, OTC etc) trading data with a higher degree of granularity – is key.
  • Data Auditing – Such stored data needs to be fully auditable for 5 years. This implies not just being able to store it but also putting in place capabilities in place to ensure  strict governance & audit trail capabilities.
  • Manage a huge volume increase in data storage requirements (5+ years) due to extensive Record keeping requirements
  • Perform Realtime Surveillance & Monitoring of data – Once data is collected,  normalized & segmented, it will need to support realtime monitoring of data (around 5 seconds) to ensure that every trade can be tracked through it’s lifecycle. Detecting patterns that could perform surveillance for market abuse and monitor for best execution are key.
  • Business Rules  – Core logic that deals with identifying some of the above trade patterns are created using business rules. Business Rules have been covered in various areas in the blog but they primarily work based on an IF..THEN..ELSE construct.
  • Machine Learning & Predictive Analytics – A variety of supervised ad unsupervised learning approaches can be used to perform extensive Behavioral modeling & Segmentation to discover transactions behavior with a view to identifying behavioral patterns of traders & any outlier behaviors that connote potential regulatory violations.
  • A Single View of an Institutional Client- From the firm’s standpoint, it would be very useful to have a single view capability for clients that shows all of their positions across multiple desks, risk position, KYC score etc.

A Reference Architecture for Market Surveillance ..

This reference architecture aims to provide generic guidance to banking Business IT Architects building solutions in the realm of Market & Trade Surveillance. This supports a host of hugely important global reg reporting mandates – CAT, MiFiD II, MAR etc that Capital Markets need to comply with. While the concepts discussed in this solution architecture discussed are definitely Big Data oriented, they are largely agnostic to any cloud implementation – private, public or hybrid.

A Market Surveillance system needs to include both real time surveillance of trading activity as well as a retrospective (batch oriented) analysis component. The real time component includes the ability to perform realtime calculations (concerning thresholds, breached limits etc). real time queries with the goal of triggering alerts. Both these kinds of analytics span structured and unstructured data sources. For the batch component, the analytics involve data queries, simple to advanced statistics (min, max, avg, std deviation, sorting, binning, segmentation) to running data science models involving text analysis & search etc.

The system needs to process tens of millions to billions of events in a trading window while providing highest uptime guarantees. Batch analysis is always running in the background.

A Hadoop distribution that includes components such as Kafka, HBase and near real time components such as Storm & Spark Streaming provide a good fit for a responsive architecture. Apache NiFi with its ability to ingest data from a range of sources is preferred for it’s ability to support complex data routing, transformation, and system mediation logic in a complex event processing architecture. The capabilities of Hortonworks Data Flow (the enterprise version of Apache NiFi) is covered in the below blogpost in much detail.

Use Hortonworks Data Flow (HDF) To Connect The Dots In Financial Services..(3/3)

A Quick Note on Data Ingestion..

Data volumes in the area of Regulatory reporting can be huge to insanely massive. For instance, at large banks, they can go up to 100s of millions of transactions a day. At market venues such as stock exchanges, they easily enter into the hundreds of billions of messages every trading day. However the data itself is extremely powerful & is really business gold in terms of allowing banks to not just file mundane reg reports but also to perform critical line of business processes such as Single View of  Customer, Order Book Analysis, TCA (Transaction Cost Analysis), Algo Backtesting, Price Creation Analysis etc. The architecture thus needs to support multiple ways of storage, analysis and reporting ranging from compliance reporting to data scientists to business intelligence.

Real time processing in this proposed architecture are powered by Apache NiFi. There are five important reasons for this decision – 

  • First of all, complex rules can be defined in NiFi in a very flexible manner. As an example, one can execute SQL queries in processor A against incoming data from any source (data that isnt from a relational databases but JSON, Avro etc.) and  then route different results to different downstream processors based on the needs for processing while enriching it. E.g. Processor A could be event driven and if any data is being routed there, a field can be added, or an alert sent to XYZ. Essentially this can be very complex, equivalent to a nested rules engine so to speak. 
  • From a Throughput standpoint, a single NiFi node can typically handle somewhere between 50MB/s to 150MB/s depending on your hardware spec and data structure. Assuming 100-500 kbytes of average messages, for a throughput of 600MB/s, the architecture can be sized to about 5-10 NiFi nodes. It is important to note that performance latency of inbound message processing depends on the network, could be extremely small. Under the hood, you are sending data from source to NIfi node (disk), extract some attributes in memory to process, and deliver to the target system.
  • Data quality can be handled via the aforementioned “nested rules engine” approach, consisting of multiple NiFi processors. One can even embed an entire rules engine into a single processor. Similarly, you can define simple authentication rules at the event level. For instance, if Field A = English, route the message to an “authenticated” relationship; otherwise send it to an “unauthenticated” relationship.

  • One of the corner stones in NiFi is called “Data Provenance“, allowing you to have end to end traceability. Not only can the event lifecycle of trade data be traced but you can also track the time at which it happened & the user role who made the change and metadata around why did it happen.

  • Security – NiFi enables authentication at ingest. One can authenticate data via the rules defined in NiFi, or leverage target system authentication which is implemented at processor level. For example, the PutHDFS processor supports kerberized HDFS, the same applies for Solr and so on.

Overall Processing flow..

The below illustration shows the high-level conceptual architecture. The architecture is composed of core platform services and application-level components to facilitate the processing needs across three major areas of a typical surveillance reporting solution:

  • Connectivity to a range of trade data sources
  • Data processing, transformation & analytics
  • Visualization and business connectivity
Reference Architecture for Market Surveillance Reg Reporting – CAT, MAR,MiFiD II et al

The overall processing of data follows the order shown below and depicted in the diagram below –

  1. Data Production – Data related to Trades and their lifecycle is produced from a range of business systems. These data feeds from a range of business systems (including but not limited to) – trade data, valuation & position data, order management systems, position management systems, reference data, rates, market data, client data, front, middle & back office, data, voice, chat & other internal communications etc.
  2. Data Ingestion – Data produced from the the above layer is ingested using Apache NiFi from a range of sources described above. Data can also be filtered and alerts can be setup based on complex event logic. For time series data support HBase can be leveraged along with OpenTSDB. For CEP requirements, such as sliding windows and complex operators, NiFi can be leveraged along with Kafka and Storm pipeline.  Using NiFi will make the process easier to load data into the data lake while applying guarantees around the delivery itself.  Data can be streamed in real time as it is created in the feeder systems. Data is also loaded at end of the trading day based on the P&L sign off and the end of day close processes.  The majority of the data will be fed in from Book of Record Trading systems as well as from market data providers.
  3. As trade and other data is ingested into the data lake, it is important to note that the route in which certain streams are processed will differ from how other streams are processed. Thus the ingest architecture needs to support multiple types of processing ranging from in memory processing, intermediate transformation processing on certain data streams to produce a different representation of the stream. This is where NiFi adds critical support in not just handling a huge transaction throughput but also enabling “on the fly processing” of data in pipelines. As mentioned, NiFi does this via the concept of “processors”.
  4. The core data processing platform is then based on a datalake pattern which has been covered in this blog before. It includes the following pattern of processing.
    1. Data is ingested real time into a HBase database (which uses HDFS as the underlying storage layer). Tables are designed in HBase to store the profile of a trade and it’s lifecycle.
    2. Producers are authenticated at the point of ingest.
    3. Once the data has been ingested into HDFS, it is taken through a pipeline of processing (L0 to L3) as depicted in the below blogpost.

      http://www.vamsitalkstech.com/?p=667

    4. Historical data (defined as T+1) once in the HDFS tier is taken through layers of processing as discussed above. One of the key areas of processing is to run machine learning on the data to discover any hidden patterns in the trades themselves. Patterns that can connote a range of suspicious behavior. Most surveillance applications are based on a search for data that breaches thresholds and seek to match sell & buy orders. The idea is that when these rules are breached, alerts are then generated for compliance officers to conduct further investigation. However this method falls short with complex types of market abuse.A range of supervised learning techniques can then be applied on data such as creating a behavioral profile of different kinds of traders (for instance junior and senior) by classifying & then scoring them based on their likelihood to commit fraud. Thus a range of Surveillance Analytics can be performed on the data. Apache Spark, is highly recommended for near realtime processing not only due to its high performance characteristics but also due to its native support for graph analytics and machine learning – both of which are critical to surveillance reporting.For a deeper look at data science, I recommend the below post.

      http://www.vamsitalkstech.com/?p=1846

    5. The other important value driver in deploying Data Science is to perform Advanced Transaction Monitoring Intelligence.  The core idea is to get years worth of Trade data in one location (i.e the datalake) & then applying  unsupervised learning to glean patterns in those transactions. The goal is then to identify profiles of actors with the intent of feeding it into existing downstream surveillance & TM systems.
    6. This knowledge can then be used to constantly learn transaction behavior for similar traders. This can be a very important capability in detecting fraud in traders, customer accounts and instruments.Some of the usecases are –
      • Profile trading activity of individuals with similar traits (types of customers, trading desks & instruments, geographical areas of operations etc.) to perform Know Your Trader
      • Segment traders by similar experience levels and behavior
      • Understand common fraudulent behavior typologies (e.g. spoofing) and clustering such (malicious) trading activities by trader, instrument and volume etc. The goal being to raise appropriate downstream investigation case management system
      • Using advanced data processing techniques like Natural Language Processing, constantly analyze electronic communications and join them up with trade data sources to both detect under the radar activity but also to keep the false positive rate low.
    7. Graph Database – Given that most kinds of trading fraud happens in groups of actors – traders acting in collusion with  verification & compliance – the ability to view complex relationships of interactions and the strength of those interactions can be a significant monitoring capability
    8. Grid Layer – To improve performance, I propose the usage of a distributed in memory data fabric like JBOSS DataGrid or Pivotal GemFire. This can aid in two ways –

      a. Help with fast lookup of data elements by the visualization layer
      b. Help perform fast computation process by overlaying a framework like Spark or MapReduce directly onto a stateful data fabric.

      The choice of tools here is dependent of the language choices that have been made in building the pricing and risk analytic libraries across the Bank. If multiple language bindings are required (e.g. C# & Java) then the data fabric will typically be a different product than the Grid.

      Data Visualization…

      The visualization solution chose shouldI enable the quick creation of interactive dashboards that provide KPIs and other important business metrics from a process monitoring standpoint. Various levels of dashboard need to be created ranging from compliance officer toolboxes, executive dashboard to help identify trends and discover valuable insights.

      Compliance Officer Toolbox (Courtesy: Arcadia Data)

      Additionally, the visualization layer shall provide

      a) A single view of Trader or Trade or Instrument or Entity

      b) Investigative workbench with Case Management capability

      c) The ability follow the lifecycle of a trade

      d) The ability to perform ad hoc queries over multiple attributes

      e) Activity correlation across historical and current data sets

      f) Alerting on specific metrics and KPIs

      To Sum Up…

      The solution architecture described in this blogpost is designed with peaceful enterprise co-existence in mind. In the sense, it interacts and is also integrated with a range of BORT systems and other enterprise systems such as ERP, CRM, legacy surveillance systems. This includes all and any other line of business solutions that typically exist as shared enterprise resources (such as CRM or ERP systems or other line-of-business solutions).

Design and Architecture of A Robo-Advisor Platform..(3/3)

This three part series explores the automated investment management or the “Robo-advisor” (RA) movement. The first post in this series @- http://www.vamsitalkstech.com/?p=2329 – discussed how Wealth Management has been an area largely untouched by automation as far as the front office is concerned. As a result, automated investment vehicles have largely begun changing that trend and they helping create a variety of business models in the industry esp those catering to the Millenial Mass Affluent Segment. The second post @- http://www.vamsitalkstech.com/?p=2418  focused on the overall business model & main functions of a Robo-Advisor (RA). This third and final post covers a generic technology architecture for a RA platform.

Business Requirements for a Robo-Advisor (RA) Platform…

Some of the key business requirements of a RA platform that confer it advantages as compared to the manual/human driven style of investing are:

  • Collect Individual Client Data – RA Platforms need to offer a high degree of customization from the standpoint of an individual investor. This means an ability to provide a preferably mobile and web interface to capture detailed customer financial background, existing investments as well as any historical data regarding customer segments etc.
  • Client Segmentation – Clients are to be segmented  across granular segments as opposed to the traditional asset based methodology (e.g mass affluent, high net worth, ultra high net worth etc).
  • Algorithm Based Investment Allocation – Once the client data is collected,  normalized & segmented –  a variety of algorithms are applied to the data to classify the client’s overall risk profile and an investment portfolio is allocated based on those requirements. Appropriate securities are purchased as we will discuss in the below sections.
  • Portfolio Rebalancing  – The client’s portfolio is rebalanced appropriately depending on life event changes and market movements.
  • Tax Loss Harvesting – Tax-loss harvesting is the mechanism of selling securities that have a loss associated with them. By doing so or by taking  a loss, the idea is that that client can offset taxes on both gains and income. The sold securities are replaced by similar securities by the RA platform thus maintaining the optimal investment mix.
  • A Single View of a Client’s Financial History- From the WM firm’s standpoint, it would be very useful to have a single view capability for a RA client that shows all of their accounts, interactions & preferences in one view.

User Interface Requirements for a Robo-Advisor (RA) Platform…

Once a customer logs in using any of the digital channels supported (e.g. Mobile, eBanking, Phone etc)  – they are presented with a single view of all their accounts. This view has a few critical areas – Summary View (showing an aggregated view of their financial picture), the Transfer View (allowing one to transfer funds across accounts with other providers).

The Summary View lists the below

  • Demographic info: Customer name, address, age
  • Relationships: customer rating influence, connections, associations across client groups
  • Current activity: financial products, account interactions, any burning customer issues, payments missed etc
  • Customer Journey Graph: which products or services they are associated with since the time they became a customer first etc,

Depending on the clients risk tolerance and investment horizon, the weighted allocation of investments across these categories will vary. To illustrate this, a Model Portfolio and an example are shown below.

Algorithms for a Robo-Advisor (RA) Platform…

There are a variety of algorithmic approaches that could be taken to building out an RA platform. However the common feature of all of these is to –

  • Leverage data science & statistical modeling to automatically allocate client wealth across different asset classes (such as domestic/foreign stocks, bonds & real estate related securities) to automatically rebalance portfolio positions based on changing market conditions or client preferences. These investment decisions are also made based on detailed behavioral understanding of a client’s financial journey metrics – Age, Risk Appetite & other related information. 
  • A mixture of different algorithms can be used such as Modern Portfolio Theory (MPT), Capital Asset Pricing Model (CAPM), the Black Litterman Model, the Fama-French etc. These are used to allocate assets as well as to adjust positions based on market movements and conditions.
  • RA platforms also provide 24×7 tracking of market movements to use that to track rebalancing decisions from not just a portfolio standpoint but also from a taxation standpoint.

Model Portfolios…

  1. Equity  

             A) US Domestic Stock – Large Cap, Medium Cap , Small Cap, Dividend Stocks 

             B) Foreign Stock – Emerging Markets, Developed Markets

       2. Fixed Income

             A) Developed Market Bonds 

             B) US Bonds

             C) International Bonds

             D) Emerging Markets Bonds

      3. Other 

             A) Real Estate  

             B) Currencies

             C) Gold and Precious Metals

             D) Commodities

       4. Cash

Sample Portfolios – for an aggressive investor…

  1. Equity  – 85%

             A) US Domestic Stock (50%) – Large Cap – 30%, Medium Cap – 10% , Small Cap – 10%, Dividend Stocks – 0%

             B) Foreign Stock – (35%) –  Emerging Markets – 18%, Developed Markets – 17% 

       2. Fixed Income – 5%

             A) Developed Market Bonds  – 2%

             B) US Bonds – 1%

             C) International Bonds – 1%

             D) Emerging Markets Bonds – 1%

      3. Other – 5%

             A) Real Estate  – 3%

             B) Currencies – 0%

             C) Gold and Precious Metals – 0%

             D) Commodities – 2%

       4. Cash – 5%

Technology Requirements for a Robo-Advisor (RA) Platform…

An intelligent RA platform has a few core technology requirements (based on the above business requirements).

  1. A Single Data Repository – A shared data repository called a Data Lake is created, that can capture every bit of client data (explained in more detail below) as well as external data. The RA datalake provides more visibility into all data to a variety of different stakeholders. Wealth Advisors access processed data to view client accounts etc. Clients can access their own detailed positions,account balances etc. The Risk group accesses this shared data lake to processes more position, execution and balance data.  Data Scientists (or Quants) who develop models for the RA platform also access this data to perform analysis on fresh data (from the current workday) or on historical data. All historical data is available for at least five years—much longer than before. Moreover, the Hadoop platform enables ingest of data across a range of systems despite their having disparate data definitions and infrastructures. All the data that pertains to trade decisions and lifecycle needs to be made resident in a general enterprise storage pool that is run on the HDFS (Hadoop Distributed Filesystem) or similar Cloud based filesystem. This repository is augmented by incremental feeds with intra-day trading activity data that will be streamed in using technologies like Sqoop, Kafka and Storm.
  2. Customer Data Collection – Existing Financial Data across the below categories is collected & aggregated into the data lake. This data ranges from Customer Data, Reference Data, Market Data & other Client communications. All of this data, can be ingested using a API or pulled into the lake from a relational system using connectors supplied in the RA Data Platform. Examples of data collected include – Customer’s existing Brokerage accounts, Customer’s Savings Accounts, Behavioral Finance Suveys and Questionnaires etc etc. The RA Data Lake stores all internal & external data.
  3. Algorithms – The core of the RA Platform are data science algos. Whatever algorithms are used – a few critical workflows are common to them. The first is Asset Allocation is to take the customers input in the “ADVICE” tab for each type of account and to tailor the portfolio based on the input. The others include Portfolio Rebalancing and Tax Loss Harvesting.
  4. The RA platform should be able to store market data across years both from a macro and from an individual portfolio standpoint so that several key risk measures such as volatility (e.g. position risk, any residual risk and market risk), Beta, and R-Squared – can be calculated at multiple levels.  This for individual securities, a specified index, and for the client portfolio as a whole.

roboadvisor_design_arch

                      Illustration: Architecture of a Robo-Advisor (RA) Platform 

The overall logical flow of data in the system –

  • Information sources are depicted at the left. These encompass a variety of institutional, system and human actors potentially sending thousands of real time messages per hour or by sending over batch feeds.
  • A highly scalable messaging system to help bring these feeds into the RA Platform architecture as well as normalize them and send them in for further processing. Apache Kafka is a good choice for this tier. Realtime data is published by a range of systems over Kafka queues. Each of the transactions could potentially include 100s of attributes that can be analyzed in real time to detect business patterns.  We leverage Kafka integration with Apache Storm to read one value at a time and perform some kind of storage like persist the data into a HBase cluster.In a modern data architecture built on Apache Hadoop, Kafka ( a fast, scalable and durable message broker) works in combination with Storm, HBase (and Spark) for real-time analysis and rendering of streaming data. 
  • Trade data is thus streamed into the platform (on a T+1 basis), which thus ingests, collects, transforms and analyzes core information in real time. The analysis can be both simple and complex event processing & based on pre-existing rules that can be defined in a rules engine, which is invoked with Apache Storm. A Complex Event Processing (CEP) tier can process these feeds at scale to understand relationships among them; where the relationships among these events are defined by business owners in a non technical or by developers in a technical language. Apache Storm integrates with Kafka to process incoming data. 
  • For Real time or Batch Analytics, Apache HBase provides near real-time, random read and write access to tables (or ‘maps’) storing billions of rows and millions of columns. In this case once we store this rapidly and continuously growing dataset from the information producers, we are able  to do perform super fast lookup for analytics irrespective of the data size.
  • Data that has analytic relevance and needs to be kept for offline or batch processing can be stored using the Hadoop Distributed Filesystem (HDFS) or an equivalent filesystem such as Amazon S3 or EMC Isilon or Red Hat Gluster. The idea to deploy Hadoop oriented workloads (MapReduce, or, Machine Learning) directly on the data layer. This is done to perform analytics on small, medium or massive data volumes over a period of time. Historical data can be fed into Machine Learning models created above and commingled with streaming data as discussed in step 1.
  • Horizontal scale-out (read Cloud based IaaS) is preferred as a deployment approach as this helps the architecture scale linearly as the loads placed on the system increase over time. This approach enables the Market Surveillance engine to distribute the load dynamically across a cluster of cloud based servers based on trade data volumes.
  • It is recommended to take an incremental approach to building the RA platform, once all data resides in a general enterprise storage pool and makes the data accessible to many analytical workloads including Trade Surveillance, Risk, Compliance, etc. A shared data repository across multiple lines of business provides more visibility into all intra-day trading activities. Data can be also fed into downstream systems in a seamless manner using technologies like SQOOP, Kafka and Storm. The results of the processing and queries can be exported in various data formats, a simple CSV/txt format or more optimized binary formats, json formats, or you can plug in custom SERDE for custom formats. Additionally, with HIVE or HBASE, data within HDFS can be queried via standard SQL using JDBC or ODBC. The results will be in the form of standard relational DB data types (e.g. String, Date, Numeric, Boolean). Finally, REST APIs in HDP natively support both JSON and XML output by default.
  • Operational data across a bunch of asset classes, risk types and geographies is thus available to investment analysts during the entire trading window when markets are still open, enabling them to reduce risk of that day’s trading activities. The specific advantages to this approach are two-fold: Existing architectures typically are only able to hold a limited set of asset classes within a given system. This means that the data is only assembled for risk processing at the end of the day. In addition, historical data is often not available in sufficient detail. Hadoop accelerates a firm’s speed-to-analytics and also extends its data retention timeline
  • Apache Atlas is used to provide Data Governance capabilities in the platform that use both prescriptive and forensic models, which are enriched by a given businesses data taxonomy and metadata.  This allows for tagging of trade data  between the different businesses data views, which is a key requirement for good data governance and reporting. Atlas also provides audit trail management as data is processed in a pipeline in the lake
  • Another important capability that Big Data/Hadoop can provide is the establishment and adoption of a Lightweight Entity ID service – which aids dramatically in the holistic viewing & audit tracking of trades. The service will consist of entity assignment for both institutional and individual traders. The goal here is to get each target institution to propagate the Entity ID back into their trade booking and execution systems, then transaction data will flow into the lake with this ID attached providing a way to do Client 360.
  • Output data elements can be written out to HDFS, and managed by HBase. From here, reports and visualizations can easily be constructed. One can optionally layer in search and/or workflow engines to present the right data to the right business user at the right time.  

Conclusion…

As one can see clearly, though automated investing methods are still in early stages of maturity – they hold out a tremendous amount of promise. As they are unmistakably the next big trend in the WM industry industry players should begin developing such capabilities.

The Three Core Competencies of Digital – Cloud, Big Data & Intelligent Middleware

Ultimately, the cloud is the latest example of Schumpeterian creative destruction: creating wealth for those who exploit it; and leading to the demise of those that don’t.” – Joe Weiman author of Cloudonomics: The Business Value of Cloud Computing

trifacta_digital

The  Cloud As a Venue for Digital Workloads…

As 2016 draws to a close, it can safely be said that no industry leader questions the existence of the new Digital Economy and the fact that every firm out there needs to create a digital strategy. Myriad organizations are taking serious business steps to making their platforms highly customer-centric via a renewed operational metrics focus. They are also working on creating new business models using their Analytics investments. Examples of these verticals include Banking, Insurance, Telecom, Healthcare, Energy etc.

As a general trend, the Digital Economy brings immense opportunities while exposing firms to risks as well. Customers now demanding highly contextual products, services and experiences – all accessible via an easy API (Application Programming Interfaces).

Big Data Analytics (BDA) software revenues will grow from nearly $122B in 2015 to more than $187B in 2019 – according to Forbes [1].  At the same time, it is clear that exploding data generation across the global economy has become a clear & present business phenomenon. Data volumes are rapidly expanding across industries. However, while the production of data itself that has increased but it is also driving the need for organizations to derive business value from it. As IT leaders know well, digital capabilities need low cost yet massively scalable & agile information delivery platforms – which only Cloud Computing can provide.

For a more detailed technical overview- please visit below link.

http://www.vamsitalkstech.com/?p=1833

Big Data & Big Data Analytics drive consumer interactions.. 

The onset of Digital Architectures in enterprise businesses implies the ability to drive continuous online interactions with global consumers/customers/clients or patients. The goal is not just provide engaging visualization but also to personalize services clients care about across multiple channels of interaction. The only way to attain digital success is to understand your customers at a micro level while constantly making strategic decisions on your offerings to the market. Big Data has become the catalyst in this massive disruption as it can help business in any vertical solve their need to understand their customers better & perceive trends before the competition does. Big Data thus provides the foundational  platform for successful business platforms.

The three key areas where Big Data & Cloud Computing intersect are – 

  • Data Science and Exploration
  • ETL, Data Backups and Data Preparation
  • Analytics and Reporting

Big Data drives business usecases in Digital in myriad ways – key examples include  –  

  1. Obtaining a realtime Single View of an entity (typically a customer across multiple channels, product silos & geographies)
  2. Customer Segmentation by helping businesses understand their customers down to the individual micro level as well as at a segment level
  3. Customer sentiment analysis by combining internal organizational data, clickstream data, sentiment analysis with structured sales history to provide a clear view into consumer behavior.
  4. Product Recommendation engines which provide compelling personal product recommendations by mining realtime consumer sentiment, product affinity information with historical data.
  5. Market Basket Analysis, observing consumer purchase history and enriching this data with social media, web activity, and community sentiment regarding past purchase and future buying trends.

Further, Digital implies the need for sophisticated, multifactor business analytics that need to be performed in near real time on gigantic data volumes. The only deployment paradigm capable of handling such needs is Cloud Computing – whether public or private. Cloud was initially touted as a platform to rapidly provision compute resources. Now with the advent of Digital technologies, the Cloud & Big Data will combine to process & store all this information.  According to the IDC , by 2020 spending on Cloud based Big Data Analytics will outpace on-premise by a factor of 4.5. [2]

Intelligent Middleware provides Digital Agility.. 

Digital Applications are applications modular, flexible and responsive to a variety of access methods – mobile & non mobile. These applications are also highly process driven and support the highest degree of automation. The need of the hour is to provide enterprise architecture capabilities around designing flexible digital platforms that are built around efficient use of data, speed, agility and a service oriented architecture. The choice of open source is key as it allows for a modular and flexible architecture that can be modified and adopted in a phased manner – as you will shortly see.

The intention in adopting a SOA (or even a microservices) architecture for Digital capabilities is to allow lines of business an ability to incrementally plug in lightweight business services like customer on-boarding, electronic patient records, performance measurement, trade surveillance, risk analytics, claims management etc.

Intelligent Middleware adds significant value in six specific areas –

  1. Supports a high degree of Process Automation & Orchestration thus enabling the rapid conversion of paper based business processes to a true digital form in a manner that lends itself to continuous improvement & optimization
  2. Business Rules help by adding a high degree of business flexibility & responsiveness
  3. Native Mobile Applications  enables platforms to support a range of devices & consumer behavior across those front ends
  4. Platforms As a Service engines which enable rapid application & business capability development across a range of runtimes and container paradigms
  5. Business Process Integration engines which enable rapid application & business capability development
  6. Middleware brings the notion of DevOps into the equation. Digital projects bring several technology & culture challenges which can be solved by a greater degree of collaboration, continuous development cycles & new toolchains without giving up proven integration with existing (or legacy)systems.

Intelligent Middleware not only enables Automation & Orchestration but also provides an assembly environment to string different (micro)services together. Finally, it also enables less technical analysts to drive application lifecycle as much as possible.

Further, Digital business projects call out for mobile native applications – which a forward looking middleware stack will support.Middleware is a key component for driving innovation and improving operational efficiency.

Five Key Business Drivers for combining Big Data, Intelligent Middleware & the Cloud…

The key benefits of combining the above paradigms to create new Digital Applications are –

  • Enable Elastic Scalability Across the Digital Stack
    Cloud computing can handle the storage and processing of any amount of data & any kind of data.This calls for the collection & curation of data from dynamic and highly distributed sources such as consumer transactions, B2B interactions, machines such as ATM’s & geo location devices, click streams, social media feeds, server & application log files and multimedia content such as videos etc. It needs to be noted that data volumes here consist of multi-varied formats, differing schemas, transport protocols and velocities. Cloud computing provides the underlying elastic foundation to analyze these datasets.
  • Support Polyglot Development, Data Science & Visualization
    Cloud technologies are polyglot in nature. Developers can choose from a range of programming languages (Java, Python, R, Scala and C# etc) and development frameworks (such as Spark and Storm). Cloud offerings also enable data visualization using a range of tools from Excel to BI Platforms.
  • Reduce Time to Market for Digital Business Capabilities
    Enterprises can avoid time consuming installation, setup & other upfront procedures. consuming  can deploy Hadoop in the cloud without buying new hardware or incurring other up-front costs. On the same vein, even big data analytics should be able to support self service across the lifecycle – from data acquisition, preparation, analysis & visualization.
  • Support a multitude of Deployment Options – Private/Public/Hybrid Cloud 
    A range of scenarios for product development, testing, deployment, backup or cloudbursting are efficiently supported in pursuit of cost & flexibility goals.
  • Fill the Talent Gap
    Open Source technology is the common thread across Cloud, Big Data and Middleware. The hope is that the ubiquity of open source will serve as a critical level in enabling the filling up of the IT-Business skills scarcity gap.

As opposed to building standalone or one-off business applications, a ‘Digital Platform Mindset’ is a more holistic approach capable of producing higher rates of adoption & thus revenues. Platforms abound in the web-scale world at shops like Apple, Facebook & Google etc. Digital Applications are constructed like lego blocks  and they reuse customer & interaction data to drive cross sell and up sell among different product lines. The key components here are to ensure that one starts off with products with high customer attachment & retention. While increasing brand value, it is key to ensure that customers & partners can also collaborate in the improvements in the various applications hosted on top of the platform.

References

[1] Forbes Roundup of Big Data Analytics (BDA) Report

http://www.forbes.com/sites/louiscolumbus/2016/08/20/roundup-of-analytics-big-data-bi-forecasts-and-market-estimates-2016/#b49033b49c5f

[2] IDC FutureScape: Worldwide Big Data and Analytics 2016 Predictions

Can Your CIO Do Digital?

Business model innovation is the new contribution of IT”  — Werner Boeing, CIO, Roche Diagnostics

Digital Is Changing the Role of the Industry CIO…

A Motley crew of some what interrelated technologies – Cloud Computing, Big Data Platforms, Predictive Analytics & Mobile Applications are changing the enterprise IT landscape. The common paradigm that captures all of them is Digital. The immense business value of Digital technology no longer in question both from a customer as well as an enterprise standpoint. However, the Digital space calls for strong and visionary leadership both from a business & IT standpoint.

Business Boards and CXOs are now concerned about their organization’s overall level and maturity of digital investments. And the tangible business value in existing business operations– (e.g increasing sales & customer satisfaction, detecting fraud, driving down business & IT costs etc)-but also in helping finetune or create new business models by leveraging Digital paradigms. It is thus an increasingly accurate argument that smart applications & ecosystems built around Digitization will dictate enterprise success.

The onset of Digital Architectures in enterprise businesses implies the ability to drive continuous micro level interactions with global consumers/customers/clients/stockholders or patients depending on the vertical you operate in. Initially enterprises viewed Digital as a bolt-on or a fresh color of paint on an existing IT operation.

How did that change over the last five years?

Mobile applications first begun forcing the need for enterprise to begin supporting multiple channels of interaction with their consumers. We have seen how how exploding data generation across the global economy has become a clear & present business & IT phenomenon. Data volumes are rapidly expanding across industries. However, while the production of data by Mobile Applications that has increased but it is also driving the need for organizations to derive business value from it, using advanced techniques such as Data Science and Machine Learning. As a first step, this calls for the collection & curation of data from dynamic,  and highly distributed sources such as consumer transactions, B2B interactions, machines such as ATM’s & geo location devices, click streams, social media feeds, server & application log files and multimedia content such as videos etc – using Big Data. Often these workloads are run on servers hosted on an agile infrastructure such as a Public or Private Cloud.

As one can understand from the above paragraph, the Digital Age calls for a diverse set of fresh skills – both from IT leadership and the rank & file. The role of the Chief Information Officer (CIO) is thus metamorphosing from being an infrastructure service provider to being the overall organizational thought leader in the Digital Age.

The question is – Can Industry CIOs adapt?

The Classic CIO is a provider of IT Infrastructure services.. 

what_cios_think                                                Illustration: The Concerns of a CIO..

So what do CIOs typically think about nowadays?

  1. Keep the core stable and running so IT delivers minimal services to the business and disarm external competition
  2. Are parts of my business really startups and should they be treated as such and should they be kept away from the shackles of inflexible legacy IT? Do I need a digital strategy?
  3. What does the emergence of the 3rd platform (Cloud, Mobility,Social and Big Data) imply?
  4. Where can I show the value of expertise and IT to the money making lines of business?
  5. How can one do all the above while keeping track of Corporate and IT security?

 CIO’s who do not adapt are on the road to Irrelevance…

Where CIOs are being perceived as managing complex legacy systems, the new role of Chief Digital Officer (CDO) has gained currency. The idea that a parallel & more agile IT organization can be created and run to create an ecosystem of innovation & that the office of the CDO is the right place to drive these innovative applications.

Why is that?

  1. CIOs that cannot or that seem dis-engaged with creating innovation through IT are headed the way of the dodo. At the enterprise officer – CIO/CTO level, it becomes very obvious that more than ever “IT is not just a complementary function or a supplementary service but IT is the Business”. If that was merely something that we all paid lip-service to in the past, it is hard reality now. So it is not a case of which company can make the best widgets or has the fastest trading platforms or efficient electronic health records. It is whose enterprise IT can provide the best possible results within a given cost that will win. Its up to the CIOs to deliver and deliver in such a way that large established organizations can compete with upstarts who do not have the same kind of enterprise constraints & shackles.
  2. Innovation & information now follow an “outside in” model. As opposed to data and value being generated by internal functions (sales,engineering, customer fulfillment, core business processes etc) . Enterprise customers are beginning to now operate in what I like to think of as the new normal: entropy.  It’s these conditions that make it imperative for IT Leadership to reconsider their core business applications at the enterprise level. Does internal IT infrastructure need to look more like those of the internet giants?
  3. As a result of the above trends, CIOs are clearly now business level stakeholders more than ever. This means that they need to engage & understand their business at a deep level from an ecosystem and competitive standpoint. Those that cannot do it are neither very effective nor in those positions for long.
  4. Also,it is not merely enough to be a passive stakeholder, CIOs have to deliver on two very broad fronts. The first is to deliver core services (aka standardized functions) on time and at a reasonable cost. These are things like core banking systems, email, data backups etc. Ensuring smooth operation running transactional systems like ERP/business processing systems in manufacturing, decision support systems, classic IT infrastructure, claims management systems in Insurance and Billing systems in Healthcare. The systems that need to run to keep the business operations.The focus here is to deliver on these on time and within SLAs to increasingly demanding internal customers. Like running the NYC subway – no one praises you for keeping things humming day in and out but all hell breaks loose when the trains are nonoperational for any reason. A thankless task but one essentially needed to win the credibility with lines of business.
  5. The advent of public cloud means that internal IT no longer has a monopoly and a captive internal customer base even with core services. If one cannot compete with the likes of Amazon AWS or any of the SaaS based clouds that are mushrooming on a quarterly basis, you will find that soon enough you have to co-exist with Not-So-Shadow IT. The industry has seen enough back-office CIOs who are not perceived by their organizations as having a largely irrelevant role in the evolution of the larger enterprise.
  6. Despite the continued focus on running a strong core as the price of CIO admission to the internal strategic dances, transformation is starting to emerge as a key business driver and is making its way into the larger industry. It is no longer the province of Wall St trading shops or a Google or a Facebook. Innovation as in “adopt this strategy and reinvent your IT and change the business”. The operative word here is incremental rather than disruptive innovation. More on this key point later.
  7. Most rank and file IT personnel in general cannot really keep up with all the nomenclature of technology. For instance, a majority do not really understand umbrella concepts like Cloud, Mobility and Big Data. They know what these mean at a high level but the complex technology underpinnings, various projects & the finer nuances are largely lost on enterprise customers. There are two stark choices from a time perspective that face overworked IT personnel – a) Do you want to increase your value to your corporation by learning to speak the lingua franca of your business and by investing in those skills away from a traditional IT employee mindset? b) do you want to increase your IT depth in your area of expertise.The first makes one a valued collaborator and paves your way up within the chain, the second may definitely increase your marketability in the industry but it is not that easy to keep up. We find that an increasing number of employees choose the first path which creates interesting openings and arbitrage opportunities for other groups in the organization. The CIO needs to step up and be the internal change agent.

CONCLUSION…

Enterprise wide business innovation will continue to be designed around the four key technologies  (Big Data, Cloud Computing, Technology & Platforms). Business Platforms created leveraging these technologies will create immense operational efficiency, better business models, increased relevance to customers and ultimately drive revenues. Such platforms will separate the visionaries, leaders from the laggards in the years to come. As often noticed, the keyword accompanying transformation is often digital. This means a renewed focus on making IT services appealing to millennial or the self service generation – be they customers or employees or partners. This really touches all areas of enterprise IT while leaving behind a significant impact on organizational culture.

This is the age of IT with no boundaries – the question is whether the role of the CIO will largely remain unscathed in the years to come.

A POV on the FRTB (Fundamental Review of the Trading Book)…

Regulatory Risk Management evolves…

The Basel Committee of supranational supervision was put in place to ensure the stability of the financial system. The Basel Accords are the frameworks that essentially govern the risk taking actions of a bank. To that end, minimum regulatory capital standards are introduced that banks must adhere to. The Bank of International Settlements (BIS) established  1930, is the world’s oldest international financial consortium. with 60+ member central banks, representing countries from around the world that together make up about 95% of world GDP. BIS stewards and maintains the Basel standards in conjunction with member banks.

The goal of Basel Committee and the Financial Stability Board (FSB) guidelines are to strengthen the regulation, supervision and risk management of the banking sector by improving risk management and governance. These have taken on an increased focus to ensure that a repeat of financial crisis 2008 comes to pass again. Basel III (building upon Basel I and Basel II) also sets new criteria for financial transparency and disclosure by banking institutions.

Basel III – the last prominent version of the Basel standards published in 2012 (named for the town of Basel in Switzerland where the committee meets) prescribes enhanced measures for capital & liquidity adequacy and were developed by the Basel Committee on Banking Supervision with voluntary worldwide applicability.  Basel III covers credit, market, and operational risks as well as liquidity risks. As this is known, BCBS 239 –  guidelines do not just apply to the G-SIBs (the Globally Systemically Important Banks) but also to the D-SIBs (Domestic Systemically Important Banks).Any important financial institution deemed ‘too big to fail” needs to work with the regulators to develop a “set of supervisory expectations” that would guide risk data aggregation and reporting.

Basel III & other Risk Management topics were covered in these previous posts – http://www.vamsitalkstech.com/?p=191 && http://www.vamsitalkstech.com/?p=667

Enter the FTRB (Fundamental Review of the Trading Book)…

In May 2012, the Basel Committee on Banking Supervision (BCBS) again issued a consultative document with an intention of revising the way capital was calculated for the trading book. These guidelines which can be found here in their final form [1] were repeatedly refined based on comments from various stakeholders & quantitative studies. In Jan 2016, a final version of this paper was released. These guidelines are now termed  the Fundamental Review of the Trading Book (FRTB) or unofficially as some industry watchers have termed – Basel IV. 

What is new with the FTRB …

The main changes the BCBS has made with the FRTB are – 

  1. Changed Measure of Market Risk – The FRTB proposes a fundamental change to the measure of market risk. Market Risk will now be calculated and reported via Expected Shortfall (ES) as the new standard measure as opposed to the venerated (& long standing) Value At Risk (VaR).  As opposed to the older method of VaR with a 99% confidence level, expected shortfall (ES) with a 97.5% confidence level is proposed. It is to be noted that for normal distributions, the two metrics should be the same however the ES is much superior at measuring the long tail. This is a recognition that in times of extreme economic stress, there is a tendency for multiple asset classes to move in unison. Consequently, under the ES method capital requirements are anticipated to be much higher.
  2. Model Creation & Approval – The FRTB also changes how models are approved & governed.  Banks that want to use the IMA (Internal Model Approach) need to pass  a set of rigorous tests so that they are not forced to used the Standard Rules approach (SA) for capital calculations. The fear is that the SA will increase capital requirements. The old IMA approach has now been revised and made more rigorous in a way that it enables supervisors to remove internal modeling permission for individual trading desks. This approach now enforces more consistent identification of material risk factors across banks, and constraints on hedging and diversification. All of this is now going to be done at a desk level instead of the entity level. FRTB moves the responsibility of showing compliant models, significant backtesting & PnL attribution to the desk level.
  3. Boundaries between the Regulatory Books – The FRTB also assigns explicit boundaries between the trading book (the instruments the bank intends to trade) and the bank book (the instruments held to maturity). These rules have been redefined in such a way that banks now have to contend with stringent rules for internal transfers between both. The regulatory motivation is to eliminate a given bank’s ability to designate individual positions as belonging to either book. Given the different accounting treatment for both, there is a feeling that bank’s were resorting to capital arbitrage with the goal of minimizing regulatory capital reserves. The FRTB also introduces more stringent reporting and data governance requirements for both which in conjunction with the well defined boundary between books. All of these changes should lead to a much better regulatory framework & also a revaluation of the structure of trading desks. 
  4. Increased Data Sufficiency and Quality – The FRTB regulation also introduces Non-Modellable risk factors (NMRF). Risk factors are non modellabe if certain aspects that pertain to the availability and sufficiency of the data are an issue . Thus with the NMRF, Banks now need increased data sufficiency and quality requirements that go into the model itself. This is a key point, the ramifications of which we will discuss in the next section.
  5. The FRTB also upgrades its standardized approach to data structuring – with a new standardized approach (SBA) which is more sensitive to various risk factors across different asset classes as compared to the Basel II SA. Regulators now determine the sensitivities in the data. Approvals will also be granted at the desk level rather than at the entity level.  The revised SA should provide a consistent way to measure risk across geographies and regions, giving regulatory a better way to compare and aggregate systemic risk. The sensitivities based approach should also allow banks to share a common infrastructure between the IMA approach and the SA approach. Thera are a set of buckets and risk factors that are prescribed by the regulator which instruments can then be mapped to.
  6. Models must be seeded with real and live transaction data – Fresh & current transactions will now need to be entered into the calculation of capital requirements as of the date on which they were conducted. Not only that, though reporting will take place at regular intervals, banks are now expected to manage market risks on a continuous basis -almost daily.
  7. Time Horizons for Calculation – There are also enhanced requirements for data granularity depending on the kind of asset. The FRTB does away with the generic 10 day time horizon for market variables in Basel II to time periods based on liquidity of these assets. It propose five different time horizons – 10 day, 20 day, 60 day, 120 day and 250 days.

FRTB_Horizons

                                 Illustration: FRTB designated horizons for market variables (src – [1])

To Sum Up the FRTB… 

The FRTB rules are now clear and they will have a profound effect on how market risk exposures are calculated. The FRTB clearly calls out the specific instruments in the trading book vs the banking book. With the new switch over to Expected Shortfall (ES) @ 97.5% over VaR @ 99% confidence levels – it should cause increased reserve requirements. Furthermore, the ES calculations will be done keeping liquidity considerations of the underlying instruments with a historical simulation approach ranging from 10 days to 250 days of stressed market conditions. Banks that use a pure IMA approach will now have to move to IMA plus the SA method.

The FRTB compels Banks to create unified teams from various departments – especially Risk, Finance, the Front Office (where trading desks sit) and Technology to address all of the above significant challenges of the regulation.

From a technology capabilities standpoint, the FRTB now presents banks with both a data volume, velocity and analysis challenge. Let us now examine the technology ramifications.

Technology Ramifications around the FRTB… 

The FRTB rules herald a clear shift in how IT architectures work across the Risk area and the Back office in general.

  1. The FRTB calls for a single source of data that pulls data across silos of the front office, trade data repositories, a range of BORT (Book of Record Transaction Systems) etc. With the FRTB, source data needs to be centralized and available in one location where every feeding application can trust it’s quality.
  2. With both the IMA and the SBA in the FRTB, many more detailed & granular data inputs (across desks & departments) need to be fed into the ES (Expected Shortfall) calculations from varying asset classes (Equity, Fixed Income, Forex, Commodities etc) across multiple scenarios. The calculator frameworks developed or enhanced for FRTB will need ready & easy access to realtime data feeds in addition to historical data. At the firm level, the data requirements and the calculation complexity will be even more higher as it needs to include the entire position book.

  3. The various time horizons called out also increase the need to run a full spectrum of analytics across many buckets. The analytics themselves will be more complex than before with multiple teams working on all of these areas. This calls out for standardization of the calculations themselves across the firm.

  4. Banks will have to also provide complete audit trails both for the data and the processes that worked on the data to provide these risk exposures. Data lineage, audit and tagging will be critical.

  5. The number of runs required for regulatory risk exposure calculations will dramatically go up under the new regime. The FRTB requires that each risk class be calculated separately from the whole set. Couple this with increased windows of calculations as discussed  in #3 above- means that more compute processing power and vectorization.

  6. FRTB also implies that from an analytics standpoint, a large number of scenarios on a large volume of data. Most Banks will need to standardize their libraries across the house. If Banks do not look to move to a Big Data Architecture, they will incur tens of millions of dollars in hardware spend.

The FRTB is the most pressing in a long list of Data Challenges facing Banks… 

The FRTB is yet another regulatory mandate that lays bare the data challenges facing every Bank. Current Regulatory Risk Architectures are based on traditional relational databases (RDBMS) architectures with 10’s of feeds from Core Banking Systems, Loan Data, Book Of Record Transaction Systems (BORTS) like Trade & Position Data (e.g. Equities, Fixed Income, Forex, Commodities, Options etc),  Wire Data, Payment Data, Transaction Data etc. 

These data feeds are then tactically placed in memory caches or in enterprise data warehouses (EDW). Once the data has been extracted, it is transformed using a series of batch jobs which then prepare the data for Calculator Frameworks to which run the risk models on them. 

All of the above applications need access to medium to large amounts of data at the individual transaction Level. The Corporate Finance function within the Bank then makes end of day adjustments to reconcile all of this data up and these adjustments need to be cascaded back to the source systems down to the individual transaction or classes of transaction levels. 

These applications are then typically deployed on clusters of bare metal servers that are not particularly suited to portability, automated provisioning, patching & management. In short, nothing that can automatically be moved over at a moment’s notice. These applications also work on legacy proprietary technology platforms that do not lend themselves to flexible & a DevOps style of development.

Finally, there is always need for statistical frameworks to make adjustments to customer transactions that somehow need to get reflected back in the source systems. All of these frameworks need to have access to and an ability to work with terabtyes (TBs) of data.

Each of above mentioned risk work streams has corresponding data sets, schemas & event flows that they need to work with, with different temporal needs for reporting as some need to be run a few times in a day (e.g. Traded Credit Risk), some daily (e.g. Market Risk) and some end of the week (e.g Enterprise Credit Risk). 

One of the chief areas of concern is that the FRTB may require a complete rewrite of analytics libraries. Under the FRTB, front office libraries will need to do Enterprise Risk –  a large number of analytics on a vast amount of data. Front office models cannot make all the assumptions that enterprise risk can to price a portfolio accurately. Front office systems run a limited number of scenarios thus trading off timeliness for accuracy – as opposed to enterprise risk.

Most banks have stringent vetting processes in place and all the rewritten analytic assets will need to be passed through that. Every aspect of the math of the analytics needs to be passed through this vigorous process. All of this will add to compliance costs as vetting process costs typically cost multiples of the rewrite process. The FRTB has put in place stringent model validation standards along with hypothetical portfolios to benchmark these.

The FRTB also requires data lineage and audit capabilities for the data. Banks will need to establish visual representation of the overall process as data flows from the BORT systems to the reporting application. All data assets have to be catalogued and a thorough metadata management process instituted.

What Must Bank IT Do… 

Given all of the above data complexity and the need to adopt agile analytical methods  – what is the first step that enterprises must adopt?

There is a need for Banks to build a unified data architecture – one which can serve as a cross organizational repository of all desk level, department level and firm level data.

The Data Lake is an overarching data architecture pattern. Lets define the term first. A data lake is two things – a small or massive data storage repository and a data processing engine. A data lake provides “massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs“. Data Lake are created to ingest, transform, process, analyze & finally archive large amounts of any kind of data – structured, semistructured and unstructured data.

The Data Lake is not just a data storage layer but one that can allow different users (traders, risk managers, compliance etc) plug in calculators that work on data that spans intra day activity as well as data across years. Calculators can then be designed that can work on data with multiple runs to calculate Risk Weighted Assets (RWAs) across multiple calibration windows.

The below illustration is a depiction of goal is to create a cross company data-lake containing all asset data and compute applied to the data.

RDA_Vamsi

                              Illustration – Data Lake Architecture for FRTB Calculations

1) Data Ingestion: This encompasses creation of the L1 loaders to take in Trade, Position, Market, Loan, Securities Master, Netting  and Wire Transfer data etc across trading desks. Developing the ingestion portion will be the first step to realizing the overall architecture as timely data ingestion is a large part of the problem at most institutions. Part of this process includes understanding examples of a) data ingestion from the highest priority of systems b) apply the correct governance rules to the data. The goal is to create these loaders for versions of different systems (e.g Calypso 9.x) and to maintain it as part of the platform moving forward. The first step is to understand the range of Book of Record transaction systems (lending, payments and transactions) and the feeds they send out. The goal would be to create the mapping to a release of an enterprise grade Open Source Big Data Platform e.g HDP (Hortonworks Data Platform) to the loaders so these can be maintained going forward.

2) Data Governance: These are the L2 loaders that apply the rules to the critical fields for Risk and Compliance. The goal here is to look for gaps in the data and any obvious quality problems involving range or table driven data. The purpose is to facilitate data governance reporting.

3) Entity Identification: This step is the establishment and adoption of a lightweight entity ID service. The service will consist of entity assignment and batch reconciliation.

4) Developing L3 loaders: This phase will involve defining the transformation rules that are required in each risk, finance and compliance area to prep the data for their specific processing.

5) Analytic Definition: Running the analytics that are to be used for FRTB.

6) Report Definition: Defining the reports that are to be issued for each risk and compliance area.

References..

[1] https://www.bis.org/bcbs/publ/d352.pdf