Data Lakes power the future of Industrial Analytics..(1/4)

The first post in this four part series on Data lakes will focus on the business reasons to create one. The second post will delve deeper into the technology considerations & choices around data ingest & processing in the lake to satisfy myriad business requirements. The third will tackle the critical topic of metadata management, data cleanliness & governance. The fourth & final post in the series will focus on the business justification to build out a Big Data Center of Excellence (COE).

Business owners at the C level are saying, ‘Hey guys, look. It’s no longer inordinately expensive for us to store all of our data. I want all of you to make copies. OK, your systems are busy. Find the time, get an extract, and dump it in Hadoop.’”- Mike Lang, CEO of Revelytix

The onset of Digital Architectures in enterprise businesses implies the ability to drive continuous online interactions with global consumers/customers/clients or patients. The goal is not just provide engaging visualization but also to personalize services clients care about across multiple modes of interaction. Mobile applications first begun forcing the need for enterprise to begin supporting multiple channels of interaction with their consumers. For example Banking now requires an ability to engage consumers in a seamless experience across an average of four to five channels – Mobile, eBanking, Call Center, Kiosk etc. Healthcare is a close second where caregivers expect patient, medication & disease data at their fingertips with a few finger swipes on an iPad app.

Big Data has been the chief catalyst in this disruption. The Data Lake architectural & deployment pattern makes it possible to first store all this data & then enables the panoply of Hadoop ecosystem projects & technologies to operate on it to produce business results.

Let us consider a few of the major industry verticals and the sheer data variety that players in these areas commonly possess – 

The Healthcare & Life Sciences industry possess some of the most diverse data across the spectrum ranging from – 

  • Structured Clinical data e.g. Patient ADT information
  • Free hand notes
  • Patient Insurance information
  • Device Telemetry 
  • Medication data
  • Patient Trial Data
  • Medical Images – e.g. CAT Scans, MRIs, CT images etc

The Manufacturing industry players are leveraging the below datasets and many others to derive new insights in a highly process oriented industry-

  • Supply chain data
  • Demand data
  • Pricing data
  • Operational data from the shop floor 
  • Sensor & telemetry data 
  • Sales campaign data

Data In Banking– Corporate IT organizations in the financial industry have been tackling data challenges due to strict silo based approaches that inhibit data agility for many years now.
Consider some of the traditional sources of data in banking –

  • Customer Account data e.g. Names, Demographics, Linked Accounts etc
  • Core Banking Data
  • Transaction Data which captures the low level details of every transaction (e.g debit, credit, transfer, credit card usage etc)
  • Wire & Payment Data
  • Trade & Position Data
  • General Ledger Data e.g AP (accounts payable), AR (accounts receivable), cash management & purchasing information etc.
  • Data from other systems supporting banking reporting functions.

Industries have changed around us since the advent of relational databases & enterprise data warehouses. Relational Databases (RDBMS) & Enterprise Data Warehouses (EDW) were built with very different purposes in mind. RDBMS systems excel at online transaction processing (OLTP) use cases where massive volumes of structured data needs to be processed quickly. EDW’s on the other hand perform online analytical processing functions (OLAP) where data extracts are taken from OLTP systems, loaded & sliced in different ways to . Both these kinds of systems are not simply suited to handle not just immense volumes of data but also highly variable structures of data.


Let us consider the main reasons why legacy data storage & processing techniques are unsuited to new business realities of today.

  • Legacy data technology enforces a vertical scaling method that is sorely unsuited to handling massive volumes of data in a scale up/scale down manner
  • The structure of the data needs to be modeled in a paradigm called ’schema on write’ which sorely inhibits time to market for new business projects
  • Traditional data systems suffer bottlenecks when large amounts of high variety data are processed using them 
  • Limits in the types of analytics that could be performed. In industries like Retail, Financial Services & Telecommunications, enterprise need to build detailed models of customers accounts to predict their overall service level satisfaction in realtime. These models are predictive in nature and use data science techniques as an integral component. The higher volumes of data along with attribute richness that can be provided to them (e.g. transaction data, social network data, transcribed customer call data) ensures that the models are highly accurate & can provide an enormous amount of value to the business. Legacy systems are not a great fit here.

Given all of the above data complexity and the need to adopt agile analytical methods  – what is the first step that enterprises must adopt? 

The answer is the adoption of the Data Lake as an overarching data architecture pattern. Lets define the term first. A data lake is two things – a small or massive data storage repository and a data processing engine. A data lake provides “massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs“.[1] Data Lake are created to ingest, transform, process, analyze & finally archive large amounts of any kind of data – structured, semistructured and unstructured data.


                                  Illustration – The Data Lake Architecture Pattern

What Big Data brings to the equation beyond it’s strength in data ingest & processing is a unified architecture. For instance, MapReduce is the original framework for writing applications that process large amounts of structured and unstructured data stored in the Hadoop Distributed File System (HDFS). Apache Hadoop YARN opened Hadoop to other data processing engines (e.g. Apache Spark/Storm) that can now run alongside existing MapReduce jobs to process data in many different ways at the same time. The result is that ANY kind of application processing can be run inside a Hadoop runtime – batch, realtime, interactive or streaming.

Visualization  – Mobile applications first begun forcing the need for enterprise to begin supporting multiple channels of interaction with their consumers. For example Banking now requires an ability to engage consumers in a seamless experience across an average of four to five channels – Mobile, eBanking, Call Center, Kiosk etc. The average enterprise user is also familiar with BYOD in the age of self service. The Digital Mesh only exacerbates this gap in user experiences as information consumers navigate applications as they consume services across a mesh that is both multi-channel as well as provides Customer 360 across all these engagement points.While information management technology has grown at a blistering pace, the human ability to process and comprehend numerical data has not. Applications being developed in 2016 are beginning to adopt intelligent visualization approaches that are easy to use,highly interactive and enable the user to manipulate corporate & business data using their fingertips – much like an iPad app. Tools such as intelligent dashboards, scorecards, mashups etc are helping change a visualization paradigms that were based on histograms, pie charts and tons of numbers. Big Data improvements in data lineage, quality are greatly helping the visualization space.

The Final Word

Specifically, Data Lake architectural pattern provide the following benefits – 

The ability to store enormous amounts of data with a high degree of agility & low cost: The Schema On Read architecture makes it trivial to ingest any kind of raw data into Hadoop in a manner that preserves it’s structure.  Business analysts can then explore  this data and then defined a schema to suit the needs of their particular application.

The ability to run any kind of Analytics on the data: Hadoop supports multiple access methods (batch, real-time, streaming, in-memory, etc.) to a common data set.  You are only restricted by your use case.

the ability to analyze, process & archive data while dramatically cutting cost : Since Hadoop was designed to work on low-cost commodity servers which have direct attached storage – it helps dramatically lower the overall cost of storage.  Thus enterprises are able to retain source data for long periods, thus providing business applications with far greater historical context.

The ability to augment & optimize Data Warehouses: Data lakes & Hadoop technology are not a ‘rip & replace’ proposition. While they provide a much lower cost environment than data warehouses, they can also be used as the compute layer to augment these systems.  Data can be stored, extracted and transformed in Hadoop. Then a subset of the data i.e the results are loaded into the data warehouse. This enables the EDW to leverage compute cycles and storage to perform truly high value analytics.

The next post of the series will dive deeper into the architectural choices one needs to make while creating a high fidelity & business centric enterprise data lake.

References – 


The Data Science Continuum in Financial Services..(3/3)

In God we trust. All others must bring data.” – DrEdwards Deming, statistician, professor, author, lecturer, and consultant.

The first post in this three part series described key ways in which innovative applications of data science are slowly changing a somewhat insular banking & financial services industry . The second post then delineated key business use cases enabled by a data driven or data native  approach. The final post will examine foundational Data Science tasks & techniques that are commonly employed to get value from data with financial industry examples. We will round off the discussion with recommendations for industry CXOs.


The Need for Data Science –

It is no surprise that Big Data approaches were first invented & then refined in web scale businesses at Google, Yahoo, eBay, Facebook and Amazon etc. These web properties offer highly user friendly, contextual & mobile native application platforms  which produce a large amount of complex and multi-varied data from consumers,sensors and other telemetry devices. All this data that is constantly analyzed to drive higher rates of application adoption thus driving a virtuous cycle. We have discussed the (now) augmented capability of financial organizations to acquire, store and process large volumes of data by leveraging the HDFS (Hadoop Distributed Filesystem) running on commodity (x86) hardware.

One of the chief reasons that these webscale shops adopted Big Data is the ability to store the entire data set in Hadoop to build more accurate predictive models. The ability store thousands of  attributes at a much finer grain over a historical amount of time instead of just depending on a statistically significant sample is a significant gain over legacy data technology.

Every year Moore’s Law keeps driving the costs of raw data storage down. At the same time, compute technologies such as MapReduce, Tez, Storm and Spark have enabled the organization and analysis of Big Data at scale. The convergence of cost effective storage and scalable processing allows us to extract richer insights from data. These insights need to then be operationalized@scale to provide business value as the use cases in the last post highlighted @

The differences between Descriptive & Predictive Analytics –

Business intelligence (BI) is a traditional & well established analytical domain that essentially takes a retrospective look at business data in systems of record. The goal for BI is to primarily look for macro or aggregate business trends across different aspects or dimensions such as time, product lines, business unites & operating geographies.

BI is primarily concerned with “What happened and what trends exist in the business based on historical data?“. The typical use cases for BI include budgeting, business forecasts, reporting & key performance indicators (KPI).

On the other hand, Predictive Analytics (a subset of Data Science) augments & builds on the BI paradigm by adding a “What could happen” dimension to the data in terms of –

  • being able to probabilistically predict different business scenarios across thousands of variables
  • suggesting specific business actions based on the above outcomes

Predictive Analytics does not intend to nor will it replace the BI domain but only adds significant business capabilities that lead to overall business success. It is not uncommon to find real world business projects leveraging both these analytical approaches.

Data Science  –

So, what exactly is Data Science ?

Data Science is an umbrella concept that refers to the process of extracting business patterns from large volumes of both structured, semi structured and unstructured data. Data Science is the key ingredient in enabling a predictive approach to the business.

Some of the key aspects that follow are  –

  1. Data Science is not just about applying analytics to massive volumes of data. It is also about exploring the patterns,associations & interrelationships of thousands of variables within the data. It does so by adopting an algorithmic approach to gleaning the business insights that are embedded in the data.
  2. Data Science is a standalone discipline that has spawned its own set of platforms, tools and processes across it’s lifecycle.
  3. Data science also aids in the construction of software applications & platforms to utilize such insights in a business context.  This involves the art of discovering data insights combined with the science of operationalizing them at scale. The word ‘scale’ is key. Any algorithm, model or deployment paradigm should support an expanding number of users without the need for unreasonable manual intervention as time goes on.
  4. A data scientist uses a combination of machine learning, statistics, visualization, and computer science to extract valuable business insights hiding in data and builds operational systems to deliver that value.
  5. The machine learning components are classified into two categories: ‘supervised’ and ‘unsupervised’ learning. In supervised learning, the constructed model defines the effect one set the inputs on the outputs through the causal chain.In unsupervised learning, the ouputs are affected by so called latent variables. It is also possible to have a hybrid approach to certain types of mining tasks.
  6. Strategic business projects typically begin leveraging a Data Science based approach to derive business value. This approach then becomes integral and eventually core to the design and architecture of such a business system.
  7. Contrary to what some of the above may imply, Data Science is a cross-functional discipline and not just the domain of Phd’s. A data scientist is part statistician, part developer and part business strategist.
  8. Working in small self sufficient teams, the Data Scientist collaborates with extended areas which includes visualization specialists, developers, business analysts, data engineers, applied scientists, architects, LOB owners and DevOp. The success of data science projects often relies on the communication, collaboration, and interaction that takes place with the extended team, both internally and possibly externally to their organization.
  9. It needs to be clarified that not every business project is a fit for a Data science approach. The criteria that must be employed to understand if such an advanced approach is called for include if the business initiative needs to provide knowledge based decisions (beyond the classical rules engine/ expert systems based approaches), deal with volumes of relevant data, a rapidly changing business climate, & finally where scale is required beyond what can be supplied using human analysts.
  10. Indeed any project where hugely improved access to information & realtime analytics for customers, analysts (and other stakeholders) is a must for the business – is fertile ground for Data Science.

Algorithms & Models

The word ‘model‘ is highly overloaded and means different things to different IT specialities e.g. RDBMS models imply data schemas, statistical models are built by statisticians etc. However, it can safely be said that models are representations of a business construct or a business situation.

Data mining algorithms are used to create models from data.

To create a data science model, the data mining algorithm looks for key patterns in data provided. The results of this analysis are to define the best parameters to create the model. Once identified, these parameters are applied across the entire data set to extract actionable patterns and detailed statistics.

The model itself can take various forms ranging from a set of customers across clusters, a revenue forecasting model, a set of fraud detection rules for credit cards or a decision tree that predicts outcomes based on specific criteria.

Common Data Mining Tasks –

There are many different kinds of data mining algorithms but all of these address a few fundamental types of tasks. The pouplar ones are listed below along with relevant examples:

  • Classification & Class Probability Estimation– For a given set of data, predict for each individual in a population, a discrete set of classes that this individual belongs to. An example classification is – “For all wealth management clients in a given population, who are most likely to respond to an offer to move to a higher segment”. Common techniques used in classification include decision trees, bayesian models, k-nearest neighbors, induction rules etc. Class Probability Estimation (CPE) is a closely related concept in which a scoring model is created to predict the likelihood that an individual would belong to that class.
  • Clustering is an unsupervised technique used to find classes or segments of populations within a larger dataset without being driven by any specific purpose. For example – “What are the natural groups our customers fall into?”. The most popular use of clustering techniques is to identify clusters to use in activities like market segmentation.A common algorithm used here is k-means clustering.
  • Market basket analysis  is commonly used to find out associations between entities based on transactions that involve them. E.g Recommendation engines which use affinity grouping.
  • Regression algorithms aim to characterize the normal or typical behavior of an individual or group within a larger population. It is frequently used in anomaly detection systems such as those that detect AML (Anti Money Laundering) and Credit Card fraud.
  • Profiling algorithms divide data into groups, or clusters, of items that have similar properties.
  • Causal Modeling algorithms attempt to find out what business events influence others.

There is no reason that one should be limited to one of the above techniques while forming a solution. An example is to use one algorithm (say clustering) to determine the natural groups in the data, and then to apply regression to predict a specific outcome based on that data. Another example is to use multiple algorithms within a single business project to perform related but separate tasks. Ex – Using regression to create financial reporting forecasts, and then using a neural network algorithm to perform a deep analysis of the factors that influence product adoption.

The Data Science Process  –

A general process framework for a typical Data science project is depicted below. The process flow depicted below suggests a sequential waterfall but allows for Agile/DevOps loops in the core analysis & feedback phases. The process is also not a virtual one sided pipeline but also allows for continuous improvements.


                                           Illustration: The Data Science Process 

  1. The central pool of data that hosts all the tiers of data processing in the above illustration is called the Data Lake. The Data Lake enables two key advantages – the ability to collect cross business unit data so that it can be sampled/explored at will & the ability to perform any kind of data access pattern across a shared data infrastructure: batch,  interactive, search, in-memory and custom etc.
  2. The Data science process begins with a clear and contextual understanding of the granular business questions that need to be answered from the real world dataset. The Data scientist needs to be trained in the nuances of the business to achieve the appropriate outcome. E.g. Detecting customer churn, predicting fraudulent credit card transactions in the credit cards space, predicting which customers in the Retail Bank are likely to churn over the next few months based on their usage patterns etc.
  3. Once this is known, relevant data needs to be collected from the real world. These sources in Banking range from –
    1. Customer Account data e.g. Names,Demographics, Linked Accounts etc
    2. Transaction Data which captures the low level details of every transaction (e.g debit, credit, transfer, credit card usage etc),
    3. Wire & Payment Data,
    4. Trade & Position Data,
    5. General Ledger Data and Data from other systems supporting core banking functions.
    6. Unstructured data. E.g social media feeds, server logs, clickstream data & mobile application data etc.
  4. Following the planning stage, Data Acquisition follows an  iterative process of acquiring data from the actual sources by creating appropriate loaders choosing appropriate technology components. E.g. Apache NiFi, Kafka, Sqoop, Flume, HDFS API, Java etc
  5. The next step is to perform Data Cleansing. Here the goal is to look for gaps in the data  (given the business context), ensuring that the dataset is valid with no missing values, consistent in layout and as fresh as possible from a temporal standpoint. This phase also involves fixing any obvious quality problems involving range or table driven data. The purpose at this stage is also to facilitate & perform appropriate data governance.
  6. Exploratory Data Analysis (EDA) helps with trial & error analysis of data. This is a phase where plots and graphs are used to systematically go through the data. The importance of this cannot be overstated as it provide the Data scientist and the business with a flavor of the data.
  7. Data Analysis: Generation of features or attributes that will be part of the model. This is the step of the process where actual data mining takes place leveraging models built using the above algorithms.

Within each of the above there exist further iterative steps within the Data Cleansing and Data Analysis stages.

Once the models have been tested and refined to the satisfaction of the business and their performance been put through a rigorous performance test phase, they are deployed into production. Once deployed, these are constantly refined based on end user and system feedback.

The Big Data ecosystem (consisting of tools such as Pig, Scalding, Hive, Spark and MapReduce etc) enable sea changes of improvement across the entire Data science lifecycle from data acquisition to data processing to data analysis. The ability of Big Data/Hadoop to unify all data storage in one place which renders data more accessible for modeling. Hadoop also scales up machine learning analysis due to it’s inbuilt paralleism which adds a tremendous amount of value both in terms of training multiple parallel models to improve their efficacy. The ability to collect a lot of data as opposed to small samples also helps greatly.

Recommendations – 

Developing a strategic mindset to Data science and predictive analytics should be a board level concern. This entails

  • To begin with – ensuring buy in & commitment in the form of funding at a Senior Management level. This support needs to extend across the entire lifecycle depicted above (from identifying business use cases).
  • Extensive but realistic ROI (Return On Investment) models built during due diligence with periodic updates for executive stakeholders
  • On a similar note, ensuring buy in using a strategy of co-opting & alignment with Quants and different high potential areas of the business (as covered in the usecases in the last blog)
  • Identifying leaders within the organization who can not only lead important projects but also create compelling content to evangelize the use of predictive analytics
  • Begin to tactically bake in or embed data science capabilities across different lines of business and horizontal IT
  • Slowly moving adoption to the Risk, Fraud, Cybersecurity and Compliance teams as part of the second wave. This is critical in ensuring that analysts across these areas move from a spreadsheet intensive model to adopting advanced statistical techniques
  • Creating a Predictive Analytics COE (Center of Excellence) that enable cross pollination of ideas across the fields of statistical modeling, data mining, text analytics, and Big Data technology
  • Informing the regulatory authorities of one’s intentions to leverage data science across the spectrum of operations
  • Ensuring that issues related to data privacy,audit & compliance have been given a great deal of forethought
  • Identifying  & developing human skills in toolsets (across open source and closed source) that facilitate adapting to data lake based architectures. A large part of this is to organically grow the talent pool by instituting a college recruitment process

While this ends the current series on Data Science in financial services, it is my intention to explore each of the above Data Mining techniques to a greater degree of depth as applied to specific business situations in 2016 & beyond. This being said, we will take a look at another pressing business & strategic concern – Cybersecurity in Banking – in the next series.

Big Data Drives Disruption In Wealth Management..(2/3)

The first post in this three part series brought to the fore critical strategic trends in the Wealth & Asset Management (WM) space – the most lucrative portion of Banking. This second post will describe an innovation framework for a forward looking WM institution.We will do this by means of concrete & granular use cases across the entire WM business lifecycle. The final post will cover technology architecture and business strategy recommendations for WM CXO’s.


The ability to sign up wealthy individuals & families; then retaining them over the years by offer those engaging, bespoke & contextual services will largely provide growth in the Wealth Management industry in 2016 and beyond. However,  WM as an industry sector has lagged other areas within banking from a technology & digitization standpoint.Multiple business forces ranging from increased regulatory & compliance demands, technology savvy customers and new Age FinTechs have led to firms slowly begin a makeover process.

So all of this begs the question – what do WM need to do to grow their client base and ultimately revenues? I contend that there are four strategic goals that firms need to operate across – 

  1. Increase Client Loyalty by Digitizing Client Interactions –  WM Clients who use services like Uber, Zillow, Amazon etc in their daily lives are now very vocal in demanding a seamless experience across all of the WM services using digital channels.  The vast majority of  WM applications still lag the innovation cycle, are archaic & are still separately managed. The net issue with this is that the client is faced with distinct user experiences ranging from client onboarding to servicing to transaction management. There is a crying need for IT infrastructure modernization ranging across the industry from Cloud Computing to Big Data to microservices to agile cultures promoting techniques such as a DevOps approach to building out these architectures. Such applications need to provide anticipatory or predictive capabilities at scale while understand the specific customers lifestyles, financial needs & behavioral preferences. 
  2. Generate Optimal Client Experiences –  In most existing WM systems, siloed functions have led to brittle data architectures operating on custom built legacy applications. This problem is firstly compounded by inflexible core banking systems and secondly exacerbated by a gross lack of standardization in application stacks underlying ancillary capabilities. These factors inhibit deployment flexibility across a range of platforms thus leading to extremely high IT costs and technical debut. The consequence is that these inhibit client facing applications from using data in a manner that constantly & positively impacts the client experience. There is clearly a need to provide an integrated digital experience across a global customer base. And then to offer more intelligent functions based on existing data assets. Current players do possess a huge first mover advantage as they offer highly established financial products across their large (and largely loyal & sticky) customer bases, a wide networks of physical locations, rich troves of data that pertain to customer accounts & demographic information. However, it is not enough to just possess the data. They must be able to drive change through legacy thinking and infrastructures as things change around the entire industry as it struggles to adapt to a major new segment – the millenials – who increasingly use mobile devices and demand more contextual services as well as a seamless and highly analytic driven & unified banking experience – akin to what they commonly experience via the internet – at web properties like Facebook, Amazon, Google or Yahoo etc
  3. Automate Back & Mid Office Processes Across the WM Value Chain – The needs to forge a closer banker/client experience is not just driving demand around data silos & streams themselves but also forcing players to move away from paper based models to more of a seamless, digital & highly automated model to rework a ton of existing back & front office processes – which is the weakest link in the chain. These processes range from risk data aggregation, supranational compliance (AML,KYC, CRS & FATCA), financial reporting across a range of global regions & Cyber Security. Can the Data architectures & the IT systems  that leverage them be created in such a way that they permit agility while constantly learning & optimizing their behaviors across national regulations, InfoSec & compliance requirements? Can every piece of actionable data be aggregated,secured, transformed and reported on in such a way that it’s quality across the entire lifecycle is guaranteed? 
  4. Tune existing business models based on client tastes and feedback – While Automation 1.0 focuses on digitizing processes, rules & workflow as stated above; Automation 2.0 implies strong predictive modeling capabilities working at large scale – systems that constantly learn and optimize products & services based on client needs & preferences. The clear ongoing theme in the WM space is constant innovation. Firms need to ask themselves if they are offering the right products that cater to an increasingly affluent yet dynamic clientele. This is the area where firms need to show that they can compete with the FinTechs (Wealthfront, Nutmeg, Fodor Bank et al) to attract younger customers.

Having set the stage for the capabilities that need to be added or augmented, let us examine what the WM firm of the future can look like.


                            Illustration – Technology Driven Wealth Management

Improve the Client Experience

The ability of the clients to view their holistic portfolio, banking,bill pay data & advisor interactions in one intuitive user interface is a must have. All this information needs to be available across multiple channels of banking & across all accounts the client owns with multiple FIs (Financial Institutions). Further, pulling in data from relevant social media properties like Twitter, Facebook etc can help clients gauge the popularity of certain products across their networks thus helping them make targeted, real-time, decisions that increase market share. Easy access to investment advice, portfolio analytics and DIY (Do it Yourself) “what if” scenarios based on the client’s investment profile, past financial behavior & family commitments are highly desirable and encourage client loyalty.

Help the Advisor –

On the other side of the coin, most  WM advisors lack a comprehensive view of their customers. This is due to legacy IT reasons due to which their interactions with clients across multiple channels takes up a lot of their work time but also results in limited collaboration within the bank when servicing client needs.

Other “must have” capabilities –

  • Predicting Customer Attrition & Churn across both a single client as well as over a n advisor’s entire client base
  • Portfolio Rebalancing & risk modeling across multiple dimensions
  • Single View of Customer Segments across multiple product offerings
  • Basket Analysis based on criteria like investment preferences, asset allocation etc – i.e “what products are typically purchased in tandem”
  • Run in place analytics on customer lifetime value (CLV) and yield per customer
  • Suggest Next Best Action for a given client and across a pool of managed clients
  • Provide multiple levels of dashboards ranging from the Descriptive (Business Intelligence) to the Prescriptive (business simulation as well as optimization)

Digitize Business Processes –

Since a high degree of WM technology still lives in the legacy age, it should not be a surprise that a lot of backend processes result in client dissatisfaction as well as an inability to provide lean & efficient operations. Strategic investments in Business Process Management (BPM) systems, Big Data architectures & processing techniques, Digital Signature systems & augmenting tactical document management systems can result in a high degree of digitization. This leads to seamless business interoperability, efficient client operations and an ability to turn around compliance information quickly & more efficiently over to regulatory authorities.

Invest In Technology to Drive the Business –

Strategic deployment of technology assets will be the differentiator in the WM business going forward. The technology investments that WM firms need to make are in three broad areas – Big Data & Predictive Analytics, Cloud Computing & in a DevOps based approach to building out these capabilities.

Big Data & Hadoop provide the foundation to an intelligent approach to unifying data (ingesting, mining & linking micro feeds with existing core banking data) and then fostering  a deep analytical approach based on predictive analytics and machine learning.

So what kind of new age business capabilities can WM firms build on a Big Data & Advanced Analytics based foundation?

  • New Client Acquisition by creating client profiles and helping develop targeted leads across a population of individuals
  • Instrument and understand Risk at multiple levels (customer churn, client risk etc) in real time
  • Advanced Portfolio Analytics
  • Performance Management Metrics for the business across client segments, advisors and specific geographies
  • Better Client Advice based on portfolio optimization which takes client life journey details into account as opposed to static age based rebalancing
  • Promoting client’s ability to self service their accounts thus reducing load on advisors for mundane tasks
  • The biggest (and perhaps the most famous) capability is providing Robo Advisor functionality with advanced visualization capabilities. One of the goals here is to compete with Fintechs which are automating their customer account servicing using an automated approach.
  • Help with Compliance and other reporting functions

Big Data and Hadoop seems to be emerging as the platform of choice for many reasons – ability to handle any kind of data at scale, cost, techniques like deep learning need a lot of computing power which Hadoop can provide via paralleization, integration with SAS/Python and R. A high degree of data preprocessing could be done via Advanced MapReduce techniques.Finally, additive to all of this is an agile infrastructure based on cloud computing principles which calls out for a microservice based approach to building out software architectures, mobile platforms that accelerate customers abilities to bank from anywhere. DevOps dictates an increased focus on automation from a business process to software system delivery  and encourages a culture that encourages risk taking & a “fail fast” approach.

The final post in this series will cover a high level technology architecture and then specific recommendations to WM CXO’s.

Big Data #2 – The velocity of Big Data & CEP vs Stream Processing

The term ‘Data In Motion’ is widely used to represent the speed at which large volumes of data are moving into corporate applications in a broad range of industries. Big Data also needs to be Fast Data. Fast in terms of processing potentially millions of messages per second while also accommodating multiple data formats across all these sources.

Stream Processing refers to the ability to process these messages (typically sensor or machine data) as soon as they hit the wire with a high event throughput to query ratio.

Some common use-cases are processing set-top data from converged media products in the Cable & Telco space to analyze device performance or outage scenarios; fraud detection in the credit card industry, stock ticker data & market feed data regd other financial instruments that must be analyzed in split second time; smart meter data in the Utilities space, driver monitoring in transportation etc. The data stream pattern could be  batch or real-time owing to the nature of the business event etc.

While the structure and the ingress velocity of all this data varies depending on the industry & the use case, what stays common is that these data stream must all be analyzed & reacted upon as soon as is possible with an eye towards business advantage.

The leading Big Data Stream processing technology is Storm. More on it’s architecture and design in subsequent posts. If you currently do not have a stream based processing solution in place, Storm is a great choice – one that is robust, proven in the field and well supported as part of the Hortonworks HDP.


A range of reference usecases listed here –

Now, at the surface a lot of this sounds awfully close to Complex Event Processing (CEP). For those of you who have been in the IT industry for a while, CEP is a mature application domain & is defined as the below by Wikipedia.

Complex Event Processing, or CEP, is primarily an event processing concept that deals with the task of processing multiple events with the goal of identifying the meaningful events within the event cloud. CEP employs techniques such as detection of complex patterns of many events, event correlation and abstraction, event hierarchies, and relationships between events such as causality, membership, and timing, and event-driven processes.”

However, these two technologies differ along the following lines –

  1. Target use cases
    CEP as a technology & product category has a very valid use-case when one has an ability to define analytics (based on simple business rules) on pre-modeled discrete (rather than continuous) events. Big Data Stream processing on the other hand enable one to apply a range of complex analytics that can infer across multiple aspects of every message.
  2. Scale -Compared to Big Data Stream Processing, CEP supports a small to medium size scale. You are typically talking a few 10,000’s of messages while supporting latency of a magnitude of seconds to sub-seconds (based on the CEP product). Big Data stream processing on the other hand operates across 100,000’s of messages every second while supporting a range of latencies. For instance, Hortonworks has benchmarked Storm as processing one million 100 byte messages per second per node.Thus, Stream processing technologies have been proven at web scale versus CEP, which is still a niche and high-end capability depending upon the vertical (low latency trading) or the availability of enlightened architects & developers who get the concept.
  3. Variety of data -CEP operates on structured data types (as opposed to Stream Processing which is agnostic to the arriving data format) where the schema of the arriving events has been predefined into some kind of a fact model. E.g. a Java POJO.
    The business logic is then defined using a vendor provided API.
  4. Type of analytics – CEP’s strong suit is it’s ability to take atomic events and correlate them into compound events (via the API). Most of the commercially viable CEP engines thus provide a rules based language to define these patterns. Patterns which are temporal and spatial in nature. E.g. JBOSS Drools DRLBig Data Stream Processing on the other hand supports a super set of such analytical capabilities with map-reduce, predictive analytics, machine learning and advanced visualization etc etc.
  5. Scalability –CEP engines are typically small multi-node installs with vertical scalability being the core model of scaling clusters in production. Stream processing on the other hand provides scalable and auto-healing across typical 2 CPU dual core boxes. The model of scalability & resilience is horizontal.Having said all of this, it is important to note  that these two technologies (Hadoop ecosystem & CEP) can also co-exist. Many customers are looking to build “live” data marts where data is kept in memory as it streams in, a range of queries are continuously applied and realtime views are then shown to end users.Microsoft has combined their StreamInsight CEP engine into Apache Hadoop where MapReduce is used to run scaled out reducers performing complex event processing over the data partitions.

    3007.StreamInsight 2.png-481x359
    Source –