Cybersecurity – The biggest threat to the Digital Economy..(1/4)

We believe that data is the phenomenon of our time. It is the world’s new natural resource. It is the new basis of competitive advantage, and it is transforming every profession and industry. If all of this is true – even inevitable – then cyber crime, by definition, is the greatest threat to every profession, every industry, every company in the world.” – IBM Corp’s Chairman & CEO Ginny Rometty, Nov 2015, NYC

The first blog of this four part series will focus on the cybersecurity challenge across industry verticals while recapping some of the major cyber attacks in the previous years. We will also discuss what responses are being put in place by Corporate Boards. Part two of this series will focus on strategies for enterprises to achieve resilience in the face of these attacks – from a technology stack standpoint. Part three will focus on advances in Big Data Analytics that provide advanced security analytics capabilities. The final post of the series will focus on the steps corporate boards, exec leadership & IT leadership needs to adopt from a governance & strategy standpoint to protect their organizations from this constant onslaught.

The Cybersecurity Challenge – 

This blog has from time to time, noted the ongoing digital transformation across industry verticals. For instance, banking organizations are building digital platforms that aim to engage customers, partners and employees. Banks now recognize that the key to win the customer of the future is to offer seamless experience across billions of endpoints. Healthcare providers want to offer their stakeholders – patients, doctors,nurses, suppliere etc with multiple avenues to access contextual data and services; the IoT (Internet of Things) domain is abuzz with the possibilities of Connected Car technology.

However, the innate challenge across all of the above scenarios is that the surface area of exposure across all of these assets exponentially rises. This rise increases security risks – risk of system compromise, data breach and worse system takeover.

A cursory study of the top data breaches in 2015 reads like a “Who’s Who” of actors in society across Governments, Banks, Retailers, Health providers etc. The world of business now understands that an comprehensive & strategic approach to cybersecurity is now far from being a cursory IT challenge a few years ago to a board level concern.

The top two business cyber-risks are data loss & the concomitant disruption to smooth operations.  The British insurance major Lloyd’s estimates that cyber attacks cost businesses as much as $400 billion a year, which includes direct damage plus post-attack disruption to the normal course of business. Vendor and media forecasts put the cybercrime figure as high as $500 billion and more.[1]

The word Cybersecurity was not as highly popular in the popular IT lexicon a few years ago as it is now. Cybersecurity and cybercrime have become not only a nagging but also an existential threat to enterprises across a whole range of verticals – retail, financial services, healthcare and government. The frequency and sophistication of these attacks have also increased in number year after year.

For instance, while the classical cybercriminal of a few years ago would target a Bank or a Retailer or a Healthcare provider but things have evolved nowadays as technology has opened up new markets. As an illustration of the expanding challenge around security – there are now threats emerging around automobiles i.e protecting cars from being taken over by cyber attackers. Is this borne out by industry research? Yes..

ABI Research forecasted that by 2020, we will have more than 20 million connected & inter communicating cars & other automobiles with Internet of Anything (IoAT) data flow capabilities[3]. The key concern is not just about securing the endpoints (the cars) themselves but the fact that the data flows into a corporate datacenter where is harnessed for business uses such as preventative maintenance, assisting in new product development, manufacturing optimization and even with recall avoidance etc. The impact and risk of the threat then become magnified as they both extend across the value chain along with data & information flows.

OnlineBreaches                                          Illustration: Largest Hacks of 2014 (source – [2])

The biggest cyberattacks of recent times include some of the below –

  • Home Depot – 109 million user records stolen
  • JP Morgan Chase – 83 million user records compromised
  • Sony Pictures Entertainment – 47k records stolen with significant loss of intellectual property

Cybersecurity – A Board level concern – 

The world of business is now driven by complex software & information technology. IT is now enterprise destiny. Given all of this complexity across global operating zones, perhaps no other business issue has the potential to result in massive customer drain, revenue losses, reputational risks & lawsuits from affected parties as do breaches in Cybersecurity. A major breach in security is a quick gamechanger and has the potential to put one in defensive mode for years.

Thus, Corporate Boards which have been long insulated from technology decisions now want to understand from their officers how they’re overseeing, and mitigating cyber security. Putting into place an exemplary program that can govern across a vast & quickly evolving cybersecurity threat landscape is a vital board level responsibility. The other important point to note is the interconnected nature of these business ecosystems implies the need for external collaboration as well as a dedicated executive to serve as a Cyber czar.

Enter the formal role of the CISO (Chief Information Security Officer)….

The CISO typically heads an independent technology and business function with a dedicated budget & resources. Her or his mandate extends from physical security (equipment lockdown, fob based access control etc_ to setting architectural security standards for business applications as well as reviewing business processes. One of the CISO’s main goals is standardize the internal taxonomy of cyber risk and to provide a framework for quantifying these risks across a global organization.

Cyber Threat is magnified in the Digital Age – 

As IBM’s CEO states above – “Data is the phenomenon of our time.”  Enterprise business is built around data assets and data is the critical prong of any digital initiative. For instance, Digital Banking platforms & Retail applications are evolving to collections of data based ecosystems. These  need to natively support loose federations of partner applications, regulatory applications which are API based & Cloud native. These applications are majorly microservice architecture based & need to support mobile clients from the get go. Owing to their very nature in that they support massive amounts of users & based on their business priority, these tend to take a higher priority in the overall security equation .

It must naturally follow that more and more information assets are at danger of being targeted by extremely well funded and sophisticated adversaries ranging from criminals to cyber thieves to hacktivists.


                       Illustration – Enterprise Cybersecurity Vectors

How are Enterprises responding? – 

The PwC Global State of Information Security Survey (GSISS) for 2015 has the following key findings [4]. These are important as we will use expand on some of these themes in the following posts –

  • An increased adoption in risk based security frameworks. E.g ISO 27001, the US National Institute of Standards and Technology (NIST) Cybersecurity Framework and SANS Critical Controls. These frameworks offer a common vocabulary, a set of guidelines that enable enterprises to  identify and prioritize threats, quickly detect and mitigate risks and understand security gaps.
  • Increased adoption of cloud based security platforms. Cloud Computing has emerged as an advanced method of deploying data protection, network security and identity & access management capabilities. These enable enterprises to improve threat intelligence gathering & modeling thus augmenting their ability to block attacks as well as to accelerate incident responses.
  • The rapid rise and adoption of Big Data analytics –  The drive to a data driven approach can help organizations shift their focus away from pure perimeter based defense to ensuring that realtime data streams can be analyzed as well as combined with historical data to drive security analytics. A data-driven approach can shift enterprises away from a predominantly perimeter-based defence strategy and enable enterprises to put real-time information to use in ways that can help predict cybersecurity incidents. Data-driven cybersecurity allows companies to better understand anomalous network activity and more quickly identify and respond to cybersecurity incidents. Big Data is being combined with existing security information and event management (SIEM) technologies to generate holistic views of network activity. Other usecases include the use of data analytics for insider threat surveillance.
  • A huge increase in external collaboration on cybersecurity working with industry peers as well as law enforcement, government agencies as well as Information Sharing and Analysis Centers (ISACs).
  • The emergence of Cyber insurance as one of the fastest growing sectors in the insurance market, according to  PwC [3].Cybersecurity insurance is designed to mitigate business losses that could occur from a variety of cyber incidents, including data breaches. This form of insurance should be factored into more and more Enterprise Risk Management programs.

Thus far, Enterprises are clearing waking to the threat and spending big dollars on cybersecurity. According to Gartner, worldwide spending on information security in 2015 reached $75 billion, an increase of 4.7% over 2014[1]. However it needs to be noted that Cybersecurity compliance comes at a huge cost both in terms of manpower as well as the amount of time needed to certify projects as being compliant with a set of standards – both of which lead to delays in time and a rise in costs.

All said, the advantage remains with the attackers – 

The key issue here is that the attackers need to succeed only once as compared to the defenders. Important factors like technology sophistication,the number of attack vectors ensure that the surface area of exposure as well remains high. This ensures that the advantage lies with the cyber attacker and will do so for the foreseeable future.

Summary – 

Given all of the above, the five important questions Corporate leaders, CXO’s & industry practitioners need to ask of themselves  –

  1. First and foremost, can an efficient security infrastructure not only be a defensive strategy but also a defining source of competitive advantage ?
  2. The ideal organizational structure and processes that need to be put in place to ensure continued digital resilience in the face of concerted & sophisticated attacks
  3. Can the above (#2) be navigated without hindering the pace of innovation? How do we balance both?
  4. Given that most cyber breaches are long running in nature – where systems are slowly compromised over months. How does one leverage Cloud Computing, Big Data and Predictive Modeling to rewire applications with any security flaws?
  5. Most importantly, how can applications implement security in a manner that they constantly adapt and learn? How can the CISO’s team influence infrastructure, application & data development standards & processes? 

The next post will examine the answers to some of these questions but from a technology standpoint.


  1. Cybersecurity ventures – “The Cybersecurity market report Q1 2016”
  2. Gemalto “Cybersecurity Breach Level Index for 2014”
  3. Forbes Magazine – “Cybersecurity Market Expected to Reach 75 billion by 2015” – Steve Morgan
  4. PwC Global State of Information Security Survey 2016 (GSIS)

The Data Science Continuum in Financial Services..(3/3)

In God we trust. All others must bring data.” – DrEdwards Deming, statistician, professor, author, lecturer, and consultant.

The first post in this three part series described key ways in which innovative applications of data science are slowly changing a somewhat insular banking & financial services industry . The second post then delineated key business use cases enabled by a data driven or data native  approach. The final post will examine foundational Data Science tasks & techniques that are commonly employed to get value from data with financial industry examples. We will round off the discussion with recommendations for industry CXOs.


The Need for Data Science –

It is no surprise that Big Data approaches were first invented & then refined in web scale businesses at Google, Yahoo, eBay, Facebook and Amazon etc. These web properties offer highly user friendly, contextual & mobile native application platforms  which produce a large amount of complex and multi-varied data from consumers,sensors and other telemetry devices. All this data that is constantly analyzed to drive higher rates of application adoption thus driving a virtuous cycle. We have discussed the (now) augmented capability of financial organizations to acquire, store and process large volumes of data by leveraging the HDFS (Hadoop Distributed Filesystem) running on commodity (x86) hardware.

One of the chief reasons that these webscale shops adopted Big Data is the ability to store the entire data set in Hadoop to build more accurate predictive models. The ability store thousands of  attributes at a much finer grain over a historical amount of time instead of just depending on a statistically significant sample is a significant gain over legacy data technology.

Every year Moore’s Law keeps driving the costs of raw data storage down. At the same time, compute technologies such as MapReduce, Tez, Storm and Spark have enabled the organization and analysis of Big Data at scale. The convergence of cost effective storage and scalable processing allows us to extract richer insights from data. These insights need to then be operationalized@scale to provide business value as the use cases in the last post highlighted @

The differences between Descriptive & Predictive Analytics –

Business intelligence (BI) is a traditional & well established analytical domain that essentially takes a retrospective look at business data in systems of record. The goal for BI is to primarily look for macro or aggregate business trends across different aspects or dimensions such as time, product lines, business unites & operating geographies.

BI is primarily concerned with “What happened and what trends exist in the business based on historical data?“. The typical use cases for BI include budgeting, business forecasts, reporting & key performance indicators (KPI).

On the other hand, Predictive Analytics (a subset of Data Science) augments & builds on the BI paradigm by adding a “What could happen” dimension to the data in terms of –

  • being able to probabilistically predict different business scenarios across thousands of variables
  • suggesting specific business actions based on the above outcomes

Predictive Analytics does not intend to nor will it replace the BI domain but only adds significant business capabilities that lead to overall business success. It is not uncommon to find real world business projects leveraging both these analytical approaches.

Data Science  –

So, what exactly is Data Science ?

Data Science is an umbrella concept that refers to the process of extracting business patterns from large volumes of both structured, semi structured and unstructured data. Data Science is the key ingredient in enabling a predictive approach to the business.

Some of the key aspects that follow are  –

  1. Data Science is not just about applying analytics to massive volumes of data. It is also about exploring the patterns,associations & interrelationships of thousands of variables within the data. It does so by adopting an algorithmic approach to gleaning the business insights that are embedded in the data.
  2. Data Science is a standalone discipline that has spawned its own set of platforms, tools and processes across it’s lifecycle.
  3. Data science also aids in the construction of software applications & platforms to utilize such insights in a business context.  This involves the art of discovering data insights combined with the science of operationalizing them at scale. The word ‘scale’ is key. Any algorithm, model or deployment paradigm should support an expanding number of users without the need for unreasonable manual intervention as time goes on.
  4. A data scientist uses a combination of machine learning, statistics, visualization, and computer science to extract valuable business insights hiding in data and builds operational systems to deliver that value.
  5. The machine learning components are classified into two categories: ‘supervised’ and ‘unsupervised’ learning. In supervised learning, the constructed model defines the effect one set the inputs on the outputs through the causal chain.In unsupervised learning, the ouputs are affected by so called latent variables. It is also possible to have a hybrid approach to certain types of mining tasks.
  6. Strategic business projects typically begin leveraging a Data Science based approach to derive business value. This approach then becomes integral and eventually core to the design and architecture of such a business system.
  7. Contrary to what some of the above may imply, Data Science is a cross-functional discipline and not just the domain of Phd’s. A data scientist is part statistician, part developer and part business strategist.
  8. Working in small self sufficient teams, the Data Scientist collaborates with extended areas which includes visualization specialists, developers, business analysts, data engineers, applied scientists, architects, LOB owners and DevOp. The success of data science projects often relies on the communication, collaboration, and interaction that takes place with the extended team, both internally and possibly externally to their organization.
  9. It needs to be clarified that not every business project is a fit for a Data science approach. The criteria that must be employed to understand if such an advanced approach is called for include if the business initiative needs to provide knowledge based decisions (beyond the classical rules engine/ expert systems based approaches), deal with volumes of relevant data, a rapidly changing business climate, & finally where scale is required beyond what can be supplied using human analysts.
  10. Indeed any project where hugely improved access to information & realtime analytics for customers, analysts (and other stakeholders) is a must for the business – is fertile ground for Data Science.

Algorithms & Models

The word ‘model‘ is highly overloaded and means different things to different IT specialities e.g. RDBMS models imply data schemas, statistical models are built by statisticians etc. However, it can safely be said that models are representations of a business construct or a business situation.

Data mining algorithms are used to create models from data.

To create a data science model, the data mining algorithm looks for key patterns in data provided. The results of this analysis are to define the best parameters to create the model. Once identified, these parameters are applied across the entire data set to extract actionable patterns and detailed statistics.

The model itself can take various forms ranging from a set of customers across clusters, a revenue forecasting model, a set of fraud detection rules for credit cards or a decision tree that predicts outcomes based on specific criteria.

Common Data Mining Tasks –

There are many different kinds of data mining algorithms but all of these address a few fundamental types of tasks. The pouplar ones are listed below along with relevant examples:

  • Classification & Class Probability Estimation– For a given set of data, predict for each individual in a population, a discrete set of classes that this individual belongs to. An example classification is – “For all wealth management clients in a given population, who are most likely to respond to an offer to move to a higher segment”. Common techniques used in classification include decision trees, bayesian models, k-nearest neighbors, induction rules etc. Class Probability Estimation (CPE) is a closely related concept in which a scoring model is created to predict the likelihood that an individual would belong to that class.
  • Clustering is an unsupervised technique used to find classes or segments of populations within a larger dataset without being driven by any specific purpose. For example – “What are the natural groups our customers fall into?”. The most popular use of clustering techniques is to identify clusters to use in activities like market segmentation.A common algorithm used here is k-means clustering.
  • Market basket analysis  is commonly used to find out associations between entities based on transactions that involve them. E.g Recommendation engines which use affinity grouping.
  • Regression algorithms aim to characterize the normal or typical behavior of an individual or group within a larger population. It is frequently used in anomaly detection systems such as those that detect AML (Anti Money Laundering) and Credit Card fraud.
  • Profiling algorithms divide data into groups, or clusters, of items that have similar properties.
  • Causal Modeling algorithms attempt to find out what business events influence others.

There is no reason that one should be limited to one of the above techniques while forming a solution. An example is to use one algorithm (say clustering) to determine the natural groups in the data, and then to apply regression to predict a specific outcome based on that data. Another example is to use multiple algorithms within a single business project to perform related but separate tasks. Ex – Using regression to create financial reporting forecasts, and then using a neural network algorithm to perform a deep analysis of the factors that influence product adoption.

The Data Science Process  –

A general process framework for a typical Data science project is depicted below. The process flow depicted below suggests a sequential waterfall but allows for Agile/DevOps loops in the core analysis & feedback phases. The process is also not a virtual one sided pipeline but also allows for continuous improvements.


                                           Illustration: The Data Science Process 

  1. The central pool of data that hosts all the tiers of data processing in the above illustration is called the Data Lake. The Data Lake enables two key advantages – the ability to collect cross business unit data so that it can be sampled/explored at will & the ability to perform any kind of data access pattern across a shared data infrastructure: batch,  interactive, search, in-memory and custom etc.
  2. The Data science process begins with a clear and contextual understanding of the granular business questions that need to be answered from the real world dataset. The Data scientist needs to be trained in the nuances of the business to achieve the appropriate outcome. E.g. Detecting customer churn, predicting fraudulent credit card transactions in the credit cards space, predicting which customers in the Retail Bank are likely to churn over the next few months based on their usage patterns etc.
  3. Once this is known, relevant data needs to be collected from the real world. These sources in Banking range from –
    1. Customer Account data e.g. Names,Demographics, Linked Accounts etc
    2. Transaction Data which captures the low level details of every transaction (e.g debit, credit, transfer, credit card usage etc),
    3. Wire & Payment Data,
    4. Trade & Position Data,
    5. General Ledger Data and Data from other systems supporting core banking functions.
    6. Unstructured data. E.g social media feeds, server logs, clickstream data & mobile application data etc.
  4. Following the planning stage, Data Acquisition follows an  iterative process of acquiring data from the actual sources by creating appropriate loaders choosing appropriate technology components. E.g. Apache NiFi, Kafka, Sqoop, Flume, HDFS API, Java etc
  5. The next step is to perform Data Cleansing. Here the goal is to look for gaps in the data  (given the business context), ensuring that the dataset is valid with no missing values, consistent in layout and as fresh as possible from a temporal standpoint. This phase also involves fixing any obvious quality problems involving range or table driven data. The purpose at this stage is also to facilitate & perform appropriate data governance.
  6. Exploratory Data Analysis (EDA) helps with trial & error analysis of data. This is a phase where plots and graphs are used to systematically go through the data. The importance of this cannot be overstated as it provide the Data scientist and the business with a flavor of the data.
  7. Data Analysis: Generation of features or attributes that will be part of the model. This is the step of the process where actual data mining takes place leveraging models built using the above algorithms.

Within each of the above there exist further iterative steps within the Data Cleansing and Data Analysis stages.

Once the models have been tested and refined to the satisfaction of the business and their performance been put through a rigorous performance test phase, they are deployed into production. Once deployed, these are constantly refined based on end user and system feedback.

The Big Data ecosystem (consisting of tools such as Pig, Scalding, Hive, Spark and MapReduce etc) enable sea changes of improvement across the entire Data science lifecycle from data acquisition to data processing to data analysis. The ability of Big Data/Hadoop to unify all data storage in one place which renders data more accessible for modeling. Hadoop also scales up machine learning analysis due to it’s inbuilt paralleism which adds a tremendous amount of value both in terms of training multiple parallel models to improve their efficacy. The ability to collect a lot of data as opposed to small samples also helps greatly.

Recommendations – 

Developing a strategic mindset to Data science and predictive analytics should be a board level concern. This entails

  • To begin with – ensuring buy in & commitment in the form of funding at a Senior Management level. This support needs to extend across the entire lifecycle depicted above (from identifying business use cases).
  • Extensive but realistic ROI (Return On Investment) models built during due diligence with periodic updates for executive stakeholders
  • On a similar note, ensuring buy in using a strategy of co-opting & alignment with Quants and different high potential areas of the business (as covered in the usecases in the last blog)
  • Identifying leaders within the organization who can not only lead important projects but also create compelling content to evangelize the use of predictive analytics
  • Begin to tactically bake in or embed data science capabilities across different lines of business and horizontal IT
  • Slowly moving adoption to the Risk, Fraud, Cybersecurity and Compliance teams as part of the second wave. This is critical in ensuring that analysts across these areas move from a spreadsheet intensive model to adopting advanced statistical techniques
  • Creating a Predictive Analytics COE (Center of Excellence) that enable cross pollination of ideas across the fields of statistical modeling, data mining, text analytics, and Big Data technology
  • Informing the regulatory authorities of one’s intentions to leverage data science across the spectrum of operations
  • Ensuring that issues related to data privacy,audit & compliance have been given a great deal of forethought
  • Identifying  & developing human skills in toolsets (across open source and closed source) that facilitate adapting to data lake based architectures. A large part of this is to organically grow the talent pool by instituting a college recruitment process

While this ends the current series on Data Science in financial services, it is my intention to explore each of the above Data Mining techniques to a greater degree of depth as applied to specific business situations in 2016 & beyond. This being said, we will take a look at another pressing business & strategic concern – Cybersecurity in Banking – in the next series.

Data Driven Decisions in Financial Services..(2/3)

“Data! Data! Data!” he cried impatiently. “I can’t make bricks without clay!” Sherlock Holmes – Conan Doyle’s The Adventure of the Copper Beaches”

The first post in this three part series described the key ways in which innovative applications of data science are changing a somewhat insular and clubby banking & financial services industry. This disruption rages across the spectrum from both a business model as well as an organizational cultural standpoint. This second post examines key & concrete usecases enabled by a ‘Data Driven’ approach in the  Industry. The next & final post will examine foundational Data Science tasks & techniques commonly employed to get value from data.

Big Data platforms, powered by Open Source Hadoop, can not only economically store large volumes of structured, unstructured or semi-structured data & but also help process it at scale. The result is a steady supply of continuous, predictive and actionable intelligence. With the advent of Hadoop and Big Data ecosystem technologies, Bank IT (across a spectrum of business services) is now able to ingest, onboard & analyze massive quantities of data at a much lower price point.

One can thus can not only generate insights using a traditional ad-hoc querying(or descriptive intelligence) model but also build advanced statistical models on the data. These advanced techniques leverage data mining tasks (like classification, clustering, regression analysis, neural networks etc) to perform highly robust predictive modeling. Owing to Hadoop’s natual ability to work with any kind of data, this can encompass the streaming and realtime paradigms in addition to the traditional historical (or batch) mode.

Further, Big Data also helps Banks capture and commingle diverse datasets that can improve their analytics in combination with improved visualization tools that aid in the exploration & monetization of data.

Now, lets break the above summary down into specifics.

Data In Banking

Corporate IT organizations in the financial industry have been tackling data challenges due to strict silo based approaches that inhibit data agility for many years now.

Consider some of the traditional (or INTERNAL) sources of data in banking –

  • Customer Account data e.g. Names, Demographics, Linked Accounts etc
  • Core Banking Data
  • Transaction Data which captures the low level details of every transaction (e.g debit, credit, transfer, credit card usage etc)
  • Wire & Payment Data
  • Trade & Position Data
  • General Ledger Data e.g AP (accounts payable), AR (accounts receivable), cash management & purchasing information etc.
  • Data from other systems supporting banking reporting functions.

To provide the reader with a wider perspective, a vast majority of the above traditional data is almost all human generated. However, with the advent of smart sensors, enhancements in telemetry based devices like ATMs, POS terminals etc –  machines are beginning to generate even more data. Thus, every time a banking customer clicks a button on their financial provider’s website or makes a purchase using a credit card or calls her bank using the phone – a digital trail is created. Mobile apps drive a ever growing number of interactions due to the sheer nature of interconnected services – banking, retail, airlines, hotels etc. The result is lots of data and metadata that is MACHINE & App generated.

In addition to the above internal & external sources, commercially available 3rd party datasets ranging from crop yields to car purchases to customer preference data  (segmented by age or affluence categories), social media feedback re- financial & retail product usage are now widely available for purchase. As financial services firms sign up partnerships in Retail, Government and Manufacturing, these data volumes will only begin to explode in size & velocity.The key point is that an ever growing number of customer facing interfaces are now available for firms to collect data in a manner that they had never been able to do so before.

Where can Predictive Analytics help – 

Let us now begin some of the main use cases  out there as depicted in the below picture-


                              Illustration – Data Science led disruption in Banking

Defensive Use Cases Across the Banking Spectrum (RFC) – Risk, Fraud & Security

Internal Risk & Compliance departments are increasingly turning to Data Science techniques to create & run models on aggregated risk data. Multiple types of models and algorithms are used to find patterns of fraud and anomalies in the data to predict customer behavior. Examples include Bayesian filters, Clustering, Regression Analysis, Neural Networks etc. Data Scientists & Business Analysts have a choice of MapReduce, Spark (via Java,Python,R), Storm etc and SAS to name a few – to create these models. Fraud model development, testing and deployment on fresh & historical data become very straightforward to implement on Hadoop.

  • Risk Data Aggregation and Measurement – Measure and project different kinds of banking risks (Market Risk, Credit Risk, Loan Default and Operational Risk) . The applications for Data Science range from predicting different risk metrics across market, credit risk in Capital Markets. In Consumer Banking sectors like mortgage banking, credit cards & other financial products, data science is heavily leveraged to classify products & customers into different risk categories. Then to predicting risk scores and risk portfolio trends across thousands of variables.
  • Fraud Detection – Detect and predict institutional fraud for a range of usecases – Anti Money Laundering Compliance (AML), Know Your Customer (KYC), watchlist screening, tax evasion, Linked Entity Analysis etc. In the area of individual level fraud – credit card fraud & mortgage fraud – predictive models are developed which constantly analyze customer spending patterns, location & travel details, employment details and social networks to detect in real time if customer accounts are being compromised.
  • Cyber SecurityAnalyze clickstreams, network packet capture data, weblogs, image data, telemetry data to predict security compromises & to provide advanced security analytics.

Capital Markets, Consumer Banking, Payment Systems & Wealth Management

A) Capital Markets

  • Algorithmic Trading– Data Science augments trading infrastructures in several ways. It helps re-tool existing trading infrastructures so that they are more integrated yet loosely coupled and efficient by helping plug in algorithm based complex trading strategies that are quantitative in nature across a range of asset classes like equities, forex,ETFs and commodities etc. It also helps with trade execution after Hadoop incorporates newer & faster sources of data (social media, sensor data, clickstream date) and not just the conventional sources (market data, position data, M&A data, transaction data etc). E.g Retrofitting existing trade systems to be able to accommodate a range of mobile clients who have a vested interest in deriving analytics. e.g marry tick data with market structure information to understand why certain securities dip or spike at certain points and the reasons for the same (e.g. institutional selling or equity linked trades with derivatives).
  • Trade Analytics – Trade Strategy development is now a complex process where heterogeneous data – ranging from market data, existing positions, corporate actions, social & sentiment data are all blended together to obtain insights into possible market movements, trader yield & profitability across multiple trading desks.
  • Market & Trade Surveillance – An intelligent surveillance system needs to store trade data, reference data, order data, and market data, as well as all of the relevant communications from all the disparate systems, both internally and externally, and then match these things appropriately. The system needs to account for multiple levels of detection capabilities starting with a) configuring business rules (that describe a fraud pattern) as well as b) dynamic capabilities based on machine learning models (typically thought of as being more predictive) to detect complex patterns that pertain to insider trading and other market integrity compromises. Such a system also needs to be able to parallelize model execution at scale to be able to meet demanding latency requirements.

B) Consumer Banking & Wealth Management

Data Science has been proven in several applications in consumer banking ranging from a single view of customer to mapping customer journey across multiple financial products & channels. Techniques like pattern analysis (detecting new patterns within and across datasets), marketing analysis (across channels), recommendation analysis (across groups of products) are becoming fairly common. One can see a clear trend in early adopter consumer banking & private banking institutions in moving to an “Analytics first” approach to creating new business applications.

  • Customer 360 & Segmentation –
    Currently most Retail and Consumer Banks lack a  comprehensive view of their customers. Each department has a limited view of customer due to which the offers and interactions with customers across multiple channels are typically inconsistent and vary a lot.  This also results in limited collaboration within the bank when servicing customer needs. Leveraging the ingestion and predictive capabilities of a Hadoop based platform, Banks can provide a user experience that rivals Facebook, Twitter or Google that provide a full picture of customer across all touch points
  • Some of the more granular business usecases that span the spectrum in Consumer Banking include –
    • Improve profitability per retail or cards customer across the lifecycle by targeting at both micro and macro levels (customer populations) .This is done by combining the rich diverse datasets – existing transaction data, interaction data, social media feeds, online visits, cross channel data etc as well as understand customer preferences across similar segments
    • Detect customer dissatisfaction by analyzing transaction, call center data
    • Cross sell and upsell opportunities across different products
    • Help improve the product creation & pricing process

B) Payment Networks 

The real time data processing capabilities of Hadoop allow it to process data in a continual or bursty or streaming or micro batching fashion. Once payment data is ingested, such it must be processed in a very small time period (hundreds of milliseconds) which is typically termed near real time (NRT). When combined with predictive capabilities via behavioral modeling & transaction profiling Data Science can provide significant operational, time & cost savings across the below areas.

  • Obtaining a single view of customer across multiple modes of payments
  • Detecting payment fraud by using behavior modeling
  • Understand which payment modes are used more by which customers
  • Realtime analytics support
  • Tracking, modeling & understanding customer loyalty
  • Social network and entity link analysis

The road ahead – 

How can leaders in the Banking industry leverage a predictive analytics based approach across each of the industry ?

I posit that this will take place in four ways –

  • Using data to create digital platforms that better engage customers, partners and employees
  • Capturing & analyzing any and all data streams from both conventional and newer sources to compile a 360 degree view of the retail customer, institutional client or payment or fraud etc. This is critical to be able to market to the customer as one entity and to assess risk across that one entity as well as populations of entities
  • Creating data products by breaking down data silos and other internal organizational barriers
  • Using data driven insights to support a culture of continuous innovation and experimentation

The next & final post will examine specific Data Science techniques covering key algorithms, and other computational approaches.. We will also cover business & strategy recommendations to industry CXO’s embarking on Data Science projects.

Big Data & Advanced Analytics drive profits in Financial Services..(1/3)

“Silicon Valley is coming. There are hundreds of start-ups with a lot of brains and money working on various alternatives to traditional banking….the ones you read about most are in the lending business, whereby the firms can lend to individuals and small businesses very quickly and — these entities believe — effectively by using Big Data to enhance credit underwriting. They are very good at reducing the ‘pain points’ in that they can make loans in minutes, which might take banks weeks. Jamie Dimon –  CEO JP Morgan Chase in Annual Letter to Shareholders Feb 2016[1].

If Jamie Dimon’s opinion is anything to go by, the Financial Services industry is undergoing a major transformation and it is very evident that Banking as we know it will change dramatically over the next few years. This blog has spent some time over the last year defining the Big Data landscape in Banking. However the rules of the game are changing from mere data harnessing to leveraging data to drive profits. With that background, let us begin examining the popular applications of Data Science in the financial industry. This blog covers the motivation for and need of data mining in Banking. The next blog will introduce key usecases and we will round off the discussion in the third & final post by covering key algorithms, and other computational approaches.

The Banking industry produces the most data of any vertical out there with well defined & long standing business processes that have stood the test of time. Banks possess rich troves of data that pertain to customer transactions & demographic information. However, it is not enough for Bank IT to just possess the data. They must be able to drive change through legacy thinking and infrastructures as things change around the entire industry not just from a risk & compliance standpoint. For instance a major new segment are the millennial customers – who increasingly use mobile devices and demand more contextual services as well as a seamless unified banking experience – akin to what they commonly experience via the internet – at web properties like Facebook, Amazon,Uber, Google or Yahoo etc.

How do Banks stay relevant in this race? A large part of the answer is to make Big Data a strategic & boardroom level discussion and to take an industrial approach to predictive analytics.  The current approach as in vogue – to treat these as one-off, tactical project investments does not simply work or scale anymore.  There are various organizational models that one could employ, ranging from a shared service to a line of business led approach. An approach that I have seen work very well is to build a Center of Excellence (COE) to create contextual capabilities, best practices and rollout strategies across the larger organization.

Banks need to lead with Business Strategy 

A strategic approach to industrializing analytics in a Banking organization can add massive value and competitive differentiation in five distinct categories –

  1. Exponentially improve existing business processes. e.. Risk data aggregation and measurement, financial compliance, fraud detection
  2. Help create new business models and go to market strategies – by monetizing multiple data sources – both internal and external
  3. Vastly improve customer satisfaction by generating better insights across the customer journey
  4. Increase security while expanding access to relevant data throughout the enterprise to knowledge workers
  5. Help drive end to end digitization

Financial Services gradually evolves from Big Data 1.0 to 2.0

Predictive analytics & data mining have only been growing in popularity in recent years. However, when coupled with Big Data, they are on their way to attaining a higher degree of business capability & visibility.

Lets take a quick walk down memory lane..

In Big Data 1.0 – (2009-2015), a large technical area of focus was to ingest huge volumes of data to process them in a batch oriented fashion to perform a limited number of business usecases. In the era of 2.0, the focus is on enabling applications to perform high, medium or low latency based complex processing.

In the age of 1.0, Banking organizations across the spectrum, ranging from the mega banks to smaller regional banks to asset managers, have used the capability to acquire, store and process large volumes of data using commodity hardware at a much lower price point. This has resulted in huge reduction in CapEx & OpEx spend on data management projects  (Big Data augments while helping augment legacy investments in MPP systems, Data Warehouses, RDBMS’s etc).

The age of Big Data 1.0 in financial services is almost over and the dawn of Big Data 2.0 is now upon the industry. One may ask, “what is the difference?”, I would contend that while Big Data 1.0 largely dealt with the identification, on-boarding and broad governance of the data; 2.0 will begin the redefinition of business based on the ability do deploy advanced processing techniques across a plethora of new & existing sources of data. 2.0 will thus be about extracting richer insights from the onboarded data to serve customers better, stay compliant with regulation & to create new businesses. The new role of  ‘Data scientist’ who is an interdisciplinary expert (part business strategist, part programmer, part statistician, data miner & part business analyst) –  has come to represent one of the highly coveted job skills today.

Much before the time “Data Science” entered the technology lexicon, the Capital Markets employed advanced quantitative techniques. The emergence of Big Data has only created up new avenues in machine learning, data mining and artificial intelligence.


                                                    Illustration: Data drives Banking

Why is that ?

Hadoop, which is now really a platform ecosystem of 30+ projects – as opposed to a standalone technology, has been reimagined twice and now forms the backbone of any financial services data initiative. Thus, Hadoop is has now evolved into a dual persona – first an Application platform in addition to being a platform for data storage & processing.

Why are Big Data and Hadoop the ideal platform for Predictive Analytics?

Big Data is dramatically changing that approach with advanced analytic solutions that are powerful and fast enough to detect fraud in real time but also build models based on historical data (and deep learning) to proactively identify risks.

The reasons why Hadoop is emerging as the best choice for predictive analytics are

  1. Access to the advances in advanced infrastructures & computing capabilities at a very low cost
  2. Monumental advances in the algorithmic techniques themselves now..e.g. mathematical abilities, feature sets, performance etc
  3. Low cost & efficient access to tremendous amounts for data & the ability to store it at scale

Technologies in the Hadoop ecosystem such as ingestion frameworks (Flume,Kafka,Sqoop etc) and processing frameworks (MapReduce,Storm, Spark et al) have enabled the collection, organization and analysis of Big Data at scale. Hadoop supports multiple ways of running models and algorithms that are used to find patterns of customer behavior, business risks, cyber security violations, fraud and compliance anomalies in the mountains of data. Examples of these models include Bayesian filters, Clustering, Regression Analysis, Neural Networks etc. Data Scientists & Business Analysts have a choice of MapReduce, Spark (via Java,Python,R), Storm etc and SAS to name a few – to create these models. Fraud model development, testing and deployment on fresh & historical data become very straightforward to implement on Hadoop

However the story around Big Data adoption in your average Bank is typically not all that revolutionary – it typically follows a more evolutionary cycle where a rigorous engineering approach is applied to gain small business wins before scaling up to more transformative projects.Leveraging an open enterprise Hadoop approach, Big Data centric business initiatives in financial services have begun realizing value in a range of areas as diverse as –  the defensive (Risk, Fraud and Compliance  – RFC ) to achieving Competitive Parity (e.g Single View of Customer) to the Offensive (Digital Transformation across their Retail Banking business, unified Trade Data repositories in Capital Markets).

With the stage thus set, the next post will describe real world compelling usecases for Predictive Analytics across the spectrum of 21st century banking.