The Emerging Role for Big Data and Machine Learning on the Buy Side in Financial Services..

The Buy Side is perhaps the biggest segment of Wall St & the financial markets – there are roughly 7,000+ mutual funds, thousands of hedge funds which invest across 40,000 plus instruments – stocks, bonds and other securities. Thus, one of the important business functions on Buy-Side institutional businesses such as Mutual Funds, Hedge Funds, Trusts, Asset Managers, Pension Funds & Private Equity is to constantly analyze a range of information about companies underlying the above instruments to determine their investment worthiness. 

The Changing Nature of the Buy Side circa 2018…

When compared with the rest of the financial services industry,  the investment and asset management sector has lagged behind in terms of the many business and technology shifts over the recent decades, as we have cataloged in the below series of blogs.

The State of Global Wealth Management..(1/3)

Given the competitive nature of the market, commodified investment strategies will need to rapidly change to incorporate more and more advanced technology into the decision making process.Combined with substandard performance across a crucial sector of the Buy Side – Hedge Funds over the last couple of years – there is all of a sudden a need to incorporate innovative approaches to enhancing Alpha.

This is even more important in this age of real-time information. Market trends, sentiment, and operational risk issues, negative news seem to crop up virtually every day.

All of the above information sources have an ability to dramatically change the quality of an underlying financial instrument. At some point, the ability of a human portfolio manager to keep up with the information onslaught is moot, this calls for techniques around advanced intelligence and automation.

In this blog post, I will discuss key recommendations across the spectrum of  Big Data and Artificial Intelligence techniques to help store, process and analyze hundreds of data points across the universe of millions of potential investments.

Recommendation #1 Focus on Non-Traditional Datasets…

Traditional investment management has tended to focus on a financial analysis. This is rigorous fundamental analysis of the investment worthiness of the underlying company. At larger Buy Side firms especially the big mutual funds, tens of Portfolio Managers & Analysts constantly analyze a range of data – both quantitative – e.g. financial statements such as balance sheets, cash flow statements & qualitative – e.g industry trends, supply chain information etc. The trend analysis is typically broken up into three broad areas – Momentum, Value (relative to other players in the same segment) and Future Profitability. It is also not uncommon for large mutual funds to add and remove companies constantly from their investment portfolio – almost on a weekly basis. I propose that firms expand the underlying data into not just the traditional sources identified above but also some of the newer kinds as depicted in the below illustration. The information asymmetry advantage conferred by using a wider source of data has the potential to produce outsize investment performance.

Recommendation #2 Leverage advances in Big Data storage and processing…

We start moving into the technology now. First, a range of non-traditional data has to be identified and then ingested into a set of commodity servers either in an on-premise data center or using a cloud provider such as Amazon AWS or Microsoft Azure. It then needs to be curated, by applying business level processing. This could include identifying businesses using fundamental analysis or applying algorithms that spot patterns in data that pertain to attractiveness based on certain trending themes etc.

As the below table captures, the advent of Big Data collection, storage, and processing techniques now enable a range of information led capabilities that were simply not possible with older technology.

All of these non-traditional data streams shown above and depicted below can be stored on commodity hardware clusters. This can be done at a fraction of the cost of traditional SAN storage. The combined data can then be analyzed effectively in near real-time thus providing support for advanced business capabilities.

Driver Business Value Example
Data Volumes   Larger data sets allow analysts to query and conduct experiments with fewer iterations to understand which business fit certain investment strength criteria Omnichannel data, Customer engagement data, ticker data, pricing data, sales volumes across longer time horizons. Social media and third-party datasets etc
Data Variety  New data types spanning text, images, time series data and video Business process data, audio data, images, Sensor & device data. Publicly available statements and OTC contracts
Analytics and visualization More powerful analytics and visualization tools to explain and explore investment themes and patterns Complex Event Processing (CEP), predictive analytics.   Portfolio and risk management dashboards
Data Velocity Open source software tools.   Lower server and enterprise storage costs Hadoop, NoSQL. Commodity hardware. Elastic compute capacity.

Recommendation #3 For certain key areas of the process esp Portfolio Backtesting and Risk Management, adopt Parallel Processing Techniques…

We have covered how the rapidly flowing information across markets creates opportunities for buy-side firms that can exploit this data. In this context, a key capability is to perform backtesting of key algorithmic strategies based on years worth of historical data. These strategies can range from deciding when to trade away exposure to capital optimization. The scale of the analysis problem is immense with virtually 10s of thousands of investment prospects (read companies) operating across the globe across 30+ countries in 6 continents. Every time an algorithm is tweaked, extensive backtesting must be performed on a few quarters or years of historical data to assess its performance.

Big Data has a huge inherent architectural advantage here in that it minimizes data movement & can bring the processing to the data, can cut down the time taken to run these kinds of backtesting and risk analyses across TB of data to hours as opposed to a day or two taken by older technology.

Recommendation #4 Adopt & Accelerate Adoption of Machine Learning Techniques…

Given that the process of investment research is rapidly becoming a data and analytics challenge, what are the new techniques in the analytics space that can help?

Big Data & Advanced Analytics drive profits in Financial Services..(1/3)

  • Classification & Class Probability Estimation– For a given set of data, predict for each individual in a population, a discrete set of classes that this individual belongs to. An example classification is – “For all wealth management clients in a given population, who are most likely to respond to an offer to move to a higher segment”. Common techniques used in classification include decision trees, Bayesian models, k-nearest neighbors, induction rules etc. Class Probability Estimation (CPE) is a closely related concept in which a scoring model is created to predict the likelihood that an individual would belong to that class.  Employing such classical machine learning techniques such as clustering, segmentation, and classification to create models that can automatically segment investment prospects into key categories. These could be based on certain key investment criteria or factors.
  • Testing investment hypotheses by understanding hitherto hidden relationships among the underlying data
  • Constantly learning from the underlying data and then ranking companies based on investment metrics and criteria
  • Adopting Natural language processing (NLP) techniques to read from and to analyze thousands of text documents such as regulatory filings, research reports etc. A key use case is to understand what kinds of geopolitical events can use movements in location sensitive instruments such as heavy metals, commodities such as oil. This is very important as markets move in concert. This can be analyzed on the fly to not just rebalance exposures but also client portfolios. The usecases for NLP are myriad.

Recommendation #5 Leverage Partnerships…

We are aware of the fact that the above investments in technology may be a huge ask of small and mid-level buy-side firms which have viewed technology as a supporting function. However, there now exist service providers that provide the infrastructure, curated data feeds, and custom analytics as a SaaS (Software-as-a-service) to interested clients. Let not size and potential upfront CapEx investment deter these firms from driving their investment methodology to a data-driven process.

Recommendation #6 Increase Automation via Analytics but Human still stays in the loop…

None of the above technology recommendations are intended to displace a portfolio manager who has years of rich industry experience and expertise. The above technology stack can enable these expensive resources to focus their valuable time on activities that add meaningful business value. e.g interviewing key investment prospects, real-time analysis/ portfolio rebalancing, trade execution and management/strategic reporting. Technology is just an aid in that sense and serves as an assistant to the portfolio manager.


Leading actively managed funds are all about selection, allocation and risk/return assessments. The business goal is to ultimately generate insights that can drive higher investment returns or to shield from investment risk. As Buy Side firms across the board evolve in 2018, one of the key themes from a business and technological standpoint is leveraging AI & Big Data technologies to transform their internal research process from a resource-intensive process to a data-driven investment process.

The Big Data Landscape – My Predictions for 2018…

In 2018 we are rapidly entering what I would like to call ‘Big Data 3.0’. This is the age of ‘Converged Big Data’ where its various complementary technologies – Data Science, DevOps, Business Automation begin to all come together to solve complex industry challenges in areas as diverse as Manufacturing, Insurance, IoT, Smart Cities and Banking. 

(Image Credit – Simplilearn)

First, we had Big Data 1.0…

In the first pass of Big Data era, Hadoop was the low-cost storage solution. Companies saved tens of billions of dollars from costly and inflexible enterprise data warehouse (EDW) projects. Nearly every large organization has begun deploying Hadoop as an Enterprise Landing Zone (ELZ) to augment an EDW. The early corporate movers working with the leading vendors more or less figured out the kinks in the technology as applied to their business challenges.

We just passed Big Data 2.0…

As adoption patterns matured and Big Data included projects such as YARN, Spark, and Hive, customers began deploying Big Data to business challenges such as Fraud Detection, Customer Journey et al and began to realize business value from it. Adoption has indeed begun skyrocketing at verticals like Banking, Telecom, Manufacturing & Insurance. The monolithic Big Data market has begun segmenting into well-defined categories – Infrastructure providers, Streaming Data companies, Data Analysis providers, SQL on Hadoop solutions, full-fledged machine learning toolsets etc.

With that said, let us look at my Big Data predictions for 2018.

Trend #1 Big Data 3.0 – where Data fuels Digital Transformation…

Fortune 5000 process large amounts of customer information daily. This is especially true in areas touched by IoT – power and utilities, manufacturing and connected car. However, they have been sorely lacking in their capacity to interpret this in a form that is meaningful to their customers and their business. In areas such as Banking & Insurance, this can greatly help arrive at a real-time understanding of not just the risks posed by a customer/partner relationship (from a credit risk/AML standpoint) but also an ability to increase the returns per client relationship. Digital Transformation can only be fueled by data assets. In 2018, more companies will tie these trends together moving projects from POC to production.

The Six Strategic Questions Every Bank Should Answer with Big Data & AI in 2018…

Trend #2 ‘Predictive Analytics on Hadoop’ projects begin to proliferate…

I have written extensively about efforts to infuse business processes with machine learning. Predictive analytics have typically resembled a line of business project or initiative. The benefits of the learning from localized application initiatives are largely lost to the larger organization if one doesn’t  allow multiple applications and business initiatives to access the models built. In 2018, machine learning expands across more usecases from the mundane (fraud detection, customer churn prediction to customer journey) to the new age (virtual reality, conversational interfaces, chatbots, customer behavior analysis, video/facial recognition) etc. Demand for data scientists will increase.

In areas around Industrie 4.0, Oil & Energy, Utilities – billions of endpoints will send data over to edge nodes and backend AI services which will lead to better business planning, real-time decisions and a higher degree of automation & efficiency across a range of processes. The underpinning data capability around these will be a Data Lake.

This is an area both Big Data and AI have begun to influence in a huge way. 2018 will be the year in which every large and medium-sized company will have an AI strategy built on Big Data techniques. Companies will begin exposing their AI models over the cloud using APIs as shown above using a Models as a Service architecture.

Trend #3 Big Data begins to take baby steps towards replacing the Enterprise Data Warehouse

Infrastructure vendors have been aiming to first augment and then replace EDW systems. As the ability of projects that perform SQL-on-Hadoop, data governance and audit matures, Hadoop will slowly begin replacing EDW footprint. The key capabilities that Data Lakes usually lack from an EDW standpoint – around OLAP, performance reporting will be augmented by niche technology partners. While this is a change that will easily take years, 2018 is when it begins. Expect migrations where clients have not really been using the full power of EDWs beyond simple relational schemas and log data etc to be the first candidates for this migration.

Trend #4 Cybersecurity pivots into Big Data…

Big Data is now the standard by which forward-looking companies will perform their Cybersecurity and threat modeling. Let us take an example to understand what this means from an industry standpoint. For instance, in Banking, in addition to general network level security, we can categorize business level security considerations into four specific buckets –   general fraud, credit card fraud, AML compliance, and cybersecurity. The current best practice in the banking industry is to encourage a certain amount of convergence in the back-end data silos/infrastructure across all of the fraud types – literally in the tens.  Forward-looking enterprises are now building cybersecurity data lakes to aggregate & consolidate all digital banking information, wire data, payment data, credit card swipes, other telemetry data (ATM & POS)  etc in one place to do security analytics. This pivot to a Data Lake & Big Data can pay off in a big way.

The reason this convergence is helpful is that across all of these different fraud types, the common thread is that the fraud is increasingly digital (or internet based) and they fraudster rings are becoming more sophisticated every day. To detect these infinitesimally small patterns, an analytic approach beyond the existing rules-based approach is key to understand for instance – location-based patterns in terms of where transactions took place, Social Graph-based patterns and Patterns which can commingle real-time & historical data to derive insights. This capability is only possible via a Big Data-enabled stack.

Trend #5 Regulators Demand Big Data – PSD2,GPDR et al…

The common thread across virtually a range of business processes in verticals such as Banking, Insurance, and Retail is the fact that they are regulated by a national or supranational authority. In Banking, across the front, mid and back office, processes ranging from risk data aggregation/reporting, customer onboarding, loan approvals, financial crimes compliance (AML, KYC, CRS & FATCA), enterprise financial reporting  & Cyber Security etc – all need to produce verifiable, high fidelity and auditable reports. Regulators have woken up to the fact that all of these areas can benefit from universal access to accurate, cleansed and well-governed cross-organization data from a range of Book Of Record systems.

A POV on Bank Stress Testing – CCAR & DFAST..

Further, applying techniques for data processing such as in-memory processing, the process of scenario analysis, computing,  & reporting on this data (reg reports/risk scorecards/dashboards etc) can be vastly enhanced. They can be made more real time in response to data about using market movements to understand granular risk concentrations. Finally, model management techniques can be clearly defined and standardized across a large organization. RegTechs or startups focused on the risk and compliance space are already leveraging these techniques across a host of areas identified above.

Trend #6 Data Monetization begins to take off…

The simplest and easiest way to monetize data is to begin collecting disparate data generated during the course of regular operations. An example in Retail Banking is to collect data on customer branch visits, online banking usage logs, clickstreams etc. Once collected, the newer data needs to be fused with existing Book of Record Transaction (BORT) data to then obtain added intelligence on branch utilization, branch design & optimization, customer service improvements etc. It is very important to ensure that the right business metrics are agreed upon and tracked across the monetization journey. Expect Data Monetization projects to take off in 2018 with verticals like Telecom, Banking, and Insurance to take the lead on these initiatives.

The Tao of Data Monetization in Banking and Insurance & Strategies to Achieve the Same…

Trend #7 Data Native Architectures converge with Cloud Native Architectures…

Most Cloud Native Architectures are designed in response to Digital Business initiatives – where it is important to personalize and to track minute customer interactions. The main components of a Cloud Native Platform are shown below and the vast majority of these leverage a microservices based design. Given all this, it is important to note that a Big Data stack based on Hadoop (Gen 2) is not just a data processing platform. It has multiple personas – a real-time, streaming data, interactive platform that can perform any kind of data processing (batch, analytical, in memory & graph based) while providing search, messaging & governance capabilities. Thus, Hadoop provides not just massive data storage capabilities but also provides multiple frameworks to process the data resulting in response times of milliseconds with the utmost reliability whether that be real-time data or historical processing of backend data. My bet on 2018 is that these capabilities will increasingly be harnessed as part of a DevOps process to develop a microservices based deployment.


 Big Data will continue to expand exponentially across global businesses in 2018. As with most disruptive innovation, it will also create layers of complexity and opportunity for Enterprise IT. Whatever be the kind of business model – tracking user behavior or location sensitive pricing or business process automation etc – the end goal of IT architecture should be to create enterprise business applications that are heavily data insight and analytics-driven.

The Tao of Data Monetization in Banking and Insurance & Strategies to Achieve the Same…

“We live in a world awash with data. Data is proliferating at an astonishing rate—we have more and more data all the time, and much of it was collected in order to improve decisions about some aspect of a business, government, or society. If we can’t turn that data into better decision making through quantitative analysis, we are both wasting data and probably creating suboptimal performance.”
Tom Davenport, 2013  – Professor Babson College, Best Selling Author and Leader at Deloitte Analytics

Data Monetization is the organizational ability to turn data into cost savings & revenues in existing lines of business and to create new revenue streams.

Digitization is driving Banks and Insurance companies to reinvent themselves…

Enterprises operating in the financial services and the insurance industry have typically taken a very traditional view of their businesses. As waves of digitization have begun slowly upending their established business models, firms have begun to recognize the importance of harnessing their substantial data assets which have been built over decades. These assets include fine-grained data about internal operations, customer information and external sources (as depicted in the below illustration). So what does the financial opportunity look like? PwC’s Strategy& estimates that the incremental revenue from monetizing data could potentially be as high as US$ 300 billion [1] every year beginning 2019. This is across all the important segments of financial services-  capital markets, commercial banking, consumer finance & banking, and insurance. FinTechs are also looking to muscle into this massive data opportunity,

The compelling advantages of Data Monetization have been well articulated across new business lines, customer experience, cost reduction et al. One of the key aspects of Digital transformation is data and the ability to create new revenue streams or to save costs using data assets.

..Which leads to a huge Market Opportunity for Data Monetization…

In 2013, PwC estimated that the market opportunity in data monetization was a nascent – US $175 million. This number has begun to grow immensely over the next five years with consumer banking and capital markets leading the way.

Digital first has been a reality in the Payments industry with Silicon Valley players like Google and Apple launching their own payments solutions (in the form of Google Pay and Apple Pay).

Visionary Banks & FinTechs are taking the lead in Data Monetization…

Leader firms such as Goldman Sachs & AIG have heavily invested in capabilities around data monetization. In 2012, Goldman purchased the smallest of the three main credit-reporting firms – TransUnion. In three years, Goldman has converted TransUnion into a data-mining machine. In addition to credit-reporting, TransUnion now gathers billions of data points about Americans consumers. This data is constantly analyzed and then sold to lenders, insurers, and others. Using data monetization, Goldman Sachs has made nearly $600 million in profit. It is expected to make about five times its initial $550 million investment. [2]

From the WSJ article…

By the time of its IPO in 2015, TransUnion had 30 million gigabytes of data, growing at 25% a year and ranging from voter registration in India to drivers’ accident records in the U.S. The company’s IPO documents boasted that it had anticipated the arrival of online lenders and “created solutions that catered to these emerging providers.”

As are forward looking Insurers …

The insurance industry is reckoning with a change in consumer behavior. Younger consumers are very comfortable with using online portals to shop for plans, compare them, purchase them and do other activities that increase the amount of data being collected by the companies. While data and models that operate on them have been critical in the insurance industry, it has been stronger around the actuarial areas. The industry has now begun heavily leveraging data monetization strategies across areas such as new customer acquisition, customer Underwriting, Dynamic Pricing et al. A new trend is for them to form partnerships with Automakers to tap into a range of telematics information such as driver behavior, vehicle performance, and location data. In fact, Automakers are already ingesting and constantly analyzing this data with the intention of leveraging it for a range of use-cases which include selling this data to insurance companies.

Leading carriers such as AXA are leveraging their data assets to strengthen broker and other channel relationships. AXA’s EB360 platform helps brokers with a range of analytic infused functions – e.g. help brokers track the status of applications, manage compensation, and commissions and monitor progress on business goals. AXA has also optimized user interfaces to ensure that data entry is minimized while supporting rapid quoting thus helping brokers easily manage their business thus strengthening the broker-carrier relationship.[3]

Introducing Five Data Monetization Strategies across Financial Services & Insurance…

Let us now identify and discuss five strategies that enable financial services firms to progressively monetize their data assets. It must be mentioned that doing so requires an appropriate business transformation strategy to be put into place. Such a strategy includes clear business goals such as improving core businesses to entering lateral business areas.

Monetization Strategy #1 – Leverage Data Collected during Business Operations to Ensure Higher Efficiency in Business Operations…

The simplest and easiest way to monetize on data is to begin collecting disparate data generated during the course of regular operations. An example in Retail Banking is to collect data on customer branch visits, online banking usage logs, clickstreams etc. Once collected, the newer data needs to be fused with existing Book of Record Transaction (BORT) data to then obtain added intelligence on branch utilization, branch design & optimization, customer service improvements etc. It is very important to ensure that the right metrics are agreed upon and tracked across the monetization journey.

Monetization Strategy #2 – Leverage Data to Improve Customer Service and Satisfaction…

The next progressive step in leveraging both internal and external data is to use it to drive new revenue streams in existing lines of business. This requires fusing both internal and external data to create new analytics and visualization. This is used to drive use cases relating to cross sell and up-sell of products to existing customers.

Demystifying Digital – Reference Architecture for Single View of Customer / Customer 360..(3/3)

Monetization Strategy #3 – Use Data to Enter New Markets…

A range of third-party data needs to be integrated and combined with internal data to arrive at a true picture of a customer. Once the Single View of a Customer has been created, the Bank/Insurer has the ability to introduce marketing and customer retention and other loyalty programs in a dynamic manner. These include the ability to combine historical data with real time data about customer interactions and other responses like clickstreams – to provide product recommendations and real time offers.

Demystifying Digital – the importance of Customer Journey Mapping…(2/3)

An interesting angle on this is to provide new adjacent products much like the above TransUnion example illustrates.

Monetization Strategy #4 – Establish a Data Exchange…

The Data Exchange is a mechanism where firms can fill in holes in their existing data about customers, their behaviors, and preferences. Data exchanges can be created using a consortium based approach that includes companies that span various verticals. Companies in the consortium can elect to share specific datasets in exchange for data while respecting data privacy and regulatory constraints.

Monetization Strategy #5 – Offer Free Products to Gather Customer Data…

Online transactions in both Banking and Insurance are increasing in number year on year. If Data is true customer gold then it must be imperative on companies to collect as much of it as they can. The goal is to create products that can drive longer & continuous online interactions with global customers. Tools like Personal Financial Planning products, complementary banking and insurance services are examples of where firms can offer free products that augment existing offerings.

A recent topical example in Telecom is Verizon Up, a program from the wireless carrier where consumers can earn credits (that they can use for a variety of paid services – phone upgrades, concert tickets, uber credits and movie premieres etc) in exchange for providing access to their browsing history, app usage, and location data. Verizon also intends to use the data to deliver targeted advertising to their customers. [4]

Consumers can win Lady Gaga tickets in Verizon’s new rewards program, which requires that they enroll in its targeted advertising program. PHOTO: ADREES LATIF/REUTERS

How Data Science Is a Core Capability for any Data Monetization Strategy…

Data Science and Machine learning approaches are the true differentiators and the key ingredients in any data monetization strategy. Further, it is a given that technological investments in Big Data Platforms, analytic investments in areas such as machine learning, artificial intelligence are also needed to stay on the data monetization curve.

How does this tie into practical use-cases discussed above? Let us consider the following usecases of common Data Science algorithms –

  • Customer Segmentation– For a given set of data, predict for each individual in a population, a discrete set of classes that this individual belongs to. An example classification is – “For all retail banking clients in a given population, who are most likely to respond to an offer to move to a higher segment”.
  • Pattern recognition and analysis – discover new combinations of business patterns within large datasets. E.g. combine a customer’s structured data with clickstream data analysis. A major bank in NYC is using this data to bring troubled mortgage loans to quick settlements.
  • Customer Sentiment analysis is a technique used to find degrees of customer satisfaction and how to improve them with a view of increasing customer net promoter scores (NPS).
  • Market basket analysis is commonly used to find out associations between products that are purchased together with a view to improving marketing products. E.g Recommendation engines which to understand what banking products to recommend to customers.
  • Regression algorithms aim to characterize the normal or typical behavior of an individual or group within a larger population. It is frequently used in anomaly detection systems such as those that detect AML (Anti Money Laundering) and Credit Card Fraud.
  • Profiling algorithms divide data into groups, or clusters, of items that have similar properties.
  • Causal Modeling algorithms attempt to find out what business events influence others.


Banks and Insurers who develop data monetization capabilities will be positioned to create new service offerings and revenues. Done right (while maintaining data privacy & consumer considerations), the monetization of data represents a truly transformational opportunity for financial services players in the quest to become highly profitable.


[1] PwC Strategy& – “The Data Gold Rush” –

[2] WSJ – “How Goldman Sachs Made More Than $1 Billion With Your Credit Score”

[3] McKinsey Quarterly – “Harnessing the potential of data in insurance..”

[4] Verizon Wants to Build an Advertising Juggernaut. It Needs Your Data First

Why Data Garbage-In means Analytics Garbage-Out..

This is the third in a series of blogs on Data Science that I am jointly authoring with Maleeha Qazi, ( We have previously covered some of the inefficiencies that result from a siloed data science process @ & the ideal way Data Scientists would like their models deployed for the maximal benefit and use – as a Service @ As the name of this third blog post suggests, the success of a data science initiative depends on data. If the data going into the process is “bad” then the results cannot be relied upon. Our goal is to also suggest some practical steps that enterprises can take from a data quality & governance process standpoint. 

However, under the strong influence of the current AI hype, people try to plug in data that’s dirty & full of gaps, that spans years while changing in format and meaning, that’s not understood yet, that’s structured in ways that don’t make sense, and expect those tools to magically handle it. ” – Monica Rogati (Data Science Advisor and ex-VP  Jawbone – 2017) [1]

Image Credit – The Daily Omnivore


Different posts in this blog have discussed Data Science and other Analytical approaches to some degree of depth. What is apparent is that whatever the kind of analytics – descriptive, predictive, or prescriptive – the availability of a wide range of quality data sources is key. However, along with volume and variety of data, the veracity, or the truth, in the data is as important. This blog post discusses the main factors that determine the quality of data from a Data Scientist’s perspective.  

The Top Issues of Data Quality

As highlighted in the above illustration, the top quality issues that data assets typically face are the following:

  1. Incomplete Data: The data provided for analysis should span the entire cross-section of known data about how the organization views its customers and products. This would include data generated from various applications that belong to the business, and external data bought from various vendors to enriched the knowledge base. The completeness criteria measures if all of the information about entities under consideration is available and useable.
  2. Inconsistent & Inaccurate Data: Consistency measures what data values give conflicting information and must be fixed. It also measures if all the data elements conform to specific and uniform formats and are stored in a consistent manner. Inaccurate data either has duplicate, missing or erroneous values. It also does not reflect an accurate picture of the state of the business at the point in time it was pulled.
  3. Lack of Data Lineage & Auditability: The data framework needs to support audit-ability, i.e provide an audit trail of how the data values were derived from source to analysis point; the various transformations performed on it to arrive at the data set being considered for analysis.
  4. Lack of Contextuality: Data needs to be accompanied by meaningful metadata – data that describes the concepts within the dataset.
  5. Temporally Inconsistent: This measures if the data was temporally consistent and meaningful given the time it was recorded.

What Business Challenges does Poor Data Quality Cause…

Image Credit – DataMartist

Data Quality causes the following business challenges in enterprises:

  1. Customer dissatisfaction: Across industries like Banking, Insurance, Telecom & Manufacturing, the ability to get a unified view of the customer & their journey is at the heart of the enterprise’s ability to promote relevant offerings & detect customer dissatisfaction. Currently, most industry players are woeful at putting together this comprehensive Single View of their Customers (SVC). Due to operational silos, each department possesses its own siloed & limited view of the customer across multiple channels. These views are typically inconsistent, lack synchronization with other departments, & miss a high amount of potential cross-sell and upsell opportunities. This is a data quality challenge at its core.
  2. Lost revenue: The Customer Journey problem has been an age-old issue which has gotten exponentially more complicated over the last five years as the staggering rise of mobile technology and the Internet of Things (IoT) have vastly increased the number of enterprise touch points that customers are exposed to in terms of being able to discover and purchase new products/services. In an OmniChannel world, an increasing number of transactions are being conducted online. In verticals like the Retail industry and Banking & Insurance industries, the number of online transactions conducted approaches an average of 40%. Adding to the problem, more and more consumers are posting product reviews and feedback online. Companies thus need to react in real-time to piece together the source of consumer dissatisfaction.
  3. Time and cost in data reconciliation: Every large enterprise nowadays runs expensive data re-engineering projects due to their data quality challenges. These are an inevitable first step in other digital projects which cause huge cost and time overheads.
  4. Increased time to market for key projects: Poor data quality causes poor data agility, which increases the time to market for key projects.
  5. Poor data means suboptimal analytics: Poor data quality causes the analytics done using it to be suboptimal – algorithms will end up giving wrong conclusions because the input provided to them is incorrect at best & inconsistent at worst.

Why is Data Quality a Challenge in Enterprises

Image Credit – DataMartist

The top reasons why data quality has been a huge challenge in the industry are:

  1. Prioritization conflicts: For most enterprises, the focus of their business is the product(s)/service(s) being provided, book-keeping is a mandatory but secondary concern. And since keeping the business running is the most important priority, keeping the books accurate for financial matters is the only aspect that gets most of the technical attention it deserves. Other data aspects are usually ignored.
  2. Organic growth of systems: Most enterprises have gone through a series of book-keeping methods and applications, most of which have no compatibility with one another. Warehousing data from various systems as they are deprecated, merging in data streams from new systems, and fixing data issues as these processes happen is not prioritized till something on the business end fundamentally breaks. Band-aids are usually cheaper and easier to apply than to try and think ahead to what the business will need in the future, build it, and back-fill it with all the previous systems’ data in an organized fashion.
  3. Lack of time/energy/resources: Nobody has infinite time, energy, or resources. Doing the work of making all the systems an enterprise chooses to use at any point in time talk to one another, share information between applications, and keep a single consistent view of the business is a near-impossible task. Many well-trained resources, time & energy is required to make sure this can be setup and successfully orchestrated on a daily basis. But how much is a business willing to pay for this? Most do not see short-term ROI and hence lose sight of the long-term problems that could be caused by ignoring the quality of data collected.
  4. What do you want to optimize?: There are only so many balls an enterprise can have up in the air to focus on without dropping one, and prioritizing those can be a challenge. Do you want to optimize the performance of the applications that need to use, gather and update the data, OR do you want to make sure data accuracy/consistency (one consistent view of the data for all applications in near real-time) is maintained regardless? One will have to suffer for the other.

How to Tackle Data Quality

Image Credit – DataMartist


With the advent of Big Data and the need to derive value from ever increasing volumes and a variety of data, data quality becomes an important strategic capability. While every enterprise is different, certain common themes emerge as we consider the quality of data:

  1. The sheer number of transaction systems found in a large enterprise causes multiple challenges across the data quality dimensions. Organizations need to have valid frameworks and governance models to ensure the data’s quality.
  2. Data quality has typically been thought of as just data cleansing and fixing missing fields. However, it is very important to address the originating business processes that cause this data to take multiple dimensions of truth. For example, centralize customer onboarding in one system across channels rather than having every system do its own onboarding.
  3. It is clear from the above that data quality and its management is not a one time or siloed application exercise. As part of a structured governance process, it is very important to adopt data profiling and other capabilities to ensure high-quality data.


Enterprises need to define both quantitative and qualitative metrics to ensure that data quality goals are captured across the organization. Once this is done, an iterative process needs to be followed to ensure that a set of capabilities dealing with data governance, auditing, profiling, and cleansing is applied to continuously ensure that data is brought up to, and kept at, a high standard. Doing so can have salubrious effects on customer satisfaction, product growth, and regulatory compliance.


[1] Monica Rogati “The AI hierarchy of Needs” –

The Three Habits of Highly Effective Real Time Enterprises…

All I do is sit at home and watch Netflix. ” – Kylie Irving

The Power of Real Time

Anyone who has used Netflix to watch a movie or used Uber to hail a ride knows how simple, time efficient, inexpensive and seamless it is to do either. Chances are that most users of Netflix and Uber would never again use a physical video store or a taxi service unless they did not have a choice. Thus it should not come as a surprise that within a short span of a few years, these companies have acquired millions of delighted customers using their products (which are just apps) while developing market capitalizations of tens of billions of dollars.

As of early 2016, Netflix had about 60 million subscribers[1] and is finding significant success in producing its own content thus continuing to grab market share from the established players like NBC, Fox and CBS. Most Netflix customers opt to ditch Cable and are choosing to stream content in real time across a variety of devices.

Uber is nothing short of a game changer in the ride sharing business. Not just in busy cities but also in underserved suburban areas, Uber services save plenty of time and money in enabling riders to hail cabs. In congested metro areas, Uber also provides near instantaneous rides for a premium which motivates more drivers to service riders. As someone, who has used Uber in almost every continent in the world, it is no surprise that as of 2016, Uber dominates in terms of market coverage, operating in 400 cities in 70+ countries.[2]

What is the common theme in ordering a cab using Uber or a viewing a movie on Netflix ?

Answer – Both services are available at the click of a button, they’re lightning quick and constantly build on their understanding of your tastes, ratings and preferences. In short, they are Real Time products.

Why is Real Time such a powerful business capability?

In the Digital Economy, the ability to interact intelligently with consumers in real time is what makes possible the ability to create new business models and to drive growth in existing lines of business.

So, what do Real Time Enterprises do differently

What underpins a real time enterprise are three critical factors or foundational capabilities as shown in the below illustration. For any enterprise to be considered real time, the presence of these three components is what decides the pace of consumer adoption. Real time capabilities are part business innovation and part technology.

Let us examine these…

#1 Real Time Businesses possess a superior Technology Strategy

First and foremost, business groups must be able to define a vision for where they would like their products and services to be able to do to acquire younger and more dynamic consumers.

As companies adopt new business models, the technologies that support them must also change along with the teams that deliver them. IT departments have to move to more of a service model while delivering agile platforms and technology architectures for business lines to develop products around.

Why Digital Disruption is the Cure for the Common Data Center..

It needs to be kept in mind that these new approaches should be incubated slowly and gradually. They must almost always be business or usecase driven at first.

#2 Real Time Enterprises are extremely smart about how they leverage data

The second capability is an ability to break down data silos in an organization. Most organizations have no idea of what to do with all the data they generate. Sure, they use a fraction of it to perform business operations but beyond that most of this data is simply let go. As a consequence they fail to view their customer as a dynamic yet unified entity. Thus, they have no idea as to how to market more products or to estimate the risk being run on their behalf etc. The ability to add  is a growing emphasis on the importance of the role of the infrastructure within service orientation. As the common factor that is present throughout an organization, the networking infrastructure is potentially the ideal tool for breaking down the barriers that exist between the infrastructure, the applications and the business. Consequently, adding greater intelligence into the network is one way of achieving the levels of virtualization and automation that are necessary in a real-time operation.

Across Industries, Big Data Is Now the Engine of Digital Innovation..

#3 Real Time Enterprises use Predictive Analytics and they automate the hell out of every aspect of their business

Real time enterprises get the fact that using only Business Intelligence (BI) dashboards is largely passe. BI implementations base their insights on data that is typically stale, (even by days). BI operates in a highly siloed manner based on long cycles of data extraction, transformation, indexing etc.

However, if products are to be delivered over mobile and other non traditional channels, then BI is ineffective at providing just in time analytics that can drive an understanding of a dynamic consumers wants and needs. The Real Time enterprise demands that workers at many levels ranging from line of business managers to executives have fresh, high quality and actionable information on which they can base complex yet high quality business decisions. These insights are only enabled by Data Science and Business Automation. When deployed strategically – these techniques can scale to enormous volumes of data and help reason over them reducing manual costs.  They can take on business problems that can’t be managed manually because of the huge amount of data that must be processed.

Why Big Data & Advanced Analytics are Strategic Corporate Assets..


Real time Enterprises do a lot of things right. They constantly experiment with creating new and existing business capabilities with a view to making them appealing to a rapidly changing clientele. They refine these using constant feedback loops and create cutting edge technology stacks that dominate the competitive landscape. Enterprises need to make the move to becoming Real time.

Neither Netflix nor Uber are sitting on their laurels. Netflix (which discontinued mail in DVDs and moved to an online only model a few years ago) continues to expand globally betting that the convenience of the internet will eventually turn it into a major content producer. Uber is prototyping self driving cars in Pittsburgh and intends to rollout its own fleet of self driving vehicles thus replacing it’s current 1.5 million drivers and also beginning a food delivery business around urban centers eventually[4].

Sure, the ordinary organization is no Netflix or Uber and when a journey such as the one to real time capabilities is embarked on, things can and will go wrong in this process. However, the cost of continuing with business as usual can be incalculable over the next few years.  There is always a startup or a competitor that wants to deliver what you do at much lower cost and at a lightning fast clip. Just ask Blockbuster and the local taxi cab company.


[1] Netflix Statistics 2016 –

[2] “Just how dominant is Uber” –

[3] Expanded Ramblings – “Uber Statistics as of Oct 2016”

[4] Uber Self driving cars debut in Pittsburgh – “”

The Data Science Continuum in Financial Services..(3/3)

In God we trust. All others must bring data.” – DrEdwards Deming, statistician, professor, author, lecturer, and consultant.

The first post in this three part series described key ways in which innovative applications of data science are slowly changing a somewhat insular banking & financial services industry . The second post then delineated key business use cases enabled by a data driven or data native  approach. The final post will examine foundational Data Science tasks & techniques that are commonly employed to get value from data with financial industry examples. We will round off the discussion with recommendations for industry CXOs.


The Need for Data Science –

It is no surprise that Big Data approaches were first invented & then refined in web scale businesses at Google, Yahoo, eBay, Facebook and Amazon etc. These web properties offer highly user friendly, contextual & mobile native application platforms  which produce a large amount of complex and multi-varied data from consumers,sensors and other telemetry devices. All this data that is constantly analyzed to drive higher rates of application adoption thus driving a virtuous cycle. We have discussed the (now) augmented capability of financial organizations to acquire, store and process large volumes of data by leveraging the HDFS (Hadoop Distributed Filesystem) running on commodity (x86) hardware.

One of the chief reasons that these webscale shops adopted Big Data is the ability to store the entire data set in Hadoop to build more accurate predictive models. The ability store thousands of  attributes at a much finer grain over a historical amount of time instead of just depending on a statistically significant sample is a significant gain over legacy data technology.

Every year Moore’s Law keeps driving the costs of raw data storage down. At the same time, compute technologies such as MapReduce, Tez, Storm and Spark have enabled the organization and analysis of Big Data at scale. The convergence of cost effective storage and scalable processing allows us to extract richer insights from data. These insights need to then be operationalized@scale to provide business value as the use cases in the last post highlighted @

The differences between Descriptive & Predictive Analytics –

Business intelligence (BI) is a traditional & well established analytical domain that essentially takes a retrospective look at business data in systems of record. The goal for BI is to primarily look for macro or aggregate business trends across different aspects or dimensions such as time, product lines, business unites & operating geographies.

BI is primarily concerned with “What happened and what trends exist in the business based on historical data?“. The typical use cases for BI include budgeting, business forecasts, reporting & key performance indicators (KPI).

On the other hand, Predictive Analytics (a subset of Data Science) augments & builds on the BI paradigm by adding a “What could happen” dimension to the data in terms of –

  • being able to probabilistically predict different business scenarios across thousands of variables
  • suggesting specific business actions based on the above outcomes

Predictive Analytics does not intend to nor will it replace the BI domain but only adds significant business capabilities that lead to overall business success. It is not uncommon to find real world business projects leveraging both these analytical approaches.

Data Science  –

So, what exactly is Data Science ?

Data Science is an umbrella concept that refers to the process of extracting business patterns from large volumes of both structured, semi structured and unstructured data. Data Science is the key ingredient in enabling a predictive approach to the business.

Some of the key aspects that follow are  –

  1. Data Science is not just about applying analytics to massive volumes of data. It is also about exploring the patterns,associations & interrelationships of thousands of variables within the data. It does so by adopting an algorithmic approach to gleaning the business insights that are embedded in the data.
  2. Data Science is a standalone discipline that has spawned its own set of platforms, tools and processes across it’s lifecycle.
  3. Data science also aids in the construction of software applications & platforms to utilize such insights in a business context.  This involves the art of discovering data insights combined with the science of operationalizing them at scale. The word ‘scale’ is key. Any algorithm, model or deployment paradigm should support an expanding number of users without the need for unreasonable manual intervention as time goes on.
  4. A data scientist uses a combination of machine learning, statistics, visualization, and computer science to extract valuable business insights hiding in data and builds operational systems to deliver that value.
  5. The machine learning components are classified into two categories: ‘supervised’ and ‘unsupervised’ learning. In supervised learning, the constructed model defines the effect one set the inputs on the outputs through the causal chain.In unsupervised learning, the ouputs are affected by so called latent variables. It is also possible to have a hybrid approach to certain types of mining tasks.
  6. Strategic business projects typically begin leveraging a Data Science based approach to derive business value. This approach then becomes integral and eventually core to the design and architecture of such a business system.
  7. Contrary to what some of the above may imply, Data Science is a cross-functional discipline and not just the domain of Phd’s. A data scientist is part statistician, part developer and part business strategist.
  8. Working in small self sufficient teams, the Data Scientist collaborates with extended areas which includes visualization specialists, developers, business analysts, data engineers, applied scientists, architects, LOB owners and DevOp. The success of data science projects often relies on the communication, collaboration, and interaction that takes place with the extended team, both internally and possibly externally to their organization.
  9. It needs to be clarified that not every business project is a fit for a Data science approach. The criteria that must be employed to understand if such an advanced approach is called for include if the business initiative needs to provide knowledge based decisions (beyond the classical rules engine/ expert systems based approaches), deal with volumes of relevant data, a rapidly changing business climate, & finally where scale is required beyond what can be supplied using human analysts.
  10. Indeed any project where hugely improved access to information & realtime analytics for customers, analysts (and other stakeholders) is a must for the business – is fertile ground for Data Science.

Algorithms & Models

The word ‘model‘ is highly overloaded and means different things to different IT specialities e.g. RDBMS models imply data schemas, statistical models are built by statisticians etc. However, it can safely be said that models are representations of a business construct or a business situation.

Data mining algorithms are used to create models from data.

To create a data science model, the data mining algorithm looks for key patterns in data provided. The results of this analysis are to define the best parameters to create the model. Once identified, these parameters are applied across the entire data set to extract actionable patterns and detailed statistics.

The model itself can take various forms ranging from a set of customers across clusters, a revenue forecasting model, a set of fraud detection rules for credit cards or a decision tree that predicts outcomes based on specific criteria.

Common Data Mining Tasks –

There are many different kinds of data mining algorithms but all of these address a few fundamental types of tasks. The pouplar ones are listed below along with relevant examples:

  • Classification & Class Probability Estimation– For a given set of data, predict for each individual in a population, a discrete set of classes that this individual belongs to. An example classification is – “For all wealth management clients in a given population, who are most likely to respond to an offer to move to a higher segment”. Common techniques used in classification include decision trees, bayesian models, k-nearest neighbors, induction rules etc. Class Probability Estimation (CPE) is a closely related concept in which a scoring model is created to predict the likelihood that an individual would belong to that class.
  • Clustering is an unsupervised technique used to find classes or segments of populations within a larger dataset without being driven by any specific purpose. For example – “What are the natural groups our customers fall into?”. The most popular use of clustering techniques is to identify clusters to use in activities like market segmentation.A common algorithm used here is k-means clustering.
  • Market basket analysis  is commonly used to find out associations between entities based on transactions that involve them. E.g Recommendation engines which use affinity grouping.
  • Regression algorithms aim to characterize the normal or typical behavior of an individual or group within a larger population. It is frequently used in anomaly detection systems such as those that detect AML (Anti Money Laundering) and Credit Card fraud.
  • Profiling algorithms divide data into groups, or clusters, of items that have similar properties.
  • Causal Modeling algorithms attempt to find out what business events influence others.

There is no reason that one should be limited to one of the above techniques while forming a solution. An example is to use one algorithm (say clustering) to determine the natural groups in the data, and then to apply regression to predict a specific outcome based on that data. Another example is to use multiple algorithms within a single business project to perform related but separate tasks. Ex – Using regression to create financial reporting forecasts, and then using a neural network algorithm to perform a deep analysis of the factors that influence product adoption.

The Data Science Process  –

A general process framework for a typical Data science project is depicted below. The process flow depicted below suggests a sequential waterfall but allows for Agile/DevOps loops in the core analysis & feedback phases. The process is also not a virtual one sided pipeline but also allows for continuous improvements.


                                           Illustration: The Data Science Process 

  1. The central pool of data that hosts all the tiers of data processing in the above illustration is called the Data Lake. The Data Lake enables two key advantages – the ability to collect cross business unit data so that it can be sampled/explored at will & the ability to perform any kind of data access pattern across a shared data infrastructure: batch,  interactive, search, in-memory and custom etc.
  2. The Data science process begins with a clear and contextual understanding of the granular business questions that need to be answered from the real world dataset. The Data scientist needs to be trained in the nuances of the business to achieve the appropriate outcome. E.g. Detecting customer churn, predicting fraudulent credit card transactions in the credit cards space, predicting which customers in the Retail Bank are likely to churn over the next few months based on their usage patterns etc.
  3. Once this is known, relevant data needs to be collected from the real world. These sources in Banking range from –
    1. Customer Account data e.g. Names,Demographics, Linked Accounts etc
    2. Transaction Data which captures the low level details of every transaction (e.g debit, credit, transfer, credit card usage etc),
    3. Wire & Payment Data,
    4. Trade & Position Data,
    5. General Ledger Data and Data from other systems supporting core banking functions.
    6. Unstructured data. E.g social media feeds, server logs, clickstream data & mobile application data etc.
  4. Following the planning stage, Data Acquisition follows an  iterative process of acquiring data from the actual sources by creating appropriate loaders choosing appropriate technology components. E.g. Apache NiFi, Kafka, Sqoop, Flume, HDFS API, Java etc
  5. The next step is to perform Data Cleansing. Here the goal is to look for gaps in the data  (given the business context), ensuring that the dataset is valid with no missing values, consistent in layout and as fresh as possible from a temporal standpoint. This phase also involves fixing any obvious quality problems involving range or table driven data. The purpose at this stage is also to facilitate & perform appropriate data governance.
  6. Exploratory Data Analysis (EDA) helps with trial & error analysis of data. This is a phase where plots and graphs are used to systematically go through the data. The importance of this cannot be overstated as it provide the Data scientist and the business with a flavor of the data.
  7. Data Analysis: Generation of features or attributes that will be part of the model. This is the step of the process where actual data mining takes place leveraging models built using the above algorithms.

Within each of the above there exist further iterative steps within the Data Cleansing and Data Analysis stages.

Once the models have been tested and refined to the satisfaction of the business and their performance been put through a rigorous performance test phase, they are deployed into production. Once deployed, these are constantly refined based on end user and system feedback.

The Big Data ecosystem (consisting of tools such as Pig, Scalding, Hive, Spark and MapReduce etc) enable sea changes of improvement across the entire Data science lifecycle from data acquisition to data processing to data analysis. The ability of Big Data/Hadoop to unify all data storage in one place which renders data more accessible for modeling. Hadoop also scales up machine learning analysis due to it’s inbuilt paralleism which adds a tremendous amount of value both in terms of training multiple parallel models to improve their efficacy. The ability to collect a lot of data as opposed to small samples also helps greatly.

Recommendations – 

Developing a strategic mindset to Data science and predictive analytics should be a board level concern. This entails

  • To begin with – ensuring buy in & commitment in the form of funding at a Senior Management level. This support needs to extend across the entire lifecycle depicted above (from identifying business use cases).
  • Extensive but realistic ROI (Return On Investment) models built during due diligence with periodic updates for executive stakeholders
  • On a similar note, ensuring buy in using a strategy of co-opting & alignment with Quants and different high potential areas of the business (as covered in the usecases in the last blog)
  • Identifying leaders within the organization who can not only lead important projects but also create compelling content to evangelize the use of predictive analytics
  • Begin to tactically bake in or embed data science capabilities across different lines of business and horizontal IT
  • Slowly moving adoption to the Risk, Fraud, Cybersecurity and Compliance teams as part of the second wave. This is critical in ensuring that analysts across these areas move from a spreadsheet intensive model to adopting advanced statistical techniques
  • Creating a Predictive Analytics COE (Center of Excellence) that enable cross pollination of ideas across the fields of statistical modeling, data mining, text analytics, and Big Data technology
  • Informing the regulatory authorities of one’s intentions to leverage data science across the spectrum of operations
  • Ensuring that issues related to data privacy,audit & compliance have been given a great deal of forethought
  • Identifying  & developing human skills in toolsets (across open source and closed source) that facilitate adapting to data lake based architectures. A large part of this is to organically grow the talent pool by instituting a college recruitment process

While this ends the current series on Data Science in financial services, it is my intention to explore each of the above Data Mining techniques to a greater degree of depth as applied to specific business situations in 2016 & beyond. This being said, we will take a look at another pressing business & strategic concern – Cybersecurity in Banking – in the next series.

Data Driven Decisions in Financial Services..(2/3)

“Data! Data! Data!” he cried impatiently. “I can’t make bricks without clay!” Sherlock Holmes – Conan Doyle’s The Adventure of the Copper Beaches”

The first post in this three part series described the key ways in which innovative applications of data science are changing a somewhat insular and clubby banking & financial services industry. This disruption rages across the spectrum from both a business model as well as an organizational cultural standpoint. This second post examines key & concrete usecases enabled by a ‘Data Driven’ approach in the  Industry. The next & final post will examine foundational Data Science tasks & techniques commonly employed to get value from data.

Big Data platforms, powered by Open Source Hadoop, can not only economically store large volumes of structured, unstructured or semi-structured data & but also help process it at scale. The result is a steady supply of continuous, predictive and actionable intelligence. With the advent of Hadoop and Big Data ecosystem technologies, Bank IT (across a spectrum of business services) is now able to ingest, onboard & analyze massive quantities of data at a much lower price point.

One can thus can not only generate insights using a traditional ad-hoc querying(or descriptive intelligence) model but also build advanced statistical models on the data. These advanced techniques leverage data mining tasks (like classification, clustering, regression analysis, neural networks etc) to perform highly robust predictive modeling. Owing to Hadoop’s natual ability to work with any kind of data, this can encompass the streaming and realtime paradigms in addition to the traditional historical (or batch) mode.

Further, Big Data also helps Banks capture and commingle diverse datasets that can improve their analytics in combination with improved visualization tools that aid in the exploration & monetization of data.

Now, lets break the above summary down into specifics.

Data In Banking

Corporate IT organizations in the financial industry have been tackling data challenges due to strict silo based approaches that inhibit data agility for many years now.

Consider some of the traditional (or INTERNAL) sources of data in banking –

  • Customer Account data e.g. Names, Demographics, Linked Accounts etc
  • Core Banking Data
  • Transaction Data which captures the low level details of every transaction (e.g debit, credit, transfer, credit card usage etc)
  • Wire & Payment Data
  • Trade & Position Data
  • General Ledger Data e.g AP (accounts payable), AR (accounts receivable), cash management & purchasing information etc.
  • Data from other systems supporting banking reporting functions.

To provide the reader with a wider perspective, a vast majority of the above traditional data is almost all human generated. However, with the advent of smart sensors, enhancements in telemetry based devices like ATMs, POS terminals etc –  machines are beginning to generate even more data. Thus, every time a banking customer clicks a button on their financial provider’s website or makes a purchase using a credit card or calls her bank using the phone – a digital trail is created. Mobile apps drive a ever growing number of interactions due to the sheer nature of interconnected services – banking, retail, airlines, hotels etc. The result is lots of data and metadata that is MACHINE & App generated.

In addition to the above internal & external sources, commercially available 3rd party datasets ranging from crop yields to car purchases to customer preference data  (segmented by age or affluence categories), social media feedback re- financial & retail product usage are now widely available for purchase. As financial services firms sign up partnerships in Retail, Government and Manufacturing, these data volumes will only begin to explode in size & velocity.The key point is that an ever growing number of customer facing interfaces are now available for firms to collect data in a manner that they had never been able to do so before.

Where can Predictive Analytics help – 

Let us now begin some of the main use cases  out there as depicted in the below picture-


                              Illustration – Data Science led disruption in Banking

Defensive Use Cases Across the Banking Spectrum (RFC) – Risk, Fraud & Security

Internal Risk & Compliance departments are increasingly turning to Data Science techniques to create & run models on aggregated risk data. Multiple types of models and algorithms are used to find patterns of fraud and anomalies in the data to predict customer behavior. Examples include Bayesian filters, Clustering, Regression Analysis, Neural Networks etc. Data Scientists & Business Analysts have a choice of MapReduce, Spark (via Java,Python,R), Storm etc and SAS to name a few – to create these models. Fraud model development, testing and deployment on fresh & historical data become very straightforward to implement on Hadoop.

  • Risk Data Aggregation and Measurement – Measure and project different kinds of banking risks (Market Risk, Credit Risk, Loan Default and Operational Risk) . The applications for Data Science range from predicting different risk metrics across market, credit risk in Capital Markets. In Consumer Banking sectors like mortgage banking, credit cards & other financial products, data science is heavily leveraged to classify products & customers into different risk categories. Then to predicting risk scores and risk portfolio trends across thousands of variables.
  • Fraud Detection – Detect and predict institutional fraud for a range of usecases – Anti Money Laundering Compliance (AML), Know Your Customer (KYC), watchlist screening, tax evasion, Linked Entity Analysis etc. In the area of individual level fraud – credit card fraud & mortgage fraud – predictive models are developed which constantly analyze customer spending patterns, location & travel details, employment details and social networks to detect in real time if customer accounts are being compromised.
  • Cyber SecurityAnalyze clickstreams, network packet capture data, weblogs, image data, telemetry data to predict security compromises & to provide advanced security analytics.

Capital Markets, Consumer Banking, Payment Systems & Wealth Management

A) Capital Markets

  • Algorithmic Trading– Data Science augments trading infrastructures in several ways. It helps re-tool existing trading infrastructures so that they are more integrated yet loosely coupled and efficient by helping plug in algorithm based complex trading strategies that are quantitative in nature across a range of asset classes like equities, forex,ETFs and commodities etc. It also helps with trade execution after Hadoop incorporates newer & faster sources of data (social media, sensor data, clickstream date) and not just the conventional sources (market data, position data, M&A data, transaction data etc). E.g Retrofitting existing trade systems to be able to accommodate a range of mobile clients who have a vested interest in deriving analytics. e.g marry tick data with market structure information to understand why certain securities dip or spike at certain points and the reasons for the same (e.g. institutional selling or equity linked trades with derivatives).
  • Trade Analytics – Trade Strategy development is now a complex process where heterogeneous data – ranging from market data, existing positions, corporate actions, social & sentiment data are all blended together to obtain insights into possible market movements, trader yield & profitability across multiple trading desks.
  • Market & Trade Surveillance – An intelligent surveillance system needs to store trade data, reference data, order data, and market data, as well as all of the relevant communications from all the disparate systems, both internally and externally, and then match these things appropriately. The system needs to account for multiple levels of detection capabilities starting with a) configuring business rules (that describe a fraud pattern) as well as b) dynamic capabilities based on machine learning models (typically thought of as being more predictive) to detect complex patterns that pertain to insider trading and other market integrity compromises. Such a system also needs to be able to parallelize model execution at scale to be able to meet demanding latency requirements.

B) Consumer Banking & Wealth Management

Data Science has been proven in several applications in consumer banking ranging from a single view of customer to mapping customer journey across multiple financial products & channels. Techniques like pattern analysis (detecting new patterns within and across datasets), marketing analysis (across channels), recommendation analysis (across groups of products) are becoming fairly common. One can see a clear trend in early adopter consumer banking & private banking institutions in moving to an “Analytics first” approach to creating new business applications.

  • Customer 360 & Segmentation –
    Currently most Retail and Consumer Banks lack a  comprehensive view of their customers. Each department has a limited view of customer due to which the offers and interactions with customers across multiple channels are typically inconsistent and vary a lot.  This also results in limited collaboration within the bank when servicing customer needs. Leveraging the ingestion and predictive capabilities of a Hadoop based platform, Banks can provide a user experience that rivals Facebook, Twitter or Google that provide a full picture of customer across all touch points
  • Some of the more granular business usecases that span the spectrum in Consumer Banking include –
    • Improve profitability per retail or cards customer across the lifecycle by targeting at both micro and macro levels (customer populations) .This is done by combining the rich diverse datasets – existing transaction data, interaction data, social media feeds, online visits, cross channel data etc as well as understand customer preferences across similar segments
    • Detect customer dissatisfaction by analyzing transaction, call center data
    • Cross sell and upsell opportunities across different products
    • Help improve the product creation & pricing process

B) Payment Networks 

The real time data processing capabilities of Hadoop allow it to process data in a continual or bursty or streaming or micro batching fashion. Once payment data is ingested, such it must be processed in a very small time period (hundreds of milliseconds) which is typically termed near real time (NRT). When combined with predictive capabilities via behavioral modeling & transaction profiling Data Science can provide significant operational, time & cost savings across the below areas.

  • Obtaining a single view of customer across multiple modes of payments
  • Detecting payment fraud by using behavior modeling
  • Understand which payment modes are used more by which customers
  • Realtime analytics support
  • Tracking, modeling & understanding customer loyalty
  • Social network and entity link analysis

The road ahead – 

How can leaders in the Banking industry leverage a predictive analytics based approach across each of the industry ?

I posit that this will take place in four ways –

  • Using data to create digital platforms that better engage customers, partners and employees
  • Capturing & analyzing any and all data streams from both conventional and newer sources to compile a 360 degree view of the retail customer, institutional client or payment or fraud etc. This is critical to be able to market to the customer as one entity and to assess risk across that one entity as well as populations of entities
  • Creating data products by breaking down data silos and other internal organizational barriers
  • Using data driven insights to support a culture of continuous innovation and experimentation

The next & final post will examine specific Data Science techniques covering key algorithms, and other computational approaches.. We will also cover business & strategy recommendations to industry CXO’s embarking on Data Science projects.

Big Data & Advanced Analytics drive profits in Financial Services..(1/3)

“Silicon Valley is coming. There are hundreds of start-ups with a lot of brains and money working on various alternatives to traditional banking….the ones you read about most are in the lending business, whereby the firms can lend to individuals and small businesses very quickly and — these entities believe — effectively by using Big Data to enhance credit underwriting. They are very good at reducing the ‘pain points’ in that they can make loans in minutes, which might take banks weeks. Jamie Dimon –  CEO JP Morgan Chase in Annual Letter to Shareholders Feb 2016[1].

If Jamie Dimon’s opinion is anything to go by, the Financial Services industry is undergoing a major transformation and it is very evident that Banking as we know it will change dramatically over the next few years. This blog has spent some time over the last year defining the Big Data landscape in Banking. However the rules of the game are changing from mere data harnessing to leveraging data to drive profits. With that background, let us begin examining the popular applications of Data Science in the financial industry. This blog covers the motivation for and need of data mining in Banking. The next blog will introduce key usecases and we will round off the discussion in the third & final post by covering key algorithms, and other computational approaches.

The Banking industry produces the most data of any vertical out there with well defined & long standing business processes that have stood the test of time. Banks possess rich troves of data that pertain to customer transactions & demographic information. However, it is not enough for Bank IT to just possess the data. They must be able to drive change through legacy thinking and infrastructures as things change around the entire industry not just from a risk & compliance standpoint. For instance a major new segment are the millennial customers – who increasingly use mobile devices and demand more contextual services as well as a seamless unified banking experience – akin to what they commonly experience via the internet – at web properties like Facebook, Amazon,Uber, Google or Yahoo etc.

How do Banks stay relevant in this race? A large part of the answer is to make Big Data a strategic & boardroom level discussion and to take an industrial approach to predictive analytics.  The current approach as in vogue – to treat these as one-off, tactical project investments does not simply work or scale anymore.  There are various organizational models that one could employ, ranging from a shared service to a line of business led approach. An approach that I have seen work very well is to build a Center of Excellence (COE) to create contextual capabilities, best practices and rollout strategies across the larger organization.

Banks need to lead with Business Strategy 

A strategic approach to industrializing analytics in a Banking organization can add massive value and competitive differentiation in five distinct categories –

  1. Exponentially improve existing business processes. e.. Risk data aggregation and measurement, financial compliance, fraud detection
  2. Help create new business models and go to market strategies – by monetizing multiple data sources – both internal and external
  3. Vastly improve customer satisfaction by generating better insights across the customer journey
  4. Increase security while expanding access to relevant data throughout the enterprise to knowledge workers
  5. Help drive end to end digitization

Financial Services gradually evolves from Big Data 1.0 to 2.0

Predictive analytics & data mining have only been growing in popularity in recent years. However, when coupled with Big Data, they are on their way to attaining a higher degree of business capability & visibility.

Lets take a quick walk down memory lane..

In Big Data 1.0 – (2009-2015), a large technical area of focus was to ingest huge volumes of data to process them in a batch oriented fashion to perform a limited number of business usecases. In the era of 2.0, the focus is on enabling applications to perform high, medium or low latency based complex processing.

In the age of 1.0, Banking organizations across the spectrum, ranging from the mega banks to smaller regional banks to asset managers, have used the capability to acquire, store and process large volumes of data using commodity hardware at a much lower price point. This has resulted in huge reduction in CapEx & OpEx spend on data management projects  (Big Data augments while helping augment legacy investments in MPP systems, Data Warehouses, RDBMS’s etc).

The age of Big Data 1.0 in financial services is almost over and the dawn of Big Data 2.0 is now upon the industry. One may ask, “what is the difference?”, I would contend that while Big Data 1.0 largely dealt with the identification, on-boarding and broad governance of the data; 2.0 will begin the redefinition of business based on the ability do deploy advanced processing techniques across a plethora of new & existing sources of data. 2.0 will thus be about extracting richer insights from the onboarded data to serve customers better, stay compliant with regulation & to create new businesses. The new role of  ‘Data scientist’ who is an interdisciplinary expert (part business strategist, part programmer, part statistician, data miner & part business analyst) –  has come to represent one of the highly coveted job skills today.

Much before the time “Data Science” entered the technology lexicon, the Capital Markets employed advanced quantitative techniques. The emergence of Big Data has only created up new avenues in machine learning, data mining and artificial intelligence.


                                                    Illustration: Data drives Banking

Why is that ?

Hadoop, which is now really a platform ecosystem of 30+ projects – as opposed to a standalone technology, has been reimagined twice and now forms the backbone of any financial services data initiative. Thus, Hadoop is has now evolved into a dual persona – first an Application platform in addition to being a platform for data storage & processing.

Why are Big Data and Hadoop the ideal platform for Predictive Analytics?

Big Data is dramatically changing that approach with advanced analytic solutions that are powerful and fast enough to detect fraud in real time but also build models based on historical data (and deep learning) to proactively identify risks.

The reasons why Hadoop is emerging as the best choice for predictive analytics are

  1. Access to the advances in advanced infrastructures & computing capabilities at a very low cost
  2. Monumental advances in the algorithmic techniques themselves now..e.g. mathematical abilities, feature sets, performance etc
  3. Low cost & efficient access to tremendous amounts for data & the ability to store it at scale

Technologies in the Hadoop ecosystem such as ingestion frameworks (Flume,Kafka,Sqoop etc) and processing frameworks (MapReduce,Storm, Spark et al) have enabled the collection, organization and analysis of Big Data at scale. Hadoop supports multiple ways of running models and algorithms that are used to find patterns of customer behavior, business risks, cyber security violations, fraud and compliance anomalies in the mountains of data. Examples of these models include Bayesian filters, Clustering, Regression Analysis, Neural Networks etc. Data Scientists & Business Analysts have a choice of MapReduce, Spark (via Java,Python,R), Storm etc and SAS to name a few – to create these models. Fraud model development, testing and deployment on fresh & historical data become very straightforward to implement on Hadoop

However the story around Big Data adoption in your average Bank is typically not all that revolutionary – it typically follows a more evolutionary cycle where a rigorous engineering approach is applied to gain small business wins before scaling up to more transformative projects.Leveraging an open enterprise Hadoop approach, Big Data centric business initiatives in financial services have begun realizing value in a range of areas as diverse as –  the defensive (Risk, Fraud and Compliance  – RFC ) to achieving Competitive Parity (e.g Single View of Customer) to the Offensive (Digital Transformation across their Retail Banking business, unified Trade Data repositories in Capital Markets).

With the stage thus set, the next post will describe real world compelling usecases for Predictive Analytics across the spectrum of 21st century banking.