A Digital Reference Architecture for the Industrial Internet Of Things (IIoT)..

A few weeks ago on the invitation of DZone Magazine, I jointly authored a Big Data Reference Architecture along with my friend & collaborator, Tim Spann (https://www.linkedin.com/in/timothyspann/). Tim & I distilled our experience working on IIoT projects to propose an industrial strength digital architecture. It brings together several technology themes – Big Data , Cyber Security, Cognitive Applications, Business Process Management and Data Science. Our goal is to discuss a best in class architecture that enables flexible deployment for new IIoT capabilities allowing enterprises to build digital applications. The abridged article was featured in the new DZone Guide to Big Data: Data Science & Advanced Analytics which can be downloaded at  https://dzone.com/guides/big-data-data-science-and-advanced-analytics

How the Internet Of Things (IoT) leads to the Digital Mesh..

The Internet of Things (IoT) has become one of the four top hyped up technology paradigms affecting the world of business. The other usual suspects being Big Data, AI/Machine Learning & Blockchain. Cisco predicts that the IOT is expected to impact about 25 billion connected things by 2020 and affect about $2 trillion of economic value globally across a diverse range of verticals. These devices are not just consumer oriented devices such as smartphones and home monitoring systems but dedicated industry objects such as sensors, actuators, engines etc.

The interesting angle to all this is the fact that autonomous devices are already beginning to communicate with one another using IP based protocols. They largely exchanging state & control information around various variables. With the growth of computational power on these devices, we are not far off from their sending over more granular and interesting streaming data – about their environment, performance and business operations – all of which will enable a higher degree of insightful analytics to be performed on the data. Gartner Research has termed this interconnected world where decision making & manufacturing optimization can occur via IoT as the “Digital Mesh“.

The evolution of technological innovation in areas such as Big Data, Predictive Analytics and Cloud Computing now enables the integration and analysis of massive amounts of device data at scale while performing a range of analytics and business process workflows on the data.

Image Credit – Sparkling Logic

According to Gartner, the Digital Mesh will thus lead to an interconnected data information deluge powered by the continuous data from these streams. These streams will encompasses classical IoT endpoints (sensors, field devices, actuators etc) sending data in a variety of formats –  text, audio, video & social data streams – along with new endpoints in areas as diverse as Industrial Automation, Remote Healthcare, Public Transportation, Connected Cars, Home Automation etc. These intelligent devices will increasingly begin communicating with their environments in a manner that will encourage collaboration in a range of business scenarios. The industrial cousin of IoT is the Industrial Internet of Things (IIIoT).

Defining the Industrial Internet Of Things (IIoT)

The Industrial Internet of Things (IIoT) can be defined as a ecosystem of capabilities that interconnects machines, personnel and processes to optimize the industrial lifecycle.  The foundational technologies that IIoT leverages are Smart Assets, Big Data, Realtime Analytics, Enterprise Automation and Cloud based services.

The primary industries impacted the most by the IIoT will include Industrial Manufacturing, the Utility industry, Energy, Automotive, Transportation, Telecom & Insurance.

According to Markets and Markets, the annual worldwide Industrial IoT market is projected to exceed $319 billion in 2020, which represents an 8% a compound annual growth rate (CAGR). The top four segments are projected to be manufacturing, energy and utilities, auto & transportation and healthcare.[1]

Architectural Challenges for Industrial IoT versus Consumer IoT..

Consumer based IoT applications generally receive the lion’s share of media attention. However the ability of industrial devices (such as sensors) to send ever more richer data about their operating environment and performance characteristics is driving a move to Digitization and Automation across a range of industrial manufacturing.

Thus, there are four distinct challenges that we need to account for in an Industrial IOT scenario as compared to Consumer IoT.

  1. The IIoT needs Robust Architectures that are able to handle millions of device telemetry messages per second. The architecture needs to take into account that all kinds of devices operating in environments ranging from the constrained to
  2. IIoT also calls for the highest degrees of Infrastructure and Application reliability across the stack. For instance, a lost message or dropped messages in a healthcare or a connected car scenario may mean life or death for a patient, or, an accident.
  3. An ability to integrate seamlessly with existing Information Systems. Lets be clear, these new age IIOT architectures need to augment existing systems such as Manufacturing Execution Systems (MES) or Traffic Management Systems. In Manufacturing, MES systems continually improve the product lifecycle and perform better resource scheduling and utilization. This integration helps these systems leverage the digital intelligence and insights across (potentially) millions of devices across complex areas of operation.
  4. An ability to incorporate richer kinds of analytics than has been possible before that provide a great degree of context. This ability to reason around context is what provides an ability to design new business models which cannot be currently imagined due to lack of agility in the data and analytics space.

What will IIoT based Digital Applications look like..

Digital Applications are being designed for specific device endpoints across industries. While the underlying mechanisms and business models differ from industry to industry, all of these use predictive analytics based on a combination of real time data processing & data science algorithms. These techniques extract insights from streaming data to provide digital services on existing toolchains, provide value added customer service, predict device performance & failures, improve operational metrics etc.

Examples abound. For instance, a great example in manufacturing is the notion of a Digital Twin which Gartner called out last year. A Digital twin is a software personification of an Intelligent device or system.  It forms a bridge between the real world and the digital world. In the manufacturing industry, digital twins can be setup to function as proxies of Things like sensors and gauges, coordinate measuring machines, vision systems, and white light scanning. This data is sent over a cloud based system where it is combined with historical data to better maintain the physical system.

The wealth of data being gathered on the shop floor will ensure that Digital twins will be used to reduce costs and increase innovation. Thus, in global manufacturing – Data science will soon make it’s way into the shop floor to enable the collection of insights from these software proxies. We covered the phenomenon of Servitization in manufacturing in a previous blogpost.

In the Retail industry, an ability to detect a customer’s location in realtime and combining that information with their historical buying patterns can drive real time promotions and an ability to dynamically price retail goods.

Solution Requirements for an IIoT Architecture..

At a high level, the IIoT reference architecture should support six broad solution areas-

  1. Device Discovery – Discovering a range of devices (and their details)  on the Digital Mesh for an organization within and outside the firewall perimeter
  2. Performing Remote Lifecycle Configuration of these devices ranging from startup to modification to monitoring to shut down
  3. Performing Deep Security level introspection to ensure the patch levels etc are adequate
  4. Creating Business workflows on the Digital Mesh. We will do this by marrying these devices to enterprise information systems (EISs)
  5. Performing Business oriented Predictive Analytics on these devices, this is critical to 
  6. On a futuristic basis, support optional integration with the Blockchain to support a distributed organizational ledger that can coordinate activity across all global areas that an enterprise operates in.

Building Blocks of the Architecture

Listed below are the foundational blocks of our reference architecture. Though the requirements will vary across industries, an organization can reasonably standardize on a number of foundational components as depicted below and then incrementally augment them as the interactions between different components increase based on business requirements.

Our reference architecture includes the following major building blocks –

  • Device Layer
  • Device Integration Layer
  • Data & Middleware Tier
  • Digital Application Layer

It also includes the following cross cutting concerns which span across the above layers –

  • Device and Data Security
  • Business Process Management
  • Service Management
  • UX Design
  • Data Governance – Provenance, Auditing, Logging

The next section provides a brief overview of the reference architecture’s components at a logical level.

A Big Data Reference Architecture for the Industrial Internet depicting multiple functional layers

Device Layer – 

The first requirement of IIIoT implementations is to support connectivity from the Things themselves or the Device layer depicted at the bottom. The Device layer includes a whole range of sensors, actuators, smartphones, gateways and industrial equipment etc. The ability to connect with devices and edge devices like routers, smart gateways using a variety of protocols is key. These network protocols include Ethernet, WiFi, and Cellular which can all directly connect to the internet. Other protocols that need a gateway device to connect include Bluetooth, RFID, NFC, Zigbee et al. Devices can connect directly with the data ingest layer shown above but it is preferred that they connect via a gateway which can perform a range of edge processing.

This is important from a business standpoint for instance, in certain verticals like healthcare and financial services, there exist stringent regulations that govern when certain identifying data elements (e.g. video feeds) can leave the premises of a hospital or bank etc. A gateway cannot just perform intelligent edge processing but also can connect thousands of device endpoints and facilitate bidirectional communication with the core IIoT architecture. 

The ideal tool for these constantly evolving devices, metadata, protocols, data formats and types is Apache NiFi.  These agents will send the data to an Apache NiFi gateway or directly into an enterprise Apache NiFi cluster in the cloud or on-premise.

Apache NiFi Eases Dataflow Management & Accelerates Time to Analytics In Banking (2/3)..

 A subproject of Apache NiFi – MiNiFi provides a complementary data collection approach that supplements the core tenets of NiFi in dataflow management. However due to its small footprint and low resource consumption, is well suited to handle dataflow from sensors and other IOT devices. It provides central management of agents while providing full chain of custody information on the flows themselves.

For remote locations, more powerful devices like the Arrow BeagleBone Black Industrial and MyPi Industrial, it is very simple to run a tiny Java or C++ MiNiFi agent for secure connectivity needs.

The data sent by the device endpoints are then modeled into an appropriate domain representation based on the actual content of the messages. The data sent over also includes metadata around the message. A canonical model can optionally be developed (based on the actual business domain) which can support a variety of applications from a business intelligence standpoint.

 Apache NiFi supports the flexibility of ingesting changing file formats, sizes, data types and schemas. The devices themselves can send a range of feeds in different formats. E.g. XML now and based on upgraded capabilities – richer JSON tomorrow. NiFi supports ingesting any file type that the devices or the gateways may send.  Once the messages are received by Apache NiFi, they are enveloped in security with every touch to each flow file controlled, secured and audited.   NiFi flows also provide full data provenance for each file, packet or chunk of data sent through the system.  NiFi can work with specific schemas if there are special requirements for file types, but it can also work with unstructured or semi structured data just as well.  From a scalability standpoint, NiFi can ingest 50,000 streams concurrently on a zero-master shared nothing cluster that horizontally scales via easy administration with Apache Ambari.

Data and Middleware Layer – 

The IIIoT Architecture recommends a Big Data platform with native message oriented middleware (MOM) capabilities to ingest device mesh data. This layer will also process device data in such a fashion – batch or real-time – as the business needs demand.

Application protocols such as AMQP, MQTT, CoAP, WebSockets etc are all deployed by many device gateways to communicate application specific messages.  The reason for recommending a Big Data/NoSQL dominated data architecture for IIOT is quite simple. These systems provide Schema on Read which is an innovative data handling technique. In this model, a format or schema is applied to data as it is accessed from a storage location as opposed to doing the same while it is ingested. From an IIOT standpoint, one must not just deal with the data itself but also metadata such as timestamps, device id, other firmware data such as software version, device manufactured data etc. The data sent from the device layer will consist of time series data and individual measurements.

The IIoT data stream can thus be visualized as a constantly running data pump which is handled by a Big Data pipeline takes the raw telemetry data from the gateways, decides which ones are of interest and discards the ones not deemed significant from a business standpoint.  Apache NiFi is your gateway and gate keeper.   It ingests the raw data, manages the flow of thousands of producers and consumers, does basic data enrichment, sentiment analysis in stream, aggregation, splitting, schema translation, format conversion and other initial steps to prepare the data. It does that all with a user-friendly web UI and easily extendible architecture.  It will then send raw or processed data to Kafka for further processing by Apache Storm, Apache Spark or other consumers.  Apache Storm is a distributed real-time computation engine that reliably processes unbounded streams of data.  Storm excels at handling complex streams of data that require windowing and other complex event processing. While Storm processes stream data at scale, Apache Kafka distributes messages at scale. Kafka is a distributed pub-sub real-time messaging system that provides strong durability and fault tolerance guarantees. NiFi, Storm and Kafka naturally complement each other, and their powerful cooperation enables real-time streaming analytics for fast-moving big data. All the stream processing is handled by NiFi-Storm-Kafka combination.  

Apache Nifi, Storm and Kafka integrate very closely to manage streaming dataflows.


Appropriate logic is built into the higher layers to support device identification, ID lookup, secure authentication and transformation of the data. This layer will process data (cleanse, transform, apply a canonical representation) to support Business Automation (BPM), BI (business intelligence) and visualization for a variety of consumers. The data ingest layer will also providing notification and alerts via Apache NiFi.

Here are some typical uses for this event processing pipeline:

a. Real-time data filtering and pattern matching

b. Enrichment based on business context

c. Real-time analytics such as KPIs, complex event processing etc

d. Predictive Analytics

e. Business workflow with decision nodes and human task nodes

Digital Application Tier – 

Once IIoT knowledge has become part of the Hadoop based Data Lake, all the rich analytics, machine learning and deep learning frameworks, tools and libraries now become available to Data Scientists and Analysts.   They can easily produce insights, dashboards, reports and real-time analytics with IIoT data joined with existing data in the lake including social media data, EDW data, log data.   All your data can be queried with familiar SQL through a variety of interfaces such as Apache Phoenix on HBase, Apache Hive LLAP and Apache Spark SQL.   Using your existing BI tools or the open sourced Apache Zeppelin, you can produce and share live reports.   You can run TensorFlow in containers on YARN for deep learning insights on your images, videos and text data; while running YARN clustered Spark ML pipelines fed by Kafka and NiFi to run streaming machine learning algorithms on trained models.

A range of predictive applications are suitable for this tier. The models themselves should seek to answer business questions around things like -Asset failure, the key performance indicators in a manufacturing process and how they’re trending, insurance policy pricing etc. 

Once the device data has been ingested into a modern data lake, key functions that need to be performed include data aggregation, transformation, enriching, filtering, sorting etc.

As one can see, this can get very complex very quick – both from a data storage and processing standpoint. A Cloud based infrastructure with its ability to provide highly scalable compute, network and storage resources is a natural fit to handle bursty IIoT applications. However, IIoT applications add their own diverse requirements of computing infrastructure, namely the ability to accommodate hundreds of kinds of devices and network gateways – which means that IT must be prepared to support a large diversity of operating systems and storage types

The tier is also responsible for the integration of the IIoT environment into the business processes of an enterprise. The IIoT solution ties into existing line-of-business applications and standard software solutions through adapters or Enterprise Application Integration (EAI) and business-to-business (B2B) gateway capabilities. End users in business-to-business or business-to-consumer scenarios will interact with the IIOT solution and the special- purpose IIoT devices through this layer. They may use the IIoT solution or line-of-business system UIs, including apps on personal mobile devices, such as smartphones and tablets.

Security Implementation

The topic of Security is perhaps the most important cross cutting concern across all layers of the IIoT architecture stack. Needless to say, each of the layers must support the strongest data encryption, authentication and authentication capabilities for devices, users and partner applications. Accordingly, capabilities must be provided to ingest and store security feeds, IDS logs for advanced behavioral analytics, server logs, device telemetry. These feeds must be constantly analyzed across three domains – the Device domain, the Business domain and the IT domain. The below blogpost delves into some of these themes and is a good read to get a deeper handle on this issue from a SOC (security operations center) standpoint.

An Enterprise Wide Framework for Digital Cybersecurity..(4/4)


It is evident from the above that IIoT will enormous opportunity for businesses globally. It will also create layers of complexity and opportunity for Enterprise IT. The creation of smart digital services on the data served up will further depend on the vertical industries. Whatever be the kind of business model – whether tracking behavior, location sensitive pricing, business process automation etc – the end goal of IT architecture should be to create enterprise business applications that are ultimately data native and analytics driven.


Here Is What Is Causing The Great Brick-And-Mortar Retail Meltdown of 2017..(1/2)

Amazon and other pure plays are driving toward getting both predictive and prescriptive analytics. They’re analyzing and understanding information at an alarming rate. Brands have pulled products off of Amazon because they’re learning more about them than the brands themselves.” — Todd Michaud, Founder and CEO of Power Thinking Media

By April 2017,17 major retailers announced plans to close stores (Image Credit: Clark Howard)

We are witnessing a meltdown in Storefront Retail..

We are barely halfway through 2017, and the US business media is rife with stories of major retailers closing storefronts. The truth is inescapable that the Retail industry is in the midst of structural change. According to a research report from Credit Suisse, around 8,600 brick-and-mortar stores will shutter their doors in 2017. The number in 2016 was 2,056 stores and 5,077 in 2015 which points to industry malaise [1].

The Retailer’s bigger cousin – the neighborhood Mall – is not doing any better. There are around 1,200 malls in the US today and that number is forecast to decline to just about 900 in a decade.[3]

It is clear that in the coming years, Retailers (and malls) across the board will remain under pressure due to a variety of changes – technological, business model and demographic.

So what can legacy Retailers do to compete with and disarm the online upstart?

Six takeaways for Retail Industry watchers..

Six takeaways that should have industry watchers take notice from the recent headlines –

  1. The brick and mortar retail store pullback has accelerated in 2017 – an year of otherwise strong economic expansion. Typical consumer indicators that influence consumer spending on retail are generally pointing upwards. Just sample the financial data – the US has seen increasing GDP for eight straight years, the last 18 months have seen wage growth for middle & lower income Americans and gas prices are at all time lows.[3] These kinds of relatively strong consumer data trends cannot explain a slowdown in physical storefronts. Consumer spending is not shrinking to due to declining affordability/spending power.
  2. Retailers that have either declared bankruptcy or announced large scale store closings include marquee names across the different categories of retail. Ranging from Apparel to Home Appliances to Electronics to Sporting Goods. Just sample some of the names – Sports Authority, RadioShack, HHGregg, American Apparel, Bebe Stores, Aeropostale, Sears, Kmart, Macy’s, Payless Shoes, JC Penney etc. So this is clearly a trend across various sectors in retail and not confined to a given area, for instance, women’s apparel.
  3. Some of this “Storefront Retail bubble burst” can definitely be attributed to hitherto indiscriminate physical retail expansion. The first indicator is here is in the glut of residual excess retail space.  The WSJ points out that the retail expansion dates back almost 30 years ago when retailers began a “land grab” to open more stores – not unlike the housing boom a decade or so ago. [1] North America now has a glut of both retail stores and shopping malls while per capita sales has begun declining. The US especially has almost five times retail space per capita compared to the UK. American consumers are also swapping materialism for more experiences.[3]  Thus, an over-buildout of retail space is one of the causes of the ongoing crash.

    The US has way more shopping space compared to the rest of the world. (Credit – Cowan and Company)
  4. The dominant retail trend in the world is online ‘single click’ shopping. This is evidenced by declining in-store Black Friday sales in 2016 when compared with record Cyber Monday (online) sales. As online e-commerce volume increases year on year, online retailers led by Amazon are surely taking market share away from the struggling brick-and mortar Retailer who has not kept up with the pace of innovation. The uptick in online retail is unmistakeable as evidenced by the below graph (src – ZeroHedge) depicting the latest retail figures. Department-store sales rose 0.2% on the month, but were down 4.5% from a year earlier. Online retailers such as Amazon, posted a 0.6% gain from the prior month and a 11.9% increase from a year earlier.[3]

    Retail Sales – Online vs In Store Shopping (credit: ZeroHedge)
  5. Legacy retailers are trying to play catch-up with the upstarts who excel at technology. This has sometimes translated into acquisitions of online retailers (e.g. Walmart’s buy of Jet.com). However, the Global top 10 Retailers are dominated by the likes of Walmart, Costco, the Kroger, Walgreens etc. Amazon comes in only at #10 which implies that this battle is only in it’s early days. However, legacy retailers are saddled by huge fixed costs & their investors prefer dividend payouts to investments in innovations. Thus their CEOs are incentivized to focus on the next quarter, not the next decade like Amazon’s Jeff Bezos who is famously known to not evidence any signs of increasing Amazon’s profitability. Though traditional retailers have begun accelerating investments (both organic and via acquisition) in the critical areas of Cloud Computing, Big Data,Mobility and Predictive Analytics – the web scale majors such as Amazon are far far ahead of typical Retail IT shop.

  6. The fastest growing Retail industry brands are companies that use Data as a core business capability to impact the customer experience versus as just another component of an overall IT system. Retail is a game of micro customer interactions that drive sales and margin. This implies a Retailer’s ability to work with realtime customer data – whether it’s sentiment data, clickstream data and historical purchase data to drive marketing promotions, personally relevant services, order fulfillment, show-rooming, loyalty programs etc etc.On the back end, the ability to streamline operations by pulling together data from operations, supply chains are helping retailers fine-tune & automate operations especially from a delivery standpoint.

    In Retail, Technology Is King..

    So, what makes Retail somewhat of a unique industry in terms of it’s data needs? I posit that there are four important characteristics –

    • First and foremost, Retail customers especially the millennials are very open about sharing their brand preferences and experiences on social media. There is a treasure trove of untapped data and similar out there. Data needs to be collected and monetized on. We will explore this in more detail in the next post.
    • Secondly, leaders such as Amazon use customer, product data and a range of other technology capabilities to shape the customer experience versus the other way around for traditional retailers. They do this based on predictive analytic approaches such as machine learning and deep learning. Case in point is Amazon which has now morphed from an online retailer to a Cloud Computing behemoth with it’s market leading AWS (Amazon Web Services). In fact it’s best in class IT enabled it to experiment with retail business models. E.g. The Amazon Prime subscription at $99-a-year Amazon Prime subscription, which includes free two delivery, music and video streaming service that competes with Netflix. As of March 31, 2017 Amazon had 80 million Prime subscribers in the U.S , an increase of 36 percent from a year earlier, according to Consumer Intelligence Research Partners.[3]
    • Thirdly, Retail organizations need to become Data driven businesses. What does that mean or imply? They need to rely on data to drive every core business process – e.g. realtime insights about customers, supply chains, order fulfillment and inventory. This data however spans every kind from traditional structured data (sales data, store level transactions, customer purchase histories, supply chain data, advertising data etc) to non traditional data (social media feeds as there is a strong correlation between the products people rave about and what they ultimately purchase), location data, economic performance data etc). This Data variety represents a huge challenge to Retailers in terms of managing, curating and analyzing these feeds.
    • Fourth, Retail needs to begin aggressively adopting the IoT capabilities they already have in place in the area of Predictive Analytics. This implies tapping and analyzing data from in store beacons, sensors and actuators across a range of use cases from location based offers to restocking shelves.

      ..because it enables new business models..

      None of the above analysis claims that physical stores are going away. They serve a very important function in allowing consumers a way to try on products and allowing for the human experience. However, online definitely is where the growth primarily will be.

      The Next and Final Post in this series..

      It is very clear from the above that it now makes more sense to talk about a Retail Ecosystem which is composed of store, online, mobile and partner storefronts.

      In that vein, the next post in this two part series will describe the below four progressive strategies that traditional Retailers can adopt to survive and favorably compete in today’s competitive (and increasingly online) marketplace.

      These are –

    • Reinventing Legacy IT Approaches – Adopting Cloud Computing, Big Data and Intelligent Middleware to re-engineer Retail IT

    • Changing Business Models by accelerating the adoption of Automation and Predictive Analytics – Increasing Automation rates of core business processes and infusing them with Predictive intelligence thus improving customer and business responsiveness

    • Experimenting with Deep Learning Capabilities  –the use of Advanced AI such as Deep Neural Nets to impact the entire lifecycle of Retail

    • Adopting a Digital or a ‘Mode 2’ Mindset across the organization – No technology can transcend a large ‘Digital Gap’ without the right organizational culture

      Needless to say, the theme across all of the above these strategies is to leverage Digital technologies to create immersive cross channel customer experiences.


[1] WSJ – ” Three hard lessons the internet is teaching traditional stores” – https://www.wsj.com/articles/three-hard-lessons-the-internet-is-teaching-traditional-stores-1492945203

[2] The Atlantic  – “The Retail Meltdown” https://www.theatlantic.com/business/archive/2017/04/retail-meltdown-of-2017/522384/?_lrsc=2f798686-3702-4f89-a86a-a4085f390b63

[3] WSJ – ” Retail Sales fall for the second straight month” https://www.wsj.com/articles/u-s-retail-sales-fall-for-second-straight-month-1492173415

Why Big Data Analytics is the Future of CRM..

A  question that I get a lot from customers is around how Big Data can help augment CRM systems. The answer isn’t just about the ability to aggregate loads of information to produce much richer views of the data but also about feeding this data to produce richer digital analytics. 

Why Combine CRM with Big Data

Customer Relationship Management (CRM) systems primarily resolve around customer information and captures a customers interactions with a company.  The strength of CRM systems is their ability to work with structured data such as customer demographic information (Name, Identifiers, Address, product history etc)

Industry customers will want to use their core CRM customer profiles as a foundational capability and then augment it with additional data as shown in the below diagram –

  1. Core CRM Records as shown at the bottom layers storing structured customer contact data
  2. Extended attribute information from MDM systems,
  3. Customer Experience Data such as Social (sentiment, propensity to buy), Web clickstreams, 3rd party data, etc. (i.e. behavioral, demographics, lifestyle, interests, etc).
  4. Any Linked accounts for customers
  5. The ability to move to a true Customer 360 or Single View
How Big Data Can Augment CRM systems (credit – Mike Ger)

All of these non traditional data streams shown above and depicted below can be stored on commodity hardware clusters. This can be done at a fraction of the cost of traditional SAN storage. The combined data can then be analyzed effectively in near real time thus providing support for advanced business capabilities.

The Seven Kinds of Non Traditional Data that has become prevalent over the last five years

Seven Common Business Capabilities

Once all of this data has been ingested into a datalake from CRM systems, Book Of Record Transaction Systems (BORT), unstructured data sources etc the following kinds of analysis are performed on it. Big Data based on Hadoop can help join CRM data (customer demographics, sales information, advertising campaign info etc) with additional data. This rich view of a complete dataset can provide the below business capabilities.

  • Customer Segmentation– For a given set of data, predict for each individual in a population, a discrete set of classes that this individual belongs to. An example classification is – “For all retail banking clients in a given population, who are most likely to respond to an offer to move to a higher segment”.
  • Pattern recognition and analysis – discover new combinations of business patterns within large datasets. E.g. combine a customers structured data with clickstream data analysis. A major bank in NYC is using this data to settle mortgage loans.
  • Customer Sentiment analysis is an technique used to find degrees of customer satisfaction and how to improve them with a view of increasing customer net promoter scores (NPS).
  • Market basket analysis is commonly used to find out associations between products that are purchased together with a view to improving marketing products. E.g Recommendation engines which to understand what banking products to recommend to customers.
  • Regression algorithms aim to characterize the normal or typical behavior of an individual or group within a larger population. It is frequently used in anomaly detection systems such as those that detect AML (Anti Money Laundering) and Credit Card fraud.
  • Profiling algorithms divide data into groups, or clusters, of items that have similar properties.
  • Causal Modeling algorithms attempt to find out what business events influence others.

Four business benefits in combining Big Data with CRM Systems –

  1. Hadoop can make CRM systems more efficient and cost effective – Most CRM technology is based on an underlying relational database or enterprise data warehouse. These legacy data storage technologies suffer from data collection delays and processing challenges. Hadoop with it’s focus on Schema On Read (SOR) and parallelism can enable low cost storage combined with efficient processing
  2. This integration can focus on improving customer experience – Combining past interactions with historical data across both systems can provide a realtime single view of a customer thus helping agents work better with their customer.
  3. Combine Data in innovative ways to create new products – Once companies have deep insights into customer behavior and purchasing patterns, they can combine the data to create or modify existing service and products.
  4. Gain Realtime insights –  Online transactions are increasing in number year on year. The onset of Digital Architectures in enterprise businesses implies the ability to drive continuous online interactions with global consumers/customers/clients or patients. The goal is not just provide engaging visualization but also to personalize services clients care about across multiple channels of interaction. The only way to attain digital success is to understand your customers at a micro level while constantly making strategic decisions on your offerings to the market. This essentially means operating in a real time world – which leads to Big Data.

To Sum Up…

Combining CRM with Big Data can help maximize competitive advantage across every industry vertical. These advantages not only stem from cheaply storing and analyzing vastly richer data. These business insights are deployed in areas such as marketing, customer service and new product ideation.

How Modern Taxation can benefit from Big Data Analytics..

We have repeatedly discussed how Predictive Analytics built on Big Data capabilities can create a tremendous amount of differentiation and value for Banking and Insurance companies. However, one of the most important areas of public finance – taxation- is just waking up to the possibilities of using this game changing technology – with the goal of  increasing public revenues via the collection of indirect taxes. In this blogpost, we will examine both perspectives – the taxation authority as well as the Corporate under the burden of tax reporting.


National Tax Authorities begin information sharing..

The last few years have seen increased digitalization of national tax accounting systems. More and more citizens opt to file their returns electronically. Reporting rules for businesses (such as Banks) that have wealthy account holders who have substantial financial assets spanning continents have also gone up. Different legislations have been promulgated in an effort to ensure that the exchange of information on tax matters is as seamless as it can possibly be – with the goal of curbing tax evasion.

The US Government’s tax arm – the IRS (Internal Revenue Service)- introduced the FATCA (Foreign Account Tax Compliance Act) in 2010 with the intention of detecting and tracking US nationals that reside abroad and owe the US taxes. The FATCA intends to prevent US taxpayers who hold non-US financial assets from avoiding taxation by ensuring that foreign banking institutions report on the US account holders or face a withholding of 30% on all US income.

The Organization for Economic Cooperation and Development (OECD) in conjunction with the US authorities has adopted the Common Reporting Standard (CRS), as an information standard for the automatic exchange of taxpayer information. As part of CRS, more than 90 countries now share information on residents assets and incomes in conformance with the reporting standard. CRS is far wider than FATCA and imposes a significant burden on the compliance team in a bank.

As an example, both the US and the UK governments enforce compliance with CRS. This enables bilateral information sharing between both national taxation authorities. [1]

To that end, national and regional tax authorities have begun using Big Data techniques refined in the private sector to increase the collection of taxes from citizens while reducing fraud and waste in the system.

Across a range of national compliance regimes, Big Data can help improve both the tax collection and improve risk scoring process. It can do this  by way of advanced analytics that operates on a much richer and wider set of data than was possible before. On the business side it can help with ensuring compliance reporting and not overpaying.

Catalogued below are the important areas in which Big Data Analytics can present a huge impact in the field of public taxation.

# 1: Help both Taxation Authorities and Finance Departments Efficiently Store and Process Enormous Amounts of Taxation Data 

The point is well made that the entire indirect taxation process is a complex process in terms of both the breadth and the depth of documentation. This is for both corporate tax departments and of the authorities. Businesses need to look at a complex range of multifaceted tax data across all of the thousands of national, state and local jurisdictions they operate across. Not being able to store, process the data in one place induces all the challenges caused by data silos. The Data Lake is a natural fit to store historical tax documents, Accounts Payables/Receivables, Expense information, business receipts, emails, call transcripts etc. Data from various source systems that drive the tax process should be from Master Data, Reference Data (containing fresh & accurate tax rate/jurisdiction data), data from various Book of Record Systems, ERP, finance and legacy financial systems. The key point here is to automate this data movement and cut down manual ingestion processes thus improving both the speed and quality of the process at the first & most important step itself.

                                                    Click on Image to view a blogpost on Data Lakes 

# 2: Perform Accurate Customer Due Diligence (CDD) by Creating a Single View of Compliance

Given the complexity of both compliance regimes, Big Data can help automate Customer Due Diligence (CDD)/ know your customer (KYC). This is important in helping improve both the tax collection and improve risk scoring process;

Big Data Analytics can also make the CDD highly automated as opposed to a manual laborious process especially around tax avoidance watch lists or suspicious accounts. One of the first steps here is to help create a 360 degree view of entity whether a high risk individual or a corporate entity for taxation. Doing so enables better account servicing as well as provides a holistic activity of fraud. Adding a machine learning process on this can help detect micro patterns of fraud across 10s of accounts across geographies. KYC programs are becoming increasingly daunting undertakings due to issues such as difficulty in identifying customers across multiple lines of business, and lack of a consistent view of customer bank product use and transaction activity. Further complicating these challenges is the advent of new risks such as digital currencies, new and unique payment methods, and continued variation in global data privacy regulation—all of which are resulting in enhanced regulatory scrutiny of banks readiness in this area. 

# 3: Predicting Tax Yields (or Liabilities) More Accurately

Leveraging the ingestion and predictive capabilities of a Big Data based platform, tax authorities can create a full picture of an individual or entity across all accounts or geographies. Internal Bank compliance teams can do the same with their client accounts. This can be used to predict better tax yield and compliance numbers.

# 4: Vastly Improve Tax Fraud & Evasion by improving Risk Scoring & Detection

The Banking industry has already begun leveraging sophisticated fraud detection strategies using socio-demographic data and taxpayers behavior. Big Data can be used to look at 10s of attributes per individual that were previously missed out before even for internal structured data. Adding a machine learning process on this can help detect micro patterns of fraud across 10s of accounts across geographies. 

Big Data based analytics also operate at scale thus eliminating manual and cumbersome spreadsheet based analysis – which bedevils the ability to quickly and visually detect tax fraud and evasion.

# 5: Improve the Auditability & Accuracy of Regulatory Reporting

Most businesses currently use ERP systems and engines within those to help with their taxation process. An example is to calculate VAT (Value Added Tax) obligations using the same.  These traditional tax engines and tools suffer from all the issues that plagued traditional tax data storage – operating on limited data sets which may or may not be accurate, manual processes and reconciliations lead to audits more regularly.  Big Data with its focus on data quality and overall governance can help remedy these issues. It can help improve the quality, timeliness and overall confidence in the reporting thus leading to lower number of audits.


Opening the door to the latest data storage and processing techniques can help taxation authorities introduce a higher degree of automation into their core business functions. This will allow them to reduce manual data operations, avoid costly reconciliation & reporting discrepancies – thus reducing costly audits.  This will enable them to focus on tasks such as strategic forecasting and better tax planning.


[1] Indirect Tax Compliance in an Era of Big Data – Gillis, McStocker and Percival, KPMG –   http://www.bna.com/indirect-tax-compliance-n17179927659/

How Big Data & Advanced Analytics can help Real Estate Investment Trusts (REITS)

                                                         Image Credit – Kiplinger’s


Real Estate Investment Trust’s (REITS) are financial companies that own various forms of commercial and residential real estate. These assets include office buildings, retail shopping centers, hospitals, warehouses, timberland and hotels etc. Real estate is growing quite nicely as a component of the global financial business. Given their focus on real estate investments, REITS have always occupied a specialized position in global finance.

Fundamentally, there are three types of REITS –

  1. Equity REITS which exclusively deal in acquiring, improving and selling properties with the aim of higher returns for their investors
  2. Mortgage REITS only buy and sell mortgages
  3. Hybrid REITS which do both #1 and #2 above

REITS have a reasonably straightforward business model – you take the yields from the properties you own and reinvest the funds to be able to pay your investors (a mandated 95% of dividends). Most of the traditional REIT business processes are well handled by conventional types of technology. However more and more REITs are being challenged to develop a compelling Big Data strategy that leverages their tremendous data assets. 

The Five Key Big Data Applications for REITS… 

Let us consider at the five key areas where advanced analytics built on a Big Data foundation can immensely help REITS.

#1 Property Acquisition Modeling 

REITS owners can leverage the rich datasets available around renters demographics, preferences, seasonality, economic conditions in specific markets to better guide capital decisions on acquiring property. This modeling needs to take into account land costs, development costs, fixture costs & any other sales and marketing costs to appeal to tenants. I’d like to call this macro business perspective. Also from a micro business perspective, being able to better study individual properties using a variety of widely available data – MLS listings for similar properties, foreclosures, closeness to retail establishments, work sites, building profiles, parking spaces, energy footprint etc can help them match tenants to their property holdings. All this is critical to getting their investment mix right to meet profitability targets.

                                  Click on the Image for a blogpost discussing Predictive Analytics in Capital Markets

#2 Portfolio Modeling 

REITS can leverage Big Data to perform more granular modeling of their MBS portfolios. As an example, they can feed in a lot more data into their existing models as discussed above. E.g.  Demographic data, macroeconomic factors et al.  

A simple scenario would be if Interest Rates go up by X basis points – what does that mean for my portfolio exposure, Default Rate, Cost Picture, Optimal times to buy certain MBS’s etc ?  REITS can then use that info to enter hedges etc to protect against any downside. Big Data can also help with a range of predictive modeling across all of the above areas as discussed below.  An example is to build a 360 degree view of a given investment portfolio.

                                                         Click on Image for a Customer 360 discussion 

#3 Risk Data Aggregation & Calculations 

The instruments underlying the portfolios themselves carry large amounts of credit & interest rate risk. Big Data is a fantastic platform for aggregating and calculating many kinds of risk exposures as the below link discuss in detail. 


                                            Click on Image for a discussion of Risk Data Aggregation and Measurement 


#4 Detect and Prevent Money Laundering (AML)

Due to the global nature of investment funds flowing into real estate, REITS are highly exposed to money laundering and sanctions risks. Whether or not REITS operate in high risk geographies (India,China, South America, Russia etc) or have complex holding structures – they need to file SAR (Suspicious Activity Reports) with the FinCEN.  There has always been a strong case to be made that shady foreign entities and individuals were laundering ill gotten proceeds to buy US real estate. In early 2016, the FinCEN began implementing Geographic Targeting Orders (GTOs). Title companies based in the United States are now required to clearly identify the real owners of either limited liability companies (LLCs) or any other partnerships, and other legal entities being used to purchase high end residential real estate using cash.

AML as a topic is covered exhaustively in the below series of blogposts (please click on image to open the first one).

                                                         Click on Image for a Deepdive on AML

#5 Smart Cities, Net New Investments and Property Management

In the future, REITS would want to invest in Smart Cities which are positioned to be leading urban centers offering mobility, green technology, personalized medicine, safe services, clean water, traffic management and other forward looking urban amenities. These Smart Cities target a new kind of client- upwardly mobile, technologically savvy, environment conscious millenials. According to RBC Capital Markets, Smart Cities presents a massive investment opportunity for REITS. Such investments could provide REITS offering income yields of around 10-20%. (Source – Ben Forster @ Schroeders).

Smart Cities will be created using a number of high end technologies such as IoT, AI, Virtual Reality, Device Meshes etc. By 2020, it is estimated that these buildings will be generating an enormous amount of data that needs to be stored and analyzed by landlords.

As the below graphic from Cisco attests, the ability to work with IoT data to analyze a range of these micro investment opportunities is a Big Data challenge.

The ongoing maintenance and continuous refurbishment of rental properties is a large portion of the business operation of a REIT. The availability of smart sensors and such IoT devices that can track air quality, home appliance malfunction etc can help greatly with preventive maintenance.


As can be seen from some of the above business areas, most REITS data needs require a holistic approach across the value chain (capital sourcing, investment decisions, portfolio management & operations). This approach spans various horizontal functions like Customer Segmentation, Property Acquisition, Risk, Finance and Business Operations.
The need of the hour for larger REITS is to move to a common model for data storage, model building and testing.  It is becoming increasingly obvious that Big Data can provide massive business opportunities for REITS.

Why Big Data & Advanced Analytics are Strategic Corporate Assets..

The industry is all about Digital now. The explosion in data storage and processing techniques promises to create new digital business opportunities across industries. Business Analytics concerns itself from deriving insights from data that is produced as a byproduct of business operations as well as external data that reflects customer insights. Due to their critical importance in decision making, Business Analytics is now a boardroom matter and not just one confined to the IT teams. My goal in this blogpost is to quickly introduce the analytics landscape before moving on to the significant value drivers that only Predictive Analytics can provide.

The Impact of Business Analytics…

The IDC “Worldwide Big Data and Analytics Spending Guide 2016”, predicts that the big data and business analytics market will grow from $130 billion by the end of this year to $203 billion by 2020[1] at a   compound annual growth rate (CAGR) of 11.7%. This exponential growth is being experienced across industry verticals such as banking & insurance, manufacturing, telecom, retail and healthcare.

Further, during the next four years, IDC finds that large enterprises with 500+ employees will be the main driver in big data and analytics investing, accounting for about $154 billion in revenue. The US will lead the market with around $95 billion in investments during the next four years – followed by Western Europe & the APAC region [1].

The two major kinds of Business Analytics…

When we discuss the broad topic of Business Analytics, it needs to be clarified that there are two major disciplines – Descriptive and Predictive. Industry analysts from Gartner & IDC etc. will tell you that one also needs to widen the definition to include Diagnostic and Prescriptive. Having worked in the industry for a few years, I can safely say that these can be subsumed into the above two major categories.

Let’s define the major kinds of industrial analytics at a high level –

Descriptive Analytics is commonly described as being retrospective in nature i.e “tell me what has already happened”. It covers a range of areas traditionally considered as BI (Business Intelligence). BI focuses on supporting operational business processes like customer onboarding, claims processing, loan qualification etc via dashboards, process metrics, KPI’s (Key Performance Indicators). It also supports a basic level of mathematical techniques for data analysis (such as trending & aggregation etc.) to infer intelligence from the same.  Business intelligence (BI) is a traditional & well established analytical domain that essentially takes a retrospective look at business data in systems of record. The goal of the Descriptive disciplines is to primarily look for macro or aggregate business trends across different aspects or dimensions such as time, product lines, business units & operating geographies.

  • Predictive Analytics is the forward looking branch of analytics which tries to predict the future based on information about the past. It describes what “can happen based on the patterns in data”. It covers areas like machine learning, data mining, statistics, data engineering & other advanced techniques such as text analytics, natural language processing, deep learning, neural networks etc. A more detailed primer on both along with detailed use cases are found here –

The Data Science Continuum in Financial Services..(3/3)

The two main domains of Analytics are complementary yet different…

Predictive Analytics does not intend to, nor will it, replace the BI domain but only adds significant sophisticated analytical capabilities enabling businesses to be able to do more with all the data they collect. It is not uncommon to find real world business projects leveraging both these analytical approaches.

However from an implementation standpoint, the only common area of both approaches is knowledge of the business and the sources of data in an organization. Most other things about them vary.

For instance, predictive approaches both augment & build on the BI paradigm by adding a “What could happen” dimension to the data.

The Descriptive Analytics/BI workflow…

BI projects tend to follow a largely structured process which has been well defined over the last 15-20 years. As the illustration below describes it, data produced in operational systems is subject to extraction, transformation and eventually is loaded into a data warehouse for consumption by visualization tools.


                                                                       The Descriptive Analysis Workflow 

Descriptive Analytics and BI add tremendous value to well defined use cases based on a retrospective look at data.

However, key challenges with this process are –

  1. the lack of a platform to standardize and centralize data feeds leads to data silos which cause all kinds of cost & data governance headaches across the landscape
  2. complying with regulatory initiatives (such as Anti Money Laundering or Solvency II etc.) needs the warehouse to handle varying types of data which is a huge challenge for most of the EDW technologies
  3. the ability to add new & highly granular fields to the data feeds in an agile manner requires extensive relational modeling upfront to handle newer kinds of schemas etc.

Big Data platforms have overcome past shortfalls in security and governance and are being used in BI projects at most organizations. An example of the usage of Hadoop in classic BI areas like Risk Data Aggregation are discussed in depth at the below blog.


That being said, BI projects tend to follow a largely structured process which has been well defined over the last 20 years. This space serves a large existing base of customers but the industry has been looking to Big Data as a way of constructing a central data processing platform which can help with the above issues.

BI projects are predicated on using an EDW (Enterprise Data Warehouse) and/or RDBMS (Relational Database Management System) approach to store & analyze the data. Both these kinds of data storage and processing technologies are legacy in terms of both the data formats they support (Row-Column based) as well as the types of data they can store (structured data).

Finally, these systems fall short of processing data volumes generated by digital workloads which tend to be loosely structured (e.g mobile application front ends, IoT devices like sensors or ATM machines or Point of Sale terminals), & which need business decisions to be made in near real time or in micro batches (e.g detect credit card fraud, suggest the next best action for a bank customer etc.) and increasingly cloud & API based to save on costs & to provide self-service.

That is where Predictive Approaches on Big Data platforms are beginning to shine and fill critical gaps.

The Predictive Analytics workflow…

Though the field of predictive analytics has been around for years – it is rapidly witnessing a rebirth with the advent of Big Data. Hadoop ecosystem projects are enabling the easy ingestion of massive quantities of data thus helping the business gather way more attributes about their customers and their preferences.


                                                                    The Predictive Analysis Workflow

The Predictive Analytics workflow always starts with a business problem in mind. Examples of these would be “A marketing project to detect which customers are likely to buy new products or services in the next six months based on their historical & real time product usage patterns – which are denoted by x, y or z characteristics” or “Detect real-time fraud in credit card transactions.”

In use cases like these, the goal of the data science process is to be able to segment & filter customers by corralling them into categories that enable easy ranking. Once this is done, the business is involved to setup easy and intuitive visualization to present the results.

A lot of times, business groups have a hard time explaining what they would like to see – both data and the visualization. In such cases, a prototype makes things easier from a requirements gathering standpoint.  Once the problem is defined, the data scientist/modeler identifies the raw data sources (both internal and external) which comprise the execution of the business challenge.  They spend a lot of time in the process of collating the data (from Oracle/SQL Server, DB2, Mainframes, Greenplum, Excel sheets, external datasets, etc.). The cleanup process involves fixing a lot of missing values, corrupted data elements, formatting fields that indicate time and date etc.

The data wrangling phase involves writing code to be able to join various data elements so that a single client’s complete dataset is gathered in the Data Lake from a raw features standpoint.  If more data is obtained as the development cycle is underway, the Data Science team has no option but to go back & redo the whole process. The modeling phase is where algorithms come in – these can be supervised or unsupervised. Feature engineering takes in business concepts & raw data features and creates predictive features from them. The Data Scientist takes the raw & engineered features and creates a model using a mix of various algorithms. Once the model has been repeatedly tested for accuracy and performance, it is typically deployed as a service. Models as a Service (MaaS) is the Data Science counterpart to Software as a Service. The MaaS takes in business variables (often hundreds of inputs) and provides as output business decisions/intelligence, measurements, & visualizations that augment decision support systems.

 How Predictive Analytics changes the game…

Predictive analytics can bring about transformative benefits in the following six ways.

  1. Predictive approaches can be applied to a much wider & richer variety of business challenges thus enabling an organization to achieve outcomes that were not really possible with the Descriptive variety. For instance, these use cases range from Digital Transformation to fraud detection to marketing analytics to IoT (Internet of things) across industry verticals. Predictive approaches are real-time and not just batch oriented like the Descriptive approaches.
  2. When deployed strategically – they can scale to enormous volumes of data and help reason over them reducing manual costs.  It can take on problems that can’t be managed manually because of the huge amount of data that must be processed.
  3. They can predict the results of complex business scenarios by being able to probabilistically predict different outcomes across thousands of variables by perceiving minute dependencies between them. An example is social graph analysis to understand which individuals in a given geography are committing fraud and if there is a ring operating
  4. They are vastly superior at handling fine grained data of manifold types than can be handled by the traditional approach or by manual processing. The predictive approach also encourages the integration of previously “dark” data as well as newer external sources of data.
  5. They can also suggest specific business actions(e.g. based on the above outcomes) by mining data for hitherto unknown patterns. The data science approach constantly keeps learning in order to increase its accuracy of decisions
  6. Data Monetization–  they can be used to interpret the mined data to discover solutions to business challenges and new business opportunities/models


[1] IDC Worldwide Semiannual Big Data and Business Analytics Spending Guide – Oct 2016 “Double-Digit Growth Forecast for the Worldwide Big Data and Business Analytics Market Through 2020 Led by Banking and Manufacturing Investments, According to IDC



Demystifying Digital – Why Customer 360 is the Foundational Digital Capability – ..(1/3)

The first post in this three part series on Digital Foundations introduces the concept of Customer 360 or Single View of Customer (SVC). We will discuss the need for & the definition of the SVC as part of the first step in any Digital Transformation endeavor. We will also discuss specific benefits from both a business & operational state that are enabled by SVC. The second post in the series introduces the concept of a Customer Journey. The third & final post will focus on a technical design & architecture needed to achieve both these capabilities.
In an era of exploding organizational touch points, how many companies can truly claim that they know & understand their customers, their needs & evolving preferences deeply and from a realtime perspective?  
How many companies can claim to keep up as a customers product & service usage matures and keep them engaged by cross selling new offerings. How many can accurately predict future revenue from a customer based on their current understanding of their profile?
The answer is not at all encouraging.
Across industries like Banking, Insurance, Telecom & Manufacturing, the ability to get a unified view of the customer & their journey is at the heart of the the enterprise ability to promote relevant offerings & detect customer dissatisfaction. 
  • Currently most industry players are woeful at putting together this comprehensive Single View of their Customers (SVC). Due to operational silos, each department possess a siloed & limited view of the customer across multiple channels. These views are typically inconsistent, lack synchronization with other departments & miss a high amount of potential cross-sell and up-sell opportunities.
  • The Customer Journey problem has been an age old issue which has gotten exponentially more complicated over the last five years as the staggering rise of mobile technology and the Internet of Things (IoT) have vastly increased the number of enterprise touch points that customers are exposed to in terms of being able to discover & purchase new products/services. In an OmniChannel world, an increasing number of transactions are being conducted online. In verticals like Retail and Banking, the number of online transactions approaches an average of 40%. Adding to the problem, more and more consumers are posting product reviews and feedback online. Companies thus need to react in realtime to piece together the source of consumer dissatisfaction.
Another large component of customer outreach are Marketing analytics & the ability to run effective campaigns to recruit customers.

The most common questions that a lot of enterprises fail to answer accurately are –

  1. Is the Customer happy with their overall relationship experience?
  2. What mode of contact do they prefer? And at what time? Can Customers be better targeted at these channels at those preferred times?
  3. What is the overall Customer Lifetime Value (CLV) or how much profit we are able to generate from this customer over their total lifetime?
  4. By understanding CLV across populations, can we leverage that to increase spend on marketing & sales for products that are resulting in higher customer value?
  5. How do we increase cross sell and up-sell of products & services?
  6. Does this customer fall into a certain natural segment and if so, how can we acquire most customers like them?
  7. Can different channels (Online, Mobile, IVR & POS) be synchronized ? Can Customers begin a transaction in one channel and complete it in any of the others without having to resubmit their data?

The first element in Digital is the Customer Centricity & it must naturally follow that a 360 degree view is a huge aspect of that.


                                       Illustration – Customer 360 view & its benefits

So what information is specifically contained in a Customer 360 –

The 360 degree view is a snapshot of the below types of data –

  • Customer’s Demographic information – Name, Address, Age etc
  • Length of the Customer-Enterprise relationship
  • Products and Services purchased overall
  • Preferred Channel & time of Contact
  • Marketing Campaigns the customer has responded to
  • Major Milestones in the Customers relationship
  • Ongoing activity – Open Orders, Deposits, Shipments, Customer Cases etc
  • Ongoing Customer Lifetime Value (CLV) Metrics and the Category of customer (Gold, Silver, Bronze etc)
  • Any Risk factors – Likelihood of Churn, Customer Mood Alert, Ongoing issues etc
  • Next Best Action for Customer

How Big Data technology can help..

Leveraging the ingestion and predictive capabilities of a Big Data based platform, banks can provide a user experience that rivals Facebook, Twitter or Google and provide a full picture of customer across all touch points.

Big Data enhances the Customer 360 capability in the following ways  –  

  1. Obtaining a realtime Single View of the Customer (typically a customer across multiple channels, product silos & geographies) across years of account history 
  2. Customer Segmentation by helping businesses understand customer segments down to the individual level as well as at a segment level
  3. Performing Customer sentiment analysis by combining internal organizational data, clickstream data, sentiment analysis with structured sales history to provide a clear view into consumer behavior.
  4. Product Recommendation engines which provide compelling personal product recommendations by mining realtime consumer sentiment, product affinity information with historical data.
  5. Market Basket Analysis, observing consumer purchase history and enriching this data with social media, web activity, and community sentiment regarding past purchase and future buying trends.

Customer 360 can help improve the following operational metrics of a Retailer or a Bank or a Telecom immensely.

  1. Cost to Income ratio; Customers Acquired per FTE; Sales and service FTE’s (as percentage of total FTE’s), New Accounts Per Sales FTE etc
  2.  Sales conversion rates across channels, Decreased customer attrition rates etc.
  3. Improved Net promotor scores (NPS), referral based sales etc

Customer 360 is thus basic digital capability every organization needs to offer their customers, partners & internal stakeholders. This implies a re-architecture of both data management and business processes automation.

The next post will discuss the second critical component of Digital Transformation – the Customer Journey.

Embedding A Culture of Business Analytics into the Enterprise DNA..

IT driven business transformation is always bound to fail” – Amber Storey, Sr Manager, Ernst & Young

The value of Big Data driven Analytics is no longer in question both from a customer as well as an enterprise standpoint. Lack of investment in an analytic strategy has the potential to impact shareholder value negatively.  Business Boards and CXOs are now concerned about their overall levels and maturity of investments in terms of business value – i.e increasing sales, driving down business & IT costs & helping create new business models. It is thus an increasingly accurate argument that smart applications & ecosystems built around them will increasingly dictate enterprise success.

Such examples among forward looking organizations abound across industries. These range from realtime analytics in manufacturing using IoT data streams across the supply chain, the use of natural language processing to drive patient care decisions in healthcare, more accurate insurance fraud detection & driving Digital interactions in Retail Banking etc to quote a few. 

However , most global organizations currently adopt a fairly tactical approach to ensuring the delivery of of traditional business intelligence (BI) and predictive analytics to their application platforms.  This departmental is quite suboptimal in ways as scaleable data driven decisions & culture not only empower decision-makers with up to date and realtime information but also help them develop long term insights into how globally diversified business operations are performing.  Scale is the key word here due to rapidly changing customer trends, partner, supply chain realities & regulatory mandates.

Scale implies speed of learning,  business agility across the organization in terms of having globally diversified operations turn on a dime thus ensuring that the business feels empowered.

A quick introduction to Business (Descriptive & Predictive) Analytics –

Business intelligence (BI) is a traditional & well established analytical domain that essentially takes a retrospective look at business data in systems of record. The goal for BI is to primarily look for macro or aggregate business trends across different aspects or dimensions such as time, product lines, business unites & operating geographies.

BI is primarily concerned with “What happened and what trends exist in the business based on historical data?“. The typical use cases for BI include budgeting, business forecasts, reporting & key performance indicators (KPI).

On the other hand, Predictive Analytics (a subset of Data Science) augments & builds on the BI paradigm by adding a “What could happen” dimension to the data in terms of –

  • being able to probabilistically predict different business scenarios across thousands of variables
  • suggesting specific business actions based on the above outcomes

Predictive Analytics does not intend to nor will it replace the BI domain but only adds significant business capabilities that lead to overall business success. It is not uncommon to find real world business projects leveraging both these analytical approaches.

Creating an industrial approach to analytics – 

Strategic business projects typically begin imbibing a BI/Predictive Analytics based approach as an afterthought to the other aspects of system architecture and buildout. This dated approach then ensures that analytics becomes external to and eventually operating in a reactive mode in the operation of business system.

Having said that, one does need to recognize that an industrial approach to analytics is a complex endeavor that depends on how an organization tackles the convergence of the below approaches –

  1. Organizational Structure
  2. New Age Technology 
  3. A Platforms Mindset
  4. Culture


        Illustration – Embedding A Culture of Business Analytics into the Enterprise DNA..

Lets discuss them briefly – 

Organizational Structure – The historical approach has been to primarily staff analytics teams as a standalone division often reporting to a CIO. This team has responsibility for both the business intelligence as well as some silo of a data strategy. Such a piecemeal approach to predictive analytics ensures that business & application teams adopt a “throw it over the wall” mentality over time.

So what needs to be done? 

In the Digital Age, enterprises should look to centralize both data management as well as the governance of analytics as core business capabilities. I suggest a hybrid organizational structure where a Center of Excellence (COE) is created which reports to the office of the Chief Data Officer (CDO) as well as individual business analytic leaders within the lines of business themselves.

 This should be done to ensure that three specific areas are adequately tackled using a centralized approach- 

  • Investing in creating a data & analytics roadmap by creating a center of excellence (COE)
  • Setting appropriate business milestones with “lines of business” value drivers built into a robust ROI model
  • Managing Risk across the enterprise with detailed scenario planning

New Age Technology –

The onset of Digital Architectures in enterprise businesses implies the ability to drive continuous online interactions with global consumers/customers/clients or patients. The goal is not just provide engaging visualization but also to personalize services clients care about across multiple modes of interaction. Mobile applications first begun forcing the need for enterprise to begin supporting multiple channels of interaction with their consumers. We have seen how how exploding data generation across the global economy has become a clear & present business & IT phenomenon. Data volumes are rapidly expanding across industries. However, while the production of data itself that has increased but it is also driving the need for organizations to derive business value from it. This calls for the collection & curation of data from dynamic,  and highly distributed sources such as consumer transactions, B2B interactions, machines such as ATM’s & geo location devices, click streams, social media feeds, server & application log files and multimedia content such as videos etc – using Big Data.

Cloud Computing is the ideal platform to provide the business with self service as well as rapid provisioning of business analytics. Every new application designed needs to be cloud native from the get go.

The onset of Digital Architectures in enterprise businesses implies the ability to drive continuous online interactions with global consumers/customers/clients or patients. The goal is not just provide engaging Visualization but also to personalize services clients care about across multiple modes of interaction. Mobile applications first begun forcing the need for enterprise to begin supporting multiple channels of interaction with their consumers. For example Banking now requires an ability to engage consumers in a seamless experience across an average of four to five channels – Mobile, eBanking, Call Center, Kiosk etc.

A Platforms Mindset – 

As opposed to building standalone or one-off business applications, a Platform Mindset is a more holistic approach capable of producing higher revenues. Platforms abound in the webscale world at shops like Apple, Facebook & Google etc. Applications are constructed like lego blocks  and they reuse customer & interaction data to drive cross sell and up sell among different product lines. The key components here are to ensure that one starts off with products with high customer attachment & retention. While increasing brand value, it is key to ensure that customers & partners can also collaborate in the improvements in the various applications hosted on top of the platform.

Culture – Business value fueled by analytics is only possible if the entire organization operates on an agile basis in order to collaborate across the value chain. Cross functional teams across new product development, customer acquisition & retention, IT Ops, legal & compliance must collaborate in short work cycles to close the traditional business & IT innovation gap. Methodologies like DevOps who’s chief goal is to close the long-standing gap between the engineers who develop and test IT capability and the organizations that are responsible for deploying and maintaining IT operations – must be adopted. Using traditional app dev methodologies, it can take months to design, test and deploy software. No business today has that much time—especially in the age of IT consumerization and end users accustomed to smart phone apps that are updated daily. The focus now is on rapidly developing business applications to stay ahead of competitors that can better harness Big Data’s amazing business capabilities.


Enterprise wide business analytic approaches designed around the four key prongs  (Structure, Culture, Technology & Platforms)   will create immense operational efficiency, better business models, increased relevance and ultimately drive revenues. These will separate the visionaries, leaders from the laggards in the years to come.

Data Lakes power the future of Industrial Analytics..(1/4)

The first post in this four part series on Data lakes will focus on the business reasons to create one. The second post will delve deeper into the technology considerations & choices around data ingest & processing in the lake to satisfy myriad business requirements. The third will tackle the critical topic of metadata management, data cleanliness & governance. The fourth & final post in the series will focus on the business justification to build out a Big Data Center of Excellence (COE).

Business owners at the C level are saying, ‘Hey guys, look. It’s no longer inordinately expensive for us to store all of our data. I want all of you to make copies. OK, your systems are busy. Find the time, get an extract, and dump it in Hadoop.’”- Mike Lang, CEO of Revelytix

The onset of Digital Architectures in enterprise businesses implies the ability to drive continuous online interactions with global consumers/customers/clients or patients. The goal is not just provide engaging visualization but also to personalize services clients care about across multiple modes of interaction. Mobile applications first begun forcing the need for enterprise to begin supporting multiple channels of interaction with their consumers. For example Banking now requires an ability to engage consumers in a seamless experience across an average of four to five channels – Mobile, eBanking, Call Center, Kiosk etc. Healthcare is a close second where caregivers expect patient, medication & disease data at their fingertips with a few finger swipes on an iPad app.

Big Data has been the chief catalyst in this disruption. The Data Lake architectural & deployment pattern makes it possible to first store all this data & then enables the panoply of Hadoop ecosystem projects & technologies to operate on it to produce business results.

Let us consider a few of the major industry verticals and the sheer data variety that players in these areas commonly possess – 

The Healthcare & Life Sciences industry possess some of the most diverse data across the spectrum ranging from – 

  • Structured Clinical data e.g. Patient ADT information
  • Free hand notes
  • Patient Insurance information
  • Device Telemetry 
  • Medication data
  • Patient Trial Data
  • Medical Images – e.g. CAT Scans, MRIs, CT images etc

The Manufacturing industry players are leveraging the below datasets and many others to derive new insights in a highly process oriented industry-

  • Supply chain data
  • Demand data
  • Pricing data
  • Operational data from the shop floor 
  • Sensor & telemetry data 
  • Sales campaign data

Data In Banking– Corporate IT organizations in the financial industry have been tackling data challenges due to strict silo based approaches that inhibit data agility for many years now.
Consider some of the traditional sources of data in banking –

  • Customer Account data e.g. Names, Demographics, Linked Accounts etc
  • Core Banking Data
  • Transaction Data which captures the low level details of every transaction (e.g debit, credit, transfer, credit card usage etc)
  • Wire & Payment Data
  • Trade & Position Data
  • General Ledger Data e.g AP (accounts payable), AR (accounts receivable), cash management & purchasing information etc.
  • Data from other systems supporting banking reporting functions.

Industries have changed around us since the advent of relational databases & enterprise data warehouses. Relational Databases (RDBMS) & Enterprise Data Warehouses (EDW) were built with very different purposes in mind. RDBMS systems excel at online transaction processing (OLTP) use cases where massive volumes of structured data needs to be processed quickly. EDW’s on the other hand perform online analytical processing functions (OLAP) where data extracts are taken from OLTP systems, loaded & sliced in different ways to . Both these kinds of systems are not simply suited to handle not just immense volumes of data but also highly variable structures of data.


Let us consider the main reasons why legacy data storage & processing techniques are unsuited to new business realities of today.

  • Legacy data technology enforces a vertical scaling method that is sorely unsuited to handling massive volumes of data in a scale up/scale down manner
  • The structure of the data needs to be modeled in a paradigm called ’schema on write’ which sorely inhibits time to market for new business projects
  • Traditional data systems suffer bottlenecks when large amounts of high variety data are processed using them 
  • Limits in the types of analytics that could be performed. In industries like Retail, Financial Services & Telecommunications, enterprise need to build detailed models of customers accounts to predict their overall service level satisfaction in realtime. These models are predictive in nature and use data science techniques as an integral component. The higher volumes of data along with attribute richness that can be provided to them (e.g. transaction data, social network data, transcribed customer call data) ensures that the models are highly accurate & can provide an enormous amount of value to the business. Legacy systems are not a great fit here.

Given all of the above data complexity and the need to adopt agile analytical methods  – what is the first step that enterprises must adopt? 

The answer is the adoption of the Data Lake as an overarching data architecture pattern. Lets define the term first. A data lake is two things – a small or massive data storage repository and a data processing engine. A data lake provides “massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs“.[1] Data Lake are created to ingest, transform, process, analyze & finally archive large amounts of any kind of data – structured, semistructured and unstructured data.


                                  Illustration – The Data Lake Architecture Pattern

What Big Data brings to the equation beyond it’s strength in data ingest & processing is a unified architecture. For instance, MapReduce is the original framework for writing applications that process large amounts of structured and unstructured data stored in the Hadoop Distributed File System (HDFS). Apache Hadoop YARN opened Hadoop to other data processing engines (e.g. Apache Spark/Storm) that can now run alongside existing MapReduce jobs to process data in many different ways at the same time. The result is that ANY kind of application processing can be run inside a Hadoop runtime – batch, realtime, interactive or streaming.

Visualization  – Mobile applications first begun forcing the need for enterprise to begin supporting multiple channels of interaction with their consumers. For example Banking now requires an ability to engage consumers in a seamless experience across an average of four to five channels – Mobile, eBanking, Call Center, Kiosk etc. The average enterprise user is also familiar with BYOD in the age of self service. The Digital Mesh only exacerbates this gap in user experiences as information consumers navigate applications as they consume services across a mesh that is both multi-channel as well as provides Customer 360 across all these engagement points.While information management technology has grown at a blistering pace, the human ability to process and comprehend numerical data has not. Applications being developed in 2016 are beginning to adopt intelligent visualization approaches that are easy to use,highly interactive and enable the user to manipulate corporate & business data using their fingertips – much like an iPad app. Tools such as intelligent dashboards, scorecards, mashups etc are helping change a visualization paradigms that were based on histograms, pie charts and tons of numbers. Big Data improvements in data lineage, quality are greatly helping the visualization space.

The Final Word

Specifically, Data Lake architectural pattern provide the following benefits – 

The ability to store enormous amounts of data with a high degree of agility & low cost: The Schema On Read architecture makes it trivial to ingest any kind of raw data into Hadoop in a manner that preserves it’s structure.  Business analysts can then explore  this data and then defined a schema to suit the needs of their particular application.

The ability to run any kind of Analytics on the data: Hadoop supports multiple access methods (batch, real-time, streaming, in-memory, etc.) to a common data set.  You are only restricted by your use case.

the ability to analyze, process & archive data while dramatically cutting cost : Since Hadoop was designed to work on low-cost commodity servers which have direct attached storage – it helps dramatically lower the overall cost of storage.  Thus enterprises are able to retain source data for long periods, thus providing business applications with far greater historical context.

The ability to augment & optimize Data Warehouses: Data lakes & Hadoop technology are not a ‘rip & replace’ proposition. While they provide a much lower cost environment than data warehouses, they can also be used as the compute layer to augment these systems.  Data can be stored, extracted and transformed in Hadoop. Then a subset of the data i.e the results are loaded into the data warehouse. This enables the EDW to leverage compute cycles and storage to perform truly high value analytics.

The next post of the series will dive deeper into the architectural choices one needs to make while creating a high fidelity & business centric enterprise data lake.

References – 

[1] https://en.wikipedia.org/wiki/Data_lake

The Data Science Continuum in Financial Services..(3/3)

In God we trust. All others must bring data.” – DrEdwards Deming, statistician, professor, author, lecturer, and consultant.

The first post in this three part series described key ways in which innovative applications of data science are slowly changing a somewhat insular banking & financial services industry . The second post then delineated key business use cases enabled by a data driven or data native  approach. The final post will examine foundational Data Science tasks & techniques that are commonly employed to get value from data with financial industry examples. We will round off the discussion with recommendations for industry CXOs.


The Need for Data Science –

It is no surprise that Big Data approaches were first invented & then refined in web scale businesses at Google, Yahoo, eBay, Facebook and Amazon etc. These web properties offer highly user friendly, contextual & mobile native application platforms  which produce a large amount of complex and multi-varied data from consumers,sensors and other telemetry devices. All this data that is constantly analyzed to drive higher rates of application adoption thus driving a virtuous cycle. We have discussed the (now) augmented capability of financial organizations to acquire, store and process large volumes of data by leveraging the HDFS (Hadoop Distributed Filesystem) running on commodity (x86) hardware.

One of the chief reasons that these webscale shops adopted Big Data is the ability to store the entire data set in Hadoop to build more accurate predictive models. The ability store thousands of  attributes at a much finer grain over a historical amount of time instead of just depending on a statistically significant sample is a significant gain over legacy data technology.

Every year Moore’s Law keeps driving the costs of raw data storage down. At the same time, compute technologies such as MapReduce, Tez, Storm and Spark have enabled the organization and analysis of Big Data at scale. The convergence of cost effective storage and scalable processing allows us to extract richer insights from data. These insights need to then be operationalized@scale to provide business value as the use cases in the last post highlighted @ http://www.vamsitalkstech.com/?p=1582

The differences between Descriptive & Predictive Analytics –

Business intelligence (BI) is a traditional & well established analytical domain that essentially takes a retrospective look at business data in systems of record. The goal for BI is to primarily look for macro or aggregate business trends across different aspects or dimensions such as time, product lines, business unites & operating geographies.

BI is primarily concerned with “What happened and what trends exist in the business based on historical data?“. The typical use cases for BI include budgeting, business forecasts, reporting & key performance indicators (KPI).

On the other hand, Predictive Analytics (a subset of Data Science) augments & builds on the BI paradigm by adding a “What could happen” dimension to the data in terms of –

  • being able to probabilistically predict different business scenarios across thousands of variables
  • suggesting specific business actions based on the above outcomes

Predictive Analytics does not intend to nor will it replace the BI domain but only adds significant business capabilities that lead to overall business success. It is not uncommon to find real world business projects leveraging both these analytical approaches.

Data Science  –

So, what exactly is Data Science ?

Data Science is an umbrella concept that refers to the process of extracting business patterns from large volumes of both structured, semi structured and unstructured data. Data Science is the key ingredient in enabling a predictive approach to the business.

Some of the key aspects that follow are  –

  1. Data Science is not just about applying analytics to massive volumes of data. It is also about exploring the patterns,associations & interrelationships of thousands of variables within the data. It does so by adopting an algorithmic approach to gleaning the business insights that are embedded in the data.
  2. Data Science is a standalone discipline that has spawned its own set of platforms, tools and processes across it’s lifecycle.
  3. Data science also aids in the construction of software applications & platforms to utilize such insights in a business context.  This involves the art of discovering data insights combined with the science of operationalizing them at scale. The word ‘scale’ is key. Any algorithm, model or deployment paradigm should support an expanding number of users without the need for unreasonable manual intervention as time goes on.
  4. A data scientist uses a combination of machine learning, statistics, visualization, and computer science to extract valuable business insights hiding in data and builds operational systems to deliver that value.
  5. The machine learning components are classified into two categories: ‘supervised’ and ‘unsupervised’ learning. In supervised learning, the constructed model defines the effect one set the inputs on the outputs through the causal chain.In unsupervised learning, the ouputs are affected by so called latent variables. It is also possible to have a hybrid approach to certain types of mining tasks.
  6. Strategic business projects typically begin leveraging a Data Science based approach to derive business value. This approach then becomes integral and eventually core to the design and architecture of such a business system.
  7. Contrary to what some of the above may imply, Data Science is a cross-functional discipline and not just the domain of Phd’s. A data scientist is part statistician, part developer and part business strategist.
  8. Working in small self sufficient teams, the Data Scientist collaborates with extended areas which includes visualization specialists, developers, business analysts, data engineers, applied scientists, architects, LOB owners and DevOp. The success of data science projects often relies on the communication, collaboration, and interaction that takes place with the extended team, both internally and possibly externally to their organization.
  9. It needs to be clarified that not every business project is a fit for a Data science approach. The criteria that must be employed to understand if such an advanced approach is called for include if the business initiative needs to provide knowledge based decisions (beyond the classical rules engine/ expert systems based approaches), deal with volumes of relevant data, a rapidly changing business climate, & finally where scale is required beyond what can be supplied using human analysts.
  10. Indeed any project where hugely improved access to information & realtime analytics for customers, analysts (and other stakeholders) is a must for the business – is fertile ground for Data Science.

Algorithms & Models

The word ‘model‘ is highly overloaded and means different things to different IT specialities e.g. RDBMS models imply data schemas, statistical models are built by statisticians etc. However, it can safely be said that models are representations of a business construct or a business situation.

Data mining algorithms are used to create models from data.

To create a data science model, the data mining algorithm looks for key patterns in data provided. The results of this analysis are to define the best parameters to create the model. Once identified, these parameters are applied across the entire data set to extract actionable patterns and detailed statistics.

The model itself can take various forms ranging from a set of customers across clusters, a revenue forecasting model, a set of fraud detection rules for credit cards or a decision tree that predicts outcomes based on specific criteria.

Common Data Mining Tasks –

There are many different kinds of data mining algorithms but all of these address a few fundamental types of tasks. The pouplar ones are listed below along with relevant examples:

  • Classification & Class Probability Estimation– For a given set of data, predict for each individual in a population, a discrete set of classes that this individual belongs to. An example classification is – “For all wealth management clients in a given population, who are most likely to respond to an offer to move to a higher segment”. Common techniques used in classification include decision trees, bayesian models, k-nearest neighbors, induction rules etc. Class Probability Estimation (CPE) is a closely related concept in which a scoring model is created to predict the likelihood that an individual would belong to that class.
  • Clustering is an unsupervised technique used to find classes or segments of populations within a larger dataset without being driven by any specific purpose. For example – “What are the natural groups our customers fall into?”. The most popular use of clustering techniques is to identify clusters to use in activities like market segmentation.A common algorithm used here is k-means clustering.
  • Market basket analysis  is commonly used to find out associations between entities based on transactions that involve them. E.g Recommendation engines which use affinity grouping.
  • Regression algorithms aim to characterize the normal or typical behavior of an individual or group within a larger population. It is frequently used in anomaly detection systems such as those that detect AML (Anti Money Laundering) and Credit Card fraud.
  • Profiling algorithms divide data into groups, or clusters, of items that have similar properties.
  • Causal Modeling algorithms attempt to find out what business events influence others.

There is no reason that one should be limited to one of the above techniques while forming a solution. An example is to use one algorithm (say clustering) to determine the natural groups in the data, and then to apply regression to predict a specific outcome based on that data. Another example is to use multiple algorithms within a single business project to perform related but separate tasks. Ex – Using regression to create financial reporting forecasts, and then using a neural network algorithm to perform a deep analysis of the factors that influence product adoption.

The Data Science Process  –

A general process framework for a typical Data science project is depicted below. The process flow depicted below suggests a sequential waterfall but allows for Agile/DevOps loops in the core analysis & feedback phases. The process is also not a virtual one sided pipeline but also allows for continuous improvements.


                                           Illustration: The Data Science Process 

  1. The central pool of data that hosts all the tiers of data processing in the above illustration is called the Data Lake. The Data Lake enables two key advantages – the ability to collect cross business unit data so that it can be sampled/explored at will & the ability to perform any kind of data access pattern across a shared data infrastructure: batch,  interactive, search, in-memory and custom etc.
  2. The Data science process begins with a clear and contextual understanding of the granular business questions that need to be answered from the real world dataset. The Data scientist needs to be trained in the nuances of the business to achieve the appropriate outcome. E.g. Detecting customer churn, predicting fraudulent credit card transactions in the credit cards space, predicting which customers in the Retail Bank are likely to churn over the next few months based on their usage patterns etc.
  3. Once this is known, relevant data needs to be collected from the real world. These sources in Banking range from –
    1. Customer Account data e.g. Names,Demographics, Linked Accounts etc
    2. Transaction Data which captures the low level details of every transaction (e.g debit, credit, transfer, credit card usage etc),
    3. Wire & Payment Data,
    4. Trade & Position Data,
    5. General Ledger Data and Data from other systems supporting core banking functions.
    6. Unstructured data. E.g social media feeds, server logs, clickstream data & mobile application data etc.
  4. Following the planning stage, Data Acquisition follows an  iterative process of acquiring data from the actual sources by creating appropriate loaders choosing appropriate technology components. E.g. Apache NiFi, Kafka, Sqoop, Flume, HDFS API, Java etc
  5. The next step is to perform Data Cleansing. Here the goal is to look for gaps in the data  (given the business context), ensuring that the dataset is valid with no missing values, consistent in layout and as fresh as possible from a temporal standpoint. This phase also involves fixing any obvious quality problems involving range or table driven data. The purpose at this stage is also to facilitate & perform appropriate data governance.
  6. Exploratory Data Analysis (EDA) helps with trial & error analysis of data. This is a phase where plots and graphs are used to systematically go through the data. The importance of this cannot be overstated as it provide the Data scientist and the business with a flavor of the data.
  7. Data Analysis: Generation of features or attributes that will be part of the model. This is the step of the process where actual data mining takes place leveraging models built using the above algorithms.

Within each of the above there exist further iterative steps within the Data Cleansing and Data Analysis stages.

Once the models have been tested and refined to the satisfaction of the business and their performance been put through a rigorous performance test phase, they are deployed into production. Once deployed, these are constantly refined based on end user and system feedback.

The Big Data ecosystem (consisting of tools such as Pig, Scalding, Hive, Spark and MapReduce etc) enable sea changes of improvement across the entire Data science lifecycle from data acquisition to data processing to data analysis. The ability of Big Data/Hadoop to unify all data storage in one place which renders data more accessible for modeling. Hadoop also scales up machine learning analysis due to it’s inbuilt paralleism which adds a tremendous amount of value both in terms of training multiple parallel models to improve their efficacy. The ability to collect a lot of data as opposed to small samples also helps greatly.

Recommendations – 

Developing a strategic mindset to Data science and predictive analytics should be a board level concern. This entails

  • To begin with – ensuring buy in & commitment in the form of funding at a Senior Management level. This support needs to extend across the entire lifecycle depicted above (from identifying business use cases).
  • Extensive but realistic ROI (Return On Investment) models built during due diligence with periodic updates for executive stakeholders
  • On a similar note, ensuring buy in using a strategy of co-opting & alignment with Quants and different high potential areas of the business (as covered in the usecases in the last blog)
  • Identifying leaders within the organization who can not only lead important projects but also create compelling content to evangelize the use of predictive analytics
  • Begin to tactically bake in or embed data science capabilities across different lines of business and horizontal IT
  • Slowly moving adoption to the Risk, Fraud, Cybersecurity and Compliance teams as part of the second wave. This is critical in ensuring that analysts across these areas move from a spreadsheet intensive model to adopting advanced statistical techniques
  • Creating a Predictive Analytics COE (Center of Excellence) that enable cross pollination of ideas across the fields of statistical modeling, data mining, text analytics, and Big Data technology
  • Informing the regulatory authorities of one’s intentions to leverage data science across the spectrum of operations
  • Ensuring that issues related to data privacy,audit & compliance have been given a great deal of forethought
  • Identifying  & developing human skills in toolsets (across open source and closed source) that facilitate adapting to data lake based architectures. A large part of this is to organically grow the talent pool by instituting a college recruitment process

While this ends the current series on Data Science in financial services, it is my intention to explore each of the above Data Mining techniques to a greater degree of depth as applied to specific business situations in 2016 & beyond. This being said, we will take a look at another pressing business & strategic concern – Cybersecurity in Banking – in the next series.