My Last Post for 2015 .. A Series of Webinars on Issues Facing the Financial Services Industry

First off, I want to take a quick moment to wish each and every one of my readers,customers, colleagues (both past & present) & collaborators a very Merry Christmas,Happy Holidays and a Very Happy New Year! I feel so completely blessed to be able to have the opportunity to meet, interact & learn from you all around the world this year. These are tremendous times for us all in technology, open source and the industry in general.  My goal for the blog in 2015 was to cover a lot of interesting industry themes (and thank you for all your indulgence) , feel I have been able to accomplish that to some extent. 2016 should see the continuation of exploration of some of the motley crew of technologies remaking the world – Virtual Currencies, Containers, DevOps, Cloud Computing, Open Source Mobility, PaaS, Spark & Data Science – areas I am spending a lot of my time with my clients on several keystone & marquee initiatives. I have learnt so much from all of you readers cum teachers that I owe everyone a huge Thank You. I am forever thankful for the tons of calls, comments & camaraderie at in person events – Thank You. Finally, a huge Thank You to my employers at Hortonworks for their constant support & insistence on my keeping things jargon & marketing free.

bigdatafsseries

I have spend a good bit of my time in November and Dec 2015 doing public facing webinars on a host of issues the blog has spent time on. I am posting the links to all of these as this weeks update. These are all posted on Brightalk and can be accessed at no cost. They are all focused on a business imperative & a deep-dive into suitable technology approaches meant to solve these challenges in a comprehensive manner. Happy Viewing!!

How Big Data is Disrupting Financial Services 

https://www.brighttalk.com/webcast/9573/177855

There are very few industries that are as data-centric as the banking and financial services industries. Every interaction that a client or partner system has with a banking institution produces actionable data that has potential business value associated with it. Areas across the FS spectrum like – Retail banking, Wealth Management, Stock Exchanges, Consumer Banking and Capital Markets have historically had multiple sources and silos of data across the front-, back- and mid-offices. Big Data technology (led by Hadoop) is changing the landscape in areas as diverse as Risk Mgmt, AML Compliance, Fraud Detection, Cyber Security and Customer Analytics. In this webinar, we will explore some of these global themes and discuss specific usecases & business areas that the largest Global Banks are leveraging Big Data across. 

Better Financial Risk Management with Hadoop

https://www.brighttalk.com/webcast/9573/177857

Improper and inadequate management of a major kind of financial risk – liquidity risk, was a major factor in the series of events in 2007 and2008 which resulted in the failure of major investment banks including Lehman Brothers, Bear Stearns etc resulting in a full blown liquidity crisis. Inadequate IT systems in terms of data management, reporting and agile methodologies are widely blamed for this lack of transparency into risk accounting – that critical function – which makes all the difference between well & poorly managed banking conglomerates. Indeed, Risk management is not just a defensive business imperative but the best managed banks can understand their holistic risks much better to deploy their capital to obtain the best possible business outcomes. Since 2008,  a raft of regulation has been passed by global banking regulators like the BCBS, the US Fed and others. These include the Basel III committee regulations, BCBS 239 principles on Risk Data Aggregation, Dodd Frank Act, the Volcker Rule,CCAR etc. Leading Global Banks are now leveraging Apache Hadoop and it’s ecosystem of projects to create holistic data management and governance architectures in support of efficient risk management across all the above areas. This webinar will discuss the business issues, technology architectures and best practices from an industry insiders perspective.

Anti Money Laundering (AML) Compliance done better with Hortonworks/Hadoop

https://www.brighttalk.com/webcast/9573/177859

In 2015, it goes without saying that Banking is an increasingly complex as well as a global business. Leading Banks now generate a large amount of revenue in Global markets and this is generally true of all major worldwide banks. Financial crime is a huge concern for banking institutions given the complexity of the products they offer their millions of customers, large global branch networks and operations spanning the spectrum of financial services. The BSA (Bank Secrecy Act) requires U.S. financial institutions to assist U.S. government agencies to detect and prevent money laundering. Specifically, the act requires financial institutions to keep records of cash purchases of negotiable instruments, to file reports of cash transactions exceeding $10,000 (daily aggregate amount), and to report suspicious activity that might signify money laundering, tax evasion, or other criminal activities. After the terrorist attacks of 2001, the US Patriot Act was passed into law by Congress. The Patriot Act augments the BSA with Know Your Customer (KYC) legislation which mandates that Banking institutions be completely aware of their customer’s identities and transaction patterns with a view of monitoring account activity. This webinar discusses the key business issues and technology considerations in moving your AML regime to a Hadoop based infrastructure and the key benefits in doing so.

Payment Card Fraud Detection

https://www.brighttalk.com/webcast/9573/177867

Credit & Payment Card fraud has mushroomed into a massive challenge for consumers, financial institutions,regulators and law enforcement. As the accessibility and usage of Credit Cards burgeons and transaction volumes increase, Banks are losing tens of billions of dollars on an annual basis to fraudsters. The Nilson Report (as 2013) estimated that of every dollar of transaction, about 5 cents are lost to fraud which makes it a massive drain on the overall system. Another pernicious side effect of Payment Card fraud is identity theft. Banks are increasingly turning to Hadoop & predictive analytics to predict and prevent fraud in real-time. This webinar will discuss the overall business problem as well as the reason Hadoop is turning into the platform of choice in tackling this challenge. We will finally discuss a real world candidate architecture that illustrates the overall technology & data governance approach.

The Digital Disruption in Retail Banking 

https://www.brighttalk.com/webcast/9573/177869

Everyday one hears more about how Big Data ecosystem technologies are helping create incremental innovation & disruption in any given industry vertical – be it in exciting new cross industry areas like Internet Of Things (IoT) or in reasonably settled industries like Banking, Manufacturing & Healthcare. Big Data platforms, powered by Open Source Hadoop, can economically store large volumes of structured, unstructured or semistructured data & help process it at scale thus enabling predictive and actionable intelligence. The Digital trend in Banking is now driving well established banking organizations to respond to the disruption being caused by emerging FinTechs. Given that a higher percentage of Banking interactions are now being driven from Digital Channels – how are banks expected to respond. One major way is to turn their Data assets into Actionable nuggets. In this webinar, we will examine these trends from a business and a technology standpoint.

Screen Shot 2015-12-22 at 7.16.02 PM

Use Hortonworks Data Flow (HDF) To Connect The Dots In Financial Services..(3/3)

This is the final blogpost in our three part series on Enterprise Dataflow Management and it’s applicability in Financial Services. This post discusses common business drivers and usecases to a good degree of depth.

As  2015 draws to a close, the Financial Services Industry seems to be in the midst of a perfect storm – a storm that began blowing up the horizon a few years ago. For an industry thats always enjoyed relatively high barriers to entry & safe incumbency  due to factors like highly diversified operations, access to a huge deposit base etc – new age Fintechs and other nimble competitors have begun upending business dynamics across many of the domains that make up the industry – Retail & Consumer Banking, Capital Markets, Wealth Management et al.

Fintechs are capturing market share using a mix of innovative technology, crowd funding, digital wallets & currencies to create new products & services – all aimed at dis-intermediating & disrupting existing value chains. It is also interesting that incumbent firms still continue to spend billions of dollars in technology projects to maintain legacy platforms as well as create lateral (& tactical) innovation.

However, large to medium sized Banks (defined based on the average number of customer accounts as greater than 5 million), which have built up massive economies of scale over the years over a large geographical area, do hold a massive first mover advantage. This is due to a range of factors like well defined operating models, highly established financial products across their large (and largely loyal & sticky) customer bases, a wide networks of Branches & ATMs, rich troves of data that pertain to customer transactions & demographic information. However, it is not enough to just possess the data. They must be able to drive change through legacy thinking and infrastructures as things change around the entire industry as it struggles to adapt to a major new segment – millenial customers – who increasingly use mobile devices and demand more contextual services as well as a seamless unified banking experience – akin to what they commonly experience via the internet – at web properties like Facebook, Amazon,Google or Yahoo etc.

What are some of the exciting new business models that the new entrants in Fintech are pioneering at the expense of the traditional Bank ?

  • Offering targeted banking services to technology savvy customers at a fraction of the cost e.g. Mint.com in retirement planning in the USA
  • Lowering cost of access to discrete financial services for business customers in areas like foreign exchange transactions & payments e.g. Currency Cloud
  • Differentiating services like peer to peer lending among small businesses and individuals e.g. Lending Club
  • Providing convenience through use of handheld devices like iPhones .e.g. Square Payments

The core capability needed is being able to deliver realtime services via  Predictive Analytics @ Low Latency & Massive Scale. Predictive Analytics provides the most important capability in terms of facilitating speedier deposits, payments, risk mitigation, compliance transaction monitoring and fraud detection. This ability to improve customer engagement and retention while providing both realtime and deeper insight across hundreds of myriad business scenarios is what separates industry leaders from the laggards.

The four common themes to becoming a Data driven & Predictive Bank –

  1. Constant Product innovation based on an incremental approach to innovation – Are we building the right products that cater to our dynamic clientele?
  2. A Unified & Seamless Channel experience as an ever higher rate of transactions are performed over Digital mediums
  3. A relentless drive to Automation – replace obsolete manual processes with an automated operating processes across both business & IT
  4. Constant push to Innovate across the BiModal World –  the stable core as well as the new edge by incorporating the latest advances. Please visit http://www.vamsitalkstech.com/?p=1244 for more detail..

So what do Banking CXOs & Architects need to do to drive an effective program of Predictive Analytics Enabled Transformation?

  • Drive innovation to provide personalized services and a seamless customer experience across multiple diffused channels
  • Eliminate data silos that have built up over the years which inhibit an ability to cross sell services that clients are interested in
  • Offer data driven capabilities that can detect customer preferences on the fly, match them with existing history and provide value added services. Services that not only provide a better experience but also help in building a longer term customer relationship
  • On the Defensive side, provide next generation and forward looking Data Ingestion & Processing capabilities in the highly regulated (and hugely vexing) areas of Risk Data Aggregation & Reporting, Various Types of Fraud Monitoring & Detection and AML (Anti Money Laundering) Compliance

So how can Hortonworks Dataflow (HDF) Create Business Value – 

What are some of the key requirements & desired capabilities such a technology category can provide ?

  1. Provides strong lifecycle management capabilities for Data In Motion. This includes not just data flowing in from traditional sources of data including Customer Account data, Transaction Data, Wire Data, Trade Data, Customer Relationship Management (CRM), General Ledger but also from streaming datasources like Market Data Systems which feed instrument data, Machine generated data – ATMs, POS & systems supporting Digital functions like Social Media, Customer Feedback etc. It is format and protocol agnostic
  2. Helps maintain a strong Chain of Custody capability for this data. Essentially, know and understand how every piece of data originated from, how it was modified and it’s lineage. This is key from both a regulatory and a market lineage perspective
  3. Provides Data and System Administrators an ability to visualize enterprise wide or departmental flows in one place and to be able to modify it was on the fly
  4. Ability to run Stream based Processing for the data as it  is primarily a dataflow system based on the concepts of flow-based programming. It supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.

Hortonworks DataFlow – HDF (based on Apache NiFi) is based on technology originally created by the NSA that encountered big data collection and processing issues at a scale and stage that is beyond most enterprise implementations today. HDF was designed inherently to meet the timely decision making needs from collecting and analyzing data from a wide range of disparate data sources, securely, efficiently and over a geographically disperse and possibly fragmented data silos the likes of which are commonplace in financial services.

Banking organizations are beginning to leverage HDF and HDP to create common cross-company data lake for data from different LOBs: mortgage, consumer banking, personal credit, wholesale and treasury banking. Both Internal Managers, Business Analysts, Data Scientists and finally Consumers are able to derive immense value from the data. A single point of data management allows the bank to operationalize security and privacy measures such as de-identification, masking, encryption, and user authentication.From a data processing perspective Hadoop supports multiple ways of running models and algorithms that are used to find patterns of fraud and anomalies in the data to predict customer behavior. Examples include Bayesian filters, Clustering, Regression Analysis, Neural Networks etc. Data Scientists & Business Analysts have a choice of MapReduce, Spark (via Java,Python,R), Storm etc and SAS to name a few – to create these models.

With the stage being said, let us now examine some concrete usecases and business drivers that are a great fit for HDF in Finance across four core domains across the spectrum – Retail & Consumer Banking, Capital Markets & Wealth Management, RFC (Risk, Fraud & Compliance) & Bank IT Ops.

NiFi-FSI                  Illustration – Enterprise Dataflow Management Business Drivers In Banking

Retail and Consumer Banking – 

According to McKinsey[1], there are a few ways to approach digital banking but for leading banks, there are typically four interconnected, mutually reinforcing elements: a) Improving Connectivity, b) Increased Automation,c) Accelerating Innovation, and d) Improved Decisioning.

Connectivity deals with being able to harness newer data sources along with internal and 3rd party data refers to build loyalty and competition-disrupting offerings. Automation deals with optimizing internal and external value chains in support of better automation. Innovation refers to how should banks continue to renew themselves, given the rapid pace of change in the industry. Decisioning refers to how big data can be used to make better, faster, and more accurate decisions regarding customer purchase choices as well as banks’ decisions on issues such as risk.

The challenge is retail banking is to seamlessly combine data from hundreds of internal databases and external systems (like 3rd party services & data providers). For instance Core Banking data needs to be combined with  Payments (made or missed) data, along with any  notes from the retail banker, and combined with behavior data to segment customers, predict loan defaults, to optimize portfolios etc etc. 

HDF can be used to ingest data from multiple channel endpoints like ATMs, POS terminals, Mobile Banking Clients etc. As the data is in motion, Predictive models are created that can be analyze transaction data and combine it with historical data on individuals to produce metrics known as scores. A given score indicate  a range of metrics around business considerations like fraud detection, risk detection as well as segmenting & ranking customers based on their likelihood of purchasing a certain product, the creditworthiness of a CDO (Collaterized Debt Obligation) etc.

Traditional Banking algorithms cannot scale with this explosion of data as well as the heterogeneity inherent in reporting across areas such as Risk management. E.g Certain kinds of Credit Risk need access to around 200 days of historical data where one is looking at the probability of the counter-party defaulting & to obtain a statistical measure of the same.

  1. Customer Profitability & Lifetime Value –
    Banking is typically a long term relationship with the provider realizing immense benefits from highly engaged customers. Understanding the journey of a single customer across multiple financial products and mapping their journey from a BI standpoint is a key capability.Accordingly, the business requirements are –

    • Integrate realtime transactions with core financial systems
    • Increase revenue & net yield per customer by showing them the right long term plan – financial products (e.g. a more cost effective mortgage or autoloan or provide portfolio optimization) etc
    • Understanding CLV across all products (savings, checking accounts, mortgages, auto loans & credit cards etc) with a view to understanding P&L(Profit and Loss) on an individual customer and segment basis


  2. Customer 360 & Segmentation –
    Currently most Retail and Consumer Banks lack a  comprehensive view of their customers. Each department has a limited view of customer due to which the offers and interactions with customers across multiple channels are typically inconsistent and vary a lot.  This also results in limited collaboration within the bank when servicing customer needs. Leveraging the ingestion and predictive capabilities of a Hadoop based platform, Banks can provide a user experience that rivals Facebook, Twitter or Google that provide a full picture of customer across all touch points
  3. Sentiment & Social Media Analysis
    Leveraging HDF’s ability to ingest data from various Social media and Digital Marketing type datasources, one can mine Twitter, Facebook and other social media conversations for sentiment data about products, services and competition, and use it to make targeted, real-time, decisions that increase market share.HDF comes out of the box with 90 data processors including encoders, encrypters, compressors, converters, creating Hadoop sequence files from data flows, interacting with AWS, sending messages to Kafka, getting messages from Twitter, and others etc. One can configure the data processors through a drag&drop visual UI, chaining them and using back-pressure between them to control the data flow.

  4. Realtime Payments
    The real time data processing capabilities of HDF allow it to process data in a continual or bursty or streaming or micro batching fashion. Once ingested, Payment data must be processed in a very small time period which is typically termed near real time(NRT). When combined with predictive capabilities via behavioral modeling & transaction profiling – HDF can provide significant operational, time & cost savings.

According to Mckinsey, the benefits of  digital banking amount to much more than just provision of financial services through the new facades of mobile and Internet channels. HDF can augment Digital Banking with more powerful capabilities—around greater governance, strong data security, and  privacy protection – all of which enable the creation & fine tuning of new products and services.

Capital Markets & Wealth Management –

Within large bulge bracket firms, Capital Markets groups engineer custom derivative trades that hedge exposure for their clients as well as for their own internal treasury groups. They may also do proprietary trading (on the bank’s behalf) for a profit (though this is the type of trading that the Volcker Rule seeks to eliminate). These groups typically lead the way in being forward looking from a high tech perspective.

Most of the compute intensive problems are generated out of either this group or the enterprise risk group. They typically own system that interface with the exchange facing order management systems, the trade booking systems, the pricing libraries for the products the bank trades as well as the tactical systems that are used to manage their market and credit risks, customer profitability, compliance and collateral systems. As a result, they usually get about a large chunk of a Bank’s total IT budget and see technology as a key competitive advantage. The above business drivers are already being tackled in many areas within the Capital Markets spectrum.

  1. Simulations & Cross LOB Analytics-
    Currently most Cross Line Of Business Analytics and simulations use limited data as high storage costs mean only a few months of data could be kept due to which simulations use only limited signals (data sources) and this affects Model accuracy. With an Hortonworks Data Platform (HDP) based Operational Store –

    • Data can now be kept indefinitely
    • Augmented with data from other LoBs
    • Provide an ability to simulating things like consumer demand and macro-trends

    HDF can constantly ingests, stores and processes market data, social media data, reference data, position data etc and constantly precomputes that can be persisted into the batch layer.

  2. Algorithmic Trading- HDF can augment trading infrastructures in several ways –

    1.Re-tool existing trading infrastructures so that they are more integrated yet loosely coupled and efficient
    2. Helping plug in algorithm based complex trading strategies that are quantitative in nature across a range of asset classes like equities, forex,ETFs and commodities etc
    3.Needing to incorporate newer & faster sources of data (social media, sensor data, clickstream date) and not just the conventional sources (market data, position data, M&A data, transaction data etc).
    4.Retrofitting existing trade systems to be able to accommodate a range of mobile clients who have a vested interest in deriving analytics. e.g marry tick data with market structure information to understand why certain securities dip or spike at certain points and the reasons for the same (e.g. institutional selling or equity linked trades with derivatives)
    5.Helping traders integrate algorithms as well as customizing these to be able to generate constant  competitive advantage

  3. Wealth Management Lifecycle – The lifecycle of Wealth Management ranging from Investment strategy development to Portfolio Optimization to Digital Marketing all depend on ingesting, analyzing and acting on complex data feeds. HDF augments Hadoop based capability by providing Data In Motion insights across this spectrum.
  4. Market & Trade Surveillance 

    An intelligent surveillance system needs to store trade data, reference data, order data, and market data, as well as all of the relevant communications from all the disparate systems, both internally and externally, and then match these things appropriately. The system needs to account for multiple levels of detection capabilities starting with a) configuring business rules (that describe a fraud pattern) as well as b) dynamic capabilities based on machine learning models (typically thought of as being more predictive). Such a system also needs to parallelize execution at scale to be able to meet demanding latency requirements for a market surveillance platform.

    HDF can augment existing systems ranging from CEP Engines to Trade Lifecycle Systems by –

    1. Supporting end to end monitoring across a variety of financial instruments across multiple venues of trading. Support a wide variety of analytics that enable the discovery of interrelationships between customers, traders & trades as the next major advance in surveillance technology.
    2. Providing a platform that can ingest from tens of millions to billions of market events (spanning a range of financial instruments – Equities, Bonds, Forex, Commodities and Derivatives etc) on a daily basis from thousands of institutional market participants
    3. Providing the ability to add new business rules (via either a business rules engine and/or a model based system that supports machine learning) is a key requirement. As we can see from the first post, market manipulation is an activity that seems to constantly push the boundaries in new and unforseen ways
    4. Providing advanced visualization techniques thus helping Compliance and Surveillance officers manage the information overload.
    5. Supporting the ability to perform deep cross-market analysis i.e. to be able to look at financial instruments & securities trading on multiple geographies and exchanges e.g.
    6. Supporting the ability to create views and correlate data that are both wide and deep. A wide view will look at related securities across multiple venues; a deep view will look for a range of illegal behaviors that threaten market integrity such as market manipulation, insider trading, watch/restricted list trading and unusual pricing.
    7. Supporting the ability to provide in-memory caches of data (based on Apache Spark)  for rapid pre-trade compliance checks.
    8. Supporting the ability to create prebuilt analytical models and algorithms that pertain to trading strategy (pre- trade models –. e.g. best execution and analysis). The most popular way to link R and Hadoop is to use HDFS as the long-term store for all data, and use MapReduce jobs (potentially submitted from Hive or Pig) to encode, enrich, and sample data sets from HDFS into SAS/R.

 

Risk, Fraud & Compliance –

Risk management is not just a defensive business imperative but the best managed banks deploy their capital to obtain the best possible business outcomes. The last few posts have more than set the stage from a business and regulatory perspective. This one will take a bit of a deep dive into the technology.

Existing data architectures are siloed with bank IT creating or replicating data marts or warehouses to feed internal lines of business. These data marts are then accessed by custom reporting applications thus replicating/copying data many times over which leads to massive data management & governance challenges.

Furthermore, the explosion of new types of data in recent years has put tremendous pressure on the financial services datacenter, both technically and financially, and an architectural shift is underway in which multiple LOBs can consolidate their data into a unified data lake.

All of the below areas have been exhaustively covered in the blog. By providing a scalable platform that enables simple, fast data acquisition, secure data transport, prioritized data flow and clear traceability of data from BORT (Book Of Record Transaction) Systems –  HDF is the perfect complement to HDP to helps bring together historical and perishable insights in the Classic RFC Areas.

  1. Risk Data Aggregation & Reporting – In depth discussion after the jump – http://www.vamsitalkstech.com/?p=667
  2. AML (Anti Money Laundering) Compliance – http://www.vamsitalkstech.com/?p=833
  3. Cyber Security – A cursory study of the significant data breaches in 2015 reads like a comprehensive list of enterprises across both Banking and Retail Verticals etc. The world of Banking now understands that an comprehensive & strategic approach to Cybersecurity is now far from being an IT challenge a few years ago to a “must have”. As Digital and IoT ecosystems evolve to loose federations of API accessible and cloud native applications, more and more assets are at danger of being targeted by extremely well funded and sophisticated adversaries. In conjunction with frameworks like OpenSOC, HDF can provide a unified data ingestion platform that can onboard & combine SIEM data, Advance Threat Intelligence,geolocation, and DNS information, Network Packet Capture to automate security threat detection while merging it all with Telemetry data. More on this in the next set of blogposts. 
  4. Fraud Monitoring & Detection – http://www.vamsitalkstech.com/?p=1098

IT Operations –

In addition to the business areas above, HDF shines in the below popular systems oriented usecases as well.

  1. Log Data Processing – The ability to process log data coming in from application endpoints, telemetry devices (e.g. ATM’s, Point Of Sale Terminals & IoT devices) is a tremendously useful capability to have across a range of usecases ranging from Customer Journey to Fraud Detection to Digital Marketing. HDF excels at log data aggregation and visualization at massive scale.
  2. Digitizing Endpoints – As larger Banks gradually adopt IoT technologies across Retail Banking, HDF can help take out the complexity of the  management of such large-scale systems that encompass a variety of endpoints and platforms e.g IP Cameras for Security, Wireless Access Points, Industrial Grade Routers, HVAC equipment etc that are commonly seen across physical locations.
  3. Cross Bank Report Generation – The ability to setup data pipelines that enable secure data sharing among different bank branches as well as data centers when personal data are shared between offices or branches of a bank is a key capability that HDF provides as well as helping produce fresh reports
  4. Cross LOB (Line Of Business) Analytics – When combined with the Hortonworks Data Platform , HDF accelerates the firm’s speed-to-analytics and also extends its data retention timeline. A shared data repository across multiple LOBs provides more visibility into all trading activities. The trading risk group accesses this shared data lake to processes more position, execution and balance data. They can do this analysis on data from the current workday, and it is highly available for at least five years—much longer than before.

Summary

Financial Services is clearly a data intensive business. Forward looking Banks, insurance companies and Securities firms have already begun to store and process huge amounts of data in Apache Hadoop and they have better insight into both their risks and opportunities. However, a significant inhibitor to enabling Predictive Analytics is the lack of strong enterprise-wide dataflow management capabilities.  Deployed at scale for almost a decade before being contributed to the Open Source Community Hortonworks Dataflow (HDF) has been proven to be an excellent and effective tool that integrates the most common current and future needs of big data acquisition and ingestion for accurately informed, on-time decision making.

References

[1]

http://docplayer.net/1202013-Asia-financial-institutions-digital-banking-in-asia-winning-approaches-in-a-new-generation-of-financial-services.html

Apache NiFi Eases Dataflow Management & Accelerates Time to Analytics In Banking (2/3)..

The previous post did a somewhat comprehensive job of cataloging the issues and problems in financial services with data ingestion & flow management at large scale.  The need for a technology platform to help ingest data @ scale across many kinds of endpoints in an Omni Channel world while adhering to stringent regulations (security, chain of custody and encryption) is a crying need both from an Operators as well as a Data Architect’s perspective. Apache NiFi is a groundbreaking 100% Open Source technology that can be leveraged to provide all of the above capabilities. This post explores NiFi from a technology standpoint. The next (and final) post will examine both the Defensive (Risk, Fraud & Compliance) as well as Offensive (Digital Transformation et al) use cases from a Banking standpoint.

Why is Data Flow Management Such a Massive Challenge in Banking-

Banks need to operate their IT across two distinct prongs – defense and offense. Defensive in areas like Risk, Fraud and Compliance (RFC) ; Offensive as in revenue producing areas of the business like Customer 360 (whether Institutional or Retail), Digital Marketing, Mobile Payments, Omni channel Wealth Management etc. If one really thinks about it – the biggest activity that banks do is manipulate and deal in information whether related to a Customer or Transaction or General Ledger etc.

Thus, Financial Services are a data intensive business. Forward looking Banks, insurance companies and securities firms have already begun to store and process huge amounts of data in Apache Hadoop and they have better insight into both their risks and opportunities.

However, the key challenges with current architectures in ingesting & processing the multi-varied data found in banking –

  1.  A high degree of Data is duplicated from system to system leading to multiple inconsistencies at the summary as well as transaction levels. Because different groups perform different risk reporting functions (e.g Credit and Market Risk) – the feeds, the ingestion, the calculators end up being duplicated as well
  2. Traditional Banking algorithms cannot scale with this explosion of data as well as the heterogeneity inherent in reporting across areas such as Risk management. E.g Certain kinds of Credit Risk need access to around 200 days of historical data where one is looking at the probability of the counter-party defaulting & to obtain a statistical measure of the same. The data inputs required are complex,multi-varied and multi source
  3. Agile and Smooth Data Ingestion is a massive challenge at most Banks. Approaching Ingestion and Dataflow Management as an enterprise rather than an IT concern  includes understanding examples of a) data ingestion from the highest priority of systems b) apply the correct governance rules to the data. c) Applying Stream Processing based Analytics to realize business value at a low temporal latency for many kinds of key use cases

The Hadoop ecosystem has lacked an Open Source alternative in this space. No more with the rapid entry & maturation of Apache NiFi.

Introducing Apache NiFi (Niagara Files)-

Incubated at the NSA (National Security Agency) and later open sourced at the Apache Foundation, Apace NiFi is a platform used to securely collect any and all enterprise data from virtually any source system (batch or realtime or streaming etc) from outside a firewall while ensuring complete tracking of the history and provenance of that data.

Apache NiFi, is the first integrated open source platform that solves the real time complexity and challenges of collecting and transporting data from a multitude of sources be they big or small, fast or slow, always connected or intermittently available. NiFi is a single combined platform which provides the data acquisition, simple event processing, transport and delivery mechanism designed to accommodate the highly diverse and complicated dataflows generated by a world of connected people, systems and things.

From a high level, Apache NiFI can be used in the financial services space in many many scenarios including Client-360,  Fraud, Cybersecurity and AML Compliance etc.

As Banks launch new initiatives in response to business challenges (the typical RFC continuum) or invest in additional capabilities (around Customer Analytics or Digital), a plethora of issues usually show up around data architectures. For instance, Hadoop can now be used to generate insights (at a latency of a few milliseconds) that can assist Banks in detecting fraud as soon as it happens. This usually means rearranging or reordering existing data-flow patterns (from a protocol or message format translation standpoint) as well as ingesting data that was simply uncollected before.

When existing dataflow based architectures cannot scale up to handle new requests, the response always has been one or a combination of the below –

  • Add new systems to mediate between protocols
  • Add new systems to transform or reorder data
  • Add new capabilities to filter the data

Core Capabilities of Apache NiFi – 

NiFI_FS

                  Illustration – NiFi Eases Enterprise Dataflow Management In Banking

NiFi obviates all of the above.

The core value add provided by NiFi is as below –

  1. Provides strong lifecycle management capabilities for Data In Motion. This includes not just data flowing in from traditional sources of data including Customer Account data, Transaction Data, Wire Data, Trade Data, Customer Relationship Management (CRM), General Ledger but also from streaming datasources like Market Data Systems which feed instrument data, Machine generated data – ATMs, POS & systems supporting Digital functions like Social Media, Customer Feedback etc. It is format and protocol agnostic
  2. Helps maintain a strong Chain of Custody capability for this data. Essentially, know and understand how every piece of data originated from, how it was modified and it’s lineage. This is key from both a regulatory and a market lineage perspective
  3. Provides Data and System Administrators an ability to visualize enterprise wide or departmental flows in one place and to be able to modify it was on the fly
  4. Ability to run Stream based Processing for the data as it  is primarily a dataflow system based on the concepts of flow-based programming. It supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.

Architecture –  

NiFi is built to tackle enterprise data flow problems and as a technology is augmentative & complementary to Hadoop. NiFi provides a web-based user interface for design, control, feedback, and monitoring of dataflows. It is also highly configurable and supports an agile methodology to developing these additive capabilities in large, medium or small projects. Indeed the core advantage of NiFi is it’s ability to shorten the time to deriving analytic value from business projects.

Note that NiFi does not seek to replace existing messaging brokers or data acquisition systems but layers on top of all these and augments their capabilities to add them several dimensions such as quality of service (e.g loss-tolerant versus guaranteed delivery), low latency versus high throughput, and priority-based queuing. Further, 
NiFi also provides fine-grained data provenance for all data received, forked, joined cloned, modified, sent, and ultimately dropped upon reaching its configured end-state.



 NiFi ships with a wide variety of connectors to projects like Kafka, Storm, Sqoop, Flume etc.

The below illustration from Hortonworks (the lead committers to NiFi like with most Hadoop projects) clarifies NiFi’s place in the Data ecosystem.

NiFI_Positioning

                  Illustration – NiFi & Other Data Movement/Processing Technologies

So what are the core features of NiFi from a run time & platform standpoint

 ?

  1. Implementation-The NiFi Graphical Flow Designer provides a visual, code free. UI enables easy development of live process flows and leaves less to code and enables a greater empowered workforce to positive affect change to the enterprise
  2. Architecture [2]-Centralized architecture. NiFi executes within a JVM living within a host operating system. A NiFi cluster is comprised of one or more NiFi Nodes (Node) controlled by a single NiFi Cluster Manager (NCM). The design of clustering is a simple master/slave model where the NCM is the master and the Nodes are the slaves. The NCM’s reason for existence is to keep track of which Nodes are in the cluster, their status, and to replicate requests to modify or observe the flow. Fundamentally, then, the NCM keeps the state of the cluster consistent. While the model is that of master and slave, if the master dies the Nodes are all instructed to continue operating as they were to ensure the data flow remains live. The absence of the NCM simply means new nodes cannot join the cluster and cluster flow changes cannot occur until the NCM is restored.[2]nifi-arch
    Illustration – NiFi Runtime

  3. Security, Governance and Management – NiFi provides coverage from edge through the lake for security governance and management
  4. Bi-directionality – Able to interact with the source of data, sending back actionable responses
  5. An extensible platform with custom processors; where developers can plug in existing code into these custom processors – 90 (and growing) with pre developed actions for ingest, forward etc
  6. Supports a Rich Selection of data formats, protocols, schemes And designed for easy extension to support more and allow organizations to add support for their own proprietary formats/schemas/ and protocols]
  7. Adaptive to data flow conditions (latency, bandwidth) to maintain scalability and reliability
  8. Distributed: NiFi manages the data flow from edge to central data store with end to end security and governance
  9. 

Data Provenance:
- It provides visual graph based tooling for data lineage and traceability
-*Determine chain of custody graphs Governance and compliance end to end -all the way down to the edge
  10. Analysis of data in flight:
-Filter data in motion.
-Analyze and Prioritize
-Process based on predefined rules (modify, re route, delay, etc.

An ideal solution for financial services, NiFi enables simple, fast data acquisition, secure data transport, prioritized data flow and clear traceability of data from the very edge of customer applications and endpoints all the way to the core data center.

The final (and third) post will then examine the use of NiFi around Financial Services usecases frequently discussed in this blog.

References

[1] Hortonworks Acquires Onyara

http://hortonworks.com/hdf/

[2] Apache NiFi

https://nifi.apache.org/

How Big Data Approaches Ease the Data Lifecycle in Finance (1/3)..

A Brief (and highly simplified) History of Banking Data

Financial Services organizations are unique in their possessing the most diverse set of data for any industry vertical. Corporate IT organizations in the financial industry have been tackling data challenges at scale for many years now. Just take Retail Banking as an example. Traditional sources of data in this segment include Customer Account data, Transaction Data, Wire Data, Trade Data, Customer Relationship Management (CRM), General Ledger and other systems supporting core banking functions.  

Shortly after these “systems of record” became established, enterprise data warehouse (EDW) based architectures began to proliferate with the intention of mining the trove of real world data that Banks possess. The primary intention being providing conventional Business Intelligence (BI) capabilities across a range of use cases – Risk Reporting, Customer Behavior, Trade Lifecycle, Compliance Reporting etc.  These operations were run by either centralized or localized data architecture groups responsible for maintaining a hodgepodge of these systems for business metrics across the above business functions & even support systems based use cases like application & log processing  – all of which further adds to the maze of data complexity.

The advent of Social technology has added newer and wider sources of data including Social Networks, IoT data and also the need to collect time series data as well as detailed information from every transaction, purchase and the channel it originated from.

Thus, Bank IT world was a world of silos till Hadoop led disruption happened. And the catalyst for this is Predictive Analytics – which provides both realtime and deeper insight across hundreds of myriad of scenarios –

  1. Predicting customer behavior in realtime,
  2. Creating models of customer personas (micro and macro) to track their journey across a Bank’s financial product offerings,
  3. Defining 360 degree views of a customer so as to market to them as one entity,
  4. Fraud monitoring & detection
  5. Risk Data Aggregation (e.g Volcker Rule)
  6. Compliance  etc.

The net result is that Hadoop and Big Data are no longer unknowns in the world of high finance. Banking organizations are beginning to leverage Apache Hadoop to create common cross-company data lake for data from different LOBs: mortgage, consumer banking, personal credit, wholesale and treasury banking. Both Internal Managers, Business Analysts, Data Scientists and finally Consumers are able to derive immense value from the data. A single point of data management allows the bank to operationalize security and privacy measures such as de-identification, masking, encryption, and user authentication.From a data processing perspective Hadoop supports multiple ways of running models and algorithms that are used to find patterns of fraud and anomalies in the data to predict customer behavior. Examples include Bayesian filters, Clustering, Regression Analysis, Neural Networks etc. Data Scientists & Business Analysts have a choice of MapReduce, Spark (via Java,Python,R), Storm etc and SAS to name a few – to create these models.

Financial Application development,  model creation, testing and deployment on fresh & historical data become very straightforward to implement on Hadoop.

How does Big Data deliver value

Hadoop (and NoSQL databases) help Bank IT deliver value in five major ways as they –

  1. Enable more agile business & data development projects
  2. Enable exploratory data analysis to be performed on full datasets or samples within those datasets
  3. Reduce time to market for business capabilities
  4. Help store raw historical data at very short notice at very low cost
  5. Help store data for months and years at a much lower cost per TB compared to tape drives and other archival solutions

Why does data inflexibility hamper business value creation? The short answer is that ranging from 60% – 80% of the time spend in data projects is spent around ingesting and preparing the data in a format that be consumed to realize insights both analytical and predictive.[1]

While Hadoop has always been touted for it’s ability to process any kind of data (be it streaming or realtime or batch etc), it’s flexibility in helping in a speedier data acquisition lifecycle does not nearly get half as much attention.  Banking, multiple systems send data into a Data Lake (or an enterprise wide repository of data). These include systems like Accounting, Trade, Loan, Payment and Wire Transfer data etc.

A constant theme and headache in Banking Data Management – some of the major issues and bottlenecks on an almost daily basis as data is moved from Book of Record Transaction (BORT) Systems to Book Of Record Enterprise (BORES) Systems are –

  • Hundreds of point-to point feeds to each enterprise system from each transaction system
  • Data being largely independently sourced leads to timing and data lineage issues
  • End of Day/Month Close processes are complicated and error prone due to dealing with incomplete and (worse) inaccurate data
  • The Reconciliation process then requires a large effort and also has significant data gaps from a granular perspective

Screen Shot 2015-09-28 at 11.08.30 PM

                                       Illustration – Data Ingestion and Processing Lifecycle

I posit that there five major streams of work encompass the lifecycle of every large Hadoop project in financial services –

1) Data Ingestion:  Ingestion is almost always the first piece of a data lifecycle. Developing this portion will be the first step to realizing a highly agile & business focused architecture. Lack of timely data ingestion frameworks is a large part of the problem at most institutions. As part of this process,  data is acquired a) Typically from the highest priority of systems b) Initial Transformation rules are applied to the data.

2) Data Governance: These are the L2 loaders that apply the rules to the critical fields for Risk and Compliance. The goal here is to look for gaps in the data and any obvious quality problems involving range or table driven data. The purpose is to facilitate data governance reporting.

3) Data Enrichment & Transformation: This will involve defining the transformation rules that are required in each marketing, risk, finance and compliance area to prep the data for their specific processing.

4) Analytic Definition: Defining the analytics that are to be used for each Data Science or Business Intelligence Project

5) Report Definition: Defining the reports that are to be issued for each business area.

As can be seen from the above, Data Ingestion is the first and one of the most critical stages of the overall data management lifecycle.

The question is how does Big Data techniques help here? 

From a high level, Hadoop techniques can help as they –

  • Help centralize data, business and operation functions by populate a data lake with a set of canonical feeds from the transaction systems
  • Incentivize technology to shrink not grow by leveraging a commodity x86 based approach
  • Create Cloud based linearly scalable platforms to host enterprise applications on top of this data lake, including hot, warm and cold computing zones

Thus, Banks, insurance companies and securities firms that store and process huge amounts of data in Apache Hadoop have better insight into both their risks and opportunities. In the Capital Markets Space and also with Stock Exchanges, deeper data analysis and insight can not only improve operational margins but also protect against one-time events that might cause catastrophic losses. However the story around Big Data adoption in your average financial services enterprise is not all that revolutionary – it typically follows a more evolutionary cycle where a rigorous engineering approach is applied to gain small business wins before scaling up to more transformative projects.

The other key area where Hadoop (and NoSQL databases) enable an immense amount of flexibility is what is commonly known as “Schema On Read“.

Schema On Read (SOR) is a data ingestion technique that enables any kind of raw data to be ingested, at massive scale, without regard to the availability of a target schema/model (or lack of thereof) into the filesystem underpinning Hadoop – HDFS (Hadoop Distributed File System). Once the data is ingested, it is cleansed, transformed, normalized and encoded based on the requirements of the processing application.

Schema On Read is a radical departure from the way classical data modeling is performed. Historical data architectures based on relational databases and warehouses. In these systems, upfront modeling has to be performed to be able to ingest the data in a relational form. Once done, this typically leads to lengthy cycles of transformation, development & testing before end users can access the data. It is a well known fact that 80% of the time spent in data science projects around is around Data Preparation or Data Wrangling. Yet far too much handcrafted work — what data scientists call “data wrangling,” “data munging” and “data janitor work” — is still required. Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.[1]

Hadoop_Augment

                                   Illustration – Schema on Read vs Schema On Write 

Data Ingestion & Simple Event Processing are massive challenges in financial services as there is clear lack of solutions that can provide strong enterprise-wide dataflow management capabilities.

What are some of the key requirements & desired capabilities such a technology category can provide ?

  1. A platform that can provide a standard for massive data ingest from all kinds of sources ranging from databases to logfiles to device telemetry to real time messaging
  2. Provide centralized ingest & dataflow policy management across thousands of applications;  How to ingest data from 100’s of application feeds (that support millions of ATMs,hundreds of websites and mobile capabilities)
  3. Pipe the ingested data over to a Hadoop Datalake for complex processing
  4. Extensibility with custom processors for application specific processing as data flows into the lake
  5. Robust Simple Event Processing (Compression, Filtering, Encryption etc) capabilities
  6. How to help model such for consumption by different kinds of audiences ? Business Analysts, Data Scientists, Domain Experts etc.
  7. How to apply appropriate governance and control policies on the data ?

The next (and second) post in this series will examine an emerging but potentially pathbreaking 100% Open Source technology (incubated now by Hortonworks) that satisfies all the above requirements and more in the area of scalable data ingestion & processing – Apache NiFi [2].

The final (and third) post will then examine the use of NiFi around Financial Services usecases frequently discussed in this blog.

References

[1] For Big Data Scientists, Janitor Work Is Key Hurdle To Insights

[2] Apache NiFi

https://nifi.apache.org/