Hadoop counters Credit Card Fraud..(2/3)

This article is the second installment in a three part series that covers one of the most critical issues facing the financial industry – Payment Card Fraud. While the first (and previous) post discussed the global scope of the problem & the business ramifications –  this post will discuss a candidate Big Data Architecture that can help financial institutions turn the tables on Fraudster Networks. The final post will cover the evolving business landscape in this sector – in the context of disruptive technology innovation (predictive & streaming analytics) and will make specific recommendations from a thought leadership standpoint.

Traditional Approach to Fraud Monitoring & Detection – 

Traditional Fraud detection systems have been focused on looking for factors such as known bad IP addresses or unusual login times based on Business Rules and Events. Advanced fraud detection systems augment the above approach with building models of customer behavior at the macro level. Then they would use these models to detect anomalous transactions and flag them as potentially being fraudulent. However, the scammers have also learnt to stay ahead of the scammed and are leveraging computing advances to come up with ever new ways of cheating the banks.

Case in point [1] –

In 2008 and 2009, PayPal tested several fraud detection packages, finding that none could provide correct analysis fast enough, Dr. Wang (head of Fraud Risk Sciences – PayPal) said. She declined to name the packages but said that the sheer amount of data PayPal must analyze slowed those systems down.

Why Big Data and Hadoop for Fraud Detection?

Big Data is dramatically changing that approach with advanced analytic solutions that are powerful and fast enough to detect fraud in real time but also build models based on historical data (and deep learning) to proactively identify risks.

The business reasons why Hadoop is emerging as the best choice for fraud detection are –

  1. Real time insights –  Hadoop can be used to generate insights at a latency of a few milliseconds  that can assist Banks in detecting fraud as soon as it happens
  2. A Single View of Customer/Transaction & Fraud enabled by Hadoop
  3. Loosely coupled yet Cloud Ready Architecture
  4. Highly Scalable yet Cost effective

The technology reasons why Hadoop is emerging as the best choice for fraud detection are:

  1. Hadoop (Gen 2) is not just a data processing platform. It has multiple personas – a real time, streaming data, interactive platform for any kind of data processing (batch, analytical, in memory & graph based) along with search, messaging & governance capabilities built in – all of which support fraud detection architecture patterns
  2. Hadoop provides not just massive data storage capabilities but also provides multiple frameworks to process the data resulting in response times of milliseconds with the outmost reliability whether that be realtime data or historical processing of backend data
  3. Hadoop can ingest billions of events at scale thus supporting the most mission critical analytics irrespective of data size
  4. From a component perspective Hadoop supports multiple ways of running models and algorithms that are used to find patterns of fraud and anomalies in the data to predict customer behavior. Examples include Bayesian filters, Clustering, Regression Analysis, Neural Networks etc. Data Scientists & Business Analysts have a choice of MapReduce, Spark (via Java,Python,R), Storm etc and SAS to name a few – to create these models. Fraud model development, testing and deployment on fresh & historical data become very straightforward to implement on Hadoop
  5. Hadoop is not all about highly scalable filesystems and processing engines. It also provides native integration with highly scalable NoSQL options including a database called HBase. HBase has been proven to support near real-time ingest of billions of data streams. HBase provides near real-time, random read and write access to tables containing billions of rows and millions of columns

Again, from [1] –

PayPal processes more than 1.1 petabytes of data for 169 million active customer accounts, according to James Barrese, PayPal’s chief technology officer. During customer transactions, subsets of the data are analyzed in real-time.

Since 2009, PayPal has been building and modifying its fraud analytics systems, incorporating new open-source technologies as they have evolved. For example, the company uses Hadoop to store data, and related analytics tools, such as the Kraken. A data warehouse from Teradata Corp. stores structured data. The fraud analysis systems run on both grid and cloud computing infrastructures.

Several kinds of algorithms analyze thousands of data points in real-time, such as IP address, buying history, recent activity at the merchant’s site or at PayPal’s site and information stored in cookies. Results are compared with external data from identity authentication providers. Each transaction is scored for likely fraud, with suspicious activity flagged for further automated and human scrutiny, Mr. Barrese said.

After implementing multiple large real time data processing applications using Big Data related technologies in financial services, we present a proven architectural pattern & technology stack that has been proven in very large production deployments. The key goal is to process millions of events per second, tens of billions of events per day and tens of terabytes of financial data per day – as is to be expected in a large Payment Processor or Bank.

Business Requirements

  1. Ingest (& cleanse) real time Card usage data to get complete view of every transaction with a view to detecting potential fraud
  2. Support multiple ways of ingest across a pub-sub messaging paradigm,clickstreams, logfile aggregation and batch data – at a minimum
  3. Allow business users to specify 1000’s of rules that signal fraud e.g. when the same credit card is used from multiple IP addresses within a very short span of time
  4. Support batch oriented analytics that provide predictive and historical models of performance
  5. As much as possible, eliminate false positives as these cause inconvenience to customers and also inhibit transaction volumes
  6. Support a very high degree of scalability – 10’s of millions of transactions a day and 100’s of TB of historical information
  7. Predict cardholder behavior (using a 360 degree view) to provide better customer service
  8. Help target customer transactions for personalized communications on transactions that raise security flags
  9. Deliver alerts the ways customers want — web, text, email and mail etc
  10. Track these events end to end from a strategic perspective across dashboards and predictive models
  11. Help provide a complete picture of high value customers to help drive loyalty programs

Design and Architecture

The architecture thus needs to consider two broad data paradigms — data in motion and data at rest.

Data in motion is defined as streaming data that is being sent into an information architecture in real time. Examples of data in motion include credit card swipes, e-commerce tickets, web-based interactions and social media feeds that are a result of purchases or feedback about services. The challenge in this area is to assimilate a huge volume of data and filter it, gather reason from it and to send it to downstream systems such as a business process management (BPM) or a Partner System(s). Managing the event data to make sure changing business rules/regulations are consistently integrated with the data is another key facet in this area.

Data at rest is defined as data that has been collected and ingested in a form that conforms to enterprise data architecture and governance specifications. This data needs to be assimilated or federated with pre-existing sources so that the business can query it in a read/write manner from a strategic and long-term perspective.

A Quick Note on Data Science and it’s applicability to Fraud Monitoring & Detection – 

Various posts in this blog have discussed the augmented capability of financial organizations to acquire, store and process large volumes of data using commodity (x86) hardware.  At the same time, technologies such as Hadoop and Spark have enabled the collection, organization and analysis of Big Data at scale. The convergence of cost effective storage and scalable processing allows us to extract richer insights from data. These insights can then be operationalized to provide commercial and social value.   Data science is a term that refers to the process of extracting meaningful insights from large volumes of structured and unstructured data. Data science is about scientific exploration of data to extract meaning or insight, and the construction of software systems to utilize such insights in a business context.   This involves the art of discovering data insights combined with the science of operationalizing them.  A data scientist uses a combination of machine learning, statistics, visualization, and computer science to extract valuable business insights hiding in data and builds operational systems to deliver that value. Data Science based approaches are core to the design and architecture of a Fraud Detection System. Data Mining techniques range from clustering and classification to find patterns and associations among a large group of data. The machine learning components are classified into two categories: ‘supervised’ and ‘unsupervised’ learning. These methods seek for accounts, customers, suppliers, etc. that behave ‘unusually’ in order to output suspicion scores, rules or visual anomalies, depending on the method. (Ref – Wikipedia).

It needs to be kept in mind that Data science is a cross-functional discipline. A data scientist is part statistician, part developer and part business strategist. The Data Science team collaborates with an extended umbrella team which includes visualization specialists, developers, business analysts, data engineers, applied scientists, architects, LOB owners and DevOps (ref – Hortonworks). The success of data science projects often relies on the communication, collaboration, and interaction that takes place with the extended team, both internally and possibly externally to their organizations.

Reference Architecture

FP1 

Illustration 1:  Candidate Architecture Pattern for a Fraud Detection Application 

 

The key technology components of the above reference architecture stack include:

  1. Information sources are depicted at the left. These encompass a variety of machine and human actors either transmitting potentially thousands of real time messages per second. These are your typical Credit Card Swipes, Online transactions, Fraud databases and other core Banking data.
  2. A highly scalable messaging system to help bring these feeds into the architecture as well as normalize them and send them in for further processing. Apache Kafka is chosen for this tier.Realtime data is published by Payment Processing systems over Kafka queues. Each of the transactions has 100s of attributes that can be analyzed in real time to  detect patterns of usage.  We leverage Kafka integration with Apache Storm to read one value at a time and perform some kind of storage like persist the data into a HBase cluster.In a modern data architecture built on Apache Hadoop, Kafka ( a fast, scalable and durable message broker) works in combination with Storm, HBase (and Spark) for real-time analysis and rendering of streaming data. Kafka has been used to message geospatial data from a fleet of long-haul trucks to financial data to sensor data from HVAC systems in office buildings.
  3. A Complex Event Processing tier that can process these feeds at scale to understand relationships among them; where the relationships among these events are defined by business owners in a non technical or by developers in a technical language. Apache Storm integrates with Kafka to process incoming data. Storm architecture is covered briefly in the below section.
  4. Once the machine learning models are defined, incoming data received from the Storm/Spark tier will be ingested into the models to predict outlier transactions or potential fraud. As a result of specific patterns being met that indicate potential fraud, business process workflows are created that follow a well defined process that is predefined and modeled by the business.
    • Credit card transaction data comes as stream (typically through Kafka)
    • An external system has information about the credit card holder’s recent location (collected from GPS on mobile device and/or from mobile towers)
    • Each credit card transaction is looked up against user’s current location
    • If the geographic distance between the credit card transaction location and user’s recent known location is significant (say 100 miles), the credit card transaction is flagged as potential fraudScreen Shot 2015-10-27 at 9.52.34 PM

Illustration 2 :External Lookup Pattern for a Fraud Detection Application (Sheetal Dolas – Hortonworks)

  1. Data that has business relevance and needs to be kept for offline or batch processing can be handled using the  storage platform based on Hadoop Distributed Filesystem (HDFS). The idea to deploy Hadoop oriented workloads (MapReduce, or, Machine Learning) to understand fraud patterns as they occur over a period of time.Historical data can be fed into Machine Learning models created in Step 1 and commingled with streaming data as discussed in step 2.
  2. Horizontal scale-out is preferred as a deployment approach as this helps the architecture scale linearly as the loads placed on the system increase over time
  3. Output data elements can be written out to HDFS, and managed by HBase. From here, reports and visualizations can easily be constructed.
  4. One can optionally layer in search and/or workflow engines to present the right data to the right business user at the right time.  


Messaging Broker Tier

The messaging broker tier (based on Apache Kafka) is the first point of entry in a system. It fundamentally hosts a set of message queues. The broker tier needs to be highly scalable while supporting a variety of cross language clients and protocols from Java, C, C++, C#, Ruby, Perl, Python and PHP. Using various messaging patterns to support real-time messaging, this tier integrates application, endpoints and devices quickly and efficiently. The architecture of this tier needs to be flexible so as to allow it to be deployed in various configurations to connect to customized solutions at every endpoint, payment outlet, partner, or device.

Pipeline

Illustration 3: Multistage Data Refinery Pipeline for a Fraud Detection Application

Apache Storm is an Open Source distributed, reliable, fault – tolerant system for real time processing of large volume of data. Spout and Bolt are the two main components in Storm, which work together to process streams of data.

  • Spout: Works on the source of data streams. In this use case, Spout will read realtime transaction data from Kafka topics.
  • Bolt: Spout passes streams of data to Bolt which processes and passes it to either a data store or another Bolt.

Storm-Kafka

                                                        Illustration 3:  Kafka-Storm integration

Storage Tier

There are broad needs for two distinct data tiers that can be identified based on business requirements.

  1. Some data needs to be pulled in near realtime, accessed in a low latency pattern as well as have calculations performed on this data. The design principle here needs to be “Write Many and Read Many” with an ability to scale out tiers of servers
  2. In memory technology based on Spark is very suitable for this use case as it not only supports a very high write rate but also gives users the ability to store, access, modify and transfer extremely large amounts of distributed data. A key advantage here is that Hadoop based architectures can pool memory and can scaleout across a cluster of servers in a horizontal manner. Further, computation can be pushed into the tiers of servers running the datagrid as opposed to pulling data into the computation tier.
  3. As the data volumes increase in size, compute can scale linearly to accommodate them. The standard means of doing so is through techniques such as data distribution and replication. Replicas are nothing but copies of the same segment or piece of data that are stored across (aka distributed) a cluster of servers for purposes of fault tolerance and speedy access. Smart clients can retrieve data from a subset of servers by understanding the topology of the grid. This speeds up query performance for tools like business intelligence dashboards and web portals that serve the business community.
  4. The second data access pattern that needs to be supported is storage for data that is older. This is typically large scale historical data. The primary data access principle here is “Write Once, Read Many.” This layer contains the immutable, constantly growing master dataset stored on a distributed file system like HDFS. Besides being a storage mechanism, the data stored in this layer can be formatted in a manner suitable for consumption from any tool within the Apache Hadoop ecosystem like Hive or Pig or Mahout.

The final word [1] – 

Since 2009, PayPal has been building and modifying its fraud analytics systems, incorporating new open-source technologies as they have evolved. For example, the company uses Hadoop to store data, and related analytics tools, such as the Kraken. A data warehouse from Teradata Corp. stores structured data. The fraud analysis systems run on both grid and cloud computing infrastructures.

Several kinds of algorithms analyze thousands of data points in real-time, such as IP address, buying history, recent activity at the merchant’s site or at PayPal’s site and information stored in cookies. Results are compared with external data from identity authentication providers. Each transaction is scored for likely fraud, with suspicious activity flagged for further automated and human scrutiny, Mr. Barrese said.

For example, “a very bad sign” is when one account shows IP addresses from 10 parts of the world, Dr. Wang said, because it suggests the account might have been hacked.

The system tags the account for review by human experts, she said. “They might discover that the IP addresses are at airports and this guy is a pilot,” she said. Once verified, that intelligence is fed back into PayPal’s systems. Humans don’t make the system faster, but they make real-time decisions as a check against, and supplement to, the algorithms, she said.

The combination of open-source technology, online caching, algorithms and “human detectives,” she said, “gives us the best analytical advantage.”

References – 

[1] “PayPal fights Fraud With Machine Learning and Human Detectives” – From WSJ.com

http://blogs.wsj.com/cio/2015/08/25/paypal-fights-fraud-with-machine-learning-and-human-detectives/

Big Data Counters Payment Card Fraud (1/3)…

This article is the first installment in a three part series that covers one of the most critical issues facing the financial industry – Payment Card Fraud. Payment Cards include Credit, ATM & Debit Cards. This first post discusses the origin and scope of the problem. The next post will discuss a candidate Big Data Architecture that can help financial institutions turn the tables on Fraudster Networks. The final post will cover the evolving technology landscape in this sector – in the context of disruptive technology innovation in predictive & streaming analytics in by Big Data.

“We are confronting a criminal population that continues to improve its sophistication and its attack vectors, so we can’t stand still,”  says Ellen Richey, chief enterprise risk officer at Visa Inc.“You see the criminal capability evolving on the technology side,” she said. “They are getting into the systems of [Visa] stakeholders and other companies that process payments,  and they are able to encrypt their own movements on networks, sometimes for months, and exfiltrate the data.” (Source – The Wall Street Journal)

Payment Card fraud has mushroomed into a massive challenge for consumers, financial institutions,regulators and law enforcement. As the accessibility and usage of Credit Cards burgeons and transaction volumes increase, Banks are losing tens of billions of dollars on an annual basis to fraudsters. The annual estimate is about $189 billion as estimated by Meridian Research.

The Nilson Report  depicts the global scale of the problem as of 2015. Nilson counted the Fraud losses incurred by banks and merchants on all credit, debit, and prepaid general purpose and private label payment cards issued worldwide. These reached $16.31 billion last year when global card volume totaled $28.844 trillion. This means that for every $100 in volume, 5.65¢ was fraudulent. Fraud, which grew by 19%, outpaced volume, which grew by 15%.

Pmt_Card_Fraus

                 Figure 1 – Payment Card Fraud Worldwide 2015  (source – The Nilson Report)

In  2015, fraud losses incurred by banks and merchants on all credit, debit, and prepaid general purpose and private label payment cards (worldwide) reached $16 billion while global card volume totaled almost $29 trillion[1]. This means that for every $100 in volume, almost 6¢ was fraudulent. Fraud increases (up by 19%) also handily outpaced growth in transaction volume, which grew by 15%.

The US Federal Reserve defines credit card fraud as “Unauthorized account activity by a person for which the account was not intended. Operationally, this is an event for which action can be taken to stop the abuse in progress and incorporate risk management practices to protect against similar actions in the future.

The US leads the world in Payment Card fraud with 48% of the total fraud occurring in the States. The problem bedevils both Card Issuers and Merchants.High profile hacks at Target, TJX Companies and Sony Pictures etc only serve to illustrate the scale of the challenge.

20140215_FNC300_0

                    Figure 2 – US Share of Payment Card Fraud (source – The Nilson Report)

Types of Credit Card Fraud – 

The various categories of credit card fraud include – application fraud (where an unauthorized person open up a credit card using stolen personal information), lost or stolen payment card information (misplaced or stolen card details used to typically make online purchases), counterfeit cards, and account takeovers. Oftentimes Credit or Payment Card fraud also involves identity theft. According to the FTC, identity theft is escalating at 40 percent a year and is particularly problematic compared with more traditional forms of financial fraud.

fraud-type-infographic

                                              Figure 3 – US Card Fraud by Type

As can be seen from the above pie chart, the highest amount occurs via online fraud. Organized criminal organizations now resemble sophisticated and agile IT operations. Gartner reports that online fraud is 12 times more likely than offline fraud.Why is this occurring at such an alarming clip and why now?

The FTC (Federal Trade Commission) estimates that enhanced consumer access to various forms of payments, sophisticated technology &  high speed communications make it ever easier for fraudsters.

How Big Data and Hadoop change the game in Fraud Detection?

Banks are increasingly turning to predictive analytics to predict and prevent fraud in real-time. That can sometimes be an inconvenience for customers who are traveling or making large purchases, but it’s necessary inconvenience today in order for banks to reduce billions in losses.

recent WSJ Article highlights advances made in the area of fraud detection and management at Visa by using Big Data techniques. The company estimates that their models have helped identify at least $2 billion worth of annual fraud, and have also given it the chance to address those vulnerabilities before that money was lost.

In August 2011, Visa as one of the early pioneers moved to a Big Data based analytic platform that harnesses the power of Big Data. The term may not have been coined yet but it the idea was to tackle the larger and more varied sets of transaction data using intensive algorithms. underlying hardware and software that runs calculations faster and more cheaply than traditional databases or analytic engines.

Big Data is dramatically changing that approach with advanced analytic solutions that are powerful and fast enough to detect fraud in real time but also build models based on historical data (and deep learning) to proactively identify risks.

Traditional (pre-Hadoop) fraud detection systems were designed for an older era and were primarily based on Business Rules and Complex Events. However, they fall short in the following ways.

  1. Static Data Analysis  vs Advanced Predictive Analytics – Traditional systems have been focused on looking for a few static factors such as known bad IP addresses or unusual login times or excessive transaction amounts.  These systems are typically based on hardcoded business rules and a barebones eventing model. Advanced fraud detection systems augment the above approach with building models of customer behavior at the macro level. Then they would use these models ( to detect anomalous transactions and flag them as potentially being fraudulent. However, the scammers have also learnt to stay ahead of the scammed and are leveraging computing advances to come up with ever new ways of cheating the banks. To accommodate larger data sets, Visa has updated its database technology. In 2010, it began using Hadoop, a software framework that is based on open-source technology from Google. It is designed to quickly process huge amounts of information from disparate sets, and to work with clusters of lower-cost machines, instead of expensive servers[1].
  2. Scope and Precision of Data Coverage –  Big Data enables Banks to incorporate way way more information into the decisioning process than was possible before. Per Visa[1], their earlier analytic models studied as little as 2% of transaction data. Adopting Big Data provides them completeness and massive breadth of attributes of every transaction. Now the company said it endeavors to analyze all of its data. In the past, the company based its security assumptions on average fraud rates for merchant categories, like grocery stores. Now it said it can analyze the actual market, right down to individual merchant terminals. That allows it to drill down on hundreds of attributes, such as average authorization volumes, average ticket sizes and frequency of purchases that turn out to be fraudulent, the company said[1].
  3. Fraud Detection in Realtime –  As Visa points out, the ability to  analyze much  larger & richer data sets helps them identify fraud more quickly – virtually in milliseconds from the time that a payment card is used. While one transaction at a merchant might not look suspicious, a data set that includes hundreds or thousands of transactions makes it easier to spot a problem, such as a tampered PIN pad.The new analytic engine can study as many as 500 aspects of a transaction at once. That’s a sharp improvement from 2005, when the company’s previous analytic engine could study only 40 aspects at once[1]. 
  4. Fraud Detection via Machine Learning – Big Data brings along machine learning to the table. Using a variety of techniques (both supervised and unsupervised learning methods), Banks and Payment Networks can build models which can detect anomalous transactions with a very high degree of surety.  These can also be very quickly updated. From [1] –  And instead of using just one analytic model, as it did in 2005, Visa now operates 16 models, covering different segments of its market, such as geographic regions.The models can be updated much more quickly, too. An attribute can be added to a model in as little as an hour. Back in 2005, it would take two or three days to make that happen.
  5. Big Data now supports Cyber Security – As Hadoop undergoes multiple changes and evolves to becoming a true Application Platform – an important use-case emerges – Hadoop as a framework for security analytics via frameworks like OpenSOC. We will cover the detailed architecture in the next post but being able to make big data part of technical security strategy by providing a platform for the application of anomaly detection and incident forensics to the data loss problem has particular relevance to the Payment Card Fraud problem.
    In the future, Big Data will play a bigger role in authenticating users, reducing the need for the system to ask users for multiple proofs of their identify, according to Visa Richey, and 90% or more of transactions will be processed without asking customers those extra questions, because algorithms that analyze their behavior and the context of the transaction will dispel doubts. “Data and authentication will come together,” Richey said. The data-driven improvement in security accomplishes two strategic goals at once, according to Richey. It improves security itself, and it increases trust in the brand, which is critical for the growth and well-being of the business, because consumers won’t put up with a lot of credit-card fraud. “To my mind, that is the importance of the security improvements we are seeing,” she said. “Our investments in data and analysis are baseline to our ability to thrive and grow as a company.”[1]

Thus, from a pure technology stack perspective, Hadoop is emerging as the best choice for fraud detection, namely because –

  1. Hadoop (Gen 2) is not just a data processing platform. It has multiple personas – a real time, streaming data, interactive platform for any kind of data processing (batch, analytical, in memory & graph based) along with search, messaging & governance capabilities built in – all of which support fraud detection architecture patterns.
  2. Hadoop provides not just massive data storage capabilities but also provides multiple frameworks to process the data resulting in response times of milliseconds with the outmost reliability whether that be realtime data or historical processing of backend data.
  3. Hadoop can ingest billions of events at scale thus supporting the most mission critical analytics irrespective of data size.
  4. From a component perspective Hadoop supports multiple ways of running models and algorithms that are used to find patterns of fraud and anomalies in the data to predict customer behavior. Examples include Bayesian filters, Clustering, Regression Analysis, Neural Networks etc. Developers have a choice of MapReduce, Spark (via Java,Python,R), Storm etc and SAS to name a few – to create these models. Fraud model development, testing and deployment on fresh & historical data become very straightforward to implement on Hadoop.
  5. Hadoop provides a highly scalable NoSQL option – HBase. HBase has been proven to support near real-time ingest of billions of data streams. HBase provides near real-time, random read and write access to tables containing billions of rows and millions of columns.

Visa estimates that their approach model has identified $2 billion in potential annual incremental fraud opportunities, and have also given it the chance to address those vulnerabilities before that money was lost[1].

Having set the stage, the next post will present a real world reference architecture from an end to end infrastructure and application re-architecture for any organization that is considering a Big Data initiative in the area of fraud detection and prevention. 

References

[1] http://blogs.wsj.com/cio/2013/03/11/visa-says-big-data-identifies-billions-of-dollars-in-fraud/

[2] http://www.nilsonreport.com/

Big Data – Banking’s New Weapon In War Against Financial Crime..(2/2)

According to the US treasury – “Money laundering is the process of making illegally-gained proceeds (i.e. “dirty money”) appear legal (i.e. “clean”). Typically, it involves three steps: placement, layering and integration. First, the illegitimate funds are furtively introduced into the legitimate financial system. Then, the money is moved around to create confusion, sometimes by wiring or transferring through numerous accounts. Finally, it is integrated into the financial system through additional transactions until the “dirty money” appears “clean.” (Source – Wikipedia)

Money-laundering.svg

                        Figure 1 – How Money Laundering works

The basic money laundering process has three steps:

  1. Placement – At the first stage, the launderer (typically a frontman such as a drug trafficker, white collar criminal, corrupt public official, terrorists or con artists) inserts the ill-gotten finances into the cash stream of a legitimate financial institution.This is typically done at a Bank using a bunch of small cash deposits. It can also be done at a Securities Broker via inserting funds into buying securities transactions.
  2. Layering – Layering is the most complex stage of a ML operation.Here the funds inserted in the first step are converted into a legitimate holding either monetary instruments or a physical holding. E.g Real estate
  3. Integration – At the final stage, the funds are white washed and brought into the legitimate economy via resale of the assets purchased etc.

As discussed in the first post, now more than ever – an efficient & scalable technology implementation underpins an effective AML compliance program. As data volumes grow due to a high number of customers across multiple channels (ATM Kiosk, Online Banking, Branch Banking & Call Center)  as well more complex transactions – more types of data are on-boarded at an increasing Ingress velocity.

Thus, the challenges for IT Organizations when it comes to AML (Anti Money Laundering) are now manifold:

  1. Ingest a variety of data on the fly – ranging from Core Banking Data to Terrorist watch-lists to Fraudulent Entity information to KYC information etc
  2. The need to monitor every transaction for Money Laundering (ML) patterns as depicted in Figure 1 –  right from customer on-boarding. This sometimes resembles the proverbial needles in a haystack
  3. The ability to perform entity linked analysis that can help detect relationships across entities that could signify organized money laundering rings
  4. The need to create aggregate and individual customer personas that adjust dynamically based on business rules
  5. Integrating with a BPM (Business Process Management) engine so that the correct information can be presented to the right users as part of an overall business workflow
  6. Integrating with other financial institutions to support complex business operations such as KYCC (Know Your Customer’s Customer)
  7. Provide a way to create and change Compliance policies and procedures on the fly as business requirements evolve
  8. Provide an integrated approach to enforce compliance and policy control around business processes and underlying data as regulation gets added/modified with the passage of time
  9. Need to enable Data Scientists and Statisticians to augment classical value  compliance analytics with model building (e.g Fraud Scoring) through knowledge discovery and machine learning techniques. There is a strong need to adopt a mechanism of pro-active alerting using advanced predictive analytic techniques

Existing solutions (whether developed in house or purchased off the shelf) in the AML space clearly fall behind in almost all of the above areas. In the last few years, AML has evolved into a heavily quant based computational domain not too unlike Risk Management.Traditional Compliance approaches based on RDBMS’s cannot scale with this explosion of data as well as handle the heterogeneity inherent in reporting across multiple kinds of compliance – both from a compute and storage perspective.

So what capabilities does Hadoop add to existing RDBMS based technology that did not exist before? The short answer is depicted in the picture below.

Financial-Services-Ref-Arch.20140310

  Figure 2 – Big Data Reference Architecture for Banking

Banking organizations are beginning to leverage Apache Hadoop to create common cross-company data lake for data from different LOBs: mortgage, consumer banking, personal credit, wholesale and treasury banking. Both Internal Managers, Business Analysts, Data Scientists and finally Consumers are able to derive immense value from the data. A single point of data management allows the bank to operationalize security and privacy measures such as de-identification, masking, encryption, and user authentication.

Banks can not only generate insights using a traditional ad-hoc querying model but also build statistical models & leverage Data Mining techniques (like classification, clustering, regression analysis, neural networks etc) to perform highly robust predictive modeling. Such models encompass the Behavioral and Realtime paradigms in addition to the traditional Batch mode – a key requirement in every enterprisewide AML initiative.

Now, from a technology perspective, Hadoop helps the Compliance projects in five major ways –

  1. enables easy ingestion of raw data from disparate business systems that contain core banking, wealth management, risk data, trade & position data,customer account data, transaction data, wire data, payment data,event data etc.
  2. enables cost effective long term storage of Compliance data, while allowing for daily incremental updates of data that are the norm in AML projects. Hadoop helps store data for months and years at a much lower cost per TB compared to tape drives and other archival solutions
  3. Supports multi-tenancy from the ground up, so that different lines of business can all be tenants of the data in the lake, creating their own views of underlying base data while running analytics & reporting on top of those views.
  4.  MapReduce is the original framework for writing applications that process large amounts of structured and unstructured data stored in the Hadoop Distributed File System (HDFS). Apache Hadoop YARN opened Hadoop to other data processing engines (e.g. Apache Spark/Storm) that can now run alongside existing MapReduce jobs to process data in many different ways at the same time.
  5. Hadoop supports multiple ways of running models and algorithms that are used to find patterns of fraud and anomalies in the data to predict customer behavior. Users have a choice of MapReduce, Spark (via Java,Python,R) etc and SAS to name a few. Compliance model development, testing and deployment on fresh & historical data become very straightforward to do on Hadoop.

AML_RefArch

  Figure 3 – Big Data Reference Architecture for AML Compliance

One of the first tenets of the above architecture is to eliminate silos of information by consolidating all feeds from the above source systems into a massively scalable central repository, known as a data-lake, running on the HDFS (Hadoop Distributed File System). Data Redundancy (which is a huge problem at many institutions) is thus eliminated.

The flow of data is from left to right as depicted above and explained below –

1) Data Ingestion: This encompasses creation of the loaders to take in data from the above source systems. Hadoop provides multiple ways of ingesting this data which makes it an extremely useful solution to have in a business area with heterogenous applications which have their own data transfer paradigms.

The most popular ones include –

  1. Apache Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and structured datastores such as relational databases. It efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as HP Vertica, Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB etc. This list of certified & fully supported connectors grows with every release of Hadoop. Sqoop provides a range of options & flags that describe how data is to be retrieved a relational system, how the data is retrieved, which connector to use, how many map tasks to use, split patterns, and final file formats. However, the one limitation is that Sqoop (which gets translated into MapReduce internally) is essentially a batch process. Let us look into other modes of message delivery which need faster processing than batch i.e streaming mode.
  2. Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with tunable reliability mechanisms for failover and recovery.
  3. Apache Kafka is a fast, scalable, durable, and fault-tolerant publish-subscribe messaging system. Kafka is often used in place of traditional message brokers like JMS and AMQP because of its higher throughput, reliability and replication. Kafka works in combination with Storm, HBase and Spark for real-time analysis and rendering of streaming data.

Developing the ingest portion will be the first step to realizing the overall AML architecture as timely data ingestion is a large part of the problem at most institutions. Part of this process includes understanding examples of a) data ingestion from the highest priority of systems b) apply the correct governance rules to the data. The goal is to create these loaders for versions of different source systems and to maintain it as part of the platform moving forward. The first step is to understand the range of Book of Record transaction systems (lending, payments and transactions) and the feeds they send out. The goal would be to create the mapping to a release of an enterprise grade Open Source Big Data Platform e.g HDP (Hortonworks Data Platform) to the loaders so these can be maintained as part of the implementation going forward.

 Quick note on AML Projects, Hadoop & ETL –

It is important to note that ETL (Extract, Transform and Load) platforms & tools are very heavily used in financial services especially in the AML space. It then becomes very important to clarify that the above platform architecture does not look to replace these tools at the outset. The approach is incremental beginning with integration using certified adapters that are developed & maintained by large vendors. These adapters enable developers to build new Hadoop applications that exchange data with ETL platforms in a bi-directional manner.Examples of these tools include IBM DataStage, Pentaho, Talend, Datameer, Oracle Data Integrator etc.

2) Data Governance & Metadata Management: These are the loaders that apply the rules to the critical fields for AML Compliance. The goal here is to look for gaps in data integrity and any obvious quality problems involving range or table driven data. The purpose is to facilitate data governance reporting as well as to achieve common data definitions across the compliance area. Right from the point that data is ingested into the data-lake, Hadoop maintains a rich set of metadata about where each piece of raw data was ingested from, what transformations were applied to them, what roles of users could operate on the data based on a rich set of ACL (Access Control Lists) of permissions etc. Apache Atlas is a project that helps not just with the above requirements but can also export this metadata to tools like SAS, BI, Enterprise Data Warehouses etc so that these toolsets can leverage this data to create execution plans to best access data in HDFS.

3) Entity Identification: This is the establishment and adoption of a lightweight entity ID service. The service will consist of entity assignment and batch reconciliation. The goal here is to get each target bank to propagate the Entity ID back into their booking and payment systems, then transaction data will flow into the lake with this ID attached providing a way to do Customer 360.

4) Data Cleansing & Transformation: Once the data is ingested into the lake, all transformation and analysis will happen on the HDFS. This will involve defining the transformation rules that are required in the Compliance area to prep,encode and transform the data for their specific processing. Hadoop again provides a multitude of options from a transformation perspective – Apache Spark, MapReduce, Storm, Pig, Hive, Crunch and Cascading etc.

  1. Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing,  ML Lib for machine learning, and GraphX for graph processing. Sophisticated analysis (and calcs) can easily be implemented using Spark.
  2. MapReduce is the original framework for writing applications that process large amounts of structured and unstructured data stored in the Hadoop Distributed File System (HDFS).
  3. Apache Hive is the defacto standard for interactive SQL queries over petabytes of data in Hadoop. With the completion of the Stinger Initiative, and the next phase of Stinger.next, the Apache community has greatly improved Hive’s speed, scale and SQL semantics. Hive easily integrates with other BI & Analytic technologies using a familiar JDBC interface.

5) Analytic Definition: Defining & executing the analytics that are to be used for each risk and compliance area. These analytics span the gamut from adhoc queries to predictive models (written in SAS/R/Python) to fuzzy logic matching to customer segmentation (Across both macro & micro populations) etc

6) Report Definition: This is the stage where Reporting and Visualization take over. Defining the reports that are to be issued for each risk and compliance area as well as creating interesting & context sensitive visual BI are the key focus here. These could run the gamut from a BI tool to a web portal that gives internal personnel and regulators a quick view of a customer or an entity’s holistic pattern of behavior. The key is to provide views of macro aggregates (i.e normal behavior for a customer persona) as well as triggering transactions for a single entity whether a retail customer or an institution. The wide variety of tools depicted at the top in Figure 2 all integrate well with the Hadoop Data Lake including MicroStrategy, Tableau, QlikView et al.

Thus – If done right, Hadoop can form a strong backbone of an effective AML program.

Big Data – Banking’s New Weapon In War Against Financial Crime..(1/2)

fincen-630x251

Big data technologies led by Apache Hadoop can help financial services firms comply with a myriad of regulations, including US Anti-Money Laundering (AML) laws and requirements.

In 2015, it goes without saying that Banking is an increasingly complex as well as a global business. Leading North American Banks now generate a large amount of revenue in Global markets and this is generally true of all major worldwide banks. Financial crime is a huge concern for banking institutions given the complexity of the products they offer their millions of customers, large global branch networks and operations spanning the spectrum of financial services. The BSA (Bank Secrecy Act) requires U.S. financial institutions to assist U.S. government agencies to detect and prevent money laundering. Specifically, the act requires financial institutions to keep records of cash purchases of negotiable instruments, to file reports of cash transactions exceeding $10,000 (daily aggregate amount), and to report suspicious activity that might signify money laundering, tax evasion, or other criminal activities. It was passed by the United States Congress in 1970. After the terrorist attacks of 2001, the US Patriot Act was passed into law by Congress. The Patriot Act augments the BSA with Know Your Customer (KYC) legislation which mandates that Banking institutions be completely aware of their customer’s identities and transaction patterns with a view of monitoring account activity.

Thus, from a regulatory perspective, banks and other financial institutions now need to comply with legislation that governs financial crimes under the umbrella Anti Money Laundering (AML) legislation which covers a wide variety of compliance requirements. These must be set in place from a data management, process, culture and IT perspective. AML legislation has now expanded to include tax evasion as well with the advent of the FATCA (Foreign Account Tax Compliance Act). Thus, the important sections of the act that institutions must comply with include the BSA (Bank Secrecy Act), KYC (Know Your Customer) and the FATCA.

The US Government also formed the FinCEN (Financial Crimes Enforcement Network) in 1990 as the primary enforcing authority that collects and analyzes transactions flowing through the system to detect AML violations.

Financial institutions are required to file FinCEN SAR’s (Suspicious Activity Report). Dealing with financial crimes also provides a significant social benefit in that drug money & other ill gotten finance does not get laundered into the system.

For every transaction flowing through the retail banking system – Banks, BHC’s (Bank holding companies), and their subsidiaries are required by federal regulations to file a SAR with respect to:

  • Criminal violations involving insider abuse in any amount.
  • Criminal violations aggregating $5,000 or more when a suspect can be identified.
  • Criminal violations aggregating $25,000 or more regardless of a potential suspect.
  • Transactions conducted or attempted by, at, or through the bank (or an affiliate) and aggregating $5,000 or more, if the bank or affiliate knows, suspects, or has reason to suspect that the transaction:
    • May involve potential money laundering or other illegal activity (e.g., terrorism financing).
    • Is designed to evade the BSA or its implementing regulations.
    • Has no business or apparent lawful purpose or is not the type of transaction that the particular customer would normally be expected to engage in, and the bank knows of no reasonable explanation for the transaction after examining the available facts, including the background and possible purpose of the transaction.

A transaction includes a deposit; a withdrawal; a transfer between accounts; an exchange of currency; an extension of credit; a purchase or sale of any stock, bond, certificate of deposit, or other monetary instrument or investment security; or any other payment, transfer, or delivery by, through, or to a bank.

(Source – FFEC Online Manual for Bank Secrecy Act enforcement)

In the Capital markets space, the FINRA (Financial Industry Regulatory Authority) regulates the broker dealer industry and deploys hundreds of professional examiners to look for any suspicious activity across the entire range of traded instruments. Classical Wall Street’s focus on compliance has dated back to 2003, with the passing of the Patriot Act. Since then Global Banks have put into place strong compliance functions to monitor their customers, bank accounts and transactions.

Implementation and re-engineering AML processes has been a focus for banks, especially as they adopt technologies around Enterprise Middleware, Cloud, Analytics and Big Data. As Banking is increasingly an Omni-channel world, compliance architectures need to be able to adapt to not just Branch & ATM banking but also Mobile, Call Center, IoT etc etc.

Technology underpins an effective compliance program. As data volumes grow and more types of data are on-boarded, the challenges for IT Organizations when it comes to AML are manifold:

  1. The need to monitor every transaction for fraudulent activity, such as money laundering, beginning right from customer on-boarding i.e looking for needles in a haystack
  2. The ability to glean insight from existing data sources as well as integrating new volumes of data from unstructured or semi structured feeds; and to achieve this in a world full of data silos
  3. The need to create aggregate and individual customer personas that adjust dynamically based on business rules
  4. Presenting information that matters to the right users as part of a business workflow
  5. Integrating with other financial institutions to support complex business operations such as KYCC (Know Your Customers Customer)
  6. Provide a way to create and change such policies and procedures on the fly as business requirements evolve
  7. Provide an integrated approach to enforce compliance and policy control around business processes and underlying data as more regulation gets added with the passage of time
  8. There is a need to enable Data Scientists and Statisticians to augment classical value  compliance analytics with model building (e.g Fraud Scoring) through knowledge discovery and machine learning techniques. There is a strong need to adopt a mechanism of pro-active alerting using advanced predictive analytic techniques

Existing solutions in the AML space clearly fall behind in almost all of the above areas.AML has evolved into a heavily quant based computational domain not too unlike Risk Management.Traditional Compliance algorithms cannot scale with this explosion of data as well as the heterogeneity inherent in reporting across multiple kinds of compliance.

The definition of Financial Crimes is fairly broad & encompasses a large area of definition – the traditional money laundering activity, financial fraud like identity theft/check fraud/wire fraud, terrorist activity, tax evasion, securities market manipulation, insider trading and other kinds of securities fraud. Financial institutions across the spectrum of the market now need to comply with the regulatory mandate at both the global as well as the local market level.

Finally, the harm done to a financial institutions reputation is immeasurable especially if they’ve have been sanctioned and otherwise penalized by the regulatory authorities for failure to institute appropriate supervisory guidelines. It is important to note that effectively tackling and implementing AML guidelines itself does not provide a source of competitive advantage but it needs to be done as a price of entry into the business of banking.

The next post will examine how Hadoop is proving to be the perfect platform in solving Compliance related business challenges. We will examine a real world reference architecture that can serve as a strong basis for any Global Scale Compliance Regime.

References

http://blogs.reuters.com/financial-regulatory-forum/2013/10/14/wall-streets-hot-hire-anti-money-laundering-compliance-officers/

Leverage Open Source to Defend and Disrupt..

disruptive-innovation

(Image Credit – Larry Putterman)

What Mark Zuckerberg (Facebook’s CEO) worries about the most is the lack of change, the lack of innovation, becoming the innovator’s dilemma company that gets big and stops moving and stops staying ahead. – Sheryl Sandberg (Facebook – COO)

I have found myself spending the vast majority of my career working with a range of marquee financial services, healthcare, business services & Telco clients. A vast percentage of these strategic discussions have centered around business transformation, enterprise architecture and overall strategy around Open Source initiatives & technology. These range from Middleware to BPM to Cloud Computing (IaaS/PaaS/SaaS) to DevOps practices.

In the last few years, more and more of those discussions have been focused around Cloud Computing, DevOps, Mobility & Big Data. We are at an inflexion point, there is now an emerging sense of urgency in mainstream Financial Services organizations to create and expand on their Open Source strategy.

Since the global economic crisis of 2008, seven years of steady economic growth and rising stock market indices, have conferred a sense of Banking prosperity while simultaneously increasing complexity across product lines. On the other hand, seismic advances in computing (namely in Big Data,Social,Cloud,Mobility & Analytics) & their convergence have increasingly begun to highlight technology led innovation as the core strategic differentiator in sorting out the winners from the laggards. With the emergence of FinTechs and their intent in dis-intermediating Banking business models, staid Banking increasingly resembles Silicon Valley.

In his pathbreaking “Innovators Solution”, Clayton Christensen codifies the significant difference between sustaining and disruptive innovations. Sustaining innovations result in evolutionary improvements that result in tactical savings in cost, performance & features. Disruptive innovations create and operate nascent markets, which incumbents find difficult to enter or engage with due to their legacy burden – in terms of both culture and IT. Disruptive innovations typically attack the lower end of the market and work their way up.

Open Source while being some what of an unknown challenge  to the mass middle market enterprise represents also a tremendous opportunity at most Banks & FinTechs across the spectrum of Financial Services. As one examines business imperatives & use-cases across the seven key segments (Retail & Consumer banking, Wealth management, Capital Markets,Insurance, Credit Cards & Payment processing, Stock Exchanges and Consumer Lending) it is clear that SMAC (Social, Mobile, Analytics, Cloud and Data) stacks can not just satisfy existing use-cases in terms of cost & satisfying business requirements across a spectrum but also help adopters build out Blue Oceans (i.e new markets).

Segments of open source include the Linux OS, Open Source Middleware, Databases and Big Data ecosystem. Technologies like these have disrupted proprietary closed source products ranging from popular UNIX variants, Application Platforms & EDWs, RDBMS’s etc. The rise of Open Standards and Open APIs have been the catalyst in this immense disruption.

Why are Open Source technologies becoming so popular & ubiquitous? I can think of eight key reasons –

  1. They are designed from the ground up to be highly performant & scalable all the time dictating a minimalist design
  2. They have been incubated in Open Communities with thousands of contributors thus satisfying a wealth of deployment options – On Premise/Cloud/Virtual/Hybrid etc
  3. They’re Public and Private Cloud ready from the ground up; you dictate where to what to run it..the vast majority of them are Cloud,OS,Hypervisor,language agnostic
  4. They are built with co-existence in mind as they subscribe to open standards (as applicable); they avoid feature competition while focusing on what 80% of the market needs are based on much lower cost. Most proprietary products offer extensive integration with complementary open source technologies
  5. Developers just love Open Source projects as they’re a snap to install, a breeze to configure, easy to operate at scale & are beginning to provide highly flexible monitoring frameworks from whom a range of business metrics can be easily gleaned
  6. They dramatically lower the bar to innovation and have open roadmaps that encourage corporate participation via co-development initiatives
  7. They enable communities of the under served (customers and individuals) to design and launch revolutionary products & disruptive services that they would not otherwise been able to do while working with established vendors. This is due to considerations of cost, zero roadmap influence,security issues and lack of deployment flexibility due to bundling regimes (Oracle Cloud anyone?)
  8. Open source roadmaps often lead to feature rich offerings that evolve over a period of time. I have seen short sighted technology decisions made in favor of proprietary stacks that have been outstripped in functionality, scalability & performance by the open source offering over a period of time

SMAD

Banks leveraging & building solutions around Cloud, Analytics & Big Data (Hadoop,NoSQL,Data Science and other complementary technologies) see a panoply of benefits across a range of areas which we have done a reasonably comprehensive job of cataloging in various posts.

The enlightened architectures built in support of these business requirements take into consideration improving business processes to make them more customer journey focused while helping lay robust & scalable security foundations in place.

One could argue that in the web-scale space, the Fab Four (Google,Facebook,Amazon & Apple) have taken this to another extreme by creating and building significant open source projects & communities that support their robust platforms (as opposed to standalone or loosely federated applications). This approach has contributed to their outstanding business success. The benefit to the overall industry has been the creation of (open source) software technologies that enable business systems to operate at massive scale in terms of billions of users and at millions of systems.  They have done all this while constantly churning out innovative offerings while still continuously adapting & learning from customer feedback. These four are followed by two other new age giants –  Netflix & LinkedIn. To further illustrate –  the Netflix stack is one of the primary open source projects that enable the creation of micro-services & loosely coupled applications at scale – all running on Amazon AWS – yet another Webscale innovation.

Now, I typically like to bucket applications of technology to industry verticals into three broad (and somewhat simplistic) categories, which may or may not follow a progression, based on the culture & technology innovation in any given shop –

Category 1

This represents the entry point for a financial services organization in considering open source technology. Players in this stage are primarily focused on achieving lower TCO (Total Cost Of Ownership) & higher operational efficiency.

Here emerging investments in the SMAC (Social, Mobile, Analytics, Cloud and Data) areas augment & eventually supplant existing, expensive legacy technology investments with cheaper and more efficient (and typically) open source alternatives.

Indeed one would be hard pressed to find a real world enterprise which did not have an open source strategy to offload some of the routine workloads (email, web, database etc) that make up mundane IT.

Category 2 –

In the second category, these investments not just augment broad IT costs but also result in becoming crucial in meeting business requirements. Classic examples include RFC (Risk, Fraud & Compliance) areas & Information Security. These are areas that keep up the CIO at night. Spend in this category is important but not more important to meeting business requirements. IT investments begin to get more and more strategic here. The vendor moves from being a supplier to a business partner.

Category 3 –

This is the realm of disruption and blue oceans where inefficient markets become efficient or even transformed by new technology. In established enterprises, these business ideas range from the small (leveraging vast amounts of data & analytics to uncover customer engagement opportunities) to the medium impact – creating new business models based on Data Products to the Big Bet – spinning off promising new lines of business etc. In Financial Services, FinTech led innovations in areas ranging from Smart Trading to Robo-Advisor’s to Mobile Payments have all been predicated on creating & managing a new kind of customer.

To recap some of the benefits in going Open Source-

  1. Create enormous Business Value in a range of areas as diverse as –  Defensive (Risk, Fraud and Compliance  – RFC ) to Competitive Parity (e.g Single View of Customer) to the Offensive (Digital Transformation across their Retail Banking business)
  2. Drastically Reduced Time to Market for new business projects
  3. Hugely Improved Quality & Access to information & realtime analytics for customers, analysts and other stakeholders
  4. Huge Reduction in CapEx & OpEx spend on IT projects & initiatives

Either an organization and it’s key leaders adopts a disruptive mindset or can expect to be disrupted. Which category does your organization belong to? That is the quintessential IT question we all have to juggle in this Brave New Age. All said and done if there is one key takeaway I want to leave readers with – Neglect Open Source at your peril.