The Definitive Reference Architecture for Market Surveillance (CAT, UMIR and MiFiD II) in Capital Markets..

We have discussed the topic of market surveillance reporting to some depth in previous blogs. e.g.  Over the last decade, Global Financial Markets have embraced the high speed of electronic trading. This trend has only accelerated with the concomitant explosion in trading volumes. The diverse range of instruments & the proliferation of trading venues pose massive regulatory challenges in the area of market conduct supervision and abuse prevention. Banks, Broker dealers, Exchanges and other market participants across the globe are now shelling out millions of dollars in fines for failure to accurately report on market abuse violations. In response to this complex world of high volume & low touch electronic trading, global capital markets regulators have been hard at work across different jurisdictions & global hubs e.g. the FINRA in the US, the IROC in Canada and the ESMA in the European Union. Regulators have created extensive reporting regimes for surveillance with a view to detecting suspicious patterns of trade behavior (e.g, dumping, quote stuffing & non bonafide fake orders etc). The intent to increase market transparency on both the buy and the sell side. Based on the scrutiny Capital Markets players are under, a Big Data Analytics based architecture has become a “must-have” to ensure timely & accurate compliance with these mandates. This blog attempts to discuss such a reference architecture.

Business Technology Requirements for Market Surveillance..

The business requirements for the Surveillance architecture are covered at the below link in more detail but are reproduced below in a concise fashion.

A POV on European Banking Regulation.. MAR, MiFiD II et al

Some of the key business requirements that can be distilled from regulatory mandates include the below:

  • Store heterogeneous data – Both MiFiD II and MAR mandate the need to perform trade monitoring & analysis on not just real time data but also historical data spanning a few years. Among others this will include data feeds from a range of business systems – trade data, eComms, aComms, valuation & position data, order management systems, position management systems, reference data, rates, market data, client data, front, middle & back office, data, voice, chat & other internal communications etc. To sum up, the ability to store a range of cross asset (almost all kinds of instruments), cross format (structured & unstructured including voice), cross venue (exchange, OTC etc) trading data with a higher degree of granularity – is key.
  • Data Auditing – Such stored data needs to be fully auditable for 5 years. This implies not just being able to store it but also putting in place capabilities in place to ensure  strict governance & audit trail capabilities.
  • Manage a huge volume increase in data storage requirements (5+ years) due to extensive Record keeping requirements
  • Perform Realtime Surveillance & Monitoring of data – Once data is collected,  normalized & segmented, it will need to support realtime monitoring of data (around 5 seconds) to ensure that every trade can be tracked through it’s lifecycle. Detecting patterns that could perform surveillance for market abuse and monitor for best execution are key.
  • Business Rules  – Core logic that deals with identifying some of the above trade patterns are created using business rules. Business Rules have been covered in various areas in the blog but they primarily work based on an IF..THEN..ELSE construct.
  • Machine Learning & Predictive Analytics – A variety of supervised ad unsupervised learning approaches can be used to perform extensive Behavioral modeling & Segmentation to discover transactions behavior with a view to identifying behavioral patterns of traders & any outlier behaviors that connote potential regulatory violations.
  • A Single View of an Institutional Client- From the firm’s standpoint, it would be very useful to have a single view capability for clients that shows all of their positions across multiple desks, risk position, KYC score etc.

A Reference Architecture for Market Surveillance ..

This reference architecture aims to provide generic guidance to banking Business IT Architects building solutions in the realm of Market & Trade Surveillance. This supports a host of hugely important global reg reporting mandates – CAT, MiFiD II, MAR etc that Capital Markets need to comply with. While the concepts discussed in this solution architecture discussed are definitely Big Data oriented, they are largely agnostic to any cloud implementation – private, public or hybrid.

A Market Surveillance system needs to include both real time surveillance of trading activity as well as a retrospective (batch oriented) analysis component. The real time component includes the ability to perform realtime calculations (concerning thresholds, breached limits etc). real time queries with the goal of triggering alerts. Both these kinds of analytics span structured and unstructured data sources. For the batch component, the analytics involve data queries, simple to advanced statistics (min, max, avg, std deviation, sorting, binning, segmentation) to running data science models involving text analysis & search etc.

The system needs to process tens of millions to billions of events in a trading window while providing highest uptime guarantees. Batch analysis is always running in the background.

A Hadoop distribution that includes components such as Kafka, HBase and near real time components such as Storm & Spark Streaming provide a good fit for a responsive architecture. Apache NiFi with its ability to ingest data from a range of sources is preferred for it’s ability to support complex data routing, transformation, and system mediation logic in a complex event processing architecture. The capabilities of Hortonworks Data Flow (the enterprise version of Apache NiFi) is covered in the below blogpost in much detail.

Use Hortonworks Data Flow (HDF) To Connect The Dots In Financial Services..(3/3)

A Quick Note on Data Ingestion..

Data volumes in the area of Regulatory reporting can be huge to insanely massive. For instance, at large banks, they can go up to 100s of millions of transactions a day. At market venues such as stock exchanges, they easily enter into the hundreds of billions of messages every trading day. However the data itself is extremely powerful & is really business gold in terms of allowing banks to not just file mundane reg reports but also to perform critical line of business processes such as Single View of  Customer, Order Book Analysis, TCA (Transaction Cost Analysis), Algo Backtesting, Price Creation Analysis etc. The architecture thus needs to support multiple ways of storage, analysis and reporting ranging from compliance reporting to data scientists to business intelligence.

Real time processing in this proposed architecture are powered by Apache NiFi. There are five important reasons for this decision – 

  • First of all, complex rules can be defined in NiFi in a very flexible manner. As an example, one can execute SQL queries in processor A against incoming data from any source (data that isnt from a relational databases but JSON, Avro etc.) and  then route different results to different downstream processors based on the needs for processing while enriching it. E.g. Processor A could be event driven and if any data is being routed there, a field can be added, or an alert sent to XYZ. Essentially this can be very complex, equivalent to a nested rules engine so to speak. 
  • From a Throughput standpoint, a single NiFi node can typically handle somewhere between 50MB/s to 150MB/s depending on your hardware spec and data structure. Assuming 100-500 kbytes of average messages, for a throughput of 600MB/s, the architecture can be sized to about 5-10 NiFi nodes. It is important to note that performance latency of inbound message processing depends on the network, could be extremely small. Under the hood, you are sending data from source to NIfi node (disk), extract some attributes in memory to process, and deliver to the target system.
  • Data quality can be handled via the aforementioned “nested rules engine” approach, consisting of multiple NiFi processors. One can even embed an entire rules engine into a single processor. Similarly, you can define simple authentication rules at the event level. For instance, if Field A = English, route the message to an “authenticated” relationship; otherwise send it to an “unauthenticated” relationship.

  • One of the corner stones in NiFi is called “Data Provenance“, allowing you to have end to end traceability. Not only can the event lifecycle of trade data be traced but you can also track the time at which it happened & the user role who made the change and metadata around why did it happen.

  • Security – NiFi enables authentication at ingest. One can authenticate data via the rules defined in NiFi, or leverage target system authentication which is implemented at processor level. For example, the PutHDFS processor supports kerberized HDFS, the same applies for Solr and so on.

Overall Processing flow..

The below illustration shows the high-level conceptual architecture. The architecture is composed of core platform services and application-level components to facilitate the processing needs across three major areas of a typical surveillance reporting solution:

  • Connectivity to a range of trade data sources
  • Data processing, transformation & analytics
  • Visualization and business connectivity
Reference Architecture for Market Surveillance Reg Reporting – CAT, MAR,MiFiD II et al

The overall processing of data follows the order shown below and depicted in the diagram below –

  1. Data Production – Data related to Trades and their lifecycle is produced from a range of business systems. These data feeds from a range of business systems (including but not limited to) – trade data, valuation & position data, order management systems, position management systems, reference data, rates, market data, client data, front, middle & back office, data, voice, chat & other internal communications etc.
  2. Data Ingestion – Data produced from the the above layer is ingested using Apache NiFi from a range of sources described above. Data can also be filtered and alerts can be setup based on complex event logic. For time series data support HBase can be leveraged along with OpenTSDB. For CEP requirements, such as sliding windows and complex operators, NiFi can be leveraged along with Kafka and Storm pipeline.  Using NiFi will make the process easier to load data into the data lake while applying guarantees around the delivery itself.  Data can be streamed in real time as it is created in the feeder systems. Data is also loaded at end of the trading day based on the P&L sign off and the end of day close processes.  The majority of the data will be fed in from Book of Record Trading systems as well as from market data providers.
  3. As trade and other data is ingested into the data lake, it is important to note that the route in which certain streams are processed will differ from how other streams are processed. Thus the ingest architecture needs to support multiple types of processing ranging from in memory processing, intermediate transformation processing on certain data streams to produce a different representation of the stream. This is where NiFi adds critical support in not just handling a huge transaction throughput but also enabling “on the fly processing” of data in pipelines. As mentioned, NiFi does this via the concept of “processors”.
  4. The core data processing platform is then based on a datalake pattern which has been covered in this blog before. It includes the following pattern of processing.
    1. Data is ingested real time into a HBase database (which uses HDFS as the underlying storage layer). Tables are designed in HBase to store the profile of a trade and it’s lifecycle.
    2. Producers are authenticated at the point of ingest.
    3. Once the data has been ingested into HDFS, it is taken through a pipeline of processing (L0 to L3) as depicted in the below blogpost.

    4. Historical data (defined as T+1) once in the HDFS tier is taken through layers of processing as discussed above. One of the key areas of processing is to run machine learning on the data to discover any hidden patterns in the trades themselves. Patterns that can connote a range of suspicious behavior. Most surveillance applications are based on a search for data that breaches thresholds and seek to match sell & buy orders. The idea is that when these rules are breached, alerts are then generated for compliance officers to conduct further investigation. However this method falls short with complex types of market abuse.A range of supervised learning techniques can then be applied on data such as creating a behavioral profile of different kinds of traders (for instance junior and senior) by classifying & then scoring them based on their likelihood to commit fraud. Thus a range of Surveillance Analytics can be performed on the data. Apache Spark, is highly recommended for near realtime processing not only due to its high performance characteristics but also due to its native support for graph analytics and machine learning – both of which are critical to surveillance reporting.For a deeper look at data science, I recommend the below post.

    5. The other important value driver in deploying Data Science is to perform Advanced Transaction Monitoring Intelligence.  The core idea is to get years worth of Trade data in one location (i.e the datalake) & then applying  unsupervised learning to glean patterns in those transactions. The goal is then to identify profiles of actors with the intent of feeding it into existing downstream surveillance & TM systems.
    6. This knowledge can then be used to constantly learn transaction behavior for similar traders. This can be a very important capability in detecting fraud in traders, customer accounts and instruments.Some of the usecases are –
      • Profile trading activity of individuals with similar traits (types of customers, trading desks & instruments, geographical areas of operations etc.) to perform Know Your Trader
      • Segment traders by similar experience levels and behavior
      • Understand common fraudulent behavior typologies (e.g. spoofing) and clustering such (malicious) trading activities by trader, instrument and volume etc. The goal being to raise appropriate downstream investigation case management system
      • Using advanced data processing techniques like Natural Language Processing, constantly analyze electronic communications and join them up with trade data sources to both detect under the radar activity but also to keep the false positive rate low.
    7. Graph Database – Given that most kinds of trading fraud happens in groups of actors – traders acting in collusion with  verification & compliance – the ability to view complex relationships of interactions and the strength of those interactions can be a significant monitoring capability
    8. Grid Layer – To improve performance, I propose the usage of a distributed in memory data fabric like JBOSS DataGrid or Pivotal GemFire. This can aid in two ways –

      a. Help with fast lookup of data elements by the visualization layer
      b. Help perform fast computation process by overlaying a framework like Spark or MapReduce directly onto a stateful data fabric.

      The choice of tools here is dependent of the language choices that have been made in building the pricing and risk analytic libraries across the Bank. If multiple language bindings are required (e.g. C# & Java) then the data fabric will typically be a different product than the Grid.

      Data Visualization…

      The visualization solution chose shouldI enable the quick creation of interactive dashboards that provide KPIs and other important business metrics from a process monitoring standpoint. Various levels of dashboard need to be created ranging from compliance officer toolboxes, executive dashboard to help identify trends and discover valuable insights.

      Compliance Officer Toolbox (Courtesy: Arcadia Data)

      Additionally, the visualization layer shall provide

      a) A single view of Trader or Trade or Instrument or Entity

      b) Investigative workbench with Case Management capability

      c) The ability follow the lifecycle of a trade

      d) The ability to perform ad hoc queries over multiple attributes

      e) Activity correlation across historical and current data sets

      f) Alerting on specific metrics and KPIs

      To Sum Up…

      The solution architecture described in this blogpost is designed with peaceful enterprise co-existence in mind. In the sense, it interacts and is also integrated with a range of BORT systems and other enterprise systems such as ERP, CRM, legacy surveillance systems. This includes all and any other line of business solutions that typically exist as shared enterprise resources (such as CRM or ERP systems or other line-of-business solutions).

Capital Markets Pivots to Big Data in 2016

Previous posts in this blog have discussed how Capital markets firms must create new business models and offer superior client relationships based on their vast data assets. Firms that can infuse a data driven culture in both existing & new areas of operation will enjoy superior returns and raise the bar for the rest of the industry in 2016 & beyond. 

Capital Markets are the face of the financial industry to the general public and generate a large percent of the GDP for the world economy. Despite all the negative press they have garnered since the financial crisis of 2008, capital markets perform an important social function in that they contribute heavily to economic growth and are the primary vehicle for household savings. Firms in this space allow corporations to raise capital using the underwriting process. However, it is not just corporations that benefit from such money raising activity – municipal, local and national governments do the same as well. Just that the overall mechanism differs – while business enterprises issue both equity and bonds, governments typically issue bonds. According to the Boston Consulting Group (BCG), the industry will grow to annual revenues of $661 billion in 2016 from $593 billion in 2015 – a healthy 12% increase. On the buy side, the asset base (AuM – Assets under Management) is expected to reach around $100 trillion by 2020 up from $74 trillion in 2014.[1]

Within large banks, the Capital Markets group and the Investment Banking Group perform very different functions.  Capital Markets (CM) is the face of the bank to the street from a trading perspective.  The CM group engineers custom derivative trades that hedge exposure for their clients (typically Hedge Funds, Mutual Funds, Corporations, Governments and high net worth individuals and Trusts) as well as for their own treasury group.  They may also do proprietary trading on the banks behalf for a profit – although it is this type of trading that Volcker Rule is seeking to eliminate.

If a Bank uses dark liquidity pools (DLP) they funnel their Brokerage trades through the CM group to avoid the fees associated with executing an exchange trade on the street.  Such activities can also be used to hide exchange based trading activity from the Street.  In the past, Banks used to make their substantial revenues by profiting from their proprietary trading or by collecting fees for executing trades on behalf of their treasury group or other clients.

Banking and within it, capital markets continues to generate insane amounts of data. These producers range from news providers to electronic trading participants to stock exchanges which are increasingly looking to monetize data. And it is not just the banks, regulatory authorities like the FINRA in the US are processing peak volumes of 40-75 billion market events a day [2]. In addition to data volumes, Capital Markets has always  possessed a variety challenge as well. They have tons of structured data around traditional banking data, market data, reference data & other economic data. You can then factor in semi-structured data around corporate filings,news,retailer data & other gauges of economic activity. An additional challenge now is the creation of data from social media, multimedia etc – firms are presented with significant technology challenges and business opportunities.

Within larger financial supermarkets, the capital markets group typically leads the way in  being forward looking in terms of adopting cutting edge technology and high tech spends.  Most of the compute intensive problems are generated out of either this group or the enterprise risk group. These groups own the exchange facing order management systems, the trade booking systems, the pricing libraries for the products the bank trades as well as the tactical systems that are used to manage their market and credit risks, customer profitability, compliance and collateral systems.  They typically hold about one quarter of a Banks total IT budget. Capital Markets thus has the largest number of use cases for risk and compliance.

Players across value chain on the buy side, the sell side, the intermediaries (stock exchanges & the custodians) & technology firms such as market data providers are all increasingly looking at leveraging these new data sets that can help unlock the value of data for business purposes beyond operational efficiency.

So what are the  different categories of applications that are clearly leveraging Big Data in production deployments.


                      Illustration – How are Capital Markets leveraging Big Data In 2016

I have catalogued the major ones below based on my work with the majors in the spectrum over the last year.

  1. Client Profitability Analysis or Customer 360 view:  With the passing of the Volcker Rule, the large firms are now moving over to a model based on flow based trading rather than relying on prop trading. Thus it is critical for capital market firms to better understand their clients (be they institutional or otherwise) from a 360-degree perspective so they can be marketed to as a single entity across different channels—a key to optimizing profits with cross selling in an increasingly competitive landscape. The 360 view encompasses defensive areas like Risk & Compliance but also the ability to get a single view of profitability by customer across all of their trading desks, the Investment Bank and Commercial Lending.
  2. Regulatory Reporting –  Dodd Frank/Volcker Rule Reporting: Banks have begun to leverage data lakes to capture every trade intraday and end of day across it’s lifecycle. They are then validating that no proprietary trading is occurring on on the banks behalf.  
  3. CCAR & DFast Reporting: Big Data can substantially improve the quality of  raw data collected across multiple silos. This improves the understanding of a Bank’s stress test numbers.
  4. Timely and accurate risk management: Running Historical, stat VaR (Value at Risk) or both to run the business and to compare with the enterprise risk VaR numbers.
  5. Timely and accurate liquidity management:  Look at the tiered collateral and their liquidity profiles on an intraday basis to manage the unit’s liquidity.  They also need to look at credit and market stress scenarios and be able to look at the liquidity impact of those scenarios.
  6. Timely and accurate intraday Credit Risk Management:  Understanding when  & if  deal breaches a tenor bucketed limit before they book it.  For FX trading this means that you have about 9 milliseconds  to determine if you can do the trade.  This is a great place to use in memory technology like Spark/Storm and a Hadoop based platform. These usecases are key in increasing the capital that can be invested in the business.  To do this they need to convince upper management that they are managing their risks very tightly.
  7. Timely and accurate intraday Market Risk Management:  Leveraging Big Data to market risk computations ensures that Banks have a real time idea of any market limit breaches for any of the tenor bucketed market limits.
  8. Reducing Market Data costs: Market Data providers like Bloomberg, Thomson Reuters and other smaller agencies typically charge a fee each time data is accessed.  With a large firm, both the front office and Risk access this data on an ad-hoc fairly uncontrolled basis. A popular way to save on cost is to  negotiate the rights to access the data once and read it many times.  The key is that you need a place to put it & that is the Data Lake.
  9. Trade Strategy Development & Backtesting: Big Data is being leveraged to constantly backtest trading strategies and algorithms on large volumes of historical and real time data. The ability to scale up computations as well as to incorporate real time streams is key to
  10. Sentiment Based Trading: Today, large scale trading groups and desks within them have begun monitoring economic, political news and social media data to identify arbitrage opportunities. For instance, looking for correlations between news in the middle east and using that to gauge the price of crude oil in the futures space.  Another example is using weather patterns to gauge demand for electricity in specific regional & local markets with a view to commodities trading. The realtime nature of these sources is information gold. Big Data provides the ability to bring all these sources into one central location and use the gleaned intelligence to drive various downstream activities in trading & private banking.
  11. Market & Trade Surveillance:Surveillance is an umbrella term that usually refers to a wide array of trading practices that serve to distort securities prices thus enabling market manipulators to illicitly profit at the expense of other participants, by creating information asymmetry. Market surveillance is generally out by Exchanges and Self Regulating Organizations (SRO) in the US – all of which have dedicated surveillance departments set up for this purpose. However, capital markets players on the buy and sell side also need to conduct extensive trade surveillance to report up internally. Pursuant to this goal, the exchanges & the SRO’s monitor transaction data including orders and executed trades & perform deep analysis to look for any kind of abuse and fraud.
  12. Buy Side (e.g. Wealth Management) – A huge list of usecases I have catalogued here – 
  13. AML Compliance –  Covered in various blogs and webinars. – 

The Final Word

A few tactical recommendations to industry CIOs:

  • Firstly, capital markets players should look to create centralized trade repositories for Operations, Traders and Risk Management.  This would allow consolidation of systems and a reduction in costs by providing a single platform to replace operations systems, compliance systems and desk centric risk systems.  This would eliminate numerous redundant data & application silos, simplify operations, reduce redundant quant work, improve and understanding of risk.
  • Secondly, it is important to put in place a model to create sources of funding for discretionary projects that can leverage Big Data.
  • Third, Capital Markets groups typically have to fund their portion of AML, Dodd Frank, Volcker Rule, Trade Compliance, Enterprise Market Risk and Traded Credit Risk projects.  These are all mandatory spends.  After this they typically get to tackle discretionary business projects. Eg- fund their liquidity risk, trade booking and tactical risk initiatives.  These defensive efforts always get the short end of the stick and are not to be neglected while planning out new initiatives.
  • Finally, an area in which a lot of current players are lacking is the ability to associate clients using a Lightweight Entity Identifier (LEI). Using a Big Data platform to assign logical and physical entity ID’s to every human and business the bank interacts can have salubrious benefits. Big Data can ensure that firms can do this without having to redo all of their customer on-boarding systems. This is key to achieving customer 360 views, AML and FATCA compliance as well as accurate credit risk reporting.

It is no longer enough for CIOs in this space to think of tactical Big Data projects, they must be thinking around creating platforms and ecosystems around those platforms to be able to do a variety of pathbreaking activities that generate a much higher rate of return.


[1] “The State of Capital Markets in 2016” – BCG Perspectives