“Data matures like wine, applications like fish.” – James Governor, Principal Analyst & Founder of RedMonk, circa 2007
I would like to begin a series of posts on Data Science jointly authored with my friend, ex-colleague, & collaborator, Maleeha Qazi – Data Scientist (https://www.linkedin.com/in/maleehaqazi/). In these posts, we are intending to bring to light several technology themes around industrial use of Data Science and Deep Learning around Industrial Applications, Big Data , Cyber Security, Cognitive Applications, Business Process Management, and Cloud Computing. Our goal for this first post is to discuss typical issues that bedevil every Data Science initiative at the beginning. Namely, the top technical and cultural concerns to communicate to the IT Department every time a new project is begun.
With Data Science emerging as a key enabler in Digital Customer focused Applications, renewed focus is being placed on how the lifecycle of these new fangled applications happens alongside traditional IT development. This blogpost aims to highlight some of the key concerns involved when Data Science groups work with IT departments. Currently there is no “one size fits all model” in terms of how advanced models are developed and deployed so that they can be accessed and used at scale by customers. It is our wager that almost every large enterprise working on these projects encounters these issues. We wanted to share our experience with the enterprise community over a series of blog posts.
It is clear that Data Science teams, product teams and IT need to collaborate to create business applications that learn from customer needs.
So what are the top asks that Data Science has for their IT groups? There are at least nine important focus areas:
#1 Understanding of the business challenge and agreeing on a common vocabulary
It is a generally accepted fact that most IT/Data Science interactions are focused on the technology portion which include some of the following elements : the data sources within the organization, acquisition and access to external data sources, the availability of tools & infrastructure to begin supporting the data science development process, cloud or on-prem, data ingestion engines (e.g. Kafka, Flume, Sqoop) to ingest and process the data, etc. While this is certainly part of the process, there has begun to be a distinct anti pattern in how this interaction is working when solely driven by technology alone. The Data Science team is involved in creating models that typically reflect customer needs that drive business value for an organization’s customers, partners, regulators & employees. In that rather important context, technology at it’s core is just an engine and does not exist in a vacuum. The most vibrant enterprises understand this ground reality and always ensure that business needs drive both Data Scientists & IT and not the other way around. It is thus highly important for both the Data Science team and IT team to agree on the business challenge at hand to ensure that their interactions (long and short term) are being driven with business & competitive outcomes in mind. Examples of such goals are a common organization wide business language (so that definitions agree semantically) across products, customers, logistics, supply chains & business domains. The shared emphasis on both teams should be on overall goals such as increased customer profitability, enhanced customer segmentation, customer service productivity, etc. Setting this tone upfront will not only ensure that outcomes for both teams are aligned but will also ensure that critical gaps in knowledge and capabilities are filled. One of the approaches that is working well is increased cross pollination across both teams, collapsing artificial organizational barriers by adopting DevOps & ensuring that Data Science teams have a “slim IT” presence (e.g. an embedded data engineer and datacenter person) to rapidly be able to fill in gaps in IT’s business knowledge or capability.
#2 IT needs to help Data Scientists acquire a deep understanding of the overall Data Architecture
Once business requirements have been identified, Data Scientists get right to work in understanding the different data sources that will comprise inputs to their models. In large enterprises, it is not inconceivable to find out that there are many varied data sources from which data needs to be sourced. For instance, in Banking there are a range of Book of Record Transaction (BORT) systems from which data needs to be extracted. It is also key to supplement this data with external data sets. Models are only as good as the data they are given to work with. Garbage In, Garbage Out (GIGO) is the moniker given to bad data that ensures that models perform poorly. A lot of times, business groups have a hard time explaining what they would like to see – both in terms of data and visualization. In such cases, a prototype makes things easier from a requirements gathering standpoint. Once the problem is defined, the data scientist/modeler identifies the raw data sources (both internal and external) which are needed for the execution of the business challenge. They spend a lot of time in the process of collating the data (from Oracle/SQL Server, DB2, Mainframes, Greenplum, Excel sheets, external datasets, etc.). The cleanup/data-wrangling process includes fixing and standardizing missing value representations, identifying potentially corrupted data elements, formatting fields that indicate time and date in a consistent manner, etc.
#3 Infrastructure & IT Self Service Across Environments, Platforms and Tools
This one is huge. The traditional IT model of hardware acquisition and vetting is typically drawn out as a process. Even with public cloud, onerous security controls are sometime added to infrastructure which delay the Data Science team’s ability to develop their models in an agile manner. The dreaded term Shadow IT (where business & data science teams go around the IT team to procure compute and storage on the public cloud) is not just an issue with infrastructure software but is slowly creeping up to business intelligence and advanced analytics apps. The delays associated with provisioning legacy data silos combined with using tools that are neither intuitive nor able to scale to deal with the increasing data deluge are making timely business analysis almost impossible to perform. Insights delivered too late are not very valuable. Data Scientists dearly desire that the environments that they need for development and testing are made available as soon as possible and ideally via a self service user interface. This calls for IT investments in Cloud computing platforms that enable agility and speedy provisioning of dev/test environments across compute, network and storage.
#4 – Collaboration with IT around the DS development lifecycle
Organizations typically have well established development methodologies and processes. Currently most data science development and traditional application development happen in two distinct tracks. Software development typically follows a Agile/DevOps process (a combination of Scrum/XP). The development lifecycle is divided into several stages with each producing a working deliverable at the end. The deliverables are incrementally updated to arrive at an acceptable product at the end which is then deployed for customer use. In this model, team members typically follow a defined role.
The Data Science development cycle is different. Data scientists/modelers are given a certain business problem to solve. They proceed to find the appropriate data they need, pull it into Hadoop or a Data Warehouse, wrangle it, try various algorithms to create the best possible models, test the models, and ensure that they perform well for the problem at hand. If they get more data during the process, they will go back and retest the whole process. The issue is that IT needs to partner with and collaborate with the Data Science team to first strategize and then help provision different environments (dev, test, prod) to enable data scientists to do iterative model development. They then need to help the Data Science team deploy these models in the appropriate deployment architecture.
#5 – Help Improve the Data Science User Experience
Using traditional app dev methodologies, it can take months to design, test and deploy software – which is simply unsustainable. One of chief goals of the DevOps model is to close the long-standing gap between the engineers who develop and test IT capability and business requirements for such capabilities. Accordingly the data science teams need best practice recommendations on using IDEs that support iterative model development & debugging. It is important that these development tools support programming languages such as R and Python – the most common go-to languages for data science – to rapidly develop code. It is critical that the IT group partner with the Data Scientists to enable these capabilities both from a development and a deployment standpoint.
#6 – Model Deployment
The data wrangling phase involves writing code to be able to join various data sets so that a single complete dataset can be created from a raw features standpoint. If more data is obtained as the development cycle is underway, the Data Science team has no option but to go back and redo the whole process. Once the raw features are gathered, feature engineering can begin to create predictive features from the raw data, taking into account business concepts. The modeling phase is where the choice of algorithms comes into play. A Data Scientist takes the raw & engineered features and creates models using the most appropriate algorithms for the task. After the models have been repeatedly tested for accuracy and performance, the best one is typically deployed for use. Once the models have been developed it is critical to ensure that these can be deployed rapidly, run automatically, and changed as per business requirements and performance. How and where these models will get deployed depends on the business case, ideally they should be deployed as a service. Models as a Service (MaaS) is the Data Science counterpart to Software as a Service. The MaaS takes in business variables (often hundreds of inputs) and provides as output business decisions/intelligence, measurements, and visualizations that augment decision support systems. IT help is needed to ensure that the models can scale as customer usage of these Digital Platforms increases.
#7 Model Governance and Management
There needs to be appropriate checks put into place to allow for the monitoring and maintenance of the models once in production. Model versioning must be handled so that customers aren’t affected during a maintenance cycle – old models must still function while the new ones are being put into place. And by keeping a check on the performance of models in production, the IT team can tell when a model stops performing optimally & to call on the Data Science team to check on why.
#8 Security and Compliance
How are security constraints around different environments managed? Though IT maintains control over the vast domain of tools and environments in any organization, the Data Science team must maintain control of the models. Any random person updating the models could lead to performance degradations. This separation of concerns is akin to DB security over schemas/tables/columns – only certain individuals should be granted access to perform certain operations for the most optimal results.
#9 – Delivering Results to Business Users –
Once the model has been deployed the results need to be made available to business users. Depending on the application, model results might need to be served up in near-real-time, every day/week/month/year, ad-hoc on demand, or any other time frame in-between. Organizations need to deal with providing appropriate tools (e.g. apps, sandboxes, etc.) to enable end users to explore the results of the analysis, and to perform intelligent visualization of the data. Visualizations include trend analysis over time, KPIs, list of interesting customers/accounts, etc.
Digital applications will continue to incorporate Data Science at an increasing scale. However, traditional IT Departments need to collaborate in the above specific areas to ensure that the algorithms developed for specific business issues are effective, forward looking and scalable.