In Part 3 of our series on the AI Factory – https://www.vamsitalkstech.com/ai/industry-spotlight-engineering-the-ai-factory-inside-ubers-ai-infrastructure-part-2/ , we explored one of the most sophisticated implementations of an AI factory in production: Uber’s machine learning infrastructure. In this blogpost, we’ll examine how Netflix architected its AI factory to handle their massive scale, the technical challenges they overcame, and the lessons learned that any organization can apply.
Netflix’s AI Infrastructure: Engineering for Global Scale Personalization
Netflix has built one of the world’s most sophisticated AI platforms, powering recommendations for hundreds of millions of users worldwide. This post explores their cloud-native architecture, unique workflow orchestration systems, and data processing infrastructure that enables their machine learning at scale.
Architecture Overview
Netflix’s system architecture is fundamentally cloud-native, built entirely on Amazon Web Services. This strategic decision, made early in their transition to streaming, provides elasticity and leverages managed services. The architecture relies heavily on microservices, allowing independent development, deployment, and scaling of various platform components, including those involved in AI and machine learning. This distributed approach aligns with Netflix’s culture of building resilient systems capable of handling failures gracefully, famously exemplified by tools like Chaos Monkey.
Within this cloud-native environment, Netflix has developed significant internal platforms for managing workflows and computations. Historically, Meson served as a general-purpose workflow orchestration and scheduling framework, initially crucial for managing the lifecycle of ML pipelines that built, trained, and validated personalization algorithms. Meson was designed to handle heterogeneous tasks involving technologies like Spark, Python, R, and Docker, orchestrating complex sequences like feature generation, model training, and validation. It operated on a single-leader architecture, managing tens of thousands of workflows and hundreds of thousands of jobs daily. However, as Netflix’s scale increased, Meson’s single-leader design encountered limitations, experiencing performance issues during peak traffic.
This scaling challenge led to the development of Maestro, Netflix’s next-generation, general-purpose workflow orchestrator designed to replace Meson. Maestro provides a fully managed Workflow-as-a-Service (WaaS) to a wide range of internal users, including data scientists, ML engineers, and analysts. It is engineered for high throughput, horizontal scalability, and enhanced usability, capable of handling hundreds of thousands of workflows and millions of jobs daily with strict Service Level Objectives (SLOs), even during traffic spikes. Maestro’s architecture features stateless microservices, distributed queuing, and support for complex workflow patterns, providing a robust foundation for ETL pipelines, ML workflows, A/B testing, and other automated processes across Netflix.
Core Technologies
Netflix’s AI infrastructure leverages a combination of open-source technologies and internally developed tools, predominantly within a Python-centric ecosystem for machine learning.
ML Frameworks/Libraries: Python is the primary language for ML development, facilitated by frameworks like Metaflow. Standard libraries such as Scikit-learn are likely used, and deep learning frameworks (e.g., PyTorch, TensorFlow) are employed, particularly in research contexts and potentially within Metaflow workflows. R is also utilized for specific modeling tasks, such as regional model building within Meson pipelines.
Data Processing Engines: Apache Spark is heavily used for large-scale data processing and ML tasks, integrated into workflows orchestrated by Meson and Metaflow. Apache Flink is the core engine for the Keystone real-time data pipeline, handling stream processing at massive scale. Hive is used for data warehousing queries, often as a source for ML pipelines.
Containerization: Docker serves as the standard for packaging applications, ML models, and pipeline steps, ensuring consistency across development and production environments and enabling deployment via Titus.
Internal Platforms:
Metaflow: The primary framework for ML workflow development and orchestration
Keystone: The unified real-time data pipeline for event transport and stream processing
Titus: The container management platform for running batch and service workloads, including model serving
Maestro: The current-generation, large-scale workflow orchestrator
Spinnaker: The continuous delivery platform, used for deploying services and potentially infrastructure components like Keystone’s Flink clusters
Atlas: Time-series telemetry system for monitoring
Eureka: Service discovery system
Compute Infrastructure
Netflix’s compute strategy is characterized by its complete reliance on AWS, leveraging the cloud provider’s elasticity and managed services rather than building custom hardware.
Cloud Provider: Amazon Web Services (AWS) exclusively.
Compute Services: The backbone is Amazon EC2, providing virtual machine instances for running various workloads. Netflix operates at a significant scale, historically cited as using over 100,000 EC2 instances. For managing batch computations, particularly those orchestrated by Metaflow, Netflix utilizes AWS Batch, which dynamically provisions EC2 resources. To address latency requirements for specialized workloads like remote visual effects (VFX) studios, Netflix employs AWS Local Zones, bringing compute resources closer to end-users.
Hardware Accelerators: GPUs are essential for many of Netflix’s ML workloads, particularly training recommendation models and potentially other tasks like content analysis. They utilize GPU-equipped EC2 instances provided by AWS. Historical accounts mention experimentation with instances featuring NVIDIA Tesla M2050 and GRID K520 GPUs. Current usage likely involves modern GPU instances such as the G4, G5 (powered by NVIDIA T4 and A10G respectively), or P-series instances, managed either directly, via Titus, or through AWS Batch. Heterogeneous clusters combining CPU and GPU instances are also employed, allowing different parts of a workload (e.g., data preprocessing vs. model training) to run on optimized hardware. CPU instances remain crucial for tasks less suited to GPUs, such as large-scale parameter sweeping or certain data processing steps.
Netflix’s decision to forgo custom hardware development and rely entirely on AWS infrastructure allows the company to focus its engineering efforts on software, algorithms, and its core streaming business. It benefits from AWS’s rapid hardware innovation cycles, global footprint, and the operational efficiencies of managed cloud services.
Data Pipeline Architectures
Netflix has engineered highly scalable and robust data pipeline infrastructure to handle the massive influx of data generated by its global user base and internal systems. This infrastructure is critical for feeding analytics, machine learning models, and operational monitoring.
Keystone Data Pipeline: Keystone is Netflix’s core, unified infrastructure for real-time data transport, processing, and routing. It evolved from earlier systems like Apache Chukwa and Suro (an internal data collection service) to a Kafka-fronted architecture.
Scale and Core Components: Keystone processes trillions of events daily (cited as over 700 billion messages per day across 36 Kafka clusters back in 2016). It uses Apache Kafka as the central message bus for ingestion and transport. For real-time stream processing tasks (filtering, enrichment, aggregation), Keystone leverages Apache Flink, having potentially used Apache Samza in earlier iterations.
Functionality: It acts as a central nervous system, collecting data from microservices and edge devices and routing it to various destinations, including data lakes (S3), search indexes (Elasticsearch), data warehouses (Hive), and other streaming systems (secondary Kafka, Mantis). It also provides a “Stream Processing as a Service” capability, allowing internal teams to deploy custom Flink jobs with abstracted infrastructure management.
Architecture and Operations: Keystone’s design emphasizes reliability and operability at scale. It employs a declarative reconciliation protocol where the desired state of the infrastructure (e.g., Kafka topics, Flink jobs) is stored in AWS RDS, acting as the single source of truth for recovery and configuration. Deployments are managed via Spinnaker. Each Flink job typically runs in its own independent cluster for isolation, sharing only Zookeeper for coordination and S3 for checkpoints. The platform embraces “failure as a first-class citizen,” building in mechanisms for handling subsystem failures and providing self-healing capabilities. Self-service tooling (UI, CLI) and managed connectors simplify usage for developers.
Data Mesh: More recently, Netflix has introduced the concept of a “Data Mesh” alongside Keystone, representing an evolution in its data processing strategy. While details are emerging, the Data Mesh aims to enable more composable data processing, allowing for the creation and reuse of standardized processors and datasets across different domains, particularly noted in the context of Netflix Studio data movement. This approach seeks to abstract complex stream processing concepts (like filtering, windowing, sessionization, deduplication) and potentially offer higher-level interfaces like Streaming SQL.
Storage and Data Formats: The foundation of Netflix’s data storage is AWS S3, used as the primary data lake and for storing pipeline artifacts and model checkpoints. To manage large analytical datasets efficiently on S3, Netflix has adopted the Apache Iceberg open table format. AWS RDS serves as the persistent state store for Keystone’s configuration. Other data stores include Cassandra for operational microservice data and Elasticsearch as a sink for real-time analytics and search use cases.
Machine Learning Workflow Orchestration
Netflix has invested significantly in building internal platforms to streamline and scale the orchestration of machine learning workflows, abstracting infrastructure complexities from data scientists and ML engineers.
Metaflow: This is Netflix’s flagship, open-source framework specifically designed for building and managing real-life data science, ML, and AI projects.
Design and Philosophy: Metaflow is human-centric and Python-native, aiming to provide a smooth path from local prototyping on a laptop to large-scale production deployment with minimal code changes. It allows practitioners to focus on modeling and business logic rather than infrastructure management.
Core Features: It manages the entire ML lifecycle, including:
Workflow Definition: Defining workflows as Directed Acyclic Graphs (DAGs) in Python
Infrastructure Abstraction: Seamlessly accessing compute resources (CPU, GPU, memory) via integrations with AWS Batch or Kubernetes, and managing data flow and access to data stores like S3
Versioning: Automatically tracking and storing code, data, parameters, and artifacts for reproducibility and experiment tracking
Dependency Management: Handling Python library dependencies across different execution environments
Scalability: Designed to scale out computations using cloud resources
Scale and Integration: Metaflow is battle-hardened at Netflix, supporting thousands of projects, executing hundreds of millions of compute jobs, and managing petabytes of data and artifacts. It integrates with production orchestrators like AWS Step Functions, Argo Workflows, Airflow, and crucially, Netflix’s internal orchestrator, Maestro. Integration with internal systems, like the “Fast Data” library for accessing the Iceberg/Spark data warehouse, is achieved through a custom extension mechanism.
Maestro: As the successor to Meson, Maestro serves as the robust, scalable backend orchestration engine for production workflows at Netflix, including those defined in Metaflow.
Role: While Metaflow provides the user-facing definition layer for ML workflows, Maestro handles the production scheduling (time-based, event-based triggers like data availability), execution management, resource allocation, monitoring, and ensuring SLOs for these and other workflows (ETL, A/B tests) at massive scale.
Architecture: Built for horizontal scalability and reliability using stateless microservices, distributed queues, and capable of handling millions of jobs daily. It provides abstractions and templates to simplify running common job types like Spark or Trino tasks. Maestro is also open-sourced.
Meson (Legacy): Netflix’s initial internal orchestrator, used for ML pipelines built on Apache Mesos. While foundational, its single-leader architecture limited scalability, leading to its replacement by Maestro.
Model Training & Serving Infrastructure
Netflix’s infrastructure for training and serving ML models leverages its general-purpose compute and container management platforms, integrated tightly with its ML workflow tools.
Model Training: Training is typically executed as steps within Metaflow workflows. These workflows dynamically provision the necessary compute resources, primarily through AWS Batch or potentially Kubernetes managed by Metaflow, utilizing various EC2 instance types (including GPUs) as needed. Distributed training for larger models is supported, leveraging frameworks like Spark or potentially Ray within the Metaflow environment. Training data is accessed from sources like S3/Iceberg via libraries like the “Fast Data” library, and model artifacts (checkpoints, final models) are persisted back to S3, with versioning managed by Metaflow.
Model Serving: Netflix utilizes its powerful, internally developed container management platform, Titus, for deploying and running ML models in production, alongside its other microservices.
Titus Architecture: Titus was originally built as a framework on top of Apache Mesos. It follows a master-agent architecture, with a replicated, leader-elected Master (using Zookeeper for election and Cassandra for state persistence) responsible for scheduling Docker containers onto a large pool of EC2-based Agents. A Gateway component provides the API layer for job submission. While built on Mesos, there are indications of potential migration or integration with Kubernetes to align with broader industry trends, although Titus maintains tight integration with AWS and Netflix’s internal ecosystem.
Functionality and Scale: Titus manages the entire lifecycle of Docker containers for both batch jobs and long-running services. It scales to manage thousands of EC2 instances and launch hundreds of thousands of containers daily. A key design principle was seamless integration with the existing Netflix cloud infrastructure (AWS services like VPC, IAM, ELB) and Netflix OSS components (Eureka for service discovery, Spinnaker for deployment, Atlas for monitoring), allowing existing applications and tooling to work with containers with minimal changes. This eased the adoption of containers within Netflix.
ML Serving via Titus: Instead of a dedicated ML serving system, models are typically packaged as Docker containers and deployed as services managed by Titus. This ensures operational consistency with how other Netflix microservices are managed and scaled.
Metaflow Hosting: To simplify the deployment process for data scientists, Metaflow offers an integrated Metaflow Hosting service. This provides an easy-to-use interface that abstracts the underlying microservice infrastructure (likely Titus), allowing users to quickly deploy their Metaflow-trained models as production-grade REST APIs with minimal operational overhead.
Key Challenges & Solutions
Netflix’s primary business revolves around user engagement and retention, driven heavily by its sophisticated personalization and recommendation systems. Scaling these systems presents significant AI infrastructure challenges.
Challenge: Scale of Personalization: Delivering unique, relevant recommendations to hundreds of millions of diverse global subscribers, based on interactions with a massive and constantly changing content library, requires processing immense data volumes and running complex algorithms efficiently. User interaction data (views, ratings, skips, searches, time of day, device, location) must be processed in near real-time to keep recommendations fresh and responsive.
Solution – Infrastructure: The entire infrastructure stack is geared towards this challenge:
Scalable Compute: Elastic AWS compute (EC2, Batch, GPUs) allows scaling resources up or down based on demand for training and processing
High-Throughput Data Pipelines: Keystone, powered by Kafka and Flink, is designed to ingest and process trillions of events daily in near real-time, feeding fresh data into personalization models
Efficient ML Workflows: Metaflow enables rapid development, iteration, and deployment of numerous personalization models, managing dependencies, versioning, and compute resources. Maestro provides the scalable backend orchestration for these workflows in production
Robust Serving: Titus provides a scalable and reliable platform for deploying recommendation models as microservices, handling high request volumes
Challenge: Algorithmic Complexity & Experimentation: Finding the best algorithms requires continuous experimentation with different models (collaborative filtering, content-based filtering, deep learning), features, and hyperparameters. Visual presentation (e.g., personalized thumbnails) also requires optimization.
Solution – Platform Support: Metaflow is explicitly designed to facilitate rapid experimentation and A/B testing, allowing data scientists to easily create and compare variations of their models. The underlying infrastructure supports running thousands of experiments concurrently.
Challenge: Cold Start Problem: Effectively recommending newly added content or serving new users with limited interaction history is difficult.
Solution – Data & Features: This likely involves leveraging content metadata extensively (genres, actors, descriptions, tags) through content-based filtering approaches. Techniques like learning embeddings from existing content and user data can help map new items into the recommendation space. Netflix’s investment in data processing (Keystone, Spark) and feature engineering (within Metaflow) is crucial here.
Challenge: Filter Bubbles and Content Diversity: Over-personalization can limit discovery by trapping users in “filter bubbles.” Balancing relevance with exposure to novel and diverse content is essential for long-term user satisfaction.
Solution – Algorithmic Refinements: This is typically addressed in the ranking or re-ranking stages of the recommendation pipeline. Algorithms can be designed to explicitly introduce diversity, promote fresher content, or explore less-popular items, potentially informed by ongoing research.
Challenge: Data Privacy and Ethics: Handling vast amounts of user interaction data requires adherence to privacy regulations (like GDPR, CCPA) and ethical considerations.
Solution – Governance and Transparency: While specific internal tools aren’t detailed in the provided materials, robust data governance practices, data anonymization where feasible, and user controls are necessary. Secure infrastructure (AWS) and internal platforms likely incorporate privacy controls.
References
- Meson: Netflix Tech Blog articles on their workflow orchestration system
- Metaflow: The official Metaflow documentation at https://docs.metaflow.org
- Keystone/Data Mesh: Netflix Tech Blog posts on their data pipeline architecture
- Titus: Official documentation for Netflix’s container management platform
- Maestro: Netflix’s open-source documentation for their next-gen workflow orchestrator