Industry Spotlight - Engineering the AI Factory: Inside Meta’s AI Infrastructure (Part 4)

In Part 4 of our series on the AI Factory – https://www.vamsitalkstech.com/ai/industry-spotlight-engineering-the-ai-factory-inside-ubers-ai-infrastructure-part-2/ , we explored one of the most sophisticated implementations of an AI factory in production: Netflix’s machine learning infrastructure. In this blogpost, we’ll examine how Meta built one of the world’s most sophisticated AI infrastructures, powering everything from recommendation systems across its family of apps to its cutting-edge Llama large language models. We explores their unique approach to AI infrastructure, highlighting their vertical integration strategy and technical innovations.

Meta’s approach to AI infrastructure stands in significant contrast to Netflix’s cloud-centric model. Meta pursues a strategy of deep vertical integration, investing heavily in designing and building custom hardware and software stacks optimized for its massive-scale AI workloads, primarily centered around deep learning recommendation models (DLRMs) for its social feeds and ads, and increasingly, large-scale generative AI (GenAI) models like Llama.

A core tenet of Meta’s strategy is hardware-software co-design. This involves tightly integrating the development of custom silicon (like the Meta Training and Inference Accelerator – MTIA), custom server and rack designs (Grand Teton, Open Rack), specialized network fabrics, and its primary software framework, PyTorch. This holistic approach allows for optimizations across the entire stack, aiming for maximum performance and efficiency for Meta’s specific needs.

Meta operates at a hyperscale, reporting hundreds of trillions of AI model executions daily across its services. This necessitates enormous infrastructure investments, exemplified by plans to possess compute power equivalent to nearly 600,000 NVIDIA H100 GPUs by the end of 2024 and capital expenditures projected between $60-65 billion in 2025, largely focused on AI.

Despite the custom internal focus, Meta maintains a strong commitment to open source. It actively contributes hardware designs like Grand Teton and Open Rack to the Open Compute Project (OCP) and is the primary force behind the PyTorch framework. Furthermore, Meta releases its large language models, such as the Llama family, under open licenses. This open approach fosters collaboration, potentially accelerates innovation, and allows Meta to influence and benefit from broader industry ecosystems.

This strategy of building custom, vertically integrated infrastructure provides Meta with fine-grained control over performance, cost, and efficiency, tailored specifically to its dominant recommendation and generative AI workloads. However, it requires substantial, long-term investment in research and development, hardware engineering, supply chain management, and data center operations, a path feasible only for companies operating at Meta’s extreme scale.

Core Technologies

Meta’s AI infrastructure relies heavily on technologies developed or significantly contributed to by the company itself, alongside widely adopted open-source tools.

ML Framework: PyTorch is the cornerstone of Meta’s ML development. Originally developed by Meta AI and now governed by the PyTorch Foundation within the Linux Foundation, it is known for its flexibility (dynamic computation graphs), Pythonic interface, and strong GPU acceleration capabilities. It is used extensively for both research and large-scale production deployment within Meta. Key features relevant to Meta include its dynamic nature facilitating research, TorchScript for bridging research and production C++ environments, robust support for distributed training essential for large models, and experimental mobile deployment capabilities.

Data Processing: Presto, a distributed SQL query engine co-developed by Meta (then Facebook), is a key component for large-scale data analytics, capable of querying exabyte-scale data sources. It supports both interactive ad-hoc queries and large batch ETL workloads. The Presto-on-Spark integration allows leveraging Spark’s capabilities for certain workloads. While not explicitly detailed as a primary processing engine in the provided snippets, Spark’s use alongside Presto is common in such ecosystems.

Storage Systems: Meta utilizes custom storage solutions tailored for its infrastructure. Tectonic, a distributed storage solution optimized for Flash media, is used for handling data loading and checkpointing in their large AI training clusters. RocksDB, a high-performance embedded key-value store developed by Meta, is mentioned as being widely used within the company, likely for various application-level storage needs.

Containerization/Orchestration: Tupperware, Meta’s internal container management system, forms the basis for their inference platform, with models deployed within Tupperware containers. While specific details are sparse, Kubernetes is mentioned in the context of the Meta Workflow Service (MWFS), suggesting potential integration or adoption for certain orchestration tasks.

Compute Infrastructure

Meta’s compute infrastructure is defined by its massive scale, custom-designed hardware components, and a multi-vendor strategy for accelerators, all housed within its own global data centers.

Data Centers: Meta designs, builds, and operates its own large-scale data centers globally. These facilities incorporate advanced designs, including AI-driven optimizations for efficiency (e.g., cooling systems) and support for high-power density racks required by AI hardware.

Hardware Platforms:

Grand Teton: This is Meta’s open GPU hardware platform, contributed to the Open Compute Project (OCP). It features an integrated chassis design combining power, control, compute, and network fabric interfaces, simplifying deployment and improving performance, signal integrity, and thermal management compared to the previous Zion-EX generation. Grand Teton is designed to support both memory-bandwidth-bound workloads like Deep Learning Recommendation Models (DLRMs) and compute-bound workloads like content understanding. It supports accelerators from multiple vendors, including NVIDIA H100 GPUs and AMD Instinct MI300X GPUs.
Open Rack: Meta utilizes its Open Rack standards, contributing designs like ORv3 to OCP. The ORv3 High Power Rack (HPR) is specifically designed for AI workloads, supporting up to 140kW per rack and incorporating liquid cooling solutions. Collaboration on disaggregated power racks (Mount Diablo) is also underway.

Accelerators: Meta employs a diverse strategy for AI acceleration:

NVIDIA GPUs: Constitute the bulk of Meta’s current large-scale training infrastructure. Meta has made massive investments, building 24,000-GPU clusters based on NVIDIA H100s for training models like Llama 3, and aims to have infrastructure equivalent to nearly 600,000 H100s by the end of 2024.
AMD GPUs: Meta supports AMD accelerators, specifically the Instinct MI300X, within its Grand Teton platform, providing vendor diversity.
Meta Training and Inference Accelerator (MTIA): This is Meta’s own family of custom-designed ASICs (Application-Specific Integrated Circuits). The first generation focused on optimizing inference for Meta’s ubiquitous recommendation models. Subsequent generations (MTIA v2 and beyond) aim to expand capabilities, potentially including support for generative AI workloads, and are developed through close hardware-software co-design with PyTorch and target model architectures. MTIA represents a strategic effort to achieve greater performance and power efficiency for Meta’s most critical, high-volume workloads compared to general-purpose GPUs.

Networking: High-bandwidth, low-latency network fabrics are critical for interconnecting thousands of accelerators in large training clusters. Meta utilizes both RDMA over Converged Ethernet (RoCE) based on Arista switches and NVIDIA Quantum2 InfiniBand fabrics in its current clusters, supporting 400 Gbps endpoints. For future clusters, Meta is developing a Disaggregated Scheduled Fabric (DSF), aiming for greater scale, component choice, and power density. DSF will support open, standard Ethernet-based RoCE interfaces compatible with GPUs and NICs from multiple vendors (NVIDIA, Broadcom, AMD). The importance of RoCE networks for distributed AI training at scale is a key focus.

Cooling: Given the high power density of AI accelerators, advanced cooling is essential. Meta utilizes liquid cooling extensively in its data centers, particularly for TPU deployments (historically) and now for its high-power ORv3 racks housing platforms like Grand Teton.

Data Pipelines

Meta’s data pipelines are built to handle the exabyte-scale data generated by its social media platforms and support its vast analytics and machine learning requirements.

Processing Engines: Presto is a cornerstone of Meta’s data analytics infrastructure. Developed initially at Facebook, this open-source distributed SQL query engine allows analysts and systems to run interactive and batch queries against massive datasets residing in various sources. It has evolved significantly over the decade to handle Meta’s hyper-growth, incorporating features like hierarchical caching, native vectorized execution, materialized views, and integration with Spark (Presto on Spark) for ETL workloads.

Data Storage: For the demanding data loading and checkpointing needs of large-scale AI training clusters, Meta employs its homegrown Tectonic distributed storage solution, which is optimized for Flash media and integrated with the Grand Teton platform. RocksDB, another Meta-developed embedded key-value store, sees widespread use for various storage needs within Meta’s systems. The underlying physical storage relies on Meta’s custom data center infrastructure, utilizing high-capacity SSDs (e.g., E1.S SSDs mentioned in conjunction with Tectonic deployments).

Data Understanding and Governance: Given the scale and sensitivity of user data on platforms like Facebook and Instagram, Meta has made substantial investments in data understanding technologies as part of its Privacy Aware Infrastructure (PAI) initiative. This is crucial for managing data privacy, ensuring compliance, and enabling responsible AI development. Key components include:

Schematization: Using DataSchema (a standard format based on Thrift IDL) to define a canonical, logical structure for data assets across over 100 diverse data systems.
Annotation: Employing a universal privacy taxonomy to provide a common semantic vocabulary for describing the meaning and context of data, independent of specific systems. ML models and heuristics are used to predict annotations, which are then validated or refined by developers.
Inventorying: OneCatalog serves as Meta’s central data discovery and inventory system, registering and enumerating data assets across the company, assigning unique identifiers, and consolidating metadata for governance and compliance purposes. PAI aims to address the inherent challenges of understanding data meaning and structure across fragmented systems, inconsistent definitions, and context-dependent classifications, integrating these processes early into developer workflows.

ML Workflow Orchestration

Meta has developed sophisticated internal platforms for orchestrating the complex workflows involved in training and deploying machine learning models at scale.

FBLearner Flow (Legacy/Current Foundation): For years, FBLearner Flow has been Meta’s primary internal ML platform, used by a significant portion (over 25%) of its engineering team.

Components & Features: It provides an authorship and execution environment based on Python, allowing engineers to define ML pipelines (Workflows) composed of individual steps (Operators). Key features included a system of futures for automatic parallelization of non-dependent operators, dynamic DAG compilation, automatic resource allocation (CPU, GPU, memory) for operators, and an experimentation management UI for launching, visualizing, and comparing workflow runs. It also offered predefined pipelines for common algorithms.
Architectural Limitations: Despite its success, FBLearner Flow’s monolithic architecture eventually presented challenges. It tightly coupled pipeline authoring, orchestration logic, and execution environment specifics. This made it difficult to integrate new execution backends (like GPU clusters) without wrappers, increased latency and build times due to dependency bloat, and relied on a large, monolithic database that faced scaling and reliability limits, restricting workflow size and complexity.

Meta Workflow Service (MWFS) (Next Generation): To overcome FBLearner Flow’s limitations, Meta re-architected its orchestration layer by developing the Meta Workflow Service (MWFS).

Rationale: The goal was to create a more flexible, scalable, and reliable orchestration engine by establishing a clear separation of concerns between workflow definition, orchestration, execution, and observability.
Architecture: MWFS is designed as a horizontally scalable, event-driven service built on SQL and distributed queues (FOQS). It focuses solely on orchestration: tracking workflow state, managing dependencies between nodes (steps) in a DAG, and invoking the appropriate actions for the next steps. It decouples orchestration from execution through an Action Service, which handles the interaction with various execution runtimes (e.g., submitting jobs to CPU/GPU clusters). An Event Router provides a pub/sub mechanism for real-time observability. MWFS supports both ephemeral (one-shot) and non-ephemeral (reusable definition) workflows and uses a state machine model for node execution (Pending -> Ready -> Running -> Finishing -> Finished) managed by NodeDecider and NodeExecutor components.

This evolution from the monolithic FBLearner Flow to the decoupled, event-driven MWFS architecture represents a significant engineering effort to build a more modern, robust, and future-proof platform. It allows different parts of the ML ecosystem (SDKs, execution environments, monitoring tools) to evolve independently and provides the flexibility needed to orchestrate complex workflows across Meta’s diverse and large-scale compute infrastructure, including specialized hardware like MTIA.

Model Training & Serving Infrastructure

Meta’s infrastructure is designed to support both the massive computational demands of training state-of-the-art AI models and the high-throughput, low-latency requirements of serving these models to billions of users.

Model Training: Training, especially for large foundation models like Llama, occurs on dedicated, large-scale clusters built with custom hardware. These clusters can comprise tens of thousands of GPUs (e.g., NVIDIA H100s) housed in Grand Teton servers within ORv3 racks, interconnected by high-speed fabrics (RoCE or InfiniBand). PyTorch is the primary framework, utilizing its distributed training capabilities. The training process itself is orchestrated by Meta’s workflow systems (FBLearner Flow / MWFS). Efficient checkpointing, crucial for long-running training jobs on potentially unreliable hardware, is handled by the Tectonic distributed storage system.

Model Serving/Inference: Meta’s serving infrastructure handles an enormous volume, measured in hundreds of trillions of model executions per day. Key aspects include:

Containerization: Models are typically deployed in containers managed by Meta’s internal system, Tupperware.
Optimization Focus: Significant effort is dedicated to optimizing inference latency and throughput, especially for LLMs. This involves strategies for fitting models onto various hardware configurations and implementing techniques like caching.
Custom Hardware: Meta leverages its custom MTIA silicon, particularly for high-volume inference workloads like ranking and recommendation models, aiming for improved efficiency and performance compared to general-purpose GPUs.
Heterogeneous Hardware Management: The serving infrastructure must accommodate dynamic user requests and operate effectively across a diverse fleet of hardware, including different types of GPUs and custom accelerators.
Scalable Deployment: Models like Llama 4 Maverick are designed with deployment flexibility in mind, capable of running on a single powerful host (like an NVIDIA DGX H100) or being distributed across multiple accelerators for maximum efficiency.

The scale of Meta’s operations necessitates end-to-end control over its infrastructure. Custom hardware like Grand Teton and MTIA, combined with sophisticated software systems for orchestration (MWFS), training (PyTorch), and serving (Tupperware-based platform with advanced optimization techniques), are essential to achieve the required performance, efficiency, and cost-effectiveness for both training massive foundation models and serving billions of users daily.

Key Challenges & Solutions

Meta faces a unique set of challenges driven by the confluence of social media scale, cutting-edge AI research (especially GenAI), and the complexities of operating custom hyperscale infrastructure.

Challenge: Scaling AI Training and Inference: The sheer size of foundation models like Llama requires unprecedented computational power for training (tens of thousands of GPUs) and efficient infrastructure for inference at a global scale.

Solution: Massive investment in custom data centers, hardware (Grand Teton, ORv3), high-performance networking (RoCE, InfiniBand, DSF), and accelerators (large fleets of NVIDIA GPUs, development of MTIA for inference optimization). Continuous focus on hardware-software co-design with PyTorch and model architectures to maximize efficiency. Development of sophisticated serving strategies including caching and optimization for heterogeneous hardware.

Challenge: Social Media Data Complexity and Privacy: Managing and deriving insights from exabytes of diverse, unstructured, and highly sensitive user-generated content across multiple platforms poses immense technical and regulatory challenges. Ensuring AI models respect privacy and comply with global regulations like GDPR is paramount, often conflicting with the need for diverse, localized data for training relevant models.

Solution: Building the comprehensive Privacy Aware Infrastructure (PAI) with components like DataSchema, universal annotation taxonomies, and OneCatalog for systematic data understanding and governance. Ongoing navigation of complex regulatory environments (demonstrated by pauses/challenges in EU and Brazil). Implementing responsible AI development practices, including pre- and post-training mitigations and red-teaming for models like Llama.

Challenge: Infrastructure Reliability at Extreme Scale: Operating massive, multi-tenant ML clusters with thousands of accelerators makes failures (hardware, software, network) inevitable. Ensuring job completion, especially for long-running, large-scale training, is critical.

Solution: Systematic analysis of failure modes and rates across different job scales. Designing reliability-aware infrastructure, system software (e.g., robust checkpointing via Tectonic), and algorithms. Implementing redundancy and rapid fault detection/mitigation strategies across the infrastructure stack.

Challenge: Developer Productivity and Platform Evolution: Enabling thousands of internal engineers and scientists to effectively utilize complex infrastructure and rapidly iterate on ML models requires user-friendly and efficient platforms. Legacy platforms like FBLearner Flow eventually required significant re-architecture to meet evolving needs.

Solution: Continuous development and evolution of internal platforms, moving from monolithic designs (FBLearner Flow) towards more modular, scalable, and flexible architectures (MWFS). Providing abstractions (SDKs, UIs) to simplify interaction with complex underlying systems.

Meta’s approach involves tackling these interconnected challenges through deep vertical integration, substantial R&D investment in both custom hardware and software platforms, and a continuous process of architectural evolution to keep pace with the demands of AI at hyperscale while navigating complex data privacy and reliability constraints.

References

[1] GenAI Infrastructure/Clusters: Building Meta’s GenAI Infrastructure
[2] Grand Teton Hardware: Meta’s open AI hardware vision
[3] MTIA Accelerator: Meta’s Next Generation AI Data Center Infrastructure
[4] PyTorch: PyTorch – Meta AI
[5] FBLearner Flow / MWFS: Introducing FBLearner Flow: Facebook’s AI backbone
[6] Presto: Presto: Free, Open-Source SQL Query Engine for any Data
[7] Model Serving: Scaling Large Language Model Serving Infrastructure at Meta
[8] Data Understanding/PAI: How Meta understands data at scale

Container Architecture Gen AI Generative AI Llama Meta Open Compute Project open source

Like this:

Related

Industry Spotlight – Engineering the AI Factory: Inside Meta’s AI Infrastructure (Part 4)

Core Technologies

Data Pipelines

ML Workflow Orchestration

Model Training & Serving Infrastructure

Key Challenges & Solutions

References

Share this:

Like this:

Related

Vamsi Chemitiganti

Architecting Intelligent Edge: Reference Architecture #1 – Cloud-to-Edge AI Integration”

You may also like

Architecting Intelligent Edge: Reference Architecture #1 – Cloud-to-Edge...

[Whitepaper] Industry Talks Tech – Next-Generation Trading Platform...

Industry Spotlight – Engineering the AI Factory: Inside...

Leave a Comment Cancel Reply