Every discussion of AI chip performance focuses on TFLOPS — the raw compute throughput number that vendors put on their datasheets. It is the wrong metric for most AI workloads in 2026. The binding constraint for language model inference is not compute; it is memory bandwidth. Understanding this distinction — what engineers call the ‘memory wall’ — changes which chips look attractive, which benchmarks matter, and why AMD’s memory advantage over NVIDIA is operationally more significant than the raw performance comparison suggests.
The Roofline Model — Why Memory Bounds Inference
The roofline model is the framework engineers use to determine whether a given workload is compute-bound or memory-bound. The concept is straightforward: every computation has an arithmetic intensity — the ratio of mathematical operations performed to bytes of data moved. Every chip has two performance ceilings: peak compute (TFLOPS) and peak memory bandwidth (TB/s). Whichever ceiling you hit first determines your actual performance.
For large language model inference — generating tokens from a model — arithmetic intensity is low. The model’s weights must be loaded from memory for each token generated, and the key-value cache must be read and written continuously. Modern language models in production are almost universally memory-bandwidth-bound, not compute-bound. A chip with 2x the TFLOPS but the same memory bandwidth will not run inference 2x faster. It will run inference approximately the same speed, with the extra compute sitting idle.
The GPU compute arms race — doubling TFLOPS every 18 months — has produced chips whose compute capacity far exceeds what inference workloads can actually use. Memory bandwidth is the constraint. This is why AMD’s 288 GB HBM3e advantage matters more than raw TFLOPS comparisons suggest.

Figure 1: GPU compute (TFLOPS) has grown 25x since 2018; memory bandwidth has grown only 7x. The divergence is the Memory Wall — and it determines why inference is memory-bound for most production LLM workloads.
HBM — The Technology Closing the Gap
High Bandwidth Memory (HBM) is the technology stack that sits directly alongside the GPU die on AI chips, connected via thousands of parallel wires using 3D stacking. It delivers dramatically higher bandwidth than the DDR memory used in conventional servers by placing memory physically adjacent to the processor and using wide, parallel connections rather than a narrow high-speed serial link.

Figure 2: HBM generation comparison by chip. AMD MI355X’s 288 GB capacity allows 70B+ models to run on a single chip — eliminating the inter-chip communication overhead that slows two-chip NVIDIA configurations.
The HBM generation progression explains more about AI chip capability evolution than the TFLOPS progression that dominates marketing materials:
- HBM2e (A100 era): 2.0 TB/s per chip, 80 GB capacity. Inference on 13B+ models is memory-bandwidth-bound. Two chips required for 30B+ models.
- HBM3 (H100): 3.35 TB/s, 80 GB. Bandwidth improves; capacity unchanged. 34B+ models become the memory-bound threshold.
- HBM3e (B200 and MI355X): The divergence point. NVIDIA B200: 8.0 TB/s, 192 GB. AMD MI355X: 5.2 TB/s, 288 GB. AMD sacrifices bandwidth headroom for capacity — the right trade-off for serving large models.
- HBM4 (Vera Rubin, late decade): Estimated 14–16 TB/s, ~384 GB. Compute-bound operation extends to 100B+ models. The memory wall retreats but does not disappear.
Why AMD’s 288 GB Changes the Operational Equation
A 70B-parameter model in FP16 precision requires approximately 140 GB of memory to load. NVIDIA B200 has 192 GB — it fits, but with only 52 GB remaining for the KV cache that accumulates during inference. AMD MI355X has 288 GB — it fits with 148 GB remaining for KV cache. That difference is not marginal. It determines whether you can serve long-context queries without evicting KV cache entries (which forces recomputation) and how many concurrent users a single chip can serve before memory pressure degrades performance.
The operational implication: for organizations serving 70B+ models, a single AMD MI355X delivers the same model-serving capability as two NVIDIA B200s in a tensor-parallel configuration. Running on one chip eliminates the NVLink communication overhead between chips, reduces power consumption, simplifies the serving infrastructure, and cuts hardware cost. AMD cloud pricing is also 40–60% below H100-equivalent at comparable providers.
The KV Cache — Where Memory Pressure Lives in Production
The key-value cache is the mechanism by which transformer models maintain context during inference — storing computed representations of all previous tokens in a conversation. Poor KV cache management is the most common source of memory waste in production AI inference systems, and understanding it is prerequisite to understanding why memory capacity matters more than raw TFLOPS for real-world deployment.
A 10,000-token context requires approximately 4x more KV cache memory than a 2,000-token context. In a system serving 100 concurrent users with 8,000-token average context lengths, the KV cache alone can consume 60–80% of available GPU memory — leaving little headroom for the model weights themselves without careful memory management.
PagedAttention — the technique developed by the vLLM project and now used in production at Meta, Mistral AI, Cohere, and IBM — eliminates 60–80% of KV cache memory waste by managing cache memory in fixed-size pages rather than contiguous allocations. Enabling PagedAttention and prefix caching (which reuses computed KV cache entries for identical prompt prefixes) is the single most impactful software optimization available for memory-bound inference workloads.
Near-Memory Compute and In-Package Memory — The Next Frontier
The HBM stack represents one approach to closing the memory wall: move memory closer to the compute die. The next architectural evolution takes this further: placing compute elements inside the memory stack itself, or integrating memory and compute in the same package at manufacturing time.
Cerebras’s wafer-scale approach — 900,000 cores on a single silicon wafer with on-chip SRAM — eliminates external memory access entirely for models that fit in the on-chip cache. The result is 2,700+ tokens/second inference on 120B models, 3x NVIDIA Blackwell throughput, driven entirely by eliminating the memory bandwidth constraint rather than adding more compute.
Samsung and SK Hynix are developing ‘Processing-in-Memory’ HBM variants that place compute elements within the memory stack. These are still research-stage for general AI workloads, but they represent the logical endpoint of the near-memory compute trajectory: a chip where the memory wall does not exist because computation happens where the data lives.
The organizations that will have the best AI infrastructure economics in 2028 are the ones that size their hardware on memory capacity and bandwidth first, compute TFLOPS second. The memory wall is real, it is measurable, and it is the constraint that determines production performance for every major language model workload today.
Featured image designed by Freepik
