Evolution of Nvidia Blackwell GPU memory architecture

Every GPU generation pushes against the same obstacle: memory. Models grow faster than memory capacity, forcing engineers into complex multi-GPU setups, aggressive quantization, or painful trade-offs.

NVIDIA’s Blackwell architecture, followed by Hopper in 2024, attacks this problem at the hardware level, rethinking not how much memory a GPU has, but how it’s structured and fully accessed.

Running the Llama 3 70B is no longer a concern – no parallelism or squeezing models into tight memory limits. Instead, the same hardware footprint can now handle significantly larger parameter counts.

This article breaks down the memory enhancements that make Blackwell the most capable AI accelerator to date.

Conditions

This article assumes you are comfortable with a few GPU basics. If any of these feel shaky, the linked resources will get you up to speed in 10-15 minutes.

GPU Anatomy – What is SM, and the role of registers, shared memory (L1), L2 cache, and memory controllers. (Memory classification of GPUs)
Three memory matrices — capacity (how much fits), bandwidth (how fast data moves), and latency (how long a single access takes). These are not interchangeable, and Blackwell optimizes all three differently. (GPU memory bandwidth)
Types of GPU Memory — HBM, GDDR, and LPDDR5X, and the bandwidth/capacity/power trade-offs between them. (Cuda GPU Memory Types)
The chip connects. — PCIe, NVLink, and a chip-to-chip (C2C) link idea. (AI Systems Game)

If you’re solid on all fours, you’re ready.

The generational leap

Before diving into how Blackwell achieves its performance gains, here’s what’s changed from the previous GPU generation:

Details	Hopper H100	Blackwell B200	Change
Capacity of HBM	80 GB (HBM3)	192 GB (HBM3e)	2.4×
HBM bandwidth	3.35 TB/s	8 TB/s	2.4×
L2 cache	50 MB	126 MB	2.5×
L1/Common per SM	256 KB	128 KB	0.5×
Die design	monolith	Double Dye (MCM)	–
CPU integration	Separate (PCIe)	Unified (NVLink C2C)	–

The numbers tell a clear story: more memory, more bandwidth, bigger caches. The rest of this article explains how these pieces fit together.

GB200 Super Chip

NVIDIA Blackwell GB200 Superchip

The Grace Blackwell (GB200) extends the superchip design NVIDIA introduced with the Grace Hopper (GH200), where an ARM-based Grace CPU is combined with GPU chips in a single package to form a unified computing system.

In the Blackwell generation, the GB200 combines a Grace CPU with two Blackwell GPUs, connected via NVLink Chip-to-Chip (NVLink-C2C), a high-bandwidth interface that allows CPUs and GPUs to share memory and operate as a system.

Grace CPU

The Grace CPU is an ARM Neoverse v2 designed by NVIDIA for bandwidth and performance. It handles general-purpose tasks, preprocessing, and tokenization, and feeds data to the GPU via NVLink C-2-C. The Grace CPU acts as extended storage for the GPU.

The Grace CPU runs at a moderate clock speed but compensates for its LPDDR5X memory with massive memory bandwidth of up to 500GB/s (low power double data rate 5x – we’ll talk more about that in a moment) with around 100MB of L3 cache.

LPDDR5X (Low Power Double Data Rate 5x)

LPDDR5X is a high-speed memory standard that delivers up to 10.7 Gbps of data. LPDDR5X offers low power efficiency, making it ideal for this use case.

It strikes a perfect balance between performance and power efficiency, delivering up to 500 GB/s while consuming only 16W, which is about one fifth of the power of traditional DDR5 memory.

Blackwell GPU

The Blackwell GPU offers significant improvements over the previous Hopper GPU model, especially in terms of memory. Blackwell GPUs are designed as dual-die GPUs, with two GPU dies in a single module.

Each die is connected via a super-fast NV-HBI (NVIDIA High-Bandwidth Interface) with a speed of 10TB/s, which ensures full performance. Each die has 104 billion transistors, for a total of 208 billion across two dies. Each die also has 96 GB of HBM3e memory, for a total of 192 GB, with 180 GB usable (as 12 GB is used for error correcting code (ECC), system firmware, etc.).

With this amount of memory, the memory bandwidth of the Blackwell GPU is approximately 2.4 times faster than the Hopper generation.

The L2 cache was also increased to 126 MB. By expanding the L2 cache, Blackwell can store more neural network weights or intermediate results on-chip, avoiding additional trips to the HBM. This ensures that the GPU’s compute units are rarely starved for data.

Blackwell dual-die multichip module (MCM) design

High Bandwidth Interface (NV-HBI)

A high-bandwidth interconnect is a standard for die-to-die (or d2d) communication. The NVIDIA High-Bandwidth Interface (NV-HBI) offers a 10TB/s connection, combining two GPUs into a single, unified GPU.

NVLINK C-2-C (Chip to Chip)

NVLink C-2-C provides communication speeds of up to ~900 GB/s between the Grace CPU and Blackwell GPUs, eliminating the need to copy memory from the CPU to the GPU memory pool via the PCIe bus.

NVLink C-2-C interconnect speed is faster than normal PCIe bus. Compared to NVLink C-2-C speeds, PCIe Gen6 is only 128 GB/s per direction. It is also cache-integrated, meaning that both the CPU and GPU share an integrated memory architecture, allowing the CPU to read and write to GPU memory and vice versa.

This unified memory architecture is called Unified CPU-GPU Memory or Extended GPU Memory (EGM) by NVIDIA.

Memory Hierarchy and Bandwidth

Understanding how data flows through Blackwell’s memory system is key to optimizing AI workloads. The architecture follows a classic hierarchical principle: smaller, faster memory is closest to the compute units, with progressively larger but slower memory tiers spreading outward.

Classification at a glance

Memory tires	Ability	Bandwidth	Purpose
Registers.	~256 KB per SM	immediate	Active accounting
L1/Shared Memory	~128 KB per SM	~40 TB/s aggregate	Data staging, inter-thread sharing
L2 cache	64-65 MB per die (~126 MB total)	~20 TB/s	Reuse of cross-SM data
HBM3e	192 GB (180 usable)	8 TB/s	Model weights, activations
LPDDR5X (CPU)	~480 GB	~500 GB/s (900 GB/s via NVLink C2C)	Overflow, large embedding

Blackwell Memory Map

Registers and L1/Shared Memory

A streaming multiprocessor (SM) executes compute instructions on the GPU. At the lowest level, each streaming multiprocessor (SM) has a register file and configurable L1/shared memory as shown in the figure above. Registers hold operations for active computations, i.e. the data that the GPU cores are currently working on.

An SM runs threads in groups of known fixed size. warpswith each warp there are exactly 32 threads that follow the same instructions in lockstep. L1/shared memory acts as a staging area, allowing threads within the SM to share data without going to slow memory speeds.

The Blackwell’s L1/shared memory is 128 KB per SM by default, down from the Hopper’s 256 KB. In certain configurations, this can increase to 228 KB per SM. The aggregate bandwidth across all SMs is approximately 40 TB/s.

Why the shortage? NVIDIA moved capacity for tensor core operations to TMEM and compensated with a larger L2 cache. General-purpose shared memory workloads tend to see less capacity per SM, but the workloads that matter most get matrix-multiplied, dedicated, fast memory.

L2 cache: Compensation for small L1

The L2 cache sits between the SMs and the HBM, which is shared across all compute units on the die. Blackwell provides 64-65 MB per die (about 126 MB total in a dual-die module). This represents a 2.5× increase over Hopper’s 50 MB and compensates for the smaller per SM L1.

In AI workloads, the same model weights are accessed repeatedly in different input batches. A larger L2 cache means more of these weights can be cached between batches, reducing expensive HBM visits. For inference serving, where a single model handles thousands of requests, this translates directly into lower latency and higher throughput.

The dual die design introduces complexity here. Each die has its own 63 MB L2 partition. Accessing data cached on another die requires traversing the NV-HBI interconnect as fast as 10 TB/s, but is still slower than local L2 access. NVIDIA’s software stack handles this transparently, but performance-conscious engineers should be aware that data placement across dies can affect cache performance.

HBM3e: main memory pool

High-bandwidth memory (HBM3e) serves as the primary storage for model weights, activations, gradients, and input data. Blackwell’s HBM3e provides 8 TB/s bandwidth per GPU, which is about 2.4× faster than Hopper’s 3.35 TB/s HBM3.

The physical implementation uses an 8-Hi stack design: eight DRAM dies are stacked vertically, each providing 3 GB for 24 GB per stack. With a total of eight stacks (four per die), the B200 GPU provides 192 GB of on-package memory, although 180 GB is usable after accounting for ECC and system overhead.

This bandwidth increase is significant. Tensor core operations can consume data at very high rates. If the HBM cannot consume data fast enough, the compute units grind to a halt, leaving expensive silicon idle. Blackwell’s 8 TB/s continues to feed tensor cores even during the largest matrix multiplications.

LPDDR5X: Extended Class

Beyond the GPU’s HBM is the Grace CPU’s LPDDR5X memory, around 480 GB that is locally accessible at 500 GB/s, or ~900 GB/s when accessed from the GPU via NVLink C-2-C.

GPU-to-LPDDR5X accesses have about 10× less bandwidth and higher latency than HBM. But it’s much faster than NVMe SSDs or network storage.

LPDDR5X acts as a high-speed overflow tier. Data that does not fit into HBM, such as large embedding tables, KV caches for long context estimation, or checkpoint buffers, can reside in CPU memory without catastrophic performance penalty.

Data Flow in Practice

When a Blackwell GPU executes an AI workload, data flows through this hierarchy in stages:

Loading model.: Weight moves from Storage → CPU Memory → HBM (or staying in LPDDR5X if HBM is full)
Batch processing: Injects data into HBM, then into L2 as SMs request it.
Accounting: Active data is transmitted from L2
Output: Results flow down the hierarchy into HBM or CPU memory.

Each level acts as a buffer for the level above it.

Practical example: Running Llama 3 70B

Consider deploying Llama 3 70B for an estimate. At FP16 resolution (note with 200 GB, you can go as low as FP4), the weight of the model alone requires about 140 GB of memory.

On Hopper H100 (80 GB HBM3): The model does not fit. You should either quantize aggressively, use tensor synchronization across multiple GPUs, or offload layers to CPU memory over PCIe (slow at ~64 GB/s).

On a GB200 Superchip (~360 GB usable HBM3e + ~480 GB LPDDR5X): The full 140 GB model easily fits into one GPU’s HBM, leaving the other GPU’s HBM and all CPU memory available for KV cache, batching, or running multiple model instances. No model parallelism is required. There is no aggressive quantization forced by memory limitations. The GB200 Superchip provides approx. 10× usable memory A single H100, basically a replacement that fits a single unit.

This is the practical effect of Blackwell’s memory architecture: models that previously required multi-GPU setups can now run on a single superchip, simplifying deployment and reducing inter-GPU communication overhead.

The result

Memory has always been the limiting factor in AI hardware. Blackwell changes this equation.

By combining dual-die GPUs, the HBM3e with 8 TB/s bandwidth, and unified CPU-GPU memory via NVLink C2C, NVIDIA has delivered a system where a single superchip offers nearly 10× the usable memory of its predecessor. Models that once demanded complex multi-GPU orchestrations now fit on a single unit.

For AI engineers, this means spending less time working on memory bottlenecks and more time building better models. The architecture isn’t just fast, it’s fundamentally easy to work with.

As models continue to grow, Blackwell’s memory-first design philosophy points to where GPU architecture is headed: tight integration, unified memory pools, and specialized hardware for the workloads that matter most.

References

NVIDIA Blackwell Architecture Technical Brief:
NVIDIA Blackwell Architecture: A Deep Dive:
AI Systems Performance Engineering:
Memory Rating of GPUs**:**
GPU memory bandwidth and its effect on performance:
AI Systems Game:
CUDA GPU Memory Types:

Conditions

Table of Contents

The generational leap

GB200 Super Chip

Grace CPU

LPDDR5X (Low Power Double Data Rate 5x)

Blackwell GPU

High Bandwidth Interface (NV-HBI)

NVLINK C-2-C (Chip to Chip)

Memory Hierarchy and Bandwidth

Classification at a glance

Registers and L1/Shared Memory

L2 cache: Compensation for small L1

HBM3e: main memory pool

LPDDR5X: Extended Class

Data Flow in Practice

Practical example: Running Llama 3 70B

The result

References

Editor's pick

Get latest news

Evolution of Nvidia Blackwell GPU memory architecture

Conditions

Table of Contents

The generational leap

GB200 Super Chip

Grace CPU

LPDDR5X (Low Power Double Data Rate 5x)

Blackwell GPU

High Bandwidth Interface (NV-HBI)

NVLINK C-2-C (Chip to Chip)

Memory Hierarchy and Bandwidth

Classification at a glance

Registers and L1/Shared Memory

L2 cache: Compensation for small L1

HBM3e: main memory pool

LPDDR5X: Extended Class

Data Flow in Practice

Practical example: Running Llama 3 70B

The result

References

Dageno AI: Be the recommended brand AI answer in 7+ major LLMs.

China’s open source bet: 10 things that matter in AI right now

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news