GPU Architecture

Mastering the Memory Pyramid: Bandwidth, VRAM, and On-Chip Cache

Examine the GPU memory hierarchy from HBM to L1/L2 caches and learn how developers manage data movement to prevent compute units from stalling.

Networking & HardwareIntermediate12 min read

In this article

The Architecture of Throughput: Why Memory is the Bottleneck

Latency vs Throughput Mental Models

Mapping the Hierarchy: From HBM to Registers

The Power of Shared Memory

Optimizing Data Access: Coalescing and Alignment

Implementation: Coalesced vs Uncoalesced Access

Advanced Memory Techniques: Tiling and Async Transfers

Code Walkthrough: Shared Memory Tiling

Detecting Stalls and Performance Profiling

The Architecture of Throughput: Why Memory is the Bottleneck

Modern software engineering often abstracts hardware away to allow developers to focus on logic and flow. However, in the realm of high-performance computing and AI, the physical constraints of data movement become the primary driver of application speed. While a CPU is designed to minimize the time it takes to finish a single task, a GPU is built to maximize the total work done per second.

This shift from latency-oriented to throughput-oriented design creates a significant challenge for the memory subsystem. Because a GPU might have thousands of cores running simultaneously, the demand for data is orders of magnitude higher than what a typical system bus can provide. If the memory cannot keep up with the compute units, those units will sit idle, wasting expensive silicon cycles.

To solve this problem, GPU architects employ a deeply tiered memory hierarchy that prioritizes bandwidth over simple access speed. Understanding this hierarchy allows developers to structure their algorithms so that data stays as close to the compute units as possible for as long as possible. This approach is known as maximizing data locality, and it is the secret to high-performance GPU kernels.

In the world of massively parallel computing, your performance is not limited by how fast you can calculate, but by how fast you can move data to the units that do the calculating.

We often refer to this disparity between compute capability and memory speed as the memory wall. While the number of floating-point operations per second has grown exponentially, memory bandwidth has struggled to keep pace. Consequently, most modern deep learning models and simulation engines are memory-bound rather than compute-bound.

Latency vs Throughput Mental Models

Think of a CPU as a high-speed sports car that can deliver a small package across town very quickly. If you need a single answer fast, the sports car is your best choice because it minimizes the wait time for that specific delivery. This is why CPUs have large, complex caches and sophisticated branch prediction to keep the single thread of execution moving forward without pauses.

A GPU is more like a massive fleet of cargo trucks moving millions of gallons of water. No individual truck is particularly fast, and it might take a long time for the first truck to arrive at its destination. However, once the fleet is in motion, the total volume of water delivered per hour is staggering compared to the sports car.

Mapping the Hierarchy: From HBM to Registers

The GPU memory hierarchy is a pyramid where the largest, slowest storage sits at the bottom and the smallest, fastest storage sits at the top. At the base is Global Memory, often implemented as High Bandwidth Memory or GDDR6. This is the main pool of VRAM where your large datasets, model weights, and textures reside during execution.

Moving up the pyramid, we encounter the L2 and L1 caches, which serve as intermediate buffers to reduce the frequency of trips to Global Memory. Unlike CPU caches which are managed almost entirely by hardware, some portions of the GPU cache layer can be explicitly managed by the developer. This explicit control allows for fine-tuned optimizations that are impossible on standard general-purpose processors.

At the very peak of the hierarchy are the Registers, which are local to individual threads. Registers provide nearly instantaneous access but are extremely limited in quantity. If a kernel uses too many registers per thread, the GPU may be forced to reduce the number of active threads, leading to a drop in overall parallel efficiency.

Global Memory: 16GB-80GB capacity, 1-2 TB/s bandwidth, 400-800 cycle latency.
L2 Cache: 4MB-96MB capacity, 3-5 TB/s bandwidth, 100-200 cycle latency.
Shared Memory / L1: 100KB-256KB per SM, 10-20 TB/s bandwidth, 20-30 cycle latency.
Registers: MBs across the whole chip, massive bandwidth, single-cycle latency.

Managing these layers requires a developer to be conscious of the occupancy of the device. If you load too much data into high-speed memory, you limit how many threads can run at once. Finding the balance between data proximity and thread count is the core challenge of GPU programming.

The Power of Shared Memory

Shared memory is a programmable cache that sits inside each Streaming Multiprocessor and is accessible by all threads within a block. It is significantly faster than global memory and allows threads to communicate with each other by sharing intermediate results. This is particularly useful for operations like matrix multiplication where multiple threads need to access the same data points.

By manually moving data from Global Memory into Shared Memory, developers can create a tile of data that stays on-chip. This technique, called tiling, reduces the total number of global memory transactions by an order of magnitude. It effectively transforms a global-memory-bound problem into one that is limited only by the speed of the shared memory bus.

Optimizing Data Access: Coalescing and Alignment

To achieve maximum bandwidth, the GPU hardware attempts to combine multiple memory requests from a group of threads into a single transaction. This process is known as memory coalescing and it requires that threads in a warp access contiguous memory addresses. When threads access consecutive 32-bit words, the hardware can fetch all the data in a single 128-byte burst.

If the access pattern is scattered or strided, the hardware must issue multiple memory transactions to satisfy the needs of a single warp. This results in wasted bandwidth because each transaction carries more data than the thread actually requested. In the worst-case scenario, your effective bandwidth could drop to a fraction of the hardware's theoretical peak.

Consider an array of structures vs a structure of arrays. If you use an array of structures, and each thread only needs one field, the hardware still fetches the entire structure for every thread. Switching to a structure of arrays ensures that all threads are reading consecutive values of the same field, maximizing the efficiency of every memory transaction.

Implementation: Coalesced vs Uncoalesced Access

The following code demonstrates the difference between an efficient coalesced read and an inefficient strided read. In the efficient version, each thread index directly maps to a sequential memory address. In the inefficient version, a stride causes the hardware to fetch unnecessary data between the requested points.

cppMemory Access Patterns in CUDA

1__global__ void efficient_kernel(float* data, float* output, int n) {
2    // Coalesced: thread 0 reads data[0], thread 1 reads data[1]
3    int idx = blockIdx.x * blockDim.x + threadIdx.x;
4    if (idx < n) {
5        output[idx] = data[idx] * 2.0f;
6    }
7}
8
9__global__ void inefficient_kernel(float* data, float* output, int n, int stride) {
10    // Strided: thread 0 reads data[0], thread 1 reads data[stride]
11    // This breaks coalescing and wastes bandwidth
12    int idx = (blockIdx.x * blockDim.x + threadIdx.x) * stride;
13    if (idx < n) {
14        output[idx] = data[idx] * 2.0f;
15    }
16}

Advanced Memory Techniques: Tiling and Async Transfers

Tiling is the most common optimization for algorithms that exhibit high temporal data reuse. By loading a small sub-section of a larger matrix into shared memory, the threads in a block can perform many calculations using that local data. This pattern is essential for large-scale linear algebra operations that form the backbone of neural network training.

Modern GPUs have also introduced asynchronous memory copy instructions that allow data to be moved from global memory to shared memory without involving the register file. This bypasses the compute cores entirely, allowing them to continue processing other data while the next batch is being fetched. This overlap of compute and memory transfer is key to achieving peak utilization.

Furthermore, developers should utilize memory streams to overlap data transfers between the CPU host and the GPU device. By using multiple streams, you can upload the next batch of data while the GPU is still busy processing the current batch. This pipelining ensures that the GPU never waits on the relatively slow PCIe bus.

Code Walkthrough: Shared Memory Tiling

This example shows how to use shared memory to store a tile of data for a 1D convolution. Instead of every thread reading from global memory multiple times for overlapping windows, we load the data once into a shared array. This significantly reduces the pressure on the global memory controller.

cppShared Memory Tiling Example

1#define TILE_SIZE 256
2#define RADIUS 3
3
4__global__ void tiled_convolution(float* input, float* output, int n) {
5    // Allocate shared memory for the tile plus padding for the radius
6    __shared__ float temp[TILE_SIZE + 2 * RADIUS];
7    
8    int gidx = blockIdx.x * blockDim.x + threadIdx.x;
9    int lidx = threadIdx.x + RADIUS;
10
11    // Load main data and halo regions into shared memory
12    if (gidx < n) {
13        temp[lidx] = input[gidx];
14        if (threadIdx.x < RADIUS) {
15            temp[lidx - RADIUS] = (gidx >= RADIUS) ? input[gidx - RADIUS] : 0;
16            temp[lidx + TILE_SIZE] = (gidx + TILE_SIZE < n) ? input[gidx + TILE_SIZE] : 0;
17        }
18    }
19    
20    // Wait for all threads to finish loading shared memory
21    __syncthreads();
22
23    // Perform calculation using only shared memory
24    float result = 0;
25    for (int i = -RADIUS; i <= RADIUS; i++) {
26        result += temp[lidx + i];
27    }
28    output[gidx] = result;
29}

Detecting Stalls and Performance Profiling

Even with perfect logic, your kernel might still perform poorly if it is stalled by memory requests. Profiling tools like NVIDIA Nsight Compute allow you to visualize these stalls through metric analysis. High values for memory throttle or long scoreboard stalls usually indicate that your compute units are waiting for data to arrive from the hierarchy.

One useful metric is Arithmetic Intensity, which is the ratio of floating-point operations to memory bytes transferred. If your intensity is low, you are likely memory-bound, and adding more compute operations might actually be free in terms of time. If it is high, you are compute-bound, and you should focus on optimizing the math itself.

Always remember that hardware is a moving target and optimizations that work on one generation may not be as effective on the next. Continuously measuring the memory throughput against the theoretical maximum of your specific device is the only way to ensure your software remains efficient. Mastering the memory hierarchy is a lifelong skill for any systems-level developer.

Inside the Streaming Multiprocessor: How GPU Cores Execute Warps Accelerating Deep Learning: The Mechanics of Tensor Cores and FP4/FP8