GPU Architecture

Inside the Streaming Multiprocessor: How GPU Cores Execute Warps

Break down the internal components of a Streaming Multiprocessor (SM), including CUDA cores, warp schedulers, and register files, to see how instructions execute at scale.

Networking & HardwareIntermediate12 min read

In this article

The Architecture of Throughput: Why SMs Exist

The Trade-off Between Complexity and Parallelism

The Anatomy of a Streaming Multiprocessor

The Role of the Register File

Warp Scheduling and SIMT Execution

The Cost of Branch Divergence

Memory Hierarchy and Resource Constraints

Occupancy and Latency Hiding

Practical Implementation: Scaling Instructions

Analyzing Performance with Profilers

The Architecture of Throughput: Why SMs Exist

Traditional CPUs are designed to minimize latency for a single sequence of instructions. They use large caches and complex branch prediction logic to ensure that a single thread runs as fast as possible. This approach is ideal for general-purpose computing where tasks are often serial and unpredictable.

In contrast, GPUs are designed to maximize throughput by processing thousands of threads simultaneously. Instead of focusing on how fast one task can finish, the GPU focuses on how many tasks can finish in a given second. This shift in philosophy requires a completely different internal structure.

The fundamental building block of this throughput-oriented design is the Streaming Multiprocessor, or SM. While a CPU might have a few dozen powerful cores, a modern GPU contains dozens of SMs, each capable of managing hundreds of active threads. Understanding the SM is the key to writing performant GPU kernels.

A CPU is like a Ferrari that can deliver a small package across town very quickly. A GPU is like a massive cargo ship that moves thousands of containers at once; it takes longer to start, but the total volume of data moved is vastly higher.

The SM achieves this efficiency by sharing resources among many threads. By grouping execution units and instruction fetch hardware, the GPU reduces the overhead that would otherwise be required to manage each thread individually. This is why we refer to the GPU as a massively parallel processor.

The Trade-off Between Complexity and Parallelism

To fit thousands of cores onto a single chip, engineers must make hardware trade-offs. GPU cores are significantly simpler than CPU cores, lacking the sophisticated out-of-order execution logic found in processors like the Intel Core or AMD Ryzen series. This simplicity allows the GPU to dedicate more silicon area to arithmetic logic units.

Because these cores are simpler, they rely on the programmer and the compiler to provide massive amounts of parallelism. If a kernel does not have enough independent work to perform, the SM will sit idle, wasting its potential. This is why understanding the inner workings of the SM is not just theoretical; it is a requirement for optimization.

The Anatomy of a Streaming Multiprocessor

Each Streaming Multiprocessor is a self-contained processing unit with its own set of resources. Inside an SM, you will find several key components: CUDA cores, Special Function Units, warp schedulers, and a large register file. These components work in concert to execute instructions across a block of threads.

The CUDA cores, also known as Stream Processors, are the primary units for integer and floating-point arithmetic. Most modern architectures further divide these into specialized pipelines for FP32, FP64, and INT32 operations. This specialization ensures that different types of math can be processed with maximum efficiency.

CUDA Cores: The primary arithmetic units for standard math operations.
Special Function Units (SFU): Hardware dedicated to complex math like sines, cosines, and square roots.
Load/Store Units: Responsible for moving data between the SM and the various memory levels.
Register File: Extremely fast on-chip storage used to hold the local variables of active threads.

Beyond math units, the SM includes Load/Store units that handle memory requests. These units are critical because moving data is often the primary bottleneck in high-performance computing. Efficiently utilizing these units requires understanding how memory is laid out and accessed by the threads.

The Role of the Register File

The register file is the largest and fastest memory structure inside the SM. Unlike a CPU, where registers are a scarce resource, a GPU SM provides tens of thousands of registers. This massive capacity allows the SM to keep the state of thousands of threads locally available without needing to swap data to slower memory.

When a thread is scheduled, its local variables are stored directly in these registers. Because the state of every thread is already on the chip, the SM can switch between different groups of threads almost instantly. This zero-overhead context switching is the secret to how GPUs hide memory latency.

Warp Scheduling and SIMT Execution

GPUs do not actually manage threads one by one. Instead, they group threads into units of 32, known as a Warp. The Warp is the smallest unit of execution in the NVIDIA architecture, and all threads in a warp execute the same instruction at the same time.

This execution model is known as Single Instruction, Multiple Threads, or SIMT. It is similar to the SIMD instructions found in CPUs but is more flexible for the programmer. In SIMT, each thread has its own instruction pointer and stack, even though they are hardware-accelerated to run in lock-step.

cppUnderstanding Warp Execution in a Kernel

1// A simple vector addition kernel demonstrating how threads map to indices
2__global__ void vectorAdd(const float* A, const float* B, float* C, int numElements) {
3    // Calculate the global thread index
4    int i = blockDim.x * blockIdx.x + threadIdx.x;
5
6    // Check boundaries to ensure we do not access memory out of bounds
7    if (i < numElements) {
8        // All 32 threads in a warp execute this line simultaneously
9        // but each uses a unique index 'i'
10        C[i] = A[i] + B[i];
11    }
12}

The Warp Scheduler is the component within the SM responsible for picking which warp to execute next. If a warp is waiting for a memory request to finish, the scheduler will simply skip it and pick another warp that is ready to run. This allows the GPU to stay busy even when data is being fetched from high-latency global memory.

The Cost of Branch Divergence

Because all threads in a warp must execute the same instruction, conditional logic can cause performance issues. If half of the threads in a warp take an 'if' branch and the other half take an 'else' branch, the hardware must execute both paths sequentially. This phenomenon is known as branch divergence.

During divergence, the threads not on the current path are masked out, meaning they do nothing while the other threads work. This effectively halves the throughput of the SM for the duration of the branch. Developers should try to organize data so that threads within the same warp follow the same execution path as much as possible.

Memory Hierarchy and Resource Constraints

In addition to registers, the SM contains a dedicated area for Shared Memory and L1 Cache. Shared Memory is a programmable cache that allows threads within the same block to communicate and share data at very high speeds. It is significantly faster than global memory but much smaller in size.

Using shared memory effectively is one of the most common ways to optimize a GPU kernel. By loading a piece of data from global memory once into shared memory, multiple threads can access it repeatedly without incurring the high latency of the main memory bus.

cppOptimizing with Shared Memory

1// Example of using shared memory to reduce global memory pressure
2__global__ void sharedMemExample(float* data) {
3    // Declare a shared array accessible by all threads in the block
4    __shared__ float temp[256];
5
6    int tid = threadIdx.x;
7    
8    // Cooperatively load data from global memory to shared memory
9    temp[tid] = data[tid];
10
11    // Ensure all loads are finished before proceeding
12    __syncthreads();
13
14    // Now perform calculations using the fast shared memory
15    float val = temp[tid] * 2.0f;
16    data[tid] = val;
17}

Resource allocation within the SM is a balancing act. Every thread requires a certain number of registers and a certain amount of shared memory. If a kernel uses too many registers per thread, the SM can host fewer active warps, which reduces the ability of the GPU to hide latency.

Occupancy and Latency Hiding

Occupancy is a metric representing the ratio of active warps to the maximum possible warps an SM can support. High occupancy is generally desirable because it gives the warp scheduler more options to choose from when one warp is stalled. However, high occupancy is not an end in itself; sometimes a kernel with lower occupancy but better instruction use is faster.

The primary goal of a developer is to provide enough work to hide the latency of memory instructions. Global memory access can take hundreds of clock cycles. Without enough active warps to switch to, the SM will have no work to do, leading to poor performance and low hardware utilization.

Practical Implementation: Scaling Instructions

When writing kernels, developers must think about how the SM will schedule their code. Large kernels with many variables can cause register spilling, where the compiler is forced to store local variables in slow global memory because the register file is full. This usually results in a massive performance penalty.

To avoid this, keep kernels focused and modular. By minimizing the number of registers each thread needs, you allow more threads to reside on the SM simultaneously. This increases the parallelism and allows the hardware to manage execution more effectively.

Finally, always consider the memory access patterns of your threads. The Load/Store units are most efficient when threads in a warp access contiguous memory addresses. This is called memory coalescing, and it allows the hardware to combine multiple memory requests into a single transaction, saving bandwidth and cycles.

Analyzing Performance with Profilers

Tools like NVIDIA Nsight Systems and Nsight Compute are essential for understanding how your code interacts with the SM. These tools can show you exactly how many registers your kernel is using and whether you are limited by math throughput or memory bandwidth. They provide a direct look into the execution of warps and the efficiency of the schedulers.

By identifying bottlenecks such as shared memory bank conflicts or high branch divergence, you can iterate on your design. GPU programming is an iterative process of balancing compute intensity against memory access efficiency to reach the maximum theoretical throughput of the hardware.

Latency vs. Throughput: Why GPUs Outperform CPUs in Parallel Tasks Mastering the Memory Pyramid: Bandwidth, VRAM, and On-Chip Cache