Go Concurrency

Deep Dive into the Go Scheduler G-M-P Model

Learn how the Go runtime orchestrates goroutines across OS threads using work-stealing and logical processors.

ProgrammingIntermediate12 min read

In this article

The Evolution of Concurrency and the Threading Problem

The Context Switching Tax

Deconstructing the GMP Model

The Role of the Logical Processor
Managing the Global Run Queue

Strategies for Efficient Work Distribution

The Work-Stealing Algorithm
Cooperative Preemption

Handling Blocking Operations and System Calls

System Call Handoff

Practical Tuning for Production Environments

Best Practices for Scalability

The Evolution of Concurrency and the Threading Problem

In traditional server environments, every incoming request often maps directly to an operating system thread. While this model is straightforward, it presents significant scalability issues as the number of concurrent connections grows into the thousands. Operating system threads are heavy resources that require a fixed amount of memory for their stack and involve expensive context switches managed by the kernel.

The primary challenge with OS threads is the overhead of saving and restoring register states during a context switch. This process consumes CPU cycles that could otherwise be used for processing application logic. Additionally, the default stack size for a thread is typically around two megabytes, which quickly exhausts memory on machines handling high volumes of concurrent traffic.

Go solves this problem by introducing goroutines, which are lightweight user-space threads managed by the Go runtime. Instead of relying on the operating system to schedule these units of work, the Go runtime uses its own internal scheduler. This allows a single OS thread to multiplex thousands of goroutines with minimal overhead.

Concurrency is not about doing many things at once, but rather about dealing with many things at once through a better structural design.

The Context Switching Tax

When the kernel switches between two OS threads, it must transition from user mode to kernel mode. This transition is relatively slow and flushes various CPU caches, leading to a performance penalty. Go minimizes this by keeping the OS threads active and switching goroutines entirely within user-space.

Because goroutines start with a very small stack size of only two kilobytes, the runtime can host millions of them in the same memory footprint required for a few thousand OS threads. The stacks grow and shrink dynamically as needed, ensuring that memory usage remains proportional to the actual work being performed.

Deconstructing the GMP Model

The core of the Go scheduler is defined by three main entities represented by the letters G, M, and P. Understanding the relationship between these three components is essential for mastering how Go handles high-performance tasks. This architectural pattern allows the runtime to distribute work efficiently across all available CPU cores.

The G represents a single goroutine, which contains the stack, the instruction pointer, and other metadata necessary for execution. It is the smallest unit of work in the Go ecosystem. Goroutines are essentially state machines that can be in states like runnable, running, or waiting.

The M represents a Machine, which is a physical operating system thread managed by the kernel. An M is responsible for executing the code and must be associated with a logical processor to run Go code. If an M becomes blocked by a system call, the scheduler can spawn or retrieve another M to keep the remaining processors busy.

The P represents a Processor, which is a logical resource that acts as a context for scheduling. You can think of P as a permit that allows an OS thread to execute Go code. The number of P instances is determined by the GOMAXPROCS setting, which typically defaults to the number of logical CPU cores on the host machine.

G (Goroutine): The unit of execution containing the function call and stack state.
M (Machine): The actual OS thread that performs the physical computation.
P (Processor): A logical resource that holds the local run queue and manages scheduling contexts.

The Role of the Logical Processor

The Processor is the most critical piece of the puzzle for maintaining high throughput. Each P maintains a local run queue of goroutines that are ready to be executed. By having a local queue for each processor, the scheduler reduces contention compared to a single global lock for all threads.

When an M wants to run a goroutine, it must first acquire a P from the idle pool. Once the M has a P, it can start popping goroutines from the local run queue of that P. This localization of data improves cache hits and significantly speeds up the scheduling process.

Managing the Global Run Queue

While local run queues handle the majority of the work, the scheduler also maintains a global run queue. This queue holds goroutines that were created without a specific processor assignment or those moved during specific load-balancing events. The scheduler periodically checks this global queue to ensure that goroutines do not starve while waiting for a processor.

To prevent the global queue from being ignored, the scheduler checks it once every sixty-one ticks. This specific prime number is used to avoid synchronization patterns that might align with other periodic tasks in the runtime. This ensures that even in highly busy systems, goroutines in the global queue eventually get their turn to run.

Strategies for Efficient Work Distribution

A static distribution of work is rarely efficient because tasks have varying completion times. Some goroutines might perform heavy calculations, while others might simply wait for a small network response. To solve this imbalance, Go implements a sophisticated work-stealing algorithm.

Work-stealing allows a processor that has exhausted its local run queue to take work from another processor. This mechanism ensures that all available CPU cores remain productive as long as there is work to be done in the system. It effectively balances the load without requiring a central coordinator that could become a bottleneck.

goSimulating Concurrent Load

1package main
2
3import (
4	"fmt"
5	"sync"
6	"time"
7)
8
9func processTask(id int, wg *sync.WaitGroup) {
10	defer wg.Done()
11	// Simulate a mix of compute and I/O work
12	fmt.Printf("Starting task %d\n", id)
13	time.Sleep(100 * time.Millisecond)
14	fmt.Printf("Completed task %d\n", id)
15}
16
17func main() {
18	var wg sync.WaitGroup
19	numTasks := 100
20
21	wg.Add(numTasks)
22	for i := 0; i < numTasks; i++ {
23		// Every go keyword creates a G and adds it to a local run queue (P)
24		go processTask(i, &wg)
25	}
26
27	wg.Wait()
28	fmt.Println("All tasks processed successfully")
29}

The Work-Stealing Algorithm

When a logical processor P finds its local run queue empty, it first checks the global run queue for available work. If the global queue is also empty, the processor enters work-stealing mode. It randomly selects another processor from the pool and attempts to steal half of the goroutines from that processor's local queue.

This randomized approach is highly effective at distributing load across the entire machine. By stealing half of the work, the thief ensures it has enough tasks to stay busy for a while, reducing the frequency of future stealing attempts. This strategy minimizes the overhead of searching for work and keeps CPU utilization high.

Cooperative Preemption

The Go scheduler is primarily cooperative, meaning a goroutine is expected to yield control at specific points, such as function calls or I/O operations. However, to prevent a single tight loop from monopolizing a processor, Go also uses asynchronous preemption. The runtime can send a signal to a thread to stop the current goroutine and let others run.

Modern Go versions use a stack-based preemption mechanism that inserts checks at function entry points. If a goroutine has been running for too long, the scheduler sets a flag that triggers a yield the next time a function is called. This ensures that even compute-bound tasks do not block the progress of other concurrent operations.

Handling Blocking Operations and System Calls

In many languages, a blocking system call like reading from a file or a socket stops the entire thread. Since Go maps many goroutines to a few threads, a single blocking call could potentially freeze an entire logical processor. The Go runtime implements two distinct strategies to handle this: the Netpoller and system call handoffs.

The Netpoller is a specialized part of the runtime that uses non-blocking I/O primitives provided by the operating system, such as epoll on Linux or kqueue on macOS. When a goroutine performs a network operation, it is moved to the Netpoller and put into a waiting state. This frees up the logical processor to run other goroutines while the network response is pending.

goNon-blocking Network Pattern

1package main
2
3import (
4	"io"
5	"net"
6	"log"
7)
8
9func handleConnection(conn net.Conn) {
10	defer conn.Close()
11	// This call looks blocking but the Go runtime uses the Netpoller
12	// to park the goroutine and free the M and P for other work.
13	buf := make([]byte, 1024)
14	for {
15		n, err := conn.Read(buf)
16		if err != nil {
17			if err != io.EOF {
18				log.Printf("read error: %v", err)
19			}
20			return
21		}
22		conn.Write(buf[:n])
23	}
24}
25
26func main() {
27	ln, _ := net.Listen("tcp", ":8080")
28	for {
29		conn, _ := ln.Accept()
30		go handleConnection(conn)
31	}
32}

System Call Handoff

For operations that cannot be made non-blocking, such as certain disk I/O tasks, the scheduler uses a handoff mechanism. When a goroutine enters a blocking system call, the M (thread) is blocked. The scheduler detects this and detaches the P (processor) from that M, allowing the P to find or create a new M to continue running other goroutines.

Once the blocking system call eventually completes, the original M tries to re-acquire a P to continue executing its goroutine. If no processors are available, it places the goroutine on the global run queue and goes into an idle state. This ensures that the total number of running OS threads matches the number of available CPU cores as closely as possible.

Practical Tuning for Production Environments

While the Go scheduler is designed to be self-tuning, developers should understand how to observe and influence its behavior. The most common tool for this is the GOMAXPROCS environment variable. This setting determines the number of logical processors and, by extension, how many OS threads can execute Go code simultaneously.

Increasing GOMAXPROCS beyond the number of physical cores can sometimes improve performance for applications with many blocking system calls. However, for most compute-heavy tasks, setting it higher than the core count leads to increased context switching and decreased performance. It is generally best to leave it at the default value unless benchmarking proves otherwise.

Monitoring the scheduler is made possible through the execution tracer tool. This tool provides a high-resolution view of goroutine creation, execution, and blocking events. By analyzing a trace, developers can identify bottlenecks where goroutines are waiting too long for a processor or where excessive work-stealing is occurring.

Do not over-optimize the scheduler settings until you have clear profiling data showing that the runtime overhead is your primary bottleneck.

Best Practices for Scalability

To get the most out of the Go scheduler, avoid creating long-running goroutines that perform tight loops without any function calls. If you must perform heavy computation, consider manually yielding control using the Gosched function from the runtime package. This allows the scheduler to give other goroutines a chance to run on that processor.

Another common pitfall is improper use of locks and channels, which can lead to goroutines being parked frequently. Use buffered channels when appropriate to reduce the frequency of sender-receiver synchronization. Minimizing lock contention ensures that goroutines stay in the running state and spend more time doing actual work.

How Goroutines Achieve Massive Scale via Lightweight Threading Implementing Scalable Worker Pools for Efficient Task Processing