NVMe & Flash Storage

Scaling Data Centers with NVMe over Fabrics Architectures

Explore how NVMe-oF extends high-performance storage across networks using TCP or RDMA to enable flexible, disaggregated data center storage.

Networking & HardwareIntermediate12 min read

In this article

Beyond the Local Bus: The Problem of Storage Silos

The Legacy Bottleneck

The Transport Layer: Choosing Between RDMA and TCP

The Power of Zero-Copy

Configuring an NVMe-oF Target and Initiator

Practical Implementation Guide

Managing Performance and Latency in Disaggregated Environments

The Role of Multipathing

Beyond the Local Bus: The Problem of Storage Silos

In the early days of solid state drives, we treated flash memory as a faster version of a spinning disk. We used the same SATA cables and the same AHCI protocols that were designed for mechanical platters. This approach created a massive performance bottleneck because the software stack could not keep up with the speed of the underlying NAND gates.

The NVMe protocol solved this by utilizing the PCIe bus, allowing for thousands of parallel queues and direct access to the CPU memory. However, this high-performance storage was physically tethered to the server chassis. If a database server ran out of local disk space while a neighboring web server had terabytes of idle flash, that capacity was effectively stranded.

This physical limitation created storage silos where resource utilization was inefficient and scaling required expensive hardware migrations. Engineers were forced to choose between the ultra-low latency of direct-attached storage and the flexibility of traditional network-attached storage. This is the specific gap that NVMe over Fabrics was designed to bridge.

NVMe-oF extends the NVMe protocol across a network fabric, allowing a host to access remote storage as if it were plugged into a local PCIe slot. It preserves the efficiency of the NVMe command set while removing the distance constraints of the motherboard. This enables a truly disaggregated architecture where compute and storage can scale independently of one another.

The primary goal of NVMe-oF is to maintain the sub-microsecond latency of local NVMe while providing the agility of a shared storage pool across the entire data center.

The Legacy Bottleneck

Before NVMe-oF, network storage relied heavily on the iSCSI protocol, which encapsulates SCSI commands inside TCP packets. SCSI was built in an era of high-latency spinning disks where sequential access was king and parallelism was non-existent. The overhead of the iSCSI stack often became the primary source of latency, regardless of how fast the underlying SSDs were.

NVMe-oF replaces these legacy SCSI commands with native NVMe submissions and completion queues. By reducing the number of CPU instructions required to process an I/O request, the protocol significantly lowers the latency floor. This allows modern data centers to achieve millions of operations per second across a standard network switch.

The Transport Layer: Choosing Between RDMA and TCP

NVMe-oF is transport-agnostic, meaning it can run over different types of network fabrics depending on the performance requirements and existing infrastructure. The two most common choices for developers and infrastructure engineers are Remote Direct Memory Access and Transmission Control Protocol. Each comes with significant trade-offs regarding cost, complexity, and raw performance.

RDMA allows data to be transferred from the memory of one computer to another without involving either computer's operating system. This zero-copy mechanism bypasses the standard kernel networking stack, which dramatically reduces CPU utilization and latency. Popular implementations include InfiniBand and RDMA over Converged Ethernet, commonly known as RoCE.

While RDMA offers the highest possible performance, it often requires specialized network interface cards and switches that support Priority Flow Control. Without a lossless network, RoCE performance can degrade rapidly when congestion occurs. This hardware requirement can be a significant barrier for teams operating in standard cloud environments or smaller on-premises labs.

NVMe over TCP has emerged as a compelling alternative that works on existing standard Ethernet hardware. It encapsulates NVMe commands in standard TCP segments, making it compatible with every modern data center switch and network interface. Although it introduces more CPU overhead than RDMA, it is much easier to deploy and manage at scale.

RDMA (RoCE/iWARP): Lowest latency and lowest CPU overhead, but requires specialized, lossless network hardware.
TCP: High compatibility and lower cost, but requires more CPU cycles to process the network stack.
Fibre Channel: Extremely reliable and optimized for storage, but increasingly viewed as a niche legacy technology compared to Ethernet-based options.

The Power of Zero-Copy

The magic of RDMA lies in its ability to place data directly into the application buffer on the remote host. In a traditional TCP transfer, the CPU must copy data from the network card buffer to the kernel space and then finally to the user space. These memory copies consume cycles and increase the time it takes for a database to receive a stored record.

By using an RDMA-capable NIC, the hardware handles the packet assembly and memory placement directly. This allows the application to stay in user space while the hardware manages the heavy lifting. For high-frequency trading or massive distributed training workloads, this difference in nanoseconds is a critical competitive advantage.

Configuring an NVMe-oF Target and Initiator

To understand how this works in practice, we can look at a Linux-based implementation using the native kernel modules. In this scenario, one server acts as the target, which shares its physical NVMe drives over the network. The other server acts as the initiator, which discovers and mounts those drives as if they were local devices.

We use the nvmet module on the target side to create a subsystem and export a specific block device. This involves creating a directory structure in the configfs virtual filesystem, which the kernel uses to manage storage exports. Once the subsystem is created, we define an allowed host and a network port to listen for incoming connections.

On the initiator side, we use the nvme-cli utility to connect to the target. The initiator sends a discovery request to the target's IP address to see available subsystems. If authorized, the initiator can then connect, and the operating system will create a new block device entry such as /dev/nvme0n1.

The following script demonstrates the basic steps to set up an NVMe over TCP target on a modern Linux distribution. It assumes you have a spare partition or drive available for export and the necessary kernel modules loaded. This pattern is often automated using configuration management tools like Ansible or integrated into Kubernetes CSI drivers.

Practical Implementation Guide

Setting up the target involves creating a namespace and mapping it to a physical device. This allows you to carve up a single large NVMe drive into multiple virtual targets for different consumers. It is important to ensure that your firewall rules allow traffic on port 4420, which is the default for NVMe over Fabrics.

bashLinux NVMe-oF TCP Target Setup

1# Load necessary kernel modules for TCP transport
2modprobe nvmet
3modprobe nvmet-tcp
4
5# Create a new NVMe subsystem named 'shared-storage'
6mkdir /sys/kernel/config/nvmet/subsystems/shared-storage
7cd /sys/kernel/config/nvmet/subsystems/shared-storage
8
9# Allow any host to connect (for demonstration purposes)
10echo 1 > attr_allow_any_host
11
12# Create a namespace and point it to a physical drive
13mkdir namespaces/1
14echo -n /dev/nvme0n1 > namespaces/1/device_path
15echo 1 > namespaces/1/enable
16
17# Create a port and bind it to a network interface
18mkdir /sys/kernel/config/nvmet/ports/1
19cd /sys/kernel/config/nvmet/ports/1
20echo 192.168.1.100 > addr_traddr
21echo tcp > addr_trtype
22echo 4420 > addr_trsvcid
23echo ipv4 > addr_adrfam
24
25# Link the subsystem to the port to start listening
26ln -s /sys/kernel/config/nvmet/subsystems/shared-storage /sys/kernel/config/nvmet/ports/1/subsystems/shared-storage

Once the target is running, the initiator must connect to it using the IP address and transport type. After running the connect command, the remote disk should appear in the output of the lsblk command. You can then format it with XFS or ext4 and use it just like a local SSD.

bashLinux NVMe-oF Initiator Connection

1# Load the initiator module for TCP
2modprobe nvme-tcp
3
4# Discover available subsystems on the target server
5nvme discover -t tcp -a 192.168.1.100 -s 4420
6
7# Connect to the specific subsystem identified in the discovery step
8nvme connect -t tcp -a 192.168.1.100 -s 4420 -n shared-storage
9
10# Verify the new block device is visible to the system
11lsblk | grep nvme
12# The output should show a new device, e.g., /dev/nvme1n1

Note that in a production environment, you should replace the allow_any_host attribute with specific Host NQNs. This ensures that only authorized servers can access the sensitive data stored on your flash fabric. You should also consider implementing multipathing to ensure storage remains available if a specific network path fails.

Managing Performance and Latency in Disaggregated Environments

While NVMe-oF is incredibly fast, it is not immune to the laws of physics and networking. The introduction of a network switch and cables adds a layer of latency that does not exist in a local PCIe connection. In a well-optimized environment, this overhead is typically in the range of 10 to 100 microseconds.

One major factor that influences performance is the number of I/O queues and the queue depth. NVMe was built to support up to 64,000 queues, each with 64,000 entries, allowing it to leverage the many cores of modern CPUs. When using NVMe-oF, ensure that your initiator is configured to use multiple queues that match the core count of your application server.

Network congestion is another common pitfall that can cause latency spikes, also known as the tail latency problem. Even if the average latency is low, a few slow requests can hang an entire application thread. This is especially problematic in multi-tenant environments where a noisy neighbor might saturate the network bandwidth.

To mitigate these issues, engineers often implement Quality of Service rules on their switches. By prioritizing storage traffic over general background traffic, you can ensure that database writes and reads remain consistent. Monitoring tools should track the p99 latency to identify these intermittent bottlenecks before they impact users.

The Role of Multipathing

In a disaggregated storage architecture, the network becomes a single point of failure. If the cable connecting your compute node to the storage fabric breaks, your application loses its disk. NVMe Multipathing solves this by creating redundant paths between the host and the storage target.

The Linux kernel includes native support for NVMe multipathing which can automatically switch paths if one becomes unavailable. Unlike older technologies that required complex third-party drivers, NVMe multipathing is lightweight and built directly into the protocol. It handles path discovery, failover, and load balancing across all available network links.

Managing Flash Endurance: From NAND Types to Wear Leveling All NVMe & Flash Storage Articles