Experimenting with NVMe over TCP
What is NVMe over TCP
NVMe over TCP is like iSCSI, a standard for network-based storage. It provides block-level access to NVMe drives over TCP/IP, having much higher performance than iSCSI.
iSCSI was originally designed for SCSI drives, i.e. SSD (Slow Spinning Disks). It’s an old protocol that does not suit the needs of today’s SSDs.
The kernel Linux has native support for NVMe over TCP.
My setup
Target:
Motherboard: Supermicro H12SSL-NT
CPU: AMD EPYC 7282
RAM: plenty
Storage: 4 x 7.68TB Kioxia CD6 U.3 NVMe SSD, ZFS RAIDz1
Network: Onboard Broadcom 10GbE
OS: Debian 12 (bookworm)
Initiator:
Motherboard: a good one
CPU: AMD Ryzen 9 5950X
RAM: plenty
Network: TP-link AQC107 10GbE
OS: Arch Linux, kernel 6.3.y
They are connected to the switch ports of my home router (OpenWRT x86_64 on Ryzen 3 3100 with Intel X710 10GbE, promiscuous mode on).
MTU is set to 9000.
Configure target
-
Install
nvme-cli
package from the distro’s package manager, if not already installed. -
Load
nvmet-tcp
kernel module if it is not loaded at boot:
modprobe nvmet-tcp
Alternatively, let the kernel load nvmet-tcp
kernel module at boot:
echo "nvmet_tcp" > /etc/modules-load.d/nvmet_tcp.conf
- Create and configure an NVMe target subsystem, let’s call it “mysub”:
mkdir /sys/kernel/config/nvmet/subsystems/mysub
cd /sys/kernel/config/nvmet/subsystems/mysub
echo 1 > attr_allow_any_host
If the kernel module nvmet-tcp
is loaded, directory
/sys/kernel/config/nvmet
should exist.
If running as sudo, echo 1 > attr_allow_any_host
might fail. In this
case, su root
or try echo 1 | sudo tee -a attr_allow_any_host >
/dev/null
instead.
-
Before creating a namespace for the target, use
lsblk
ornvme list
to find out the name of the NVMe device to be attached to the target (i.e. /dev/nvme0n1). -
Next, create and configure a namespace
1
:
mkdir /sys/kernel/config/nvmet/subsystems/mysub/namespace/1
cd /sys/kernel/config/nvmet/subsystems/mysub/namespace/1
echo -n /dev/nvme0n1 > device_path
echo 1 > enable
- Configure a port
1
for an initiator to connect:
mkdir /sys/kernel/config/nvmet/ports/1
cd /sys/kernel/config/nvmet/ports/1
echo "ipv4" > addr_adrfam
echo "tcp" > addr_trtype
echo 4420 > addr_trsvcid
echo 192.168.0.231 > addr_traddr
Here, our target has a static IPv4 address 192.168.0.231
. 4420 is
the common port number used for NVMe-oF connection.
- Create a symbolic link from the
mysub
subsystem to the port1
ln -s /sys/kernel/config/nvmet/subsystems/mysub/ /sys/kernel/config/nvmet/ports/1/subsystems/mysub
Using dmesg | grep nvme_tcp
, we can see in the kernel log that the
port is enabled.
Configure initiator
-
Install
nvme-cli
package from the distro’s package manager, if not already installed. -
Load
nvme-tcp
kernel module (NOTnvmet-tcp
) if it is not loaded at boot:
modprobe nvme-tcp
Alternatively, let the kernel load nvme-tcp
kernel module at boot:
echo "nvme_tcp" > /etc/modules-load.d/nvme_tcp.conf
- Attempt to discover a remote target. This step can be skipped.
nvme discover -t tcp -a 192.168.0.231 -s 4420
- Connect the NVMe device:
nvme connect -t tcp -a 192.168.0.231 -s 4420 -n mysub
The subsystem NQN argument -n mysub
can be replaced by a host NQN,
which is stored in the target’s /etc/nvme/hostnqn
. The command will
look like this:
nvme connect -t tcp -a 192.168.0.231 -s 4420 --hostnqn=nqn.2014-08.org.nvmexpress:uuid:1b4e28ba-2fa1-11d2-883f-0016d3ccabcd
- Check
nvme list
orlsblk
for the attached device. To disconnect, first unmount the mounted NVMe device, then runnvme disconnect
:
nvme disconnect /dev/nvme2n1 -n mysub
Performance
Definitely faster than NFS, Samba, or iSCSI.
After formatting the zvol to ext4, I was able to use the package
fio-3.34
to test the I/O performance. Here is the test profile I
used:
[global]
bs=4k
ioengine=libaio
iodepth=32
size=1g
numjobs=16
direct=1
refill_buffers=1
runtime=60
directory=/mnt
group_reporting=1
filename=ssd.test.file
[rand-read]
rw=randread
stonewall
[rand-write]
rw=randwrite
stonewall
I got about 650MB/s sequential read/write and about 90K IOPS random read/write, slightly better than a typical SATA SSD. The numbers were not great, probably because of the suboptimal zpool setup. Mounting the zvol on the target directly, I only got about 1.5GB/s sequential read. Also, the ethernet adapter at the initiator’s side, AQC107, is known to be not very stable and sometimes overheat (and no support for RDMA). It would be better if I had a 25Gb or 40Gb network adapter.