Table of contents
The reason for writing this article is because I read the article TriCache: A User-Transparent Block Cache Enabling High-Performance Out-of-Core Processing with In-Memory Programs , in which SPDK
the application is masterful and I/O
the performance is raised to a new level.
SPDK basics
SPDK Storage Performance Development Kit
, the Storage Performance Development Kit, provides a set of tools and libraries for writing high-performance, scalable user-mode storage applications
Relevant information:
Link |
---|
Introduction to the Storage Performance Development Kit (SPDK) |
spdk official website documentation |
spdk source code GitHub spdk/spdk |
spdk technical articles |
There is also a technology called
DPDK
Architecture andSPDK
Similarity. If you want to learn this book, I recommend reading "In-depth and SimpleDPDK
"
advantage:
- Move all required drivers to user space, which avoids system calls and enables zero-copy access for applications
- Polling hardware for completion rather than relying on interrupt preemption reduces overall latency and latency variance
- Avoid all locking in the I/O path and instead rely on message passing (a lock-free message queue is used)
Compile and install
git clone https://github.com/spdk/spdk
cd spdk
git submodule update --init # 拉取子模块
sudo scripts/pkgdep.sh # 安装依赖性
./configure
make
Run the unit test, the following means the test is successful
(base) root@nizai8a-desktop:~/tt/spdk# ./test/unit/unittest.sh
=====================
All unit tests passed
=====================
WARN: lcov not installed or SPDK built without coverage!
WARN: neither valgrind nor ASAN is enabled!
Allocate huge pages and unbind any NVMe
devices from the native kernel driver
This is a common error, indicating that
/dev/nvme1n1
it is in use, so if you want to unbind the device, you mustumount
format it
. The user driver usesuio/vfio
the function in to map the devicePCI BAR
to the current process, allowing the driver to execute directlyMMIO
(default adopteduio
)
(base) root@nizai8a-desktop:~/tt/spdk# sudo scripts/setup.sh
0000:81:00.0 (144d a808): Active mountpoints on nvme1n1:nvme1n1, so not binding PCI dev
0000:01:00.0 (15b7 5009): nvme -> uio_pci_generi
Why allocate huge pages?
Principle: dpdk large page memory principle
- All large pages and large page tables are stored in shared memory and will never be swapped to the disk swap partition due to insufficient memory.
- Since all processes share a large page table, the overhead of the page table is reduced, and the memory space occupied is virtually reduced, allowing the system to support more processes running at the same time.
- Reduce pressure on TLB
- Reduce the pressure of searching memory
By default, the script allocates 2048MB
huge pages. To change this number, specifyHUGEMEM
sudo HUGEMEM = 4096 scripts/setup.sh
Check the large page types supported by the system
(base) root@nizai8a-desktop:/sys/kernel/mm/hugepages# ls
hugepages-1048576kB hugepages-2048kB
SPDK architecture
This section mainly SPDK
focuses on the three advantages of
User mode driver
After VVMe
the emergence of , software has become I/O
a bottleneck in intensive scenarios. There are two methods to optimize the kernel.
io_ring
: Provide a new set of system calls, based on the optimization of system call paths- Bypassing the kernel
kernel bypass
, the entireI/O
operation does not need to be trapped in the kernel,SPDK
that is, this storage acceleration solution
Intel
Its definition is an acceleration library that uses user mode, asynchronous, and polling methodsNVMe
to accelerate NVMe SSD
application software used as back-end storage.
NVMe
The protocol is aSSD
specification defined for faster communication between solid-state drives and hosts. It extends its local storage performance to networkNVMe-oF
protocols to supportInfiniBand
optical fiber or Ethernet. Among the network storage solutions, currently there are mainlyDAS
,NAS
andSAN
SPDK
Based on UIO
or VFIO
supports the method of directly mapping the address space of the storage device to the application space, and uses NVMe
specifications to initialize NVMe SSD
the device and implement basic I/O
operations to build a user-mode driver, so the whole process does not need to be trapped in the kernel.
In SPDK
the user-mode driven solution, this behavior is replaced by asynchronous polling. Through continuous CPU
polling, once the query is completed, the callback function is immediately triggered and given to the upper user program, so that the user program can send messages on demand. Multiple requests to improve performance
Another obvious benefit of removing system interrupts is the avoidance of context switches
In addition, the affinity of SPDK
the application is used to bind CPU
threads and CPU
cores, and a thread model is designed. From the time the application receives I/O
the operation of this core to the end of the operation, it is completed on this core, so that it can be used more efficiently. cache, while also avoiding memory synchronization issues between multiple cores
At the same time, the management of memory resources on a single core uses large page storage to accelerate
thread
In SPDK
the thread model architecture, each CPU
core has a kernel thread and will initialize areactor
Each reactor
thread can hold zero to multiple SPDK
abstracted lightweight user-mode threads spdk_thread
. In order to improve reactor
the efficiency of communication and synchronization between them, SPDK
the traditional locking method is abandoned. Instead, the abstract method is used to send messages to each thread reactor
. spdk_ring
The obtained spdk_thread
lower possession poller
is used to register user functions.
SPDK use
rpc background startup
SPDK
Starting from v20.x
the version, the format of the configuration file has been switched json
. When running the executable program, --json
the json configuration file can be passed as a parameter. When a file SPDK
is specified when starting an application , the initialization process will be performed in the mode and the operations specified in the file will be performed.json
SPDK
rpc
subsystems
json
rpc
It is a well-known method for dynamically and flexibly performing operations after the program is started. It is mainly used tounix socket
transfer message data between the client and the server.
SPDK
It also integrates or implementsrpc
interactive channels, which can support dynamic operations.
Server
SPDK
rpc
The server in spdk_rpc_initialize
is initialized in the function. If no listening address is specified, the default listening address will be used /var/tmp/spdk.sock
. rpc
This is the default listening address used when the client accesses. Each rpc
module or function that needs to be called can SPDK_RPC_REGISTER
register its service-providing function into g_rpc_methods
the linked list. The function jsonrpc_handler
is the entry point for processing all requests from the client. The g_rpc_methods
specific processing function will be matched according to the request from the linked list.
client
SPDK
rpc
Most of the functions provided by the client are called ./spdk/scripts/rpc.py
using scripts as the entry point. The script will include scripts ./spdk/python/spdk/rpc
in the directory , client processing of each function, and public functions for interacting with the server are defined in these included scripts. The corresponding processing logic of the client in the rpc function provided by each module is collected in a file named with the module name.python
rpc
python
If you want to query what supported rpc calling functions are, you can directly execute
./rpc.py -h
the query
If you want to add new
rpc
functions, you need toSPDK_RPC_REGISTER
register the new functions andrpc
add correspondingpython
script logic on the client
Basic mechanism analysis
SPDK
The sub-core parallelism, lock-free and Run to completion
programming features in are mainly composed of the reactor、events、poller
and io channel
mechanisms
Reactors
DPDK
When the function is executed, it will create threads on each specified available core rte_eal_init
except the currently running core.CPU main cpu
CPU
And bind it to run on the corresponding CPU
core by modifying the affinity parameter of the thread. The execution function of each thread is eal_thread_loop
to wait for data to be received pipe
from it and execute
/* Launch threads, called at application init(). */
int
rte_eal_init(int argc, char **argv)
{
// ...
RTE_LCORE_FOREACH_WORKER(i) {
/*
* create communication pipes between main thread
* and children
*/
if (pipe(lcore_config[i].pipe_main2worker) < 0)
rte_panic("Cannot create pipe\n");
if (pipe(lcore_config[i].pipe_worker2main) < 0)
rte_panic("Cannot create pipe\n");
lcore_config[i].state = WAIT;
/* create a thread for each lcore */
// 创建线程,执行函数为eal_thread_loop
ret = pthread_create(&lcore_config[i].thread_id, NULL,
eal_thread_loop, NULL);
if (ret != 0)
rte_panic("Cannot create thread\n");
/* Set thread_name for aid in debugging. */
snprintf(thread_name, sizeof(thread_name),
"lcore-worker-%d", i);
rte_thread_setname(lcore_config[i].thread_id, thread_name);
// 增加线程亲和性
ret = pthread_setaffinity_np(lcore_config[i].thread_id,
sizeof(rte_cpuset_t), &lcore_config[i].cpuset);
if (ret != 0)
rte_panic("Cannot set affinity\n");
}
// ...
}
DPDK
rte_mempool
The and mechanisms are provided rte_ring
to support memory requirements.
Each rte_mempool
instance is a memory pool consisting of large page memory, and is organized in a specific data structure, in which each allocated and used unit can be used to store the caller's data.
When created rte_mempool
, it will be created on each available CPU at the same time cache buffers
, so that it can be obtained rte_mem_get
directly cache buffer
from it when called, speeding up the allocation process (sort of like per-cpu cache
)
/**
* The RTE mempool structure.
*/
struct rte_mempool {
/*
* Note: this field kept the RTE_MEMZONE_NAMESIZE size due to ABI
* compatibility requirements, it could be changed to
* RTE_MEMPOOL_NAMESIZE next time the ABI changes
*/
char name[RTE_MEMZONE_NAMESIZE]; /**< Name of mempool. */
RTE_STD_C11
union {
void *pool_data; /**< Ring or pool to store objects. */
uint64_t pool_id; /**< External mempool identifier. */
};
void *pool_config; /**< optional args for ops alloc. */
const struct rte_memzone *mz; /**< Memzone where pool is alloc'd. */
unsigned int flags; /**< Flags of the mempool. */
int socket_id; /**< Socket id passed at create. */
uint32_t size; /**< Max size of the mempool. */
uint32_t cache_size;
/**< Size of per-lcore default local cache. */
uint32_t elt_size; /**< Size of an element. */
uint32_t header_size; /**< Size of header (before elt). */
uint32_t trailer_size; /**< Size of trailer (after elt). */
unsigned private_data_size; /**< Size of private data. */
/**
* Index into rte_mempool_ops_table array of mempool ops
* structs, which contain callback function pointers.
* We're using an index here rather than pointers to the callbacks
* to facilitate any secondary processes that may want to use
* this mempool.
*/
int32_t ops_index;
struct rte_mempool_cache *local_cache; /**< Per-lcore local cache */
uint32_t populated_size; /**< Number of populated objects. */
struct rte_mempool_objhdr_list elt_list; /**< List of objects in pool */
uint32_t nb_mem_chunks; /**< Number of memory chunks */
// 内存池
struct rte_mempool_memhdr_list mem_list; /**< List of memory chunks */
} __rte_cache_aligned;
rte_ring
It is a queue used to deliver messages. Each rte_ring
unit passed in is a memory pointer. Please refer to spdk_thread_send_msg
the usage in the function.
rte_ring
A lock-free queue model is used to support the multi-producer and multi-consumer model. Currently, SPDK
the multi-producer and single-consumer model is used.
SPDK
CPU
After startup, one will be run on each specified available core, and on cores reactor
other than the main core , there is a one-to-one correspondence between andCPU
SPDK
reactor
eal_thread_loop
reactor->events
It is rte_ring
implemented based on and can be used to reactors
pass messages between calls spdk_event_allocate
and spdk_event_call
can send required messages to any spdk
used CPU
core in order to run private logical operations.
Usage scenarios for this operation
CPU
You need to perform an action delayed on the current core, but there is no corresponding onespdk_thread
available.- You need to perform a certain action on other cores
SPDK
, but there is no associated one available.CPU
spdk_thread
DEFINE_STUB(spdk_event_allocate, struct spdk_event *, (uint32_t core, spdk_event_fn fn, void *arg1,
void *arg2), NULL);
/* DEFINE_STUB is for defining the implmentation of stubs for SPDK funcs. */
#define DEFINE_STUB(fn, ret, dargs, val) \
bool ut_ ## fn ## _mocked = true; \
ret ut_ ## fn = val; \
ret fn dargs; \
ret fn dargs \
{
\
return MOCK_GET(fn); \
}
DEFINE_STUB_V(spdk_event_call, (struct spdk_event *event));
/* DEFINE_STUB_V macro is for stubs that don't have a return value */
#define DEFINE_STUB_V(fn, dargs) \
void fn dargs; \
void fn dargs \
{
\
}
SPDK
The default scheduling policy is static
type i.e. reactor
and thread
both run in polling
mode
SPDK’s thread model
spdk_thread
It is not a thread in the conventional sense. It is actually a logical concept. It has no specific execution function and all its related operations are reactor
executed in the execution function.
spdk_thread
reactor
The relationship between and is N:1
a corresponding relationship, that is, reactor
there can be many on each spdk_thread
, but each spdk_thread
needs to belong to and can only belong to one specificreactor
Poll
Each registration spdk_poller
is stored in spdk_thread->timed_pollers
a red-black tree structure or spdk_thread->active_pollers
linked list. So if you want to use it poller
, you first need to create aspdk_thread
Once you have spdk_thread
it, you can spdk_poller
run a function repeatedly or periodically by registering it. If poller
the cycle is specified when registering 0
, then poller
the corresponding execution function will reactor
be called in each cycle; if the cycle is not 0
, then each reactor
cycle will check whether it meets the execution cycle before executing it.
struct spdk_poller {
TAILQ_ENTRY(spdk_poller) tailq;
/* Current state of the poller; should only be accessed from the poller's thread. */
enum spdk_poller_state state;
uint64_t period_ticks;
uint64_t next_run_tick;
uint64_t run_count;
uint64_t busy_count;
spdk_poller_fn fn;
void *arg;
struct spdk_thread *thread;
int interruptfd;
spdk_poller_set_interrupt_mode_cb set_intr_cb_fn;
void *set_intr_cb_arg;
char name[SPDK_MAX_POLLER_NAME_LEN + 1];
};
Once created spdk_thread
, you can use spdk_thread_send_msg
functions to execute specific functions. By selecting the appropriate one, spdk_thread
you can perform operations on the current CPU
core or other SPDK
used cores.CPU
What is passed in this function is allocated msg
from g_spdk_msg_mempool
(an instance), and the lock-free queue is used when passing it.rte_mempool
rte_ring
In general, lock-free queues are coordinated through communication Reactor
between different cores or the same core .event(rte_ring)
cpu
spdk_thread
cpu
spdk_thread
Message(rte_ring)
io_channel
IO channel
CPU
It is an abstract mechanism for performing the same operation independently on each available core. It will not be summarized.
backendvhost
device lookup
Scan the device and bind the device to the controller and read and write data
probe_cb
:NVMe controller
Callback after findingattach_cb
:NVMe
Called once the controller has connected to the userspace driver
The process of finding the device and binding the driver is spdk_nvme_probe
implemented in the function
In the code, we scan the devices on the bus by specifying them, save the drivers and devices through two global linked lists, traverse the drivers, and match the found devices and drivers transport id
.PCI
int
spdk_nvme_probe(const struct spdk_nvme_transport_id *trid, void *cb_ctx,
spdk_nvme_probe_cb probe_cb, spdk_nvme_attach_cb attach_cb,
spdk_nvme_remove_cb remove_cb)
{
struct spdk_nvme_transport_id trid_pcie;
struct spdk_nvme_probe_ctx *probe_ctx;
if (trid == NULL) {
memset(&trid_pcie, 0, sizeof(trid_pcie));
spdk_nvme_trid_populate_transport(&trid_pcie, SPDK_NVME_TRANSPORT_PCIE);
trid = &trid_pcie;
}
probe_ctx = spdk_nvme_probe_async(trid, cb_ctx, probe_cb,
attach_cb, remove_cb);
if (!probe_ctx) {
SPDK_ERRLOG("Create probe context failed\n");
return -1;
}
/*
* Keep going even if one or more nvme_attach() calls failed,
* but maintain the value of rc to signal errors when we return.
*/
return nvme_init_controllers(probe_ctx);
}
Focus on the reading and writing process
HOST
It is NVMe
the system into which the card is inserted. The interaction between ·HOST· and ·Controller· is carried out through ·Qpair·
pair
Divided intoIO Qpair
andAdmin Qpair
, as the name suggests,Admin Qpair
is used for the transmission of control commands, whileIO Qpair
forIO
the transmission of commands
Qpair
For a fixed-number circular queue consisting of a submission queue ( Submission Queue, SQ
) and a completion queue ( ), the submission queue is an array of commands consisting of a fixed number of bytes, plus an integer (head and tail index). The completion queue is a circular queue composed of a fixed number of byte commands plus an integer (head and tail index). There are also two bit registers ( ), andCompletion Queue, CQ
64
2
16
2
32
Doorbell
Head Doorbell
Tail Doorbell
HOST
When you need to NVMe
write data to, you need to specify the address of the data in the memory and NVMe
the location where it is written to . The same is true for reading data from the slave. You need to specify HOST
the address and memory address, so that you know where to fetch the data. After fetching, Where to put the data, there are two ways to express the data address, one is , and the other isNVMe
NVMe
HOST
NVMe
PRP
SGL
PRP
Points to a physical memory page. PRP
Similar to normal addressing, the base address plus the offset address. PRP
Points to a physical address page
SPDK
Submit I/O
to local PCIe
device process
Commit to the device by constructing a one 64
-byte command, putting it into the commit queue at the current position of the commit queue tail index, and then writing the commit queue at the new index of the commit queue tail . You can also write multiple commands to , and then submit all commands by writing them only once .I/O
NVMe
Tail Doorbell
SQ
Doorbell
The command itself describes the operation and also describes the location in host memory that contains the host memory data associated with the command, that is, the location in memory where we want to write the data or place the read data into memory. DMA
Transfer data to or from this address via
The completion queue works similarly, the device writes response messages to the command CQ
. CQ
Each element in contains a phase that switches between and Phase Tag
on each cycle of the entire ring . The device is notified of updates via interrupts , but does not enable interrupts and instead polls the phase bits to detect updates0
1
HOST CQ
SPDK
CQ
It ’s a bit like
io_uring
the mechanism
Bicycle I/O
Asynchronous is used extensively here I/O
, and is Linux
generally used by default inAIO
io_uring
Among them, some deficiencies io_uring
were made up foraio
io_uring
sometimes calledaio_ring,io_ring,ring_io
one smallexample
/**
* 读取文件
**/
#include <bits/stdc++.h>
#include <liburing.h>
#include <unistd.h>
char buf[1024] = {
0};
int main() {
int fd = open("1.txt", O_RDONLY, 0);
io_uring ring;
io_uring_queue_init(32, &ring, 0); // 初始化
auto sqe = io_uring_get_sqe(&ring); // 从环中得到一块空位
io_uring_prep_read(sqe, fd, buf, sizeof(buf), 0); // 为这块空位准备好操作
io_uring_submit(&ring); // 提交任务
io_uring_cqe* res; // 完成队列指针
io_uring_wait_cqe(&ring, &res); // 阻塞等待一项完成的任务
assert(res);
std::cout << "read bytes: " << res->res << " \n";
std::cout << buf << std::endl;
io_uring_cqe_seen(&ring, res); // 将任务移出完成队列
io_uring_queue_exit(&ring); // 退出
return 0;
}
Io_uring
There are three things: submission queue, completion queue, and task entity
Reference article: