Brief analysis of SPDK technology

The reason for writing this article is because I read the article TriCache: A User-Transparent Block Cache Enabling High-Performance Out-of-Core Processing with In-Memory Programs , in which SPDKthe application is masterful and I/Othe performance is raised to a new level.

SPDK basics

SPDK Storage Performance Development Kit, the Storage Performance Development Kit, provides a set of tools and libraries for writing high-performance, scalable user-mode storage applications

Relevant information:

Link
Introduction to the Storage Performance Development Kit (SPDK)
spdk official website documentation
spdk source code GitHub spdk/spdk
spdk technical articles

There is also a technology called DPDKArchitecture and SPDKSimilarity. If you want to learn this book, I recommend reading "In-depth and Simple DPDK"

advantage:

  • Move all required drivers to user space, which avoids system calls and enables zero-copy access for applications
  • Polling hardware for completion rather than relying on interrupt preemption reduces overall latency and latency variance
  • Avoid all locking in the I/O path and instead rely on message passing (a lock-free message queue is used)

Compile and install

git clone https://github.com/spdk/spdk
cd spdk
git submodule update --init   # 拉取子模块
sudo scripts/pkgdep.sh        # 安装依赖性
./configure
make

Run the unit test, the following means the test is successful

(base) root@nizai8a-desktop:~/tt/spdk# ./test/unit/unittest.sh
=====================
All unit tests passed
=====================
WARN: lcov not installed or SPDK built without coverage!
WARN: neither valgrind nor ASAN is enabled!

Allocate huge pages and unbind any NVMedevices from the native kernel driver

This is a common error, indicating that /dev/nvme1n1it is in use, so if you want to unbind the device, you must umountformat it
. The user driver uses uio/vfiothe function in to map the device PCI BARto the current process, allowing the driver to execute directly MMIO(default adopted uio)

(base) root@nizai8a-desktop:~/tt/spdk# sudo scripts/setup.sh
0000:81:00.0 (144d a808): Active mountpoints on nvme1n1:nvme1n1, so not binding PCI dev
0000:01:00.0 (15b7 5009): nvme -> uio_pci_generi

Why allocate huge pages?
Principle: dpdk large page memory principle

  • All large pages and large page tables are stored in shared memory and will never be swapped to the disk swap partition due to insufficient memory.
  • Since all processes share a large page table, the overhead of the page table is reduced, and the memory space occupied is virtually reduced, allowing the system to support more processes running at the same time.
  • Reduce pressure on TLB
  • Reduce the pressure of searching memory

By default, the script allocates 2048MBhuge pages. To change this number, specifyHUGEMEM

sudo HUGEMEM = 4096 scripts/setup.sh

Check the large page types supported by the system

(base) root@nizai8a-desktop:/sys/kernel/mm/hugepages# ls
hugepages-1048576kB  hugepages-2048kB

SPDK architecture

This section mainly SPDKfocuses on the three advantages of


User mode driver

After VVMethe emergence of , software has become I/Oa bottleneck in intensive scenarios. There are two methods to optimize the kernel.

  • io_ring: Provide a new set of system calls, based on the optimization of system call paths
  • Bypassing the kernel kernel bypass, the entire I/Ooperation does not need to be trapped in the kernel, SPDKthat is, this storage acceleration solution

IntelIts definition is an acceleration library that uses user mode, asynchronous, and polling methodsNVMe to accelerate NVMe SSDapplication software used as back-end storage.

NVMeThe protocol is a SSDspecification defined for faster communication between solid-state drives and hosts. It extends its local storage performance to network NVMe-oFprotocols to support InfiniBandoptical fiber or Ethernet. Among the network storage solutions, currently there are mainly DAS, NASandSAN

SPDKBased on UIOor VFIOsupports the method of directly mapping the address space of the storage device to the application space, and uses NVMespecifications to initialize NVMe SSDthe device and implement basic I/Ooperations to build a user-mode driver, so the whole process does not need to be trapped in the kernel.

In SPDKthe user-mode driven solution, this behavior is replaced by asynchronous polling. Through continuous CPUpolling, once the query is completed, the callback function is immediately triggered and given to the upper user program, so that the user program can send messages on demand. Multiple requests to improve performance

Another obvious benefit of removing system interrupts is the avoidance of context switches

In addition, the affinity of SPDKthe application is used to bind CPUthreads and CPUcores, and a thread model is designed. From the time the application receives I/Othe operation of this core to the end of the operation, it is completed on this core, so that it can be used more efficiently. cache, while also avoiding memory synchronization issues between multiple cores

At the same time, the management of memory resources on a single core uses large page storage to accelerate


thread

In SPDKthe thread model architecture, each CPUcore has a kernel thread and will initialize areactor

Each reactorthread can hold zero to multiple SPDKabstracted lightweight user-mode threads spdk_thread. In order to improve reactorthe efficiency of communication and synchronization between them, SPDKthe traditional locking method is abandoned. Instead, the abstract method is used to send messages to each thread reactor. spdk_ringThe obtained spdk_threadlower possession polleris used to register user functions.

Insert image description here

SPDK use

rpc background startup

SPDKStarting from v20.xthe version, the format of the configuration file has been switched json. When running the executable program, --jsonthe json configuration file can be passed as a parameter. When a file SPDKis specified when starting an application , the initialization process will be performed in the mode and the operations specified in the file will be performed.jsonSPDKrpcsubsystemsjson

rpcIt is a well-known method for dynamically and flexibly performing operations after the program is started. It is mainly used to unix sockettransfer message data between the client and the server.
SPDKIt also integrates or implements rpcinteractive channels, which can support dynamic operations.

Server

SPDKrpcThe server in spdk_rpc_initializeis initialized in the function. If no listening address is specified, the default listening address will be used /var/tmp/spdk.sock. rpcThis is the default listening address used when the client accesses. Each rpcmodule or function that needs to be called can SPDK_RPC_REGISTER register its service-providing function into g_rpc_methodsthe linked list. The function jsonrpc_handleris the entry point for processing all requests from the client. The g_rpc_methodsspecific processing function will be matched according to the request from the linked list.

client

SPDKrpcMost of the functions provided by the client are called ./spdk/scripts/rpc.pyusing scripts as the entry point. The script will include scripts ./spdk/python/spdk/rpcin the directory , client processing of each function, and public functions for interacting with the server are defined in these included scripts. The corresponding processing logic of the client in the rpc function provided by each module is collected in a file named with the module name.pythonrpcpython

If you want to query what supported rpc calling functions are, you can directly execute ./rpc.py -hthe query

If you want to add new rpcfunctions, you need to SPDK_RPC_REGISTER register the new functions and rpcadd corresponding pythonscript logic on the client

Basic mechanism analysis

SPDKThe sub-core parallelism, lock-free and Run to completionprogramming features in are mainly composed of the reactor、events、pollerand io channelmechanisms


Reactors

DPDKWhen the function is executed, it will create threads on each specified available core rte_eal_initexcept the currently running core.CPU main cpuCPU

And bind it to run on the corresponding CPUcore by modifying the affinity parameter of the thread. The execution function of each thread is eal_thread_loopto wait for data to be received pipefrom it and execute

/* Launch threads, called at application init(). */
int
rte_eal_init(int argc, char **argv)
{
    
    
	// ...
	RTE_LCORE_FOREACH_WORKER(i) {
    
    

		/*
		 * create communication pipes between main thread
		 * and children
		 */
		if (pipe(lcore_config[i].pipe_main2worker) < 0)
			rte_panic("Cannot create pipe\n");
		if (pipe(lcore_config[i].pipe_worker2main) < 0)
			rte_panic("Cannot create pipe\n");

		lcore_config[i].state = WAIT;

		/* create a thread for each lcore */
		// 创建线程,执行函数为eal_thread_loop
		ret = pthread_create(&lcore_config[i].thread_id, NULL,
				     eal_thread_loop, NULL); 
		if (ret != 0)
			rte_panic("Cannot create thread\n");

		/* Set thread_name for aid in debugging. */
		snprintf(thread_name, sizeof(thread_name),
				"lcore-worker-%d", i);
		rte_thread_setname(lcore_config[i].thread_id, thread_name);
		// 增加线程亲和性
		ret = pthread_setaffinity_np(lcore_config[i].thread_id,
			sizeof(rte_cpuset_t), &lcore_config[i].cpuset);
		if (ret != 0)
			rte_panic("Cannot set affinity\n");
	}
	// ...
}

DPDKrte_mempoolThe and mechanisms are provided rte_ringto support memory requirements.

Each rte_mempoolinstance is a memory pool consisting of large page memory, and is organized in a specific data structure, in which each allocated and used unit can be used to store the caller's data.

When created rte_mempool, it will be created on each available CPU at the same time cache buffers, so that it can be obtained rte_mem_getdirectly cache bufferfrom it when called, speeding up the allocation process (sort of like per-cpu cache)

/**
 * The RTE mempool structure.
 */
struct rte_mempool {
    
    
	/*
	 * Note: this field kept the RTE_MEMZONE_NAMESIZE size due to ABI
	 * compatibility requirements, it could be changed to
	 * RTE_MEMPOOL_NAMESIZE next time the ABI changes
	 */
	char name[RTE_MEMZONE_NAMESIZE]; /**< Name of mempool. */
	RTE_STD_C11
	union {
    
    
		void *pool_data;         /**< Ring or pool to store objects. */
		uint64_t pool_id;        /**< External mempool identifier. */
	};
	void *pool_config;               /**< optional args for ops alloc. */
	const struct rte_memzone *mz;    /**< Memzone where pool is alloc'd. */
	unsigned int flags;              /**< Flags of the mempool. */
	int socket_id;                   /**< Socket id passed at create. */
	uint32_t size;                   /**< Max size of the mempool. */
	uint32_t cache_size;
	/**< Size of per-lcore default local cache. */

	uint32_t elt_size;               /**< Size of an element. */
	uint32_t header_size;            /**< Size of header (before elt). */
	uint32_t trailer_size;           /**< Size of trailer (after elt). */

	unsigned private_data_size;      /**< Size of private data. */
	/**
	 * Index into rte_mempool_ops_table array of mempool ops
	 * structs, which contain callback function pointers.
	 * We're using an index here rather than pointers to the callbacks
	 * to facilitate any secondary processes that may want to use
	 * this mempool.
	 */
	int32_t ops_index;

	struct rte_mempool_cache *local_cache; /**< Per-lcore local cache */

	uint32_t populated_size;         /**< Number of populated objects. */
	struct rte_mempool_objhdr_list elt_list; /**< List of objects in pool */
	uint32_t nb_mem_chunks;          /**< Number of memory chunks */
	// 内存池
	struct rte_mempool_memhdr_list mem_list; /**< List of memory chunks */
}  __rte_cache_aligned;

rte_ringIt is a queue used to deliver messages. Each rte_ringunit passed in is a memory pointer. Please refer to spdk_thread_send_msgthe usage in the function.

rte_ringA lock-free queue model is used to support the multi-producer and multi-consumer model. Currently, SPDKthe multi-producer and single-consumer model is used.

SPDKCPUAfter startup, one will be run on each specified available core, and on cores reactorother than the main core , there is a one-to-one correspondence between andCPUSPDKreactoreal_thread_loop

reactor->eventsIt is rte_ringimplemented based on and can be used to reactorspass messages between calls spdk_event_allocateand spdk_event_callcan send required messages to any spdkused CPUcore in order to run private logical operations.

Usage scenarios for this operation

  • CPUYou need to perform an action delayed on the current core, but there is no corresponding one spdk_threadavailable.
  • You need to perform a certain action on other coresSPDK , but there is no associated one available.CPUspdk_thread
DEFINE_STUB(spdk_event_allocate, struct spdk_event *, (uint32_t core, spdk_event_fn fn, void *arg1,
		void *arg2), NULL);
		
/* DEFINE_STUB is for defining the implmentation of stubs for SPDK funcs. */
#define DEFINE_STUB(fn, ret, dargs, val) \
	bool ut_ ## fn ## _mocked = true; \
	ret ut_ ## fn = val; \
	ret fn dargs; \
	ret fn dargs \
	{
      
       \
		return MOCK_GET(fn); \
	}
DEFINE_STUB_V(spdk_event_call, (struct spdk_event *event));

/* DEFINE_STUB_V macro is for stubs that don't have a return value */
#define DEFINE_STUB_V(fn, dargs) \
	void fn dargs; \
	void fn dargs \
	{
      
       \
	}

SPDKThe default scheduling policy is statictype i.e. reactorand threadboth run in pollingmode


SPDK’s thread model

spdk_threadIt is not a thread in the conventional sense. It is actually a logical concept. It has no specific execution function and all its related operations are reactorexecuted in the execution function.

spdk_threadreactorThe relationship between and is N:1a corresponding relationship, that is, reactorthere can be many on each spdk_thread, but each spdk_threadneeds to belong to and can only belong to one specificreactor


Poll

Each registration spdk_polleris stored in spdk_thread->timed_pollersa red-black tree structure or spdk_thread->active_pollerslinked list. So if you want to use it poller, you first need to create aspdk_thread

Once you have spdk_threadit, you can spdk_pollerrun a function repeatedly or periodically by registering it. If pollerthe cycle is specified when registering 0, then pollerthe corresponding execution function will reactorbe called in each cycle; if the cycle is not 0, then each reactorcycle will check whether it meets the execution cycle before executing it.

struct spdk_poller {
    
    
	TAILQ_ENTRY(spdk_poller)	tailq;

	/* Current state of the poller; should only be accessed from the poller's thread. */
	enum spdk_poller_state		state;

	uint64_t			period_ticks;
	uint64_t			next_run_tick;
	uint64_t			run_count;
	uint64_t			busy_count;
	spdk_poller_fn			fn;
	void				*arg;
	struct spdk_thread		*thread;
	int				interruptfd;
	spdk_poller_set_interrupt_mode_cb set_intr_cb_fn;
	void				*set_intr_cb_arg;

	char				name[SPDK_MAX_POLLER_NAME_LEN + 1];
};

Once created spdk_thread, you can use spdk_thread_send_msgfunctions to execute specific functions. By selecting the appropriate one, spdk_threadyou can perform operations on the current CPUcore or other SPDKused cores.CPU

What is passed in this function is allocated msgfrom g_spdk_msg_mempool (an instance), and the lock-free queue is used when passing it.rte_mempoolrte_ring

In general, lock-free queues are coordinated through communication Reactorbetween different cores or the same core .event(rte_ring)cpuspdk_threadcpuspdk_threadMessage(rte_ring)


io_channel

IO channelCPUIt is an abstract mechanism for performing the same operation independently on each available core. It will not be summarized.

backendvhost


device lookup

Scan the device and bind the device to the controller and read and write data

  • probe_cb: NVMe controllerCallback after finding
  • attach_cb: NVMeCalled once the controller has connected to the userspace driver

The process of finding the device and binding the driver is spdk_nvme_probeimplemented in the function

In the code, we scan the devices on the bus by specifying them, save the drivers and devices through two global linked lists, traverse the drivers, and match the found devices and drivers transport id.PCI

int
spdk_nvme_probe(const struct spdk_nvme_transport_id *trid, void *cb_ctx,
		spdk_nvme_probe_cb probe_cb, spdk_nvme_attach_cb attach_cb,
		spdk_nvme_remove_cb remove_cb)
{
    
    
	struct spdk_nvme_transport_id trid_pcie;
	struct spdk_nvme_probe_ctx *probe_ctx;

	if (trid == NULL) {
    
    
		memset(&trid_pcie, 0, sizeof(trid_pcie));
		spdk_nvme_trid_populate_transport(&trid_pcie, SPDK_NVME_TRANSPORT_PCIE);
		trid = &trid_pcie;
	}

	probe_ctx = spdk_nvme_probe_async(trid, cb_ctx, probe_cb,
					  attach_cb, remove_cb);
	if (!probe_ctx) {
    
    
		SPDK_ERRLOG("Create probe context failed\n");
		return -1;
	}

	/*
	 * Keep going even if one or more nvme_attach() calls failed,
	 *  but maintain the value of rc to signal errors when we return.
	 */
	return nvme_init_controllers(probe_ctx);
}

Focus on the reading and writing process

HOSTIt is NVMethe system into which the card is inserted. The interaction between ·HOST· and ·Controller· is carried out through ·Qpair·

pairDivided into IO Qpairand Admin Qpair, as the name suggests, Admin Qpairis used for the transmission of control commands, while IO Qpairfor IOthe transmission of commands

QpairFor a fixed-number circular queue consisting of a submission queue ( Submission Queue, SQ) and a completion queue ( ), the submission queue is an array of commands consisting of a fixed number of bytes, plus an integer (head and tail index). The completion queue is a circular queue composed of a fixed number of byte commands plus an integer (head and tail index). There are also two bit registers ( ), andCompletion Queue, CQ64216232DoorbellHead DoorbellTail Doorbell

HOSTWhen you need to NVMewrite data to, you need to specify the address of the data in the memory and NVMethe location where it is written to . The same is true for reading data from the slave. You need to specify HOSTthe address and memory address, so that you know where to fetch the data. After fetching, Where to put the data, there are two ways to express the data address, one is , and the other isNVMeNVMeHOSTNVMePRPSGL

PRPPoints to a physical memory page. PRPSimilar to normal addressing, the base address plus the offset address. PRPPoints to a physical address page


SPDKSubmit I/Oto local PCIedevice process

Commit to the device by constructing a one 64-byte command, putting it into the commit queue at the current position of the commit queue tail index, and then writing the commit queue at the new index of the commit queue tail . You can also write multiple commands to , and then submit all commands by writing them only once .I/ONVMeTail DoorbellSQDoorbell

The command itself describes the operation and also describes the location in host memory that contains the host memory data associated with the command, that is, the location in memory where we want to write the data or place the read data into memory. DMATransfer data to or from this address via

The completion queue works similarly, the device writes response messages to the command CQ. CQEach element in contains a phase that switches between and Phase Tagon each cycle of the entire ring . The device is notified of updates via interrupts , but does not enable interrupts and instead polls the phase bits to detect updates01HOST CQSPDKCQ

It ’s a bit like io_uringthe mechanism

Bicycle I/O

Asynchronous is used extensively here I/O, and is Linuxgenerally used by default inAIOio_uring

Among them, some deficiencies io_uringwere made up foraio

io_uringsometimes calledaio_ring,io_ring,ring_io

one smallexample

/**
* 读取文件
**/
#include <bits/stdc++.h>
#include <liburing.h>
#include <unistd.h>

char buf[1024] = {
    
    0};

int main() {
    
    
  int fd = open("1.txt", O_RDONLY, 0);
  
  io_uring ring;
  io_uring_queue_init(32, &ring, 0); // 初始化
  auto sqe = io_uring_get_sqe(&ring); // 从环中得到一块空位
  io_uring_prep_read(sqe, fd, buf, sizeof(buf), 0); // 为这块空位准备好操作
  io_uring_submit(&ring); // 提交任务
  io_uring_cqe* res; // 完成队列指针
  io_uring_wait_cqe(&ring, &res); // 阻塞等待一项完成的任务
  assert(res);
  std::cout << "read bytes: " << res->res << " \n";
  std::cout << buf << std::endl;
  io_uring_cqe_seen(&ring, res); // 将任务移出完成队列
  io_uring_queue_exit(&ring); // 退出
  return 0;
}

Io_uring There are three things: submission queue, completion queue, and task entity

Reference article:

Guess you like

Origin blog.csdn.net/qq_48322523/article/details/128653182