RSSD hard to achieve 1.2 million IOPS of cloud SPDK IO path optimization practice

A Profile

Ultra concurrent users, large scale computing needs to promote the continuous development of storage hardware technology, the performance of the storage cluster getting better and better, more and more low-latency performance requirements of overall IO path are also increasing. A hard disk in the cloud scene, to generate IO requests from the rear end of the storage cluster and then return path between IO complex, virtual IO particular path may become a performance bottleneck, because all the IO within the virtual machines need to send it next back-end storage systems. We used to optimize virtualization SPDK IO path, proposed open-source unresolved SDPK hot upgrades and online migration scenarios, and successfully applied in high performance cloud disk scenario, and achieved good results, RSSD cloud hard drive of up to 1.2 million IOPS. In this paper, we share some of our experience in this area.

The basic principle of two SPDK vhost

SPDK (Storage Performance Development Kit) Providing a set prepared for high performance, scalable, user mode application stored in libraries and tools, consisting essentially divided into a user mode, polling, asynchronous, no lock NVMe drive is provided from the user-space application to directly access the SSD zero copy, highly parallel access.

In the IO virtual path, is more commonly used virtio a semi virtualization solution, and the bottom is virtio communicate through VRING, following the basic principle of first introduced virtio VRING, each virtio VRING mainly contains the following sections :

RSSD hard to achieve 1.2 million IOPS of cloud SPDK IO path optimization practice

desc table array, the array is equal to the size of the device queue depth, typically 128. Each element of the array represents an IO request, a pointer to the elements will be included in the saved data memory IO address, the IO basic information such as the length. An IO request corresponding to a general desc array element, of course, related to the plurality of IO memory pages, then it is necessary to connect a plurality of linked list desc used, unused desc element connects to free_head next pointer by itself, is formed a list for subsequent use.

available array, which is a circular array, each array desc represents the index a, when processing the IO request, to get an index of the array will be from the inside to find desc array corresponding to the IO request.

It used an array, the array and avail similar, but used to indicate completion of the IO request. When an IO request processing is completed, the request desc array index will be stored in the array, and the front end of the scan will be notified virtio driver determines whether the requested data is completed, if complete recovery will be desc corresponding to the request to the next array entry IO requests.

SPDK vhost The principle is simple, first by the initialization of qemu vhost drive to transmit information to the above virtio vring SPDK array, and then determines whether the IO request to find available SPDK array by stop wheel, there is a request to process, processed after the index added to the array used, and by respective front end virtio eventfd notification.

When a SPDK IO request is received, only a pointer to the request, needs to be able to directly access this memory when processing, and the address pointer is the address space qemu, obviously can not be used directly, so there need to do some conversion.

When using virtual SPDK confidential use large page memory, the virtual machine in the information sent to SPDK initialization will be large page memory, SPDK parses the information and mapping through mmap the same large page memory to its own address space, so implements shared memory, when the get pointer SPDK qemu address space, by calculating the offset can be easily converted to a pointer to the address space SPDK.

We can see from the above principle SPDK vhost large page memory shared by the IO request may be a manner that the rapid transfer of the process do not need to copy the memory between the two pointers is completely transmitted, thus greatly improving performance IO path .

We compared the cloud qemu disk drive previously used delay and delay using SPDK vhost after, in order to simply compare the performance of virtualized IO path, the way we used to return after receiving direct IO:

1. Single Queue (1 iodepth, 1 numjob)

qemu network disk drive delay:

RSSD hard to achieve 1.2 million IOPS of cloud SPDK IO path optimization practice

SPDK vhost delay:

RSSD hard to achieve 1.2 million IOPS of cloud SPDK IO path optimization practice

In the case where a single queue seen falling very obvious delay, average delay from the original 130us down to 7.3us.

2. The multi-queue (128 iodepth, 1 numjob)

qemu network disk drive delay:

RSSD hard to achieve 1.2 million IOPS of cloud SPDK IO path optimization practice

SPDK vhost delay:

RSSD hard to achieve 1.2 million IOPS of cloud SPDK IO path optimization practice

Multi-queue IO latency generally bigger than a single queue, the queue can be seen in the multi-scene average delay also decreased from 3341us to 1090us, dropped to one-third of the original.

Three SPDK Hot Upgrade

在我们刚开始使用SPDK时,发现SPDK缺少一重要功能——热升级。我们使用SPDK 并基于SPDK开发自定义的bdev设备肯定会涉及到版本升级,并且也不能100%保证SPDK进程不会crash掉,因此一旦后端SPDK重启或者crash,前端qemu里IO就会卡住,即使SPDK重启后也无法恢复。

我们仔细研究了SPDK的初始化过程发现,在SPDK vhost启动初期,qemu会下发一些配置信息,而SPDK重启后这些配置信息都丢失了,那么这是否意味着只要SPDK重启后重新下发这些配置信息就能使SPDK正常工作呢?我们尝试在qemu中添加了自动重连的机制,并且一旦自动重连完成,就会按照初始化的顺序再次下发这些配置信息。开发完成后,初步测试发现确实能够自动恢复,但随着更严格的压测发现只有在SPDK正常退出时才能恢复,而SPDK crash退出后IO还是会卡住无法恢复。从现象上看应该是部分IO没有被处理,所以qemu端虚机一直在等待这些IO返回导致的。

通过深入研究virtio vring的机制我们发现在SPDK正常退出时,会保证所有的IO都已经处理完成并返回了才退出,也就是所在的virtio vring中是干净的。而在意外crash时是不能做这个保证的,意外crash时virtio vring中还有部分IO是没有被处理的,所以在SPDK恢复后需要扫描virtio vring将未处理的请求下发下去。这个问题的复杂之处在于,virtio vring中的请求是按顺序下发处理的,但实际完成的时候并不是按照下发的顺序的。

假设在virtio vring的available ring中有6个IO,索引号为1,2,3,4,5,6,SPDK按顺序的依次得到这个几个IO,并同时下发给设备处理,但实际可能请求1和4已经完成,并返回了成功了,如下图所示,而2,3,5,6都还没有完成。这个时候如果crash,重启后需要将2,3,5,6这个四个IO重新下发处理,而1和4是不能再次处理的,因为已经处理完成返回了,对应的内存也可能已经被释放。也就是说我们无法通过简单的扫描available ring来判断哪些IO需要重新下发,我们需要有一块内存来记录virtio vring中各个请求的状态,当重启后能够按照该内存中记录的状态来决定哪些IO是需要重新下发处理的,而且这块内存不能因SPDK重启而丢失,那么显然使用qemu进程的内存是最合适的。所以我们在qemu中针对每个virtio vring申请一块共享内存,在初始化时发送给SPDK,SPDK在处理IO时会在该内存中记录每个virtio vring请求的状态,并在意外crash恢复后能利用该信息找出需要重新下发的请求。

RSSD hard to achieve 1.2 million IOPS of cloud SPDK IO path optimization practice

四 SPDK在线迁移

SPDK vhost所提供的虚拟化IO路径性能非常好,那么我们有没有可能使用该IO路径来代替原有的虚拟化IO路径呢?我们做了一些调研,SPDK在部分功能上并没有现有的qemu IO路径完善,其中尤为重要的是在线迁移功能,该功能的缺失是我们使用SPDK vhost代替原有IO路径的最大障碍。

SPDK在设计时更多是为网络存储准备的,所以支持设备状态的迁移,但并不支持设备上数据的在线迁移。而qemu本身是支持在线迁移的,包括设备状态和设备上的数据的在线迁移,但在使用vhost模式时是不支持在线迁移的。主要原因是使用了vhost之后qemu只控制了设备的控制链路,而设备的数据链路已经托管给了后端的SPDK,也就是说qemu没有设备的数据流IO路径所以并不知道一个设备那些部分被写入了。

在考察了现有的qemu在线迁移功能后,我们觉着这个技术难点并不是不能解决的,因此我们决定在qemu里开发一套针对vhost存储设备的在线迁移功能。

块设备的在线迁移的原理比较简单,可以分为两个步骤,第一个步骤将全盘数据从头到尾拷贝到目标虚机,因为拷贝过程时间较长,肯定会发生已经拷贝的数据又被再次写入的情况,这个步骤中那些再次被写脏的数据块会在bitmap中被置位,留给第二个步骤来处理,步骤二中通过bitmap来找到那些剩余的脏数据块,将这些脏数据块发送到目标端,最后会block住所有的IO,然后将剩余的一点脏数据块同步到目标端迁移就完成了。

SPDK的在线迁移原理上于上面是相同的,复杂之处在于qemu没有数据的流IO路径,所以我们在qemu中开发了一套驱动可以用来实现迁移专用的数据流IO路径,并且通过共享内存加进程间互斥的方式在qemu和SPDK之间创建了一块bitmap用来保存块设备的脏页数量。考虑到SPDK是独立的进程可能会出现意外crash的情况,因此我们给使用的pthread mutex加上了PTHREAD_MUTEX_ROBUST特性来防止意外crash后死锁的情况发生,整体架构如下图所示:

RSSD hard to achieve 1.2 million IOPS of cloud SPDK IO path optimization practice

五 SPDK IO uring体验

IO uring是内核中比较新的技术,在上游内核5.1以上才合入,该技术主要是通过用户态和内核态共享内存的方式来优化现有的aio系列系统调用,使得提交IO不需要每次都进行系统调用,这样减少了系统调用的开销,从而提供了更高的性能。

SPDK在最新发布的19.04版本已经包含了支持uring的bdev,但该功能只是添加了代码,并没有开放出来,当然我们可以通过修改SPDK代码来体验该功能。

首先新版本SPDK中只是包含了io uring的代码甚至默认都没有开放编译,我们需要做些修改:

1.安装最新的liburing库,同时修改spdk的config文件打开io uring的编译;

2. The reference implementation of other bdev added for rpc call io uring equipment, so we can create io uring devices like creating other equipment as bdev;

3. The latest liburing has io_uring_get_completion calls into a io_uring_peek_cqe, and the need to meet io_uring_cqe_seen use, so we have to SPDK in io uring code adjustment under implementation, to avoid the error can not find io_uring_get_completion function when compiling:

RSSD hard to achieve 1.2 million IOPS of cloud SPDK IO path optimization practice

4. Modify open call, use O_SYNC mode opens the file, to ensure that our data is written on the floor of the return, and higher efficiency than calling fdatasync, we aio bdev did the same changes, and add read-write mode:

RSSD hard to achieve 1.2 million IOPS of cloud SPDK IO path optimization practice

After these modifications spdk io uring device can successfully create out, and we do the next performance comparison:

Use aio bdev time:

RSSD hard to achieve 1.2 million IOPS of cloud SPDK IO path optimization practice

Use io uring bdev time:

RSSD hard to achieve 1.2 million IOPS of cloud SPDK IO path optimization practice

Visible at the highest performance and latency io uring have a good advantage, IOPS improved by about 20%, delayed by approximately 10%. This result is in fact limited by the maximum performance of the underlying hardware, it has not yet reached the upper limit io uring.

Six summary

SPDK application virtualization technology makes performance IO upgrade path is no longer the bottleneck, also contributed to UCloud high performance cloud disk products can play a better performance of back-end storage. Of course, the application of a technology is not so smooth, we also encountered a number of problems during use SPDK, in addition to the above-mentioned share some bug fixes, etc. We have also been submitted to the SPDK community, SPDK as a fast iterative development the project, each version to our surprise, there are also many interesting features waiting for us to discover and use to further enhance cloud on the disk and other product performance.

Guess you like

Origin blog.51cto.com/13832960/2403168