NetEase Shufan open source iSCSI server tgt is uniquely optimized to completely solve performance problems

iSCSI is an important technology in modern enterprise-level storage systems. The open source iSCSI server tgt has a single-threaded performance problem, and the related optimization patches have uneven effects and have not really solved the problem. This article introduces how the NetEase Shufan storage team adopted a series of The unique innovation achieves a significant improvement in tgt performance and uses it in the open source cloud-native software-defined storage Curve, allowing the Curve block device system to maintain high performance under different operating system environments.

background

1. CurveBS

Curve (github.com/opencurve/curve) is an open source cloud-native distributed storage system of NetEase Shufan, including the block device system CurveBS and the file system CurveFS. Among them, CurveBS undertakes a large number of tasks, such as cloud disks such as KVM virtual machines and K8s PV, which are the infrastructure of the cloud platform. CurveBS can also be used by users in other business scenarios, such as SAN storage systems in enterprises can be replaced by CurveBS - because CurveBS uses Raft replica consistency and CopySet concepts to automatically repair damaged replicas, both in terms of reliability and stability , these features are enough to replace the traditional SAN, and even surpass the traditional SAN in terms of scalability and cost performance.

2. iSCSI

When it comes to traditional storage devices such as SAN, we have to mention SCSI. As a connection and transmission protocol for external block devices, SCSI is the most extensive block device protocol. It was first proposed in 1979 and is an interface technology developed for minicomputers. It is now fully pervasive on minicomputers, servers and ordinary PCs. In order to be able to transmit data blocks over TCP/IP, Cisco and IBM initiated the iSCSI protocol, which has been strongly supported by major storage manufacturers. iSCSI can realize the transmission of the SCSI protocol over an IP network, enabling fast data access and backup operations such as high-speed Ethernet. The iSCSI standard was certified by the IETF (Internet Engineering Task Force) on February 11, 2003. iSCSI inherits the two most traditional technologies: SCSI and TCP/IP protocol. This has laid a solid foundation for the development of iSCSI. iSCSI-based storage systems can implement SAN storage functions with little investment, and even utilize existing TCP/IP networks directly. Especially for non-Linux systems such as Windows systems, if you want to use CurveBS as a Windows hard disk, just let CurveBS support the iSCSI protocol.

3. What is tgt

In order to support iSCSI, you must have iSCSI server software. tgt (Linux target framework) is an open source iSCSI server. For details, please see: https://github.com/opencurve/curve-tgt/blob/master/README .

Compared with other open source iSCSI server software in the industry, we found that tgt is the most suitable solution, because it is pure user mode, does not depend on the operating system kernel, does not need to write different codes for different kernel versions, and allows development work. becomes much simpler.

When we developed the Curve block device server, we wanted more systems to use CurveBS block devices, not just Linux systems, so we modified tgt to allow CurveBS to provide iSCSI services, for which we added a native CurveBS driver to tgt , read and write CurveBS directly through the curvebs part 1 interface, bypassing the kernel block device layer, saving system resources and calling the kernel block device layer.

4. Problems encountered in using tgt

We observed that the vanilla tgt used a single main thread, the epoll event loop, to process iSCSI commands, and also included the unix domian socket of the management plane in this main thread. On a 10 Gbit/s network or even a faster network, the speed of processing iSCSI commands with a single thread (ie, a single CPU) can no longer keep up with the need. When one thread deals with multiple targets, the request speed of multiple ISCSI Initiators Slightly higher and this single-threaded iSCSI usage is 100% busy. Therefore, the optimization of tgt has been put on the agenda, and domestic industry peers have long said that the single-threaded implementation of tgt has performance problems, and the developers of sheepdog have also mentioned how to optimize it. After careful research and determination to succeed, we carefully analyzed the architecture of tgt and found a way to optimize it. This method is unique to the Curve team and does not refer to any external patches at all, because we feel that those patches do not really solve the problem.

NetEase Shufan Optimization Strategy

1. Use multiple threads for epoll

The performance of modern CPUs still obeys Moore's Law, but the implementation path of Moore's Law has changed. The clock frequency of a single CPU is no longer increased, but replaced by more physical CPU cores. To this end, we must implement multiple epoll event loop threads, each of which is responsible for the processing of iscsi commands on a certain number of socket connections, so that the processing capability of multiple CPUs can be exerted.

2. Create an epoll thread for each target

Specifically in the implementation process, in order to avoid the problem of exceeding the processing capacity of a single CPU when multiple targets share an epoll thread, we set an epoll thread for each target. The CPU usage of the target epoll thread is scheduled by the OS scheduler, so that fair CPU usage can be achieved on each target. If there are enough CPUs, each target can have a CPU to serve, which will greatly improve performance. Of course, if the network speed is faster and the IO performance of the SSD is better, there will still be a single epoll thread that cannot handle a request on an iscsi target, but at present this solution is still the best solution we can do, because we do not need to introduce other Additional multithreading and locking mechanisms.

3. Management plane

The management plane maintains compatibility with the original tgt. In terms of command line usage, there is no difference and no modification, so the workload of human support is reduced, and the original manuals and documents are complete materials. The management plane provides services on the main thread of the program. The main thread is also an epoll loop thread, which is no different from the original tgt. It is responsible for the management of target, lun, login/logout, discover, session, connection, etc. When the Intiator connects to the ISCSI server, it is always served by the management plane thread first. If the connection finally needs to create a session to access a target, the connection will be migrated to the epoll thread of the corresponding target.

4. Locks on data structures

Provide a mutex for each target. The data structure in the target is protected by this lock. When the target epoll thread is running, the lock is locked by the thread, so that the thread can arbitrarily end a session or connection. When the thread enters epoll_wait, the lock is released, and the lock is locked again when epoll_wait returns. We modified the relevant code so that the target epoll thread does not need to traverse the target list, but only accesses the target-related structure it serves, so that we do not need the target list lock. The management plane will also modify the data structure in the target, move a login connection, delete a session or connection, all of which need to lock the target lock. So the management plane and the target epoll thread use this mutex for mutual exclusion, so that the corresponding target can be safely accessed. Because the management plane is single-threaded, and many target epoll threads do not traverse the target list, we do not need target list locks in the entire system, and the only thing we add is target locks.

5. connection establishes a session

When login_finish succeeds, login_finish sometimes creates a session (if no session exists). login_finish sets the iscsi target to be migrated to in the field migrate_to of the connection structure.

6. The connection is added to the session

Usually a new connection generates a new session, just like the login_finish mentioned above. But there is a situation, iSCSI allows multiple connections in a session, so the connection is directly added to the session, which is done by login_security_done.

7. When to do connection migration

When the call returns to iscsi_tcp_event_handler, because login_finish sets the migrate_to target target, iscsi_tcp_event_handler locks the target iscsi target structure, and inserts the fd of the connection into the event loop of the target target to complete the migration.

8. Set pthread name

Set the name of the thread of each target event loop in the top as tgt/n, and n is the target id, so that it is easy to use tools such as top to observe which target occupies the CPU.

9. An implementation example

If the management plane wants to delete a target, the following code illustrates the process:

/* called by mgmt */
tgtadm_err tgt_target_destroy(int lld_no, int tid, int force)
{
        struct target *target;
        struct acl_entry *acl, *tmp;
        struct iqn_acl_entry *iqn_acl, *tmp1;
        struct scsi_lu *lu;
        tgtadm_err adm_err;

        eprintf("target destroy\n");

        /*
         * 这里因为控制面是单线程的,而且SCSI IO线程不会删除target,
         * 所以我们找target的时候并不需要锁
         */

        target = target_lookup(tid);                                  
        if (!target)                                            
                return TGTADM_NO_TARGET;

        /*
         * 这里要锁住target,因为我们要删除数据结构,所以不能和iscsi io
         * 线程一起共享,必须在target event loop线程释放了锁时进行
         */

        target_lock(target);                                            
        if (!force && !list_empty(&target->it_nexus_list)) {
                eprintf("target %d still has it nexus\n", tid);
                target_unlock(target);                 
                return TGTADM_TARGET_ACTIVE;
        }        
 …
        /* 以上步骤删除了所有资源 ,可以释放锁了 */
        target_unlock(target);                                               
        if (target->evloop != main_evloop) {
                /* 通知target上的evloop停止,并等待evloop 线程退出 */
                tgt_event_stop(target->evloop);                         
                if (target->ev_td != 0)                                 
                        pthread_join(target->ev_td, NULL);
                /* 下面把evloop的资源删除干净 */
                work_timer_stop(target->evloop);                      
                lld_fini_evloop(target->evloop);
                tgt_destroy_evloop(target->evloop);
       }

Optimization effect

1. Performance comparison

We have configured 3 disks for tgt, including 1 CurveBS volume and 2 local disks

 <target iqn.2019-04.com.example:curve.img01>
    backing-store cbd:pool//iscsi_test_
    bs-type curve
</target>

<target iqn.2019-04.com.example:local.img01>
    backing-store /dev/sde
</target>

<target iqn.2019-04.com.example:local.img02>
    backing-store /dev/sdc
</target>

Use this machine to log in iscsi iscsiadm --mode node --portal 127.0.0.1:3260 --login

To set up a block device for fio to access these iscsi, use

[global]
rw=randread
direct=1
iodepth=128
ioengine=aio
bsrange=16k-16k
runtime=60
group_reporting

[disk01]
filename=/dev/sdx

[disk02]
filename=/dev/sdy
size=10G

[disk03]
filename=/dev/sdz
size=10G

The test scores are as follows:

The first is the unoptimized fio test score with 38.8K IOPS.

Followed by the multi-threaded optimized fio test score, the IOPS reached 60.9K.

It can be seen that the performance of tgt has been greatly improved. With the increase of target, we no longer have to worry about the CPU being wasted in the system.

2. Windows test

The system is preliminarily tested on Windows and runs well. For details on how to configure an iSCSI client on Windows, please refer to: https://jingyan.baidu.com/article/e4511cf37feade2b845eaff8.html

It should be noted that when setting CHAP authentication on Windows, the password must be set to 12 to 16 characters, and it must be the same as that on tgt. If the cipher is not set within this length, strange Windows iSCSI errors will appear.

tgt and Curve

We provide a driver for tgt to access CurveBS, see doc/README.curve for details, so that users can use CurveBS block storage on any OS that supports iSCSI (eg Windows, Mac).

Added the expansion awareness function to tgt, because the block device of CurveBS can be expanded, and the original tgt can only determine the size of the block device when creating the lun, and it cannot be changed after that. We have added a code so that the tgtadm command can notify tgt of expansion information. . The iSCSI client can re-read the size of the lun to complete the expansion. For details on how to re-read the size of the lun, see doc/README.curve.

summary

This article introduces the problems encountered by the NetEase Shufan Curve team in the use of tgt, the unique optimization methods for tgt and the results obtained, and the application of tgt in CurveBS. The current iser target service still belongs to the main thread service. In view of the popularity of RDMA, this part of the code has not been modified. It is expected that with the popularity of the RDMA environment, we will make further optimizations.

learn more

  1. Curve project homepage: http://www.opencurve.io/
  2. tgt usage documentation: https://github.com/opencurve/curveadm/wiki/others#deploy-tgt
  3. NetEase Shufan experts will interpret tgt optimization and other Curve related technologies at the Curve community online biweekly meeting. Welcome to scan the following QR code on WeChat to join the Curve user group to view the time/topics of the Curve biweekly meeting and more news.

The author of this article: NetEase Shufan Curve team

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4565392/blog/5450500