"In-depth explanation of SSD: solid-state storage core technology, principle and actual combat"----Learning Records (2)

Chapter 2 SSD Controller and All-Flash Array

SSD is mainly composed of two modules - main control and flash media. In fact, in addition to the above two modules, there is also a cache unit that is optional. The main controller is the brain of the SSD, which undertakes the functions of command, calculation and coordination. It is specifically manifested in: first, to realize the communication between the standard host interface and the host; second, to realize the communication with the flash memory; and third, to run the FTL algorithm inside the SSD. It can be said that the quality of a main control chip directly determines the performance, life and reliability of SSD.

2.1 SSD system architecture

As a data storage device, SSD is actually a typical (System on Chip) stand-alone system. It has modules such as main control CPU, RAM, operation accelerator, bus, and data encoding and decoding. See Figure 2-1. The operating objects are protocols, Data commands, media, the purpose of operation is to write and read user data

insert image description here
Figure 2-1 Hardware diagram of SSD main control module图片来源于《深入浅出SSD:固态存储核心技术、原理与实战》

Figure 2-1 is only a schematic diagram of the SSD system architecture. This main control uses ARM CPU, which is mainly divided into two parts: the front end and the back end. The front end (Host Interface Controller, host interface controller) deals with the host, and the interface can be SATA, PCIe, SAS, etc. The backend (Flash Controller, flash memory controller) deals with flash memory and completes data encoding and decoding and ECC. In addition to buffer (Buffer), DRAM. The modules are interconnected through the AXI high-speed and APB low-speed buses to complete the communication of information and data. On this basis, SSD firmware developers build firmware (Firmware) to uniformly complete the functions required by SSD products, schedule each hardware module, and complete the writing and reading of data from the host to the flash memory.

2.1.1 Front end

Host interface: A standard protocol interface for communicating (data interaction) with the host, currently mainly represented by SATA, SAS, and PCIe. Table 2-1 shows the interface rates of the three

insert image description here
Table 2-1 SATA, SAS, and PCIe interface rates表格来源于《深入浅出SSD:固态存储核心技术、原理与实战》

The full name of SATA is Serial Advanced Technology Attachment (Serial Advanced Technology Attachment), which is a serial hardware driver interface based on industry standards, as shown in Figure 2-2

insert image description here
Figure 2-2 SATA interface图片来源于《深入浅出SSD:固态存储核心技术、原理与实战》

SAS (Serial Attached SCSI) is Serial Attached SCSI, which is a new generation of SCSI technology. It is the same as the popular Serial ATA (SATA) hard disk. It uses serial technology to obtain higher transmission speed. Improve internal space, etc., as shown in Figure 2-3

insert image description here
Figure 2-3 SAS port图片来源于《深入浅出SSD:固态存储核心技术、原理与实战》

PCIe (Peripheral Component Interconnect Express) is a high-speed serial computer expansion bus standard. Its main advantage is its high data transmission rate. The current highest version 4.0 can reach 2GB/s (one-way single-channel rate), as shown in Figure 2-4 and 2-5 shown

insert image description here
Figure 2-4 PCIe interface card (AIC)图片来源于《深入浅出SSD:固态存储核心技术、原理与实战》

insert image description here
Figure 2-5 U.2 interface图片来源于《深入浅出SSD:固态存储核心技术、原理与实战》

The front end is the interface responsible for the communication between the host and the SSD device. Commands and data transmission flow to or from the SSD device through the front-side bus.

  • From the perspective of hardware modules, the front end has a SATA/SAS/PCIe PHY layer, commonly known as the physical layer, which receives serial bit data streams and converts them into digital signals for processing by the front-end subsequent modules. These modules process NVMe/SATA/SAS commands, they receive and process commands and data information one by one, and DMA is used when data movement is involved. Generally, the command information will be queued and placed in the queue, and the data will be placed in the SRAM fast medium. If encryption and compression functions are involved, the front end will have corresponding hardware modules for processing. If the software cannot cope with the rapid demand for compression and encryption, it will become a performance bottleneck
  • From a protocol point of view, a SATA Write FPDMA command is taken as an example to illustrate the above content. Send a write command request from the file system on the host side. After the request is sent to the AHCI register of the South Bridge of the main board, the AHCI register executes the request, that is, the write operation, ignoring the operation details of the file system to the AHCI path. From the SSD front-side bus, it will be issued as follows Write interactive operation, as shown in Figure 2-6

insert image description here
Figure 2-6 SAT A Write FPDMA command protocol processing steps图片来源于《深入浅出SSD:固态存储核心技术、原理与实战》

  • Step 1: The host issues the Write FPDMA command FIS (Frame Information Structure, frame information structure, which is the packet used by SATA to realize asynchronous transmission of data blocks) on the bus
  • Step 2: After the SSD receives the command, it judges whether its internal write buffer (Write Buffer) has space to receive new data. If so, send DMA Setup FIS to the host; otherwise, nothing is sent, and the host is in a waiting state (this is called flow control: data flow control)
  • Step 3: After receiving the DMA Setup FIS, the host sends Data FIS not larger than 8KB to the device
  • Step 4: Repeat Step 2 and Step 3 until all data is sent
  • Step 5: The device (SSD) sends a status FIS to the host, indicating that this write command completes all operations from the protocol level. Of course, Status can be a good status or a bad/error status, indicating that the Write FPDMA command is operating normally or completed abnormally

After the SSD receives the commands and data and puts them in the internal buffer of the SSD, the front-end firmware module needs to parse the commands and assign tasks to the mid-end FTL. Command Decoder parses command FIS into elements that firmware and FTL (Flash Translation Layer) can understand:

  • What kind of command is this? Is the command attribute read or write?
  • The starting LBA and data length of this write command
  • Other attributes of this write command, such as whether it is a FUA command, and whether it is continuous with the previous command LBA (whether it is a continuous command or a random command)

When the command analysis is completed, put it into the command queue and wait for the mid-end FTL to queue up for processing. Since there are already two main information elements, the starting LBA and the data length, FTL can accurately map the LBA space to the physical space of the flash memory. So far, the front-end hardware and firmware module has completed the tasks it should complete

2.1.2 Main CPU

The SSD controller SoC module is not essentially different from other embedded system SoC modules. It is generally composed of one or more CPU cores, and there are I-RAM, D-RAM, PLL, IO, UART, high and low speed buses on the chip. and other peripheral circuit modules. The CPU is responsible for computing and system scheduling, IO completes the necessary input and output, and the bus connects the front-end and back-end modules

Usually the so-called firmware runs on the CPU core, which has a code storage area I-RAM and a data storage area D-RAM respectively. If it is a multi-core CPU, it should be noted that the software can be symmetric multiprocessing (SMP) and asymmetric multiprocessing (AMP). Symmetric multi-processing and multi-core share the OS and the same execution code, while asymmetric multi-processing means that multiple cores execute different codes separately. The former multi-core shares a copy of I-RAM and D-RAM, resource sharing; the latter corresponds to a copy of I-RAM and D-RAM, each core runs independently, and there is no problem of slow code execution due to memory preemption. When the SSD CPU requires higher computing power, in addition to increasing the number of cores and single-core CPU frequency, the design of AMP is more suitable for computing and task independence, eliminating the problem of slow execution speed caused by code and data resource preemption

The firmware is designed according to the number of cores of the CPU, and it is one aspect of the firmware design to take full advantage of the computing power of the multi-core CPU. In addition, the firmware will consider the division of tasks, and will load the tasks to different CPUs for execution, so that all CPUs have a reasonable and balanced load while achieving parallel processing, so that some CPUs will not be busy and some CPUs will be idle. This is The important content to be considered in the firmware architecture design, the goal is to let the SSD output the maximum read and write performance

SSD's CPU peripheral modules include UART, GPIO, and JTAG, which are essential debugging ports for programs. In addition, there are timer modules Timer and other internal modules, such as DMA, temperature sensor, Power regulator module, etc.

2.1.3 Backend

The two modules at the back end are the ECC module and the flash memory controller, as shown in Figure 2-7

insert image description here
Figure 2-7 ECC module and flash controller in SSD图片来源于《深入浅出SSD:固态存储核心技术、原理与实战》

The ECC module is a data encoding and decoding unit. Due to the bit error rate inherent in flash memory storage, for data accuracy, ECC check protection should be added to the original data during data writing operations. This is an encoding process. When reading data, it is also necessary to detect and correct errors through decoding. If the number of erroneous bits exceeds the ECC error correction capability, the data will be uploaded to the host in the form of "uncorrectable error". The process of ECC encoding and decoding here is completed by the ECC module unit. The ECC algorithms in SSD mainly include BCH and LDPC, among which LDPC is gradually becoming the mainstream

The flash memory controller is responsible for managing the reading and writing of data from the cache to the flash memory using flash memory commands that comply with the flash ONFI, Toggle standard

From the perspective of a single flash memory, a Die/LUN is a basic unit for executing flash memory commands. The flash memory controller and flash memory connection pins operate as follows, as shown in Figure 2-8

insert image description here
Figure 2-8 Flash chip interface图片来源于《深入浅出SSD:固态存储核心技术、原理与实战》

  • External interface: 8 IO interfaces, 5 enable signals (ALE, CLE, WE#, RE#, CE#), 1 status pin (R/B#), 1 write protection pin (WP#)
  • Commands, addresses, and data are input and output through 8 IO interfaces
  • When writing commands, addresses, and data, it is necessary to pull down the WE# and CE# signals at the same time, and the data is latched at the rising edge of WE#
  • CLE and ALE are used to distinguish whether data or address is transmitted on the IO pin

From the perspective of the flash memory controller, multiple flash memory Dies/LUNs need to be concurrently configured for performance requirements, and multiple channels are usually configured. The number of flash memory Dies/LUNs connected to a channel depends on the SSD capacity and performance requirements. The more Dies/LUNs, the more concurrent numbers, the better the performance.
Die/LUN is the smallest basic management unit of flash memory communication. The above-mentioned set of buses includes 8 I/O ports, 5 enable signals (ALE, CLE, WE#, RE#, CE#), 1 status pin (R/B#), 1 write protection pin Pin (WP#)

If multiple flash Dies/LUNs are connected to one channel, each Die shares a set of buses on each channel, and the flash memory controller identifies which Die to communicate with through the strobe signal CE#. Before the flash memory controller sends read and write commands and data to the flash memory Die at a specific address, it first strobes the CE# signal of the corresponding Die, and then sends the read and write commands and data. There can be multiple CEs on one channel, and the SSD main control is generally designed to be 4 to 8, and there is a certain degree of flexibility in the selection of capacity.

2.2 SSD master control manufacturers

SSD master control is a chip product with great technical depth and market breadth

2.2.1 Marvell master control

Marvell is in a leading position in high-end SoC design. Marvell has established technical barriers ahead of competitors through complex SoC architecture, leading error correction mechanism, interface technology, low power consumption and other advantages

2.2.2 Samsung main control

Samsung’s main control is basically Samsung’s own SSD. The 830 series uses the MCX main control, while the 840 and 840Pro use the MDX main control. The 850Pro/840EVO uses the MEX main control. 750EVO uses MGX master control, 650 uses MFX master control

2.3 Case: SiliconGo SG9081 master control

This section takes the SATA3.2SSD master SG9081 of the domestic master manufacturer Sige as an example to analyze how the master can achieve high performance. Figure 2-14 is a structural block diagram of the SG9081 main control

insert image description here
Figure 2-14 Structural block diagram of SG9081 main control图片来源于《深入浅出SSD:固态存储核心技术、原理与实战》

1. HAM+GoCache accelerates random read and write IOPS

HAM is the abbreviation of hardware acceleration module. In addition to MCU, there is also a hardware acceleration module HAM in the SSD main control. This module hardwareizes part of the algorithm processing actions, which on the one hand releases the resources of the MCU, and on the other hand accelerates the implementation of the algorithm, especially for the processing of small data. In addition, GoCache (SiliconGo's unique technology) is integrated in the main control, which can efficiently manage the mapping relationship, thereby improving the transmission capacity of small data more efficiently. The combination of the two improves the performance of SSD finished modules

2. DMAC accelerated sequential read and write

DMAC is the abbreviation of Direct Memory Access Controller. The existence of this module makes it unnecessary for the SSD to occupy the resources of the MCU when performing continuous large data transmission. When a DMA request is initiated, the internal bus arbitration logic will be controlled by the DMAC, and then the high-speed data transfer operation will start. During the transmission process, the MCU can handle other affairs, and when the data transmission is over, the DMAC will give up the bus to the MCU. Under the guarantee of such a mechanism, the efficiency of SSD read and write operations is greatly improved, thus showing excellent sequential read and write performance

3. LDPC+RAID improves reliability, enhances flash memory durability and data retention

At present, flash memory is shifting from 2D to 3D architecture, and the requirements for error correction processing of flash memory are getting higher and higher. Early BCH can no longer meet the requirements of advanced manufacturing or advanced technology flash memory. The main control of SG9081 uses LDPC to realize ECC. Under the same user data conditions, LDPC code can correct more errors than BCH check code, and it also enhances the service life of flash memory. The introduction of the RAID function adds a double insurance to data protection. The RAID function in the main controller can be understood as a layer of verification protection for the data, and the content of the verification can be restored to the original data when necessary. LDPC and RAID functions greatly improve data stability

2.4 Case: Unified design of enterprise-level and consumer-level master control requirements

SSD is divided into enterprise grade and consumer grade. Enterprise-level SSD products pay more attention to random performance, delay, IO QoS guarantee and stability; while consumer-level products pay more attention to sequential performance, power consumption, price, etc., as shown in Table 2-3

insert image description here
Table 2-3 Comparison of enterprise-level and consumer-level SSDs表格来源于《深入浅出SSD:固态存储核心技术、原理与实战》

Is there a unified SSD controller that can meet both enterprise and consumer needs? The main question is whether the cost, power consumption and function can be unified in the controller hardware architecture

  • 1) In terms of cost, enterprise-level SSDs are less sensitive to controller costs, and normalized SSD controllers need to focus on meeting the cost budget of consumer-level SSDs. Adopt a common hardware architecture and optimize hardware resource overhead to constrain the cost of SSD controllers, and meet the different performance requirements of enterprise-level and consumer-level products through differentiated firmware
  • 2) In terms of performance, after market precipitation, NVMeU.2 and M.2 SSDs have gradually become the mainstream, and the performance requirements of the two types of SSD products have also tended to be consistent. As a substitute for the AIC form, 1U servers generally carry 8 or more U.2 form SSDs, so that the random performance of a single 4KB U.2 form SSD is 300-400 KIOPS, which can meet most application requirements. In contrast to the consumer-grade SSD market, the theoretical performance of NVMe M.2 SSDs for high-end gaming platforms has reached 3.5GB/s, which is similar to the sequential IO of some enterprise-grade SSDs. Some Internet manufacturers have applied M.2 SSDs in IDC data centers. In the data center, the upper layer has made a lot of optimizations to the data flow, and the data is written to the SSD in a sequential access mode, which reduces the demand for the random performance of the enterprise-level SSD.
  • 3) In terms of lifespan, the demand for enterprise-level and consumer-level SSDs is quite different. But the main factor affecting the life of SSD is the durability of flash memory. The SSD controller ensures enhanced error correction capabilities for flash memory. Therefore, the design goals of enterprise-level and consumer-level SSD controllers in terms of lifespan are the same
  • 4) In terms of capacity, there is a big difference between enterprise-level SSDs and consumer-level SSDs. The SSD controller needs to support large-capacity flash memory at a relatively small cost, so as to cover the needs of both enterprise-level and consumer-level SSDs
  • 5) In terms of reliability, enterprise-level SSDs generally require two layers of data protection capabilities, ECC and DIE-RAID. With the gradual popularization of 3D flash memory, flash memory manufacturers began to suggest providing DIE-RAID capabilities on consumer-grade SSDs. Therefore, in terms of reliability, the design goals of enterprise-level and consumer-level SSD controllers tend to be consistent.
  • 6) In terms of power consumption, consumer products are the most sensitive to power consumption, especially battery-powered devices such as tablets and laptops, which have strict restrictions on power consumption. SSD controllers need to consider complex low power requirements when designing, and need to support multiple power states and fast wake-up. Enterprise SSDs are relatively insensitive to power consumption. However, for the entire data center, the cost of electricity has accounted for nearly 20% of the operating cost of the data center. With the large-scale deployment of SSDs, low power consumption design has also become the pursuit goal of enterprise-level SSD controllers

It is not difficult to see from the above points that under the trend that the design indicators of enterprise-level and consumer-level SSDs tend to be consistent, it is very possible to achieve unified hardware specifications. The differentiation of SSD product form is reflected by the firmware on the SSD controller. Starchip Technology's STAR1000 chip has made a relatively successful attempt in the design, as shown in Figure 2-15

insert image description here
Figure 2-15 STAR1000 key technologies图片来源于《深入浅出SSD:固态存储核心技术、原理与实战》

2.5 Case: DERA NVMe controller TAI and NVMe SSD products

The NVMe protocol is designed for the structure of modern multi-core computing systems, giving full play to the high concurrency and low latency characteristics of NVM media, and laying a good ecological foundation for the realization of high throughput and low latency storage devices. DERA Storage follows protocol standards and develops and provides high-performance and highly reliable NVMe SSD solutions for the enterprise computing market

The controller is the core component of NVMe SSD, and it is the bridge connecting the host bus and the flash memory unit. Essentially, an NVMe SSD device needs to process a large number of IO transactions with high concurrency. Each IO transaction is accompanied by a variety of hardware operations and event processing. Some of the functional features need to be combined with computationally intensive operations, such as data error detection. Encoding and decoding, or data encryption and decryption, while completing these processes, must also meet strict power consumption requirements, so it is inevitable to use a dedicated hardware acceleration unit. In general, the NVMe SSD controller is generally an ASIC (Application Specific Integrated Circuit) closely combined with NAND flash memory management software for highly customized design. Only when the data path and computing resources are properly arranged and allocated can the finally realized NVMe SSD achieve a well-unified goal in terms of reliability, performance, and power consumption

The DERA NVMe controller is the core component of DERA NVMe SSD products, and TAI is the first controller of DERA, as shown in Figure 2-16. DERA TAI front-end supports PCIe Gen3x8 or x4 interface, integrates multiple NAND interface channels and high-strength ECC hardware codec unit, and all data channels use ECC and CRC multiple hardware protection mechanisms. On the basis of the TAI controller, the tightly coordinated Flash Translation Layer (FTL) algorithm is responsible for scheduling and management, comprehensively using a variety of technologies to achieve enterprise-level data storage reliability, and giving full play to the high-speed access characteristics of NAND flash memory to achieve high reliability, Low latency, high throughput data storage requirements

insert image description here
Figure 2-16 DERA TAI controller图片来源于《深入浅出SSD:固态存储核心技术、原理与实战》

Flash ECC is the core function of SSD. In order to deal with the high original bit error rate of flash memory chips under the new structure and new process nodes, and to meet the low latency requirements of high concurrent access, the DERA TAI controller is equipped with an independent ECC unit for each flash memory channel, and the error correction capability is 100b/1KB, which meets the requirements of mainstream flash memory devices for the error correction capability of the main control, that is, it has achieved a good balance in complexity, area and power consumption, decoding delay time determinism and controllability. In addition, DERA TAI's ECC protection and CRC check for the complete data channel also provide a further basic guarantee for data reliability without affecting performance

DERA SSD provides complete hardware means to continuously monitor the power supply, and triggers a protection strategy when the power supply is abnormal, automatically switches to the backup capacitor or other uninterruptible power supply, fully cooperates with the overall software strategy, and maximizes the power supply in the event of unexpected power failure. Guarantee the integrity of user data

2.6 All Flash Array AFA

2.6.1 Gross Anatomy

1 structure

Figure 2-17 shows a standard XtremIO all-flash array, which contains two X-Bricks interconnected by Infiniband. It can be seen that X-Brick is the core

insert image description here
Figure 2-17 XtremIO all-flash array structure图片来源于《深入浅出SSD:固态存储核心技术、原理与实战》

An X-Brick includes:

  • 1 advanced UPS power supply
  • 2 storage controllers
  • Disk array storage cabinet DAE, with many SSDs, each SSD is connected to the storage controller with SAS
  • If the system has multiple X-Bricks, then two Infiniband switches are required to realize high-speed interconnection of storage controllers

2 storage controller

As shown in Figure 2-18, the storage controller is actually an Intel server with two power supplies. It appears to be two independent CPUs of the NUMA architecture, two Infiniband controllers, and two SAS HBA cards. Intel E5CPU, each CPU is equipped with 256GB memory

insert image description here
Figure 2-18 Inside the storage controller chassis图片来源于《深入浅出SSD:固态存储核心技术、原理与实战》

As shown in Figure 2-19, various cables are inserted behind it, which looks messy, as shown in Figure 2-19. The designed architecture is suitable for clusters, so many cables are redundant

insert image description here
Figure 2-19 X-Brick rear connection diagram图片来源于《深入浅出SSD:固态存储核心技术、原理与实战》

The front view of the array shows the status of the UPS power supply on the LCD. Figure 2-20 shows the vertical SSD array

insert image description here
Figure 2-20 Front view of the Xtrem-IO all-flash storage array图片来源于《深入浅出SSD:固态存储核心技术、原理与实战》

3 configuration

As shown in Table 2-5, the capacity of an X-Brick is 10TB, and the usable capacity is 7.5TB. However, considering the ratio of data deduplication and compression about 5:1, the final usable capacity is 37.5TB.

insert image description here
Table 2-5 XtremIO configuration table表格来源于《深入浅出SSD:固态存储核心技术、原理与实战》

2.6.2 Hardware Architecture

EMC XtremIO is EMC's raid on the all-flash array market. It is designed from the ground up based on the characteristics of flash memory. As shown in Figure 2-23, one X-Brick includes two storage controllers, one DAE with 25 SSDs installed, and two battery backup power supplies (Battery Backup Unit, BBU). Each X-Brick contains 25 400GB SSDs with an original capacity of 10TB. It uses high-end eMLC flash memory, and its general erasing and writing life is an order of magnitude longer than that of ordinary MLC. If you only buy one X-Brick, it is equipped with two BBUs, one of which is for redundancy. If you continue to add X-Bricks, then other X-Bricks only need one BBU

insert image description here
Figure 2-23 X-Brick Dimensions图片来源于《深入浅出SSD:固态存储核心技术、原理与实战》

2.6.3 Software Architecture

With the development of the storage industry today, hardware is becoming more and more standardized, so it is difficult to rely on hardware to shine. If you can manufacture memory chips, such as Samsung’s model, you can do it yourself from the bottom up, and you can rely on huge shipments to reap the profits of hardware.

1 Several major killers of XIO software

  • Deduplication: Improving performance, and at the same time, due to the reduction of write amplification, the life of the flash memory is extended and the reliability is improved
  • Thin Provisioning: The capacity of the partition can automatically grow as it is used (until the array is full), so that performance will not be affected at critical moments
  • Mirroring: Advanced mirroring architecture ensures that capacity and performance will not be compromised
  • XDP Data Protection: Protecting Data with RAID6
  • VAAI integration

2 Core design idea of ​​XIO software

  • 1) Everything is for random performance: accessing any data block on any node will not increase additional costs, that is, all resources must be accessed fairly. The result of this is that even if the number of nodes increases, the performance can increase linearly and the scalability is good.
  • 2) Reduce write amplification as much as possible: for SSD, write amplification will not only lead to shortened service life, but also decrease quality and data reliability due to increased erasing times of flash memory. The design goal of XIO is to make the data actually written in the background as little as possible, which plays a role of data attenuation
  • 3) Do not do global garbage collection: XIO uses SSD arrays, and SSDs have high-performance enterprise-level controller chips inside. The current SSD masters are very powerful, and the garbage collection efficiency is very high, so XIO does not repeat Do a garbage collection. The effect of this is to reduce write amplification. After all, the amount of data moved in the background is small. At the same time, time and system resources are saved for other software functions, data services, and VAAI.
  • 4) Store data according to the content: the address of the data storage is generated by the data content, and has nothing to do with the logical address. In this way, data can be stored in any location, improving random performance, and various optimizations can be made for SSDs. Data can be placed evenly across the system
  • 5) True Active/Active data access: LUN has no owner, all nodes can serve any volume, so that performance will not be damaged due to a problem with a certain node
  • 6) Good scalability: performance, capacity, etc. can be linearly expanded

3 Why does the XIO software run in the Linux user mode

As shown in Figure 2-28, the XIO all-flash array software architecture, the XIO OS and the XIO software all run in the Linux user mode. The Linux system is divided into kernel mode and user mode. Our applications run in user mode. System resources such as various hardware interfaces are managed through kernel mode. User mode accesses kernel resources through system call. XIO software running in user mode has several advantages:

  • Avoids process switching in kernel state, fast
  • Simple development, no need for various kernel interfaces, complex memory management and exception handling
  • Not bound by the GPL. Linux is an open source system, and the program running in the kernel must use the kernel code. According to the provisions of the GPL, it must be open source, and the application developed by itself in the user mode is not subject to this restriction.

insert image description here
Figure 2-28 XIO software architecture图片来源于《深入浅出SSD:固态存储核心技术、原理与实战》

There is a XIOS program running on each CPU: X-ENV, if you hit the "top" command, you will find that this program controls all CPU and memory resources

  • The first function is to enable XIO to use 100% of hardware resources
  • The second function is not to give other processes the opportunity to affect the performance of XIO, to ensure the stability of performance
  • The third function is to provide a possibility: in the future, it can be ported to UNIX or Windows platforms with simple modifications, or ported from X86CPU to ARM, PowerPC and other CPU architectures, because these are upper-level programs and do not involve the underlying interface.

XIO is a software that is completely separated from the hardware. Moreover, the hardware of XIO basically does not have its own special components, does not include FPGA, does not have self-developed chips, SSD cards, firmware, etc., and uses standard components. The advantage of this is that you can use the latest and greatest X86 hardware, as well as the latest interconnection technologies, such as faster than Infiniband

2.6.4 Workflow

1 6 modules

XIO software is divided into 6 modules to realize complex functions, including three data modules R, C, D, and three control modules P, M, L

  • P (Platform, platform module): monitoring system hardware, each node has a P module running
  • M (Management, management module): realize various system configurations. Perform tasks by communicating with the XMS management server, such as creating volumes, LUN masks, and other commands sent from the command line or graphical interface. There is one node running the M module and other nodes running another backup M module
  • L (Cluster, cluster module): manage cluster members, each node runs an L module
  • R (Routing, routing module): Translate the sent SCSI commands into XIO internal commands, responsible for commands from two FC and two iSCSI interfaces, split all read and write data into 4KB, and calculate the value of each 4KB data Hash value, using the SHA-1 algorithm, each node runs an R module
  • C (Control, control module): Contains a mapping table: A2H (logic address of data block - Hash value), with advanced data services such as mirroring, deduplication, and automatic expansion
  • D (Data, data module): Contains another mapping table: H2P (Hash value - SSD physical storage address). It can be seen that the storage address of the data has nothing to do with the logical address, but only with the data, because the hash value is calculated from the data, responsible for reading and writing to the SSD, and responsible for the RAID data protection technology - XDP (XtremIO Data Protection)

2 Read process

The reading process is as follows:

  • 1) The host sends the read command to the R module through the FC or iSCSI interface, and the command includes the logical address and size of the data block
  • 2) The R module splits the command into 4KB data blocks and forwards them to the C module
  • 3) The C module checks the A2H table, gets the Hash value of the data block, and forwards it to the D module
  • 4) The D module checks the H2P table, gets the physical address of the data block in the SSD, and reads it out

3 non-repetitive writing process

The non-repetitive writing process is as follows, see Figure 2-29:

  • 1) The host sends the write command to the R module through the FC or iSCSI interface, and the command includes the logical address and size of the data block
  • 2) The R module splits the command into 4KB data blocks, calculates the Hash value, and forwards it to the C module
  • 3) The C module finds that the Hash value is not repeated, so it inserts its own table and forwards it to the D module
  • 4) The D module assigns the physical address in the SSD to the data block and writes it down

insert image description here
Figure 2-29 Non-repeated write process图片来源于《深入浅出SSD:固态存储核心技术、原理与实战》

4 Write process that can be deduplicated

The deduplication writing process is as follows, see Figure 2-30:

  • 1) The host sends the write command to the R module through the FC or iSCSI interface, and the command includes the logical address and size of the data block
  • 2) The R module splits the command into 4KB data blocks, calculates the Hash value, and forwards it to the C module
  • 3) The C module checks the A2H table (it is estimated that there is also an H2A table, or a tree, a Hash array, etc.), finds that there is a duplication, and forwards it to the D module
  • 4) D module knows that the data block is repeated, so it does not write, but adds 1 to the reference number of the data block

insert image description here
Figure 2-30 Deduplication write process图片来源于《深入浅出SSD:固态存储核心技术、原理与实战》

5 ESXi和VAAI

ESXi is embedded in the operating system, so ESXi can be regarded as a virtual machine platform with many virtual machines running on it

VAAI (vStorage APIs for Array Integration) is one of the standard languages ​​in the virtualization field. It is actually a protocol for sending commands such as ESXi.

6 Copy process

Figure 2-31 shows the data status before copying, and the copying process is shown in Figure 2-32, as follows:

  • 1) The virtual host on ESXi sends a virtual machine (VM) copy command in VAAI language
  • 2) The R module receives the command through iSCSI or FC, and selects a C module to perform the copy
  • 3) The C module parses out the command content, copies the address range 0~6 of the original VM to the new address 7~D, and sends the result to the D module
  • 4) The D module queries the Hash table and finds that the data is duplicated, so no data is written, and only the number of references is increased by 1

insert image description here
Figure 2-31 Data status before copying图片来源于《深入浅出SSD:固态存储核心技术、原理与实战》

insert image description here
Figure 2-32 Copy process图片来源于《深入浅出SSD:固态存储核心技术、原理与实战》

7 Review R, C, D modules

R deals with the upper layer, C is the middle layer, and D deals with the bottom SSD. One X-Brick control server has 2 CPUs, and each CPU runs a XIOS software. As shown in Figure 2-33, R and C modules run on one CPU, and D runs on another CPU

insert image description here
Figure 2-33 X-Brick internal interconnection diagram图片来源于《深入浅出SSD:固态存储核心技术、原理与实战》

Because the Intel Sandy Bridge CPU integrates a PCIe controller (the Sandy Bridge Enterprise Edition CPU integrates a PCIe 3.0 interface and does not need to be transferred through the South Bridge). Therefore, in a multi-CPU architecture, if the device is directly connected to the PCIe interface of the CPU, the performance will be very high, and the distribution of R, C, and D is also designed according to this requirement. For example, the SAS adapter card is inserted into the PCIe slot of CPU 2, so the D module must run on CPU 2, so that the performance can be optimized. From here, we can see the advantages of XIO's architecture, that is, the software can be configured according to the standardized hardware, and the optimal performance can be achieved through the layout. If the CPU distribution changes, the software distribution will be simply adjusted according to the new architecture to improve performance

8 Inter-module communication: excellent scalability

How do modules communicate? In fact, it is not required that the modules must be on the same CPU, as shown in Figure 2-33, R and C do not necessarily have to be on the same CPU. The communication between all modules is realized through Infiniband, the data path uses RDMA, and the control path is realized through RPC. The total IO delay of XIO is 600-700μs, of which Infiniband only accounts for 7-16μs. The advantage of using Infiniband for interconnection is for scalability. Even if the X-Brick is increased, the delay will not increase, because the communication path does not change. Any two modules still communicate through Infiniband. If there are many R, C, and D modules in the system, when a 4KB data block is sent to a front-end R module, it will calculate the Hash value, and the Hash will randomly fall on any C Above, no one is special. In this way everything is linear, the increase or decrease of X-Brick will linearly lead to the increase or decrease of performance

2.7 Solid state drive with computing function

The foundation supporting a huge data network is IT infrastructure, which mainly includes three parts: network, computing and storage. As shown in Figure 2-34, the function of the IT infrastructure is similar to that of processing trade. The network is the porter of data, computing is the processor of data, and storage is the nest of data. Since the advent of solid-state drives, storage is no longer a problem. The latest PCIe 3.0x8SSD has a read and write bandwidth of more than 4GB/s! On the one hand, storage is progressing rapidly, but on the other hand, CPU is limited by the failure of Moore's Law, and the process progress is slow. Therefore, computing has become a bottleneck, especially in image and video processing, deep learning and other aspects. Massive data can be read and written from PCIe SSD at high speed, but the CPU cannot handle it

insert image description here
Figure 2-34 IT infrastructure图片来源于《深入浅出SSD:固态存储核心技术、原理与实战》

SSD with FPGA - CFS (Computing Flash System, Computing Flash System). It adopts PCIe 3.0x8 high-speed interface, and its performance can reach 5GB/s. SSD provides high-speed data storage, and FPGA can provide computing acceleration. In this way, the data from SSD can be calculated by FPGA by the way, freeing up the CPU. Everything returns to the original position, CPU is used for control, FPGA is used for calculation, and SSD is used for storage

Its advantages are mainly reflected in the high-speed storage of massive data and artificial intelligence computing. If you think about it, there will be many scenarios, such as driverless cars. At present, general driverless cars are equipped with various sensors such as millimeter-wave radar, lidar, and high-speed cameras. 1GB of data is generated every second, and it takes a lot of computing power to analyze that much data. Many self-driving cars still use GPUs for calculations. A CPU+GPU computing box currently on the market can consume up to 5000W of power consumption. For cars, the heat dissipation of this small cupping tank will bring great safety risks and consume a lot of power. However, if the FPGA solution is used instead, the power consumption can be reduced. After optimizing the algorithm according to the application scenario of unmanned driving, the computing performance can also meet the demand. For example, Audi's self-driving car uses FPGA computing platform. The data generated by these sensors is currently lost. It is a pity that after commercial use in the future, both the government and manufacturers will have the need to store valuable driving data and back it up to the cloud. These data are very useful for perfecting unmanned driving and analyzing the scene of a car accident. To save these data, only PCIe SSDs can achieve a write speed above 1GB/s. Therefore, on the one hand, FPGA SSD can quickly store driving data, and on the other hand, it can provide FPGA for data analysis, which perfectly meets the computing and storage needs of unmanned driving.

Since the start of a new upsurge in artificial intelligence, many companies have begun to use FPGAs for artificial intelligence calculations. Using CFS, they can directly use the artificial intelligence hardware algorithms in FPGAs to perform high-speed analysis of massive data inside SSDs. Finally, the analysis The result is sent to the host

致谢《深入浅出SSD:固态存储核心技术、原理与实战》

insert image description here

I hope this article is helpful to everyone. If there is anything wrong with the above, please correct me.

Sharing determines the height, and learning widens the gap

Guess you like

Origin blog.csdn.net/qq_42078934/article/details/131406343