2023: Latest development and trend analysis of generative AI and storage (Part 2)

1. Overview of new storage developments

        The biggest milestone in the storage field in the past two years should be that flash memory has won more than half of the market. Gartner's market analysis data for several consecutive quarters has confirmed this many times. The trend of solid-state storage replacing mechanical hard drives is irreversible. Against this background, there are three new development directions that have attracted more and more attention, namely new storage media, computational storage (integrated storage and computing) and further pursuit of ultimate performance.

2. medium

        Intel once used Optane to promote the revolution of the media layer. It created a new level of SCM/PMem between DRAM and SSD, expanding the number of layers in the storage pyramid. With good performance, low latency, and non-volatility, this new type of phase change memory has almost no shortcomings except that it is expensive and has a small capacity. However, in 2022, Optane suddenly withdrew from the market, resulting in a de facto drain on the bottom of the pot. Many product technology research and ecology based on new media were left to dry on the beach. However, after several years of market cultivation, the seedlings of demand have long since emerged and have become an objective existence, leaving a vacuum that needs to be filled. Regarding product calls, the industry has been rumored for more than a year, whether it is Samsung Kioxia Dapu Micro or other manufacturers. Who can actually launch a mature alternative to solve the contradiction between supply and demand is very worthy of attention.

picture

        Before the emergence of alternative products, the industry has two ideas to solve this problem. One is to revisit the NVDIMM (non-volatile memory module) route, and the other is to return to DRAM+SSD and redesign the software and hardware architecture. Neither can be achieved overnight. Simple work. From a hardware perspective, NVDIMM-P/NVDIMM-H both belong to SCM. The designs of Optane and NVDIMM-P are also similar and should be referred to. However, NVDIMM uses DRAM, which results in high costs and inherent shortcomings in product competitiveness. Secondly, if you return to the DRAM+SSD solution, the caching mechanism and data loss solution will need to be reconstructed, and you will need to bear the risk of time and product maturity.

3.Integrated storage and calculation

        Strictly speaking, the programmable SSD that integrated storage and computing technology relies on cannot be classified as media, but it can be considered to be very closely bound to the media. In recent years, two interesting opposite ideas have coexisted: First, storage and computing integration/computable storage/programmable SSD are all "offload" ideas: taking the initiative to take the computing load of part of the data processing that was originally responsible for the host side. Offload to the storage side (including smart network cards), and provide computing power by adding an ARM CPU or simply an FPGA close to the storage medium, which is the so-called bringing computing closer to the data. The calculations it can complete include data compression, video encoding and decoding, encryption and decryption and other functions required for IO-intensive applications. There are currently many participants in this direction and it is a hot spot.

        Another idea is to bring the management and control capabilities that are tightly integrated on the SSD media side to the host side for processing. An example is the open channel SSD that was widely discussed in the industry in the past two years. The function was originally solidified into the FLASH main control chip, and the interface was opened to the host side, allowing the host side to realize optimization through software algorithm adjustment according to its own application load characteristics. In essence, the work of storage firmware FTL is raised to the upper layer, so that the system can understand the underlying situation, perform collaborative design of file system software and media hardware, and use various methods to improve performance. This forms an interesting contrast with the previous Offload idea.

4. DNA storage

        In terms of personal interest, I think the really interesting medium is DNA storage, which is an interdisciplinary combination of biotechnology and information technology, BT+IT. So far, all electronic information technologies and industries are based on physics, and energy band theory gave birth to the discovery of semiconductors. The ability of DNA base pair sequences to store genetic information is within the scope of biology, a completely different discipline. High school biology has taught the double helix structure of DNA and the four purines and pyrimidines of ATCG. Using ACTG to represent binary data 00 01 10 11 respectively can realize data storage. DNA coding synthesis technology can realize data writing, and DNA sequencing technology Data can be read.

picture

        DNA storage has several outstanding features. First, it has high storage density and can store three orders of magnitude (1,000 times) more data per unit volume than flash memory. Mark Bathe, a professor of bioengineering at MIT, has a famous view, "The world in a mug": using DNA storage technology, a coffee cup can hold 175ZB of the world's data.

        Secondly, the storage time is long and the storage cost is low. The limited storage time of disk and flash memory is usually within ten or several decades, but the storage time of DNA storage is at least more than a hundred years. If stored properly, thousands and thousands of years are also possible. , after all, everyone has heard the story of extracting flying insect genes from amber ten thousand years ago. What is even more exaggerated is that a paper in Nature mentioned that the genetic material of a 1.2 million-year-old mammoth in the frozen soil can be extracted and analyzed. DNA was analyzed.

        However, the biggest problems of DNA storage are slow reading speed and high writing cost. The cost of synthesizing 1MB of data may exceed US$100,000. Although high-speed sequencing technology is also called high-speed, it is not the same as the high-speed in the storage industry.

        The overall research on DNA storage has made some progress in the past two years, but has not yet produced a major breakthrough. At the end of 21 and the beginning of 2022, Microsoft + the University of Washington published a new paper to implement a concurrent reading and writing method; Southeast University used electrochemical methods to accelerate synthesis (writing) and Sequencing (reading); In September 2022, the Tianjin University team used the perfect combination of BT+IT to solve the problem of DNA breakage errors after storage at room temperature. Using the sequence reconstruction algorithm of biological science and the Fountain Code (a type of erasure code) of information storage technology, the Dunhuang murals stored in DNA in advance have been perfectly restored. They have also previously used yeast propagation to achieve biological replication of data, which is very interesting.

        In addition, the foreign DNA Data Storage Industry Alliance led by Microsoft Western Digital released a white paper last year; domestic BGI and the Shenzhen Institute of Advanced Technology of the Chinese Academy of Sciences and other units jointly released the "DNA Storage Blue Book" in July 22 and also proposed the establishment of DNA data storage. Industry-Academic Alliance.

5. Extremely high performance storage

        Obtaining extreme high performance is not an easy task. It involves all aspects of the entire data link, including media, interfaces, protocols, caching mechanism design at all levels, and cooperation with each other. Only one or two links are partial. Upgrading and optimizing sometimes do not yield the desired results as expected. The performance bottleneck is always a cunning dynamic drifter that requires a global perspective and careful practice to master.

        Measuring storage performance is nothing more than bandwidth, IOps and latency, as well as the stable output range of performance QoS. No matter how high the peak value is, the performance that fluctuates up and down is definitely unacceptable.

        From the media point of view, Flash, SCM, and DRAM may all appear on the data path, with corresponding caching mechanisms to improve the absolute value of performance. From the interface point of view, in the past PCIe4.0 era, M.2 and U.2 used PCIex4 , the sequential read bandwidth can reach more than 7GBps, and the 4k IOPS can reach 1 million to 1.6 million; (in addition, the card-type storage directly uses the PCIe interface, supports X8 and X16, and the theoretical bandwidth can exceed 20GBp). In the current PCIe5.0 era, the new interfaces E1.S/E1.L and E3.S/E3.L not only increase capacity, but also double the bandwidth performance because they support PCIe5.0 X8 and X16; and When PCIe6.0 arrives in the future, since the channel bandwidth will double again to 128GBps, the new interface should require more consideration on how to leverage this unprecedented channel performance.

        As for protocols, the NVMe protocol has been widely adopted. NVMe/RDMA (IB) in NVMe-oF has certain research value for achieving ultimate performance, while the RoCE protocol may have more difficult to overcome problems in latency and is more suitable for Take the route of cost-effective solutions. The CXL3.0 protocol that has attracted widespread attention in the industry recently may be the CXL3.0 protocol. Through the three sub-protocol modules of cxl.io, cxl.mem, and cxl.cache, it realizes the bidirectional access of the host to the peripheral memory and the peripheral to the host memory and the system memory. expansion while providing memory-level interconnect capabilities. At the U.S. Flash Memory Summit (FMS) in August 2023, a Korean manufacturer used CXL pooled memory to demonstrate application performance that was 3.32 times better than the traditional RDMA solution. In terms of research on ultimate storage performance, CXL is a protocol worthy of attention.

        Although we have discussed the new developments of media, interfaces, and protocols separately, to achieve the ultimate high performance of storage systems, we must consider them together, explore the collaborative design of high-speed networks, new media, and new protocols, and implement all levels in each specific system. Only by matching can the performance potential be fully realized.

6. What is distributed doing?

        Distributed storage has always been the direction of my long-term attention and research. In the past two years, distributed all-flash and high-end distributed storage with full media coverage have shown a very obvious upward trend, and have a good effect in data center level and high-performance computing applications. Performance, high performance, massive small files and mixed data requirements are both present. At the same time, some advanced functions of centralized storage, such as deduplication, are also implemented corresponding to "distributed deduplication". Some enhanced features for distributed indexing and retrieval for industries such as finance have also been introduced.

        This year I also noticed the emergence of LDPC - Forward Error Correcting Code (Error Correcting Code) in the fault-tolerant technology of the underlying data. It was originally mainly used in communications and video and audio coding. Compared with the typical EC erasure code that I am already familiar with, Reed-Solomon encoding, LDPC brings better encoding and decoding performance. The main reason is that the core encoding and decoding algorithm uses a sparse encoding matrix and only uses XOR operations. It is a bold technical choice to reduce the encoding and decoding time with a slight possibility of decoding failure.

        In addition, the concept of distributed converged storage was also officially launched this year. Some manufacturers also call it distributed intelligent converged storage. The word "fusion" appears again in distributed storage products. In terms of definition, there are three main points. Media convergence uses a preset scalable hierarchical storage mechanism to support existing and future types of media, from HDD to SCM; it widely supports various storage protocols and big data protocols to achieve the convergence of storage services; Data fusion is achieved through multi-protocol interoperability technology and unified data management technology. Different applications can access the same data through different protocols, truly realizing a unified resource pool. Service fusion, data fusion and media fusion form distributed converged storage, which is a product concept worthy of attention. There are more challenges in productization and engineering.

After talking about storage, let’s look at the collision between AI and storage.

7. Infrastructure requirements for large models

        For storage systems, generative AI is also an application, so it is very important to understand the application mechanism and real requirements of large models.

picture

        At this stage, what is the real need for large models? There is no doubt that all competitors are focusing on one thing, how to complete the establishment and deployment of GPU clusters as soon as possible. As we analyzed in the previous article, due to production capacity, policies and other reasons, NVidia's high-end products H100 and A100, which are most suitable for large model applications, have experienced market shortages and purchasing difficulties. The computing power requirements of large AI models have doubled every two to three months. An architect from Alibaba Cloud gave an estimate of 275 times every two years. Faced with such strong demand, NVidia's stock price also exceeded $500 and hit a new high. In addition to purchase, the cost of using the cluster is also high, and is measured in hours. For such precious computing resources, maximizing utilization is the first consideration. Leading players in the industry have thought of many ways to use algorithms, such as Improve computing parallelism and avoid GPU idling caused by bubbles.

        It is difficult to make a meal without rice. For large models, computing power is the first priority, followed by ultra-high-speed network; because in essence, the current generative AI is a very typical computing-intensive application, which is different from traditional scientific computing and high-speed network. Performance computing (HPC) is very similar. According to previous HPC experience, when building such IT infrastructure, computing power and high-speed network are the most difficult and troublesome problems that need to be solved. We found that the same is true in large-model applications. 90% of the energy and budget are used to solve the above two problems. How to achieve high-speed interconnection of tens of thousands of H100/A100 cards using IB network is a very troublesome problem.

        At the same time, due to the "width computing" architecture mentioned above, memory has actually become a high-priority issue to be solved. The trillions of parameters and gradients of the Transform architecture need to be placed in the fastest medium. The cache (video memory) built using HBM (High Bandwidth Memory) is obviously not enough. Therefore, the industry has also been advancing the out-of-band cache technology of GPU for a while. Time, if sorted according to the priority of speed and latency, cache->DRAM->NVMe peripherals, the focus of the top participants in the AI ​​industry is still concentrated on the first two levels, and the priority of storage Obviously not too high.

        Finally, even if the supply and technical issues of this series of infrastructure have been solved, it will not be easy to find suitable data center resources for deployment. The energy consumption of GPU is much greater than that of CPU. For example, Dell and H3C have already adopted AI servers. A 2400W or even 3000W power supply consumes far more power than an ordinary server. Nowadays, there are a large number of IDCs on the market that use standard 4KW cabinets. Even 6KW cabinets are difficult to meet the requirements for AI infrastructure deployment. This is also something we have to face. Practical problems.

8. Data volume and storage requirements for large models

picture

        The amount of real large model training data is actually not amazing. From GPT’s 5GB to GPT3’s 570GB training data, the total amount is within a small range. Public information shows that Inspur’s Source 1.0 large model has collected almost the entire Chinese Internet data set for training, and the total data volume is only around 5TB; if according to the latest analysis material of GPT-4 in July, 13 trillion was used Each token is used for training. Calculated at 4 bytes per token, the entire training data set is only about 53TB. For today's storage industry, 53TB is not really a huge capacity requirement. A high-end all-flash storage device can usually provide 50 to 100TB of capacity, and hybrid flash and mid-range storage can provide even greater capacity. , an order of magnitude difference.

        However, before starting training, the data set needs to go through two preparatory actions of collecting and cleaning.

picture

        Taking GPT-3 as an example, the original training data comes from 45TB of Internet public data obtained by the web crawler tool CommonCrawl, which contains approximately 1 trillion Tokens. After the data cleaning work is completed, the data volume is reduced by 80 times to 570GB, and The number of Tokens has also been reduced to about 40%, 410 billion. In this preparation stage for data collection and cleaning, the requirements for storage capacity and concurrent access still exist objectively. They are basically the typical requirements of big data applications and data lakes in previous years.

        In addition, due to the failure to enable multi-modal data sets in the training phase until GPT-4, the explosive growth of unstructured data in the field of generative AI has not really come widely, which may happen in the next six months to one year. There are huge changes.

9. Mechanisms related to large models and storage

        We have discussed before that the two most important stages of large model application are training and inference. When entering the training stage, there are two points in the operation mechanism of large models that are closely related to storage.

        The first is the initial loading of the training data set. The training cluster of a severely expanded large model usually has a considerable scale, and the working mechanism of the neural network requires that all data be loaded before it can be started. In this process, the data set has an action similar to database sharding, which generates a large number of concurrent reads and writes to the storage. At present, the mainstream access mode of large models to storage is through the file interface protocol. The cleaned data set mainly contains a large number of small files. In this case, the concurrency performance of NAS storage, including metadata performance, will be tested.

        The second mechanism is that the training process lasts for weeks and months and errors often occur during the process. Unable to do so, AI engineers have already proposed a response method called Checkpoint. This mechanism is actually a passive response. It is assumed that an error may occur every 8 hours. Wrong, then set a checkpoint of 6 hours, back up all the intermediate state data every 6 hours, and roll back to the most recent checkpoint state to start again when an error occurs next time. I call this local method of backup, and it forces AI engineers to even design backup software to a certain degree.

        The huge amount of parameters is a characteristic of large models, and these intermediate data are also very huge. If it is put back into the centralized storage or distributed storage that provided the original training data set, the read and write process may be very slow, like the first data load. If this lasts for more than ten hours or more, it will cause problems: "The backup is not completed, and the production system crashes." Therefore, the more wealthy solution is to directly insert 5,678 NVME SSDs on the nodes of the training cluster, and cache the Checkpoint data directly locally. Without all the troubles of network and concurrent IO, "backup" and "restore" are very fast, but the cost is higher.

        Compared with the storage requirements of the training process, the difficulty of the inference phase is basically negligible. In the inference phase, since the model has completed training and fine-tuning, most of the workload is doing calculations and may obtain some new data such as from User input will also generate data for inference results, but the amount of data is at the level of a common application, and there is no huge challenge unlike before.

        There are also some large model applications that hope to continue this continuous optimization process. After going online, they will continue to make adjustments based on real user feedback. This may also involve full life cycle management such as processing of feedback data and data archiving at different stages. It may also involve storage requirements, but not many have been exposed to it so far, and technically they all belong to conventional application requirements, which can be easily handled by various current storage systems.

10. Summary of AI from a storage perspective

        It is undeniable that, except for Nvidia, which is the biggest beneficiary, this round of generative AI has promoted demand growth in all IT industry chains, and manufacturers large and small are celebrating the new orders brought by large models. According to a senior industry strategy expert, from the perspective of the global IT market size of US$2 trillion, storage only accounts for a single-digit percentage and is a relatively small part of it; from the perspective of AI application, whether it is resources The degree of shortage, the technical urgency and budget share that need to be solved, storage, especially external storage, is not a priority now. But for the storage industry, although it is just one of many applications that need to be supported, the future growth prospects of generative AI applications deserve priority attention.

        At the current stage, the storage requirements of generative AI are first and foremost high performance and low latency, but this requirement is not difficult to meet. According to Nvidia’s official recommendation, the performance bandwidth can reach 40GB for reading and 20GB for writing. In its recommended computing node configuration, it is also There are only two 40GB InfiniBand ports. Considering network redundancy, the bandwidth of one port is sufficient. If the performance of flash memory can be fully utilized, millions of IOps will be able to meet the needs of generative AI.

        The second requirement is concurrent access capabilities and data sharing, but it is only strongly required when loading training data. In addition, if a vector database is used as the data storage solution, then the storage requirements are simplified to traditional requirements such as performance reliability. .

        Finally, there are some advanced features to be studied and discussed, such as GDS support in NVidia CUDA, which allows the GPU to skip the CPU and directly access the storage, improving performance and response. In addition, as discussed earlier, some of the storage function replacements implemented by AI engineers using engineering methods, such as checkpoints, etc., can be transferred to a more professional implementation method of the storage system and Offloaded to the storage layer. This is an interesting research direction.

        In addition, the generative AI industry currently has the characteristics of small absolute data volume and insensitivity to costs. Based on the above, the current two storage products of new NVme SSD and high-performance distributed all-flash file storage are more suitable. In practice That's mostly the case.

picture

        Generally speaking, AI applications are characterized by very rapid development, and there are tipping points from time to time. "Large models" are in the ascendant, and the new concept of "AI agents" has been put on the table. The new company Imbue has not yet launched its product, but has already started to develop. NVidia received US$200 million in venture capital and 10,000 H100s, with a valuation of US$1 billion, and revolutionary technology iterations one after another. In 2023, the global competition for general large models will still be very fierce, and leading groups will make open source actions from time to time. According to the perspective of the investment industry, each time may bring about a reshuffle. The war of hundreds of large models in domestic vertical industries is also in full swing, and the demand for relevant technical product solution talents is huge. Before the end is over, there will at least be a window period that is worth seizing in the storage industry.

————————————————

Copyright statement: This article is an original article by Chen Xuefei, storage committee member of Shanghai Computer Society

おすすめ

転載: blog.csdn.net/iamonlyme/article/details/132962557