Storage computing hardware

With the development of cloud storage and Internet of Things, consumer electronics, aerospace, earth resource information, scientific computing, medical imaging and life sciences, military equipment and other important electronic information applications, today's society is in an era of big data with information explosion. Ultra-high-performance computing with ultra-high speed, high bandwidth, large capacity, high density, low power consumption, and low cost is exploding. The traditional computer uses the von Neumann architecture, in which the computing and storage functions are separated, and are completed by the central processing unit (CPU) and memory, respectively. With the rapid development of microelectronics technology, the performance of CPU and memory, such as speed and capacity, has increased rapidly, but due to the very limited increase in the bus speed for transmitting data and instructions, frequent data transmission between CPU and memory has caused information processing The bottleneck is called the storage wall.

As early as the "storage wall" was exposed, computer researchers have begun to find ways to solve or weaken the "storage wall" problem. The method used in industry until now is the well-known "Memory hierarchy". The core idea is to buffer a speed mismatch between the processor and the dynamic storage unit by inserting a series of cache memories (cache). . Although the storage hierarchy reduces the average latency of the calculation to a certain extent, it does not fundamentally eliminate the "storage wall" problem.

At present, many scholars and institutions have begun to study computing in storage (process in memory), the core idea is to integrate the computing (processing) function and the storage function in the same chip, all calculations are all implemented in the storage, No need to read and write data.

Facing the increasingly serious storage wall problem, memory access power consumption problem, and the drive brought by artificial intelligence applications, computing storage / in-memory computing / depositing and computing integration provides a promising method. Judging from the current implementation, computing storage / in-memory computing / integrative computing is divided into two routes: based on mature volatile storage and immature non-volatile storage. Regardless of the route, there are certain challenges:

1) Based on mature volatile storage: In this way, computing storage / in-memory computing / integrating computing requires a fusion of processor technology and memory technology. However, due to the different manufacturing processes of processors and memories, if the function of the memory is implemented on the processor, the storage density of the memory may be reduced; if the function of the processor is implemented on the memory, the processor Running speed. It is difficult to have a good compromise between performance and capacity.

2) Based on immature non-volatile storage: Non-volatile storage is a natural fusion of storage and computing. It is the best device for building computing storage / in-memory computing / integrative computing. However, due to the current immature manufacturers and processes. Economically, in addition to more investment in existing memory manufacturing facilities used to produce these new technologies, it is difficult for users to move to new technologies as long as they can still use DRAM or Flash.

Overview

Modern electronic equipment is developing rapidly towards intelligence, light weight, and portability, but the challenge of intelligent big data processing and the bottleneck of von Neumann's computing architecture have become one of the key contradictions in the current electronic information field; at the same time, the size of the device is shrinking (Moore's Law fails) ) The power consumption and reliability problems brought about further exacerbated the rapid deterioration of this contradiction. In recent years, new data-centric computing architectures, such as integrated computing and storage chip technology, have received widespread attention, especially in end-side smart scenarios. However, based on the consideration of many factors such as resources, delay, cost, power consumption and other factors of the end-side equipment, the industry has put forward stringent requirements for the integrated memory and computing chip. Therefore, the storage-computing integrated medium and the calculation paradigm are particularly important. At the same time, device-chip-algorithm-application cross-layer collaboration is critical to the industrial application and ecological construction of integrated memory and computing chips. This article gives a simple overview of the demand, status, mainstream direction, application prospects and challenges of the end-to-end intelligent storage and computing integrated chip. We have reason to believe that, supported by the hardware of high-efficiency and low-cost intelligent storage and computing integrated chips, with the maturity of 5G communication and Internet of Things (IoT) technology, the era of intelligent Internet of Things (AIoT) is coming.
Since the fourth information revolution, modern electronic equipment has developed rapidly towards intelligence, light weight, and portability. Especially in recent years, with the deep research and popularization of artificial intelligence algorithms represented by deep learning neural networks, smart electronic devices and related application scenarios have been everywhere, such as face recognition, voice recognition, smart home, security monitoring, unmanned Driving etc. At the same time, with the maturity of 5G communication and Internet of Things (IoT) technology, it is foreseeable that the era of intelligent Internet of Things (AIoT) is coming. As shown in Figure 1, in the future AIoT scenario, devices will be mainly divided into three categories: cloud, edge, and terminal [1], where edge terminal devices will show explosive growth. As we all know, the three major elements of artificial intelligence are computing power, data and algorithms. The popularization of Internet and 5G communications has solved the big data problem, the rapid development of deep learning neural networks has solved the algorithm problem, and the large-scale industrialization of high-performance hardware such as NVIDIA GPU / Google TPU has solved the problem of cloud computing power. However, the computing power of resource-constrained edge terminal devices is still a missing link, and due to its special requirements for delay, power consumption, cost, and security (especially considering the special needs of subdivided scenarios), Become the core key of AIoT large-scale industrial application. Therefore, on the road to AIoT, the core challenges that need to be solved are end-side smart chips with high energy efficiency, low cost, and long standby.

Figure 1. Schematic diagram of the future AIoT scenario, including three layers: computing data center, edge and terminal [1]

Feng Neumann's computing architecture bottleneck and big data intelligent processing challenges

With the rapid rise of applications such as big data, the Internet of Things, and artificial intelligence, data has grown rapidly at an explosive rate. Relevant research reports indicate that the amount of data generated every day in the world is about bytes, and the volume is still increasing at a rate of doubling every 40 months [2]. The efficient storage, migration and processing of massive data has become one of the major challenges in the field of electronic information. However, limited by the classic von Neumann computing architecture [3, 4], data storage and processing are separated, and data transmission is performed between the memory and the processor through the data bus, as shown in Figure 2 (a). In application scenarios such as big data analysis, this computing architecture has become one of the main bottlenecks of high-performance low-power computing systems. On the one hand, the limited bandwidth of the data bus severely restricts the performance and efficiency of the processor. At the same time, there is a serious performance mismatch between the memory and the processor, as shown in Figure 2 (b). No matter how fast the processor is running and how good the performance is, the data is still stored in the memory. Every time the operation is performed, the data needs to be moved from the memory to the processor through the data bus, and then moved back to the memory after the data is processed. This is like an hourglass. The two ends of the hourglass represent memory and processor, the sand represents data, and the narrow channel connecting the two ends of the hourglass represents the data bus. Therefore, the memory bandwidth greatly limits the performance of the processor, which is called the storage wall challenge. At the same time, Moore's Law is gradually failing, and the technical path of continuing to improve chip performance by relying on device size scaling is facing huge challenges in power consumption and reliability. Therefore, the traditional von Neumann computing architecture is difficult to meet the fast, accurate and intelligent response requirements of intelligent big data application scenarios. On the other hand, the frequent migration of data between memory and processors brings serious transmission power consumption problems, called power wall challenges. Nvidia's research report points out that the power consumption required for data migration is even greater than the actual data processing power consumption. For example, related research reports indicate that under the 22-nanometer process node, the data transmission power consumption required for one-bit floating-point operation is about 200 times the data processing power consumption [5]. In the field of electronic information, the problems of storage walls and power walls are also called von Neumann computing architecture bottlenecks. Therefore, the challenge of intelligent big data processing is essentially caused by the contradiction between the processing capacity of the hardware facility and the data scale of the problem being processed. It is of great scientific significance and application prospect to construct efficient hardware facilities and computing architecture, especially in resource-limited AIoT edge terminal devices, to deal with the von Neumann computing architecture bottleneck in the context of intelligent big data applications.

Figure 2 (a) Schematic diagram of von Neumann computing architecture; (b) Performance gap between memory and processor
In order to break the bottleneck of von Neumann computing architecture and reduce the overhead caused by data movement, one of the most straightforward approaches is Increasing the data bus bandwidth or clock frequency will inevitably bring greater power consumption and hardware cost overhead, and its scalability is also severely limited. At present, the mainstream solution adopted by the industry is to achieve high-speed and high-bandwidth data communication through high-speed interfaces, optical interconnection, 3D stacking, and on-chip cache. At the same time, the memory should be as close to the processor as possible to reduce the distance of data transmission. Optical interconnection technology is still in the pilot stage of research and development, and methods such as 3D stacking technology and adding on-chip cache have been widely used in actual products. Many high-efficiency and enterprises at home and abroad are developing and applying this technology, such as Google, Intel, AMD, Nvidia, Cambrian Technology and so on. For example, the use of 3D stacking technology and the integration of large-capacity memory on the processor chip can increase the data bandwidth from tens of GB / s to hundreds of GB / s; based on the 3D stacking DRAM technology, IBM released a Ten billion times supercomputing system [6]; domestic Baidu Kunlun and British Graphcore company integrated 200MB-400MB on-chip cache on chip products to improve performance. It is worth noting that the above solution will inevitably bring power consumption and cost overhead, and it is difficult to apply to AIoT devices where energy consumption and cost of edge terminals are limited, and it does not change the problem of separation of data storage and data processing, so It can only be relieved to a certain extent, but it cannot fundamentally solve the bottleneck of von Neumann's computing architecture.

Fundamental Principles of Integrating Saving and Computing and the Current Development Situation at Home and Abroad

The integrated computing and storage chip technology aims to transform the traditional computing-centric architecture into a data-centric architecture, which directly uses memory for data processing, thereby integrating data storage and computing in the same chip, which can completely eliminate Feng Neumann's computing architecture bottleneck is especially suitable for large-scale parallel application scenarios such as deep learning neural networks. It should be noted that there are currently many similar English concepts in academia and industry, such as Computing-in-Memory, In-Memory-Computing, Logic-in-Memory, In-Memory-Processing, Processing-in-Memory Etc., and the titles of different research fields (devices, circuits, architectures, database software, etc.) are not uniform, and the corresponding Chinese translations are also different, such as memory processing, in-memory processing, memory computing, memory computing fusion, memory Calculation, storage and calculation, etc. In addition, in a broad sense, near-memory computing is also summarized as one of the technology paths of the integration of deposit and computing.

The basic concept of integrated computing and computing can be traced back to the 1970s. Kautz et al. Of the Stanford Research Institute proposed the concept of integrated computing and computing as early as 1969 [7, 8]. Subsequent research work has been carried out at the level of chip circuits, computing architectures, operating systems, and system applications. For example, Patterson et al. Of the University of California, Berkeley successfully integrated the processor in the DRAM memory chip to implement an intelligent storage and computing integrated computing architecture [9]. However, due to the complexity of chip design and manufacturing costs, as well as the lack of killer-level big data applications to drive, the early integration of storage and computing only stayed in the research stage, and has not been practically applied. In recent years, with the increasing amount of data and the improvement of memory chip technology, the concept of integrated memory and computing has regained people's attention, and began to be applied to commercial-grade DRAM main memory. Especially around 2015, with the rise of big data applications such as the Internet of Things and artificial intelligence, technology has been widely researched and applied by academic and industrial circles at home and abroad. At the 2017 Microprocessor Annual Conference (Micro 2017), including NVIDIA, Intel, Microsoft, Samsung, Federal Institute of Technology Zurich and the University of California, Santa Barbara, etc. all launched their prototypes of integrated computing and storage systems [10 -12].
In particular, in recent years, non-volatile memory technologies, such as flash memory (Flash), memristors (resistance change memory RRAM), phase change memory (PCM) and spin magnetic memory (MRAM), etc. [13-17], are The efficient implementation of the integrated memory and computing chip has brought a new dawn. The resistive storage principle of these non-volatile memories can provide inherent computing power, so data storage and data processing functions can be integrated at the same physical unit address. In addition, non-volatility allows data to be stored directly on the system-on-chip, enabling immediate power-on / off without the need for additional off-chip memory. The team of Professor Williams of Hewlett-Packard Laboratory proposed and verified the use of memristors to implement simple Boolean logic functions in 2010 [18]. Subsequently, a large number of related research work continued to emerge. In 2016, the team of Professor Xie Yuan of the University of California, Santa Barbara (UCSB) proposed to use RRAM to build a deep learning neural network based on an integrated architecture of storage and computing (referred to as PRIME [19]), which attracted widespread attention in the industry. The test results show that, compared with the traditional scheme based on von Neumann computing architecture, PRIME can achieve a power consumption reduction of about 20 times and a speed increase of about 50 times [20]. This scheme can efficiently realize vector-matrix multiplication operation, and has huge application prospects in the field of deep learning neural network accelerators. Internationally, Duke University, Purdue University, Stanford University, University of Massachusetts, Nanyang Technological University of Singapore, Hewlett-Packard, Intel, Magnesium, etc. have all carried out related research work, and released related test chip prototypes [21-24]. China's research in this area has also achieved a series of innovative achievements, such as the team of Professor Liu Ming of the Institute of Microelectronics of the Chinese Academy of Sciences, the team of Professor Huang Ru and Kang Jinfeng of Peking University, the team of Professor Yang Huazhong and Professor Wu Huaqiang of Tsinghua University, Shanghai Microsystems The team of Professor Song Zhitang and the team of Professor Miao Xiangshui of Huazhong University of Science and Technology, etc. have all released prototypes of related devices / chips, and have been tested and verified through applications such as image / speech recognition [25-27]. PCM has multi-bit characteristics similar to RRAM, and vector-matrix multiplication can be implemented based on similar principles. For MRAM, due to its binary storage physical characteristics, it is difficult to realize the vector-matrix multiplication operation based on the cross-point array, so the memory-calculation integration based on MRAM usually adopts the calculation paradigm of Boolean logic [28-30]. However, due to issues such as the maturity of the technology / process, the integrated memory and calculation chips based on phase change memory, resistance change memory, and spin memory have not yet been industrialized. At the same time, based on Nor Flash's integrated memory and storage chip technology has recently received special attention from the industry. Since the first sample was released by UCSB in 2016, many start-up companies have been researching and developing, such as Mythic, Syntiant in the United States, and Zhikun Technology in China. Industrial investment of mainstream semiconductor companies and capital at home and abroad, including Lam Research, Applied Materials, Intel, Micron, ARM, Bosch, Amazon, Microsoft, Softbank, Walden, SMIC, etc. In comparison, Nor Flash has advantages in the end-to-end AIoT field in terms of technology / process maturity and cost. The three major companies all announced mass production by the end of 2019.

End-to-end intelligent application features and integrated computing and storage chip requirements

With the rapid development of AIoT, users have special application requirements such as delay, bandwidth, power consumption, privacy / security, as shown in Figure 3 (a), driving the explosion of intelligent application scenarios on the edge side. First of all, latency is the most intuitive feeling of user experience, and it is a necessary requirement for certain application scenarios, such as autonomous driving, real-time interactive games, AR / VR, etc. Considering the amount of data generated in real time, the actual transmission bandwidth, and the energy consumption of the end-side equipment, it is impossible to rely on the cloud for all operations. For example, according to Intel estimates, each autonomous vehicle generates up to 400GB of data per day [1]; for another example, each high-definition security surveillance camera generates up to 40GB-200GB of data per day. If the data generated by all vehicles and even all cameras are sent to the cloud for processing, it is not just the user experience, even for the transmission network and cloud devices will be a disaster. Moreover, the half-life of edge data is usually relatively low, such a huge amount of data, in fact, the data that really makes sense may be very small, so it does not make sense to send all the data to the cloud for processing. In addition, most of the data generated by the same type of equipment usually has extremely high characteristics of the same pattern. With the limited processing power of the edge / terminal, most of the useless data can be filtered out, thereby greatly improving the user experience and overhead. Another parameter that enhances the user experience is standby time, which is especially critical for portable wearable devices. For example, smart glasses and earphones must have at least a full-load standby time of more than one day. Therefore, the power consumption / energy efficiency of the terminal equipment is a great challenge. Secondly, users are increasingly demanding privacy / security and are reluctant to send data to the cloud for processing, making local processing an indispensable capability for terminal devices. For example, with the popularization of voice recognition and face recognition applications, more and more people are concerned about the issue of privacy leakage. Even though smart homes have become popular, many users choose to turn off the voice processing function. Finally, in a networkless environment scenario, edge terminal processing will become necessary. Correspondingly, unlike cloud chips, for end-side smart chips, the requirements for cost and power consumption are the highest, while the requirements for versatility, computing power, and speed are second, as shown in Figure 3 (b). Therefore, traditional technology paths that rely on device size scaling to continue to improve chip performance face huge challenges in terms of power consumption and cost; while technology paths that rely on device and architecture innovation are gaining more and more attention. In 2018, the US DARPA "Electronic Renaissance Plan" clearly proposed an equal-scale miniaturization path that no longer relies on Moore's Law. To reduce the need for mobile data in data processing circuits, and study new computing topology architectures for data storage and processing, bringing significant improvements in computing performance. The industry generally believes that the integrated computing and storage chip technology will provide one of the feasible technical paths to achieve this goal.

Figure 3. (a) Demand characteristics of smart application scenarios at the edge and end; Adapted from Gartner, 2019; (b) Different performance requirements of smart chips in the cloud and the end

Main research direction of integrated computing and storage chips

According to the different storage media, the current mainstream R & D of integrated memory and storage chips focuses on traditional volatile memory, such as SRAM, DRAM, and non-volatile memory, such as RRAM, PCM, MRAM, and flash memory, etc. SRAM and MRAM are representative of the general near-memory computing architecture. It is worth noting that this chapter will mainly discuss the implementation of deep learning neural network accelerators based on integrated computing and memory chips. In such applications, more than 95% of the operations are vector matrix multiplication (MAC), so the integration of storage and calculation is mainly used to speed up this part of the operation.

(1) The general near-memory computing architecture
is shown in Figure 4. This scheme usually adopts a homogeneous many-core architecture. Each storage computing core (MPU) includes a computing engine (Processing Engine, PE), cache (Cache), and control (CTRL) and input and output (Inout / Output, I / O), etc., here the cache can be SRAM, MRAM or similar high-speed random access memory. Each MPU is connected through a network-on-chip (NoC). Each MPU accesses its own cache and can implement high-performance parallel operations. Typical cases include the British Graphcore company, whose test chip integrates 200-400MB of SRAM cache, and the US Gyrfalcon Technology company, whose test chip integrates 40MB of embedded MRAM cache.
(2) SRAM storage and calculation integration
Because SRAM is a binary memory, binary MAC operation is equivalent to XNOR accumulation operation and can be used for binary neural network operation. As shown in Figure 4 (a) and Figure 4 (b) are two typical design schemes, the core idea network weight is stored in the SRAM cell, the excitation signal is fed from the extra word line, and finally the XNOR accumulation operation is realized by the peripheral circuit, the result is passed Calculator or analog current output, the specific implementation can refer to [31,32]. The main difficulty of this scheme is to achieve large array operations while ensuring the accuracy of operations.

Figure 4. SRAM storage and calculation integrated unit design; (a) 12 tube design [31]; (b) 8 tube design [32] (3) DRAM storage and calculation integration

The DRAM-based integrated design of storage and calculation mainly uses the charge sharing mechanism between DRAM cells [33,34]. Figure 5 shows a typical implementation scheme [33]. When multiple rows of cells are gated at the same time, charge exchange and sharing will occur between different cells due to the difference in stored data, thereby implementing logical operations. One of the problems of this scheme is that the calculation operation is destructive to the data, that is, the data stored in the DRAM storage unit will be destroyed every time the operation is performed, and it needs to be refreshed after each operation, which causes a large power consumption problem; Another difficulty is to achieve large array operations while ensuring the accuracy of operations.

Figure 5. Typical DRAM-based integrated design and storage design principles [33]

(4) RRAM / PCM / Flash multi
-value storage and calculation integration The basic principle of the RRAM / PCM / Flah-based multi-value storage and calculation solution is to use the multi-value characteristics of the storage unit to pass the intrinsic physical and electrical behavior of the device (such as Kiel Hough's Law and Ohm's Law) to achieve multi-valued MAC operations [13, 21-25], as shown in Figure 7. Each memory cell can be regarded as a variable conductance / resistance, used to store network weights, when current / voltage (excitation) is applied in each row, each column can get the voltage / current value of MAC operation. In actual chips, depending on the physical principles and operating methods of different storage media, the specific implementation will vary. Since RRAM / PCM / Flash itself is a non-volatile memory, which can directly store network weights, no off-chip memory is required, reducing chip cost; at the same time, non-volatile can ensure that data is not lost when power is off, thus realizing instant boot / Shutdown operation reduces static power consumption and prolongs standby time. It is very suitable for edge terminal devices with limited power consumption. At present, the storage and computing integrated technology based on RRAM / PCM is a very hot research direction in academia. Unfortunately, because of RRAM / PCM maturity and other issues, it has not yet been industrialized, but it has great potential in the future; based on Flash The integrated technology of storage and computing is relatively mature and has received widespread attention from the industry. It is expected to be mass-produced by the end of 2019.

Figure 6. The basic principle of MAC operation based on RRAM / PCM / Flah [13]

(5) RRAM / PCM / MRAM binary
storage and calculation integration There are two main schemes based on RRAM / PCM / MRAM binary storage and calculation integration. The first solution is to use auxiliary peripheral circuits, which is similar to the above SRAM memory and computing integration. As shown in Figure 7 (a), it is a typical reconfigurable memory and computing integration implementation scheme [35], which can be used in storage applications and storage. Switch between computing applications. Due to the principle of non-volatile resistive storage of RRAM / PCM / MRAM, it has different circuit implementations, for specific reference [35-37]. The second scheme is to directly use the storage unit to achieve Boolean logic calculation [28,38-40], as shown in Figure 7 (b), this scheme directly uses the input and output operations of the storage unit to perform logical operations, according to different memory storage units The structure and operation method are different, and there can be different implementations. For details, please refer to [28,38-40]

Figure 7. The basic principle of memory and calculation integration based on RRAM / PCM / MRAM; (a) using peripheral circuit scheme [35]; (b) using storage unit scheme [40]

Application prospects and challenges

The integrated computing and storage chip technology, especially the non-volatile integrated computing and storage chip technology, because of its high computing power, low power consumption, low cost and other advantages, will have great application prospects in the field of AIoT in the future. The challenge of large-scale industrialization of integrated memory and computing chips mainly comes from two aspects: (1) technical level; integrated memory and computing chips involve multi-level cross-layer collaboration such as device-chip-algorithm-application, as shown in Figure 9. For example, the different performance requirements of subdivision application scenarios determine the design of neural network algorithms and chips. The algorithm relies on the collaboration of neural network frameworks, compilation, driving, mapping and other tools with the chip architecture, which in turn depends on devices, circuits and foundry processes . These are quite a challenge to the research and development and preparation of integrated chips for storage and computing, especially the support of foundries. In particular, the integrated storage and computing technology based on new storage media has different physical principles, behavioral characteristics, and integration processes. It requires cross-layer collaboration to achieve optimal performance (accuracy, power consumption, delay, etc.) and cost.

(2) Industrial ecology level; as an emerging technology, if it is to be popularized on a large scale, it is inseparable from the construction of industrial ecology, which requires the vigorous collaboration, research and development, promotion and application of chip manufacturers, software tool manufacturers, application integration vendors, etc. To achieve the combination and performance of performance and scenarios, especially under the premise that traditional chips have occupied most of the existing application scenarios, how to break through new markets and attract new users is the key to rapid industrialization. The success of NVIDIA GPU has given us a good inspiration and reference. On the one hand, it is necessary to optimize tools and services to facilitate users' use; on the other hand, it is necessary to avoid competition as much as possible, based on the advantages of integrated storage and computing chips, open up new applications, new scenarios, new markets, and create new application markets that traditional chips cannot cover.

Figure 8. Schematic diagram of cross-layer collaboration of integrated computing-device-chip-algorithm-application

Storage computing hardware

Guess you like