Embedded memory provides an implementation framework for the realization of AI

Research activity in the field of brain-inspired computing has grown enormously in recent years. The main reason is an attempt to move beyond the limitations of traditional von Neumann architectures, which are increasingly affected by the bandwidth and latency limitations of memory-logic communication. In neuromorphic architectures, memory is distributed and can be co-located with logic. Given that new resistive memory technologies can be integrated in the interconnect layers of CMOS processes, this possibility can be easily provided.

Although much of the current attention in embedded AI deployment is on implementing deep learning algorithms in large conventional computing systems, the impact on device and circuit technologies is mixed. Although advanced standard CMOS technology has been used to develop GPUs and specific circuit accelerators, there has been no real push to use any "biologically inspired" hardware. Emerging resistive memory devices (RRAM) could open avenues, due to their (perceived) immaturity, to emulate biologically plausible synaptic behavior at the nanoscale by modulating conductance at relatively low bias voltages, thus This approach has been limited to research group techniques.

But these new devices could provide a solution to one of the main problems facing mass deployment of embedded AI into consumer and industrial products: energy efficiency. If the use of AI is scaled up, the energy overhead of transferring all the data to cloud/server systems for analysis will quickly reach the limit of AI's economic viability. Furthermore, for real-time systems such as self-driving cars and industrial controls, latency remains an issue if the servers connected to the 5G infrastructure to process data are concentrated in well-defined areas rather than distributed throughout the infrastructure. For these reasons, and in Europe also for privacy concerns, it will become increasingly important to have edge/point-of-use, AI-enabled systems that are highly energy-efficient, and possibly have incrementally improved local learning capabilities.

Embedded AI systems are ideal for processing data that requires real-time response, and where energy is a major concern. Interest in such systems is growing, as evidenced by the success of the tinyML initiative. When dealing with sparse, time-domain, sensor-generated data streams (such as microphones, LIDAR, ultrasound, etc.), approaches that are biologically inspired (i.e., the memory element also acts as an interconnect and computational element) have an additional advantage in the field . These systems would then be able to do most of their operations in the analog domain, avoiding power consumption, unnecessary multiple analog-to-digital conversions, and using clockless data-driven architectures to simplify data flow. No dissipation in the clock and storage elements during signal pulses only, results in very low power consumption in the absence of input (hence suitable for sparse signals), and may not require a specific sleep mode for battery-powered working status. In addition, non-volatility requires setting parameters only on first power-up or eventual system update, and does not require transfers from external sources on every power-up.

However, the use of novel resistive memories is not limited to such "edge" or "bio-inspired" applications, but can also benefit traditional all-digital clock systems that perform the functions of slow non-volatile caches/fast mass-storage intermediate memory neuroaccelerators in the level. In this case, the benefit would be to reduce fast DRAM and SRAM cache areas while still reducing the latency of accessing mass storage.

Hardware Platforms for Biologically Inspired Computing

From a technical perspective, RRAM is a good candidate for neuromorphic applications due to its CMOS compatibility, high scalability, strong endurance, and good retention characteristics. However, defining practical implementation strategies and useful applications of large-scale hybrid-integrated neuromorphic systems (CMOS neurons with resistive memory synapses) remains a formidable challenge.

Resistive RAM (RRAM) devices such as phase-change memory (PCM), conductive bridge RAM (CBRAM), and oxide RAM (OxRAM) have been proposed to emulate biologically inspired functions of synapses that are essential for realizing Neuromorphic hardware is critical. Among the different types of simulated synaptic features, spike timing-dependent plasticity (STDP) is the most commonly used one, but certainly not the only possibility, and some may show to be more useful for real-world implementations.

An example of a circuit implementing these ideas and validating the approach is SPIRIT, presented at IEDM 2019 [2]. The implemented SNN topology is a single-layer, fully connected topology aimed at performing inference tasks on the MNIST database, with 10 output neurons, one for each class. To reduce the number of synapses, images were scaled down to 12 × 12 pixels (144 synapses per neuron). Synapses are implemented using single-level cell (SLC) RRAM, i.e. only low and high resistance levels are considered. The structure is of the 1T-1R type with one access transistor per cell. Multiple units are connected in parallel to achieve various weights. Synapse quantization experiments on the learning framework show that integer values ​​between -4 and +4 are a good compromise between classification accuracy and the amount of RRAM. Since we aim to get weighted currents, 4 RRAMs must be used for positive weighting. For negative weights, RRAM can also be used to encode the sign bit: however, since fault-tolerant triple redundancy will be required, negative weights are best implemented with 4 additional RRAMs.

The guiding principle for the design of "integrate-and-fire (IF)" simulated neurons is the need to be mathematically equivalent to the tanh activation function used in offline learning with supervision. The specifications are as follows: (1) stimuli with a synaptic weight equal to ±4 must produce spikes; (2) neurons must produce positive and negative spikes; (3) they must have a refractory period during which they cannot spiking, but must Keep scoring. The neuron is designed around a MOM 200fF capacitor. Two comparators are used to compare its voltage level with positive and negative thresholds. Since the RRAM has to be read using a voltage drop limited to 100mV across its terminals, to prevent setting the device as LRS, the currents obtained cannot be directly integrated by the neuron, so they are replicated by the current injector. The impact of programming conditions was evaluated and sufficient programming conditions were used to ensure a sufficiently large memory window. The relaxation mechanism does appear on a short time frame (less than an hour). Therefore, the classification accuracy does not decrease over time. Read stability is also verified sending up to 800M peaks to the circuit.

The classification accuracy on the 10K test images of the MNIST database was measured to be 84%. This value has to be compared with the accuracy obtained from the ideal simulation of 88%, which is limited by the simple network topology (1 layer with 10 output neurons). The energy dissipation per synaptic event is equal to 3.6 pJ. When considering the circuit logic and the SPI interface, it amounts to 180 pJ (can be reduced by optimizing the communication protocol). Measurements show that image classification requires an average of 136 input peaks (for ΔS = 10): less than one peak accumulation per input, a 5x energy gain compared to equivalent form coded MAC operations in the 130nm node. The energy gain comes from (1) the brightness of the basic operation (accumulation, rather than multiply-accumulate as in classical encoding) and (2) the sparsity of activity due to spiky encoding. The sparsity benefit will increase with the number of layers.

This small demonstrator shows how it can be on par with traditional embedded approaches, but with significantly lower power consumption. In fact, the rate code used in the SNN demonstration makes this implementation equivalent to that of a classical code: transcoding from the classical domain to the spiking domain does not induce any loss in accuracy. However, the simple topology used in the proof-of-concept (i.e., a single-layer perceptron) can explain the slightly lower classification accuracy compared to state-of-the-art deep learning models that use larger networks and more layers. To overcome this difference, a more complex topology (MobileNet-like) is currently being implemented and the classification accuracy will increase accordingly, while having the same energy efficiency.

The same approach would extend to circuits embedded with microphones or lidar to analyze data streams locally and in real time, eliminating the need for transmission over a network. Both rate-coding and time-coding strategies can be used to optimize the network, depending on the information content of the signal. Initially, learning will be centralized and only inference will be integrated into the system, but some degree of incremental learning will be introduced in later generations.

Another way to take advantage of RRAM's attributes that are beneficial for embedded AI products is to use an analog architecture based on an RRAM crossbar array. They can provide denser implementations of multiply-accumulator (MAC) functions than conventional digital implementations, critical in both inference and learning circuits. If the further step of going into the time domain and eliminating clocks is taken, compact low power systems beyond current state of the art can be obtained. Although very promising and extensively studied by academia, this approach is still not widely accepted by industry, pointing to the difficulty of designing, verifying, characterizing, and certifying analog asynchronous designs, and the difficulty of scaling analog solutions. From our perspectives,

Part of the perceived difficulty of these memories comes from observed variability, but this is a reflection of experimental conditions. We observed better distribution when working within 300mm and the integration process was more mature, so we hypothesized that the variability issue could be resolved in the industrialization process. Design tools are also on the horizon, and more accurate models are becoming available. Changes in temperature certainly have an impact, but the statistical nature of this type of computation and its inherent robustness to somewhat parameter changes during the inference phase make the ultimate impact far less significant than conventional analog designs used by the community. One of the advantages of the analog crossbar approach is that when "zero" data is applied, there is automatically no current flow.

Some problems are more fundamental. The first is that power efficiency and high parallelism come from tradeoffs in time multiplexing (frequency of operation) vs. area: what is the limit of net size (number of questions or classes) that trades off favorably? How does it depend on the implementation node? Another is the recyclability of these memories. While it is sufficient for the inference phase and the programming of the crossbar can be done in the initialization phase with acceptable overhead, on-chip learning using the classical backpropagation scheme and number of iterations is a no-brainer due to the excessive write load of. However, very promising avenues using other learning methods are being explored and are expected to provide effective solutions in the coming years.

Before the introduction of this type of circuitry, technologies such as RRAM and 3D integration were available in conventional implementations to provide solutions with smaller power budgets and smaller form factors. FPGA implementations for highly customized applications, software-only implementations running on MCUs or CPUs, or general-purpose software dedicated to highly parallel multi-cores/accelerators (similar to GPUs) such as general-purpose GPUs are mainstream today. All of these can also benefit from local non-volatile memory, which can make FPGAs more compact and better suited to the memory hierarchy of MCU/CPU and multi-core/accelerator chips. In particular, using a dedicated version of monolithic 3D integration, RRAM planes are inserted between simulated neuron planes,

Within the framework of the European H2020 program NeuRAM3, we investigate this approach and lead an interdisciplinary research team of EU R&D institutions working on the link between advanced device technologies, circuit architectures and fabrication algorithms for neuromorphic chips. the best match between. Among the many results of this project, an example of OxRAM fabricated in the CoolCube 3D monolithic process, connected to the top and bottom CMOS layers, can be seen in the figure below. Going forward, this technique could be used to integrate very dense arrays in complex CMOS circuit structures dedicated to AI.

number. OxRam's CoolCube 3D is monolithically integrated within the interconnect between the top and bottom CMOS layers, opening the way for dense multilayer neural networks.

3DTSV and 3D via Cu-Cu bonding are also promising candidates for compact neuromorphic systems comprising various elements in a highly integrated architecture where partitioning is optimized depending on the application, or embedded AI Elements are tightly coupled to imagers or other sensing or actuating elements.

Embedded Internet of Things needs to learn a lot. Don't learn the wrong route and content, which will cause your salary to go up!

Share a data package with everyone, about 150 G. The learning content, face-to-face scriptures, and projects in it are relatively new and complete! (Click to find a small assistant to receive)

Guess you like

Origin blog.csdn.net/m0_70911440/article/details/132101511