Long short-term memory (LSTM) neural network in the memristor crossbar switch array

Long short-term memory (LSTM) neural network in the memristor crossbar switch array

Original: Long short-term memory networks in memristor crossbar arrays
Author: CanLi et al..
Journals: Nature machine intelligence

One of the winter vacation tasks assigned by the instructor was to translate several specified documents. Among them, the most interesting is this and the 2020 Tsinghua article on the nature of the memristor-based storage-calculation integrated system. The following translates the part before the experimental result and the part of the memristor crossbar switch array in the experimental result.

Summary:

​ Recently, breakthroughs in cyclic deep neural networks based on long and short-term memory (LSTM) units have created major advances in the field of artificial intelligence. However, due to the significantly increased complexity and a large number of parameters, the most cutting-edge LSTM model suffers from bottlenecks in terms of computing power. These bottlenecks come from limited storage capacity and limited data transmission bandwidth. In this article, we have experimentally verified that in an LSTM, the synaptic weights shared by different time steps can be implemented on a memristor crossbar array, which has a small circuit size and can store a large number of parameters. At the same time, the in-memory computing power it provides helps overcome the "von Neumann bottleneck." We demonstrated the ability of our crossbar array as a core component to solve regression and classification problems in practice, indicating that the memristor-based LSTM is an ideal low-power and low-latency hardware platform for edge inference.

text:

​ The recent success in the field of artificial intelligence has largely benefited from advances in deep neural networks. Among the many architectures of neural networks, LSTM is an important one. By controlling the learning process to remember or forget the observed historical data, LSTM-based recurrent neural networks (RNNs) play a role in the analysis of time series data, such as data prediction, natural language understanding, machine translation, speech recognition, and video surveillance. However, when LSTM is implemented on a traditional digital circuit, its complex architecture leads to a regression in inference delay and power consumption. In the Internet of Things (IoT) era, as more and more applications involve the processing of time data generated at the source of the data, these problems are becoming increasingly prominent. In order to accelerate the LSTM-based recurrent neural network, although more and more efforts are being invested in the design of new architectures, the low parallelism and limited bandwidth between the computing unit and the storage unit are still outstanding issues. Therefore, finding an alternative computing paradigm for LSTM networks is currently an urgent task.

​ The memristor is a two-port "memory resistance", according to the laws of physics, can be calculated in the place where information is stored (in-storage calculation). This integration of storage and calculation completely eliminates the need to transfer data between storage and calculation. Built in a cross-switch structure, the memristor has been successfully applied to a fully connected feedforward neural network, and compared to the same part based on CMOS, it shows great advantages in power consumption and inferred delay. The short-term memory effect of some memristors has also been used in reserve pool calculations. On the other hand, the most cutting-edge deep neural networks, including LSTM, which has contributed to the recent success of temporal data processing, are built on a more complex structure than fully connected networks. The implementation of LSTM on the memristor crossbar switch array has yet to be confirmed, mainly because of the relative lack of large memristor arrays.

​ In this article, we confirmed the experimental realization of a core part of the LSTM network on the memristor crossbar switch array. The memristor is monolithically integrated into the transistor to realize a single transistor-single memristor unit. By connecting a recurrent LSTM network to a fully connected network, we implemented the in-situ training and inference of this LSTM-based multilayer recurrent neural network for regression and classification problems. All matrix multiplications and updates are inferred by digital calculations and are physically implemented on a memristor interleaved array. These LSTM networks successfully predicted the number of flight passengers and single-person recognition based on gait in the memristor test. These works show that the LSTM network built on the memristor crossbar switch array represents a promising alternative computing paradigm that is efficient in terms of speed and power consumption.

result:

Memristor crossbar switch array for LSTM : Neural networks containing LSTM cells are cyclic; that is, they not only completely connect the nodes between different layers, but also cyclically connect the same layer at different time steps Nodes, as shown in Fig.1a. The cyclic connection in the LSTM unit also includes a gate unit to control memory or forgetting, so that the LSTM can learn long-term dependence. The data flow in a standard LSTM unit is as shown in Fig.1b, expressed as a method in equation (1) (linear matrix operation) and equation (2) (gate-level nonlinear activation) or equivalently Part of the equations (3) to (5).

(Show equation 1 and equation 2)

Where xtx^txt is the input vector of the current step,hth^tht andht − 1 h^{t-1}ht 1 are the output vectors of the current and previous time steps respectively,c ^ t \hat{c}^tc^t is the state of the internal cell unit,⨀ \bigodot stands for element- wise multiplication. σ is the logistic sigmoid function, which generatesi ^ t, f ^ tfor input gate, forget gate and output gate\hat{i}^t,\hat{f}^ti^t,f^t o ^ t \hat{o}^t The^t . The model parameters are stored in weight W, cycle weight U, and bias parameter b, which are used for cell activation (a) and each gate (i, f, o), respectively. Due to this complex structure, the cutting-edge deep RNN containing LSTM units contains a large number of model parameters, which usually exceed the on-chip memory (usually static random storage, SRAM), and sometimes even off-chip main memory (usually dynamic random storage). Storage, the normal capacity of DRAM). As a result, network inference and training will require the transmission of parameters from a separate chip for calculation to the processing unit, and the data transmission between chips greatly limits the performance of LSTM-based RNNs on traditional hardware.

To solve this problem, we adopted a memristor crossbar array for an RNN, and stored a large number of parameters required by an LSTM-RNN as the conductance of the memristor. The topology and data flow direction of this neural network is shown in Fig.1c. Linear matrix multiplication performs in-situ calculations on a cross array of memristor switches, eliminating the need to transmit weight values ​​back and forth. The parameters of the model are stored in the same cross array of memristor switches that implement the analog matrix multiplication. For the experiments described here, we connect the LSTM layer to a fully connected layer. In the future, these layers can be cascaded into more complex structures. For the purpose of verification, the gate-level units in the LSTM layer and the non-linear units in the fully connected layer in our current work are implemented in software, but they can be implemented by analog circuits that do not require digital signal conversion, which greatly reduces energy Consumption and inference delays.

The analog matrix unit on our LSTM is implemented in a 128 × 64 1 T 1 R 128\times64\space1T1R128×6 4 1 T 1 R  cross array, in which the memristor is monolithically integrated on top of a commercial foundry transistor array. IntegratedTa / H f O 2 Ta/HfO_2Ta/HfO2The memristor exhibits stable multilayer conductance, enabling matrix multiplication in the field of analog signals to be realized. The current is limited by transistor control, and the integrated memristor array can be loaded with a pre-defined conductance matrix by writing and verifying (previously used for analog signal and image processing, and ex-situ training of fully connected neural networks) or Through a simple two-pulse scheme (previously used for in-situ training of fully connected neural networks, it was also used for in-situ training of LSTM in our work). The inference of the LSTM layer is performed by applying the voltage on the row line of the memristor array and the current on the virtual ground (column line). The read current vector is the dot product of the memristor conductance matrix and the input voltage amplitude vector, which can be directly obtained from the laws of physics (Ohm's law is used for multiplication, Kirchhoff's current law is used for summation). The parameters of each LSTM model are encoded by the difference between the conductance of the two memristors in the same column. By applying voltages of the same amplitude but different polarities to the corresponding row lines, we can use the crossbar switch array Achieve subtraction. The voltage applied to the row line connected to the memristor is used to indicate the bias, which is fixed in all samples and time steps. The test read current on the memristor switch cross array consists of 4 parts, representing the vector a ^ t, i ^ t, f ^ t \hat{a}^t,\hat{ described in equation (1) i}^t,\hat{f}^ta^t,i^t,f^t o ^ t \hat{o}^t The^t , they are non-linearly activated, controlled by gate level, and converted into voltage (implemented by software in current work). Voltage vectorhth^thAfter t, it is fed back to the next layer (fully connected layer), and loops to the LSTM layer itself in the next time step (ht − 1 h^{t-1}ht 1 ). The neural network is trained in situ on the memristor switch array to compensate for possible hardware defects, such as limited device output, changes in conductance state and noise, line resistance, peripheral asymmetry of analog signals, and so on. Before training, all the memristor conductance must be initialized through the voltage pulse set on the entire memristor device and the timely fixed amplitude pulse on the transistor. During training, the initial inference is performed on a batch of time series data (small batches), and time series output is produced. The conductance of the memristor must then be adjusted to make the inferred output closer to the target output (evaluated by a loss function, see the method section). The desired increase in conductance is calculated by using the back propagation time algorithm (BPTT) on the off-chip electronic device (see the method section for details), and then used in the test of the memristor array. For memristors that need to reduce conductance, we first apply a reset voltage pulse to their bottom electrode (top electrode is grounded) to initialize the memristor to their low conductance state. We then apply the synchronous setting voltage pulse to the top electrode, and the analog voltage pulse to the transistor gate circuit (Δ V gate ∝ Δ G ΔV_{gate}\proptoΔGΔVgateΔ G ), zero voltage is applied to the bottom electrode (ground) to update the conductance values ​​of all memristors in the array. The conductance update can be done in each row or column, as mentioned in the previous work. In terms of realizing linear and symmetrical memristor conductance update, this double-pulse scheme has been proven effective in previous work.

Current work focuses on exploring the feasibility of neural networks (especially LSTM networks) with a variety of structures based on the use of new analog devices (such as memristors). For this purpose, we have established a neural network structure with Keras prediction function on matlab, so that any configuration of neural network architecture can be realized, especially in our work, this LSTM-fully connected network ( The detailed structure is shown in Figure 2). The memristor used in the experiment performs matrix multiplication and weight update in forward and backward propagation. This can be replaced by a simulated memristor interleaved array or a software backend that uses 32-bit floating point calculations. This architecture can be achieved, a direct comparison between a cross-array neural network and a digital method using the same algorithm and data set. In the realization of the switch array, this structure communicates with our customized off-chip test system (received or sent from the memristor switch cross array), which can provide up to 128 different analog voltages and instantaneously feel 64 current channels, Complete the matrix multiplication and weight update.

Guess you like

Origin blog.csdn.net/weixin_45358177/article/details/113896316