Deep Learning takes you to learn more about Echo State Network (ESN)

                                                            Abstract

First of all, the inspiration for writing this blog comes from how to solve the long-term dependency problem caused by its cyclic structure and parameter sharing when I was learning RNN (Recurrent Neural Network). I will briefly describe in (1) that RNN leads to the protagonist ESN of this article ( Echo State Network). Secondly, in (2), we will start from the most basic question: what is ESN. In the end, we will master and be able to unilaterally solve the long-term dependency problem. Compared with adjusting the initialization weight matrix and activation function without changing the network model, Comparing the advantages and existing problems of adding regularization terms, or changing the network architecture, such as LSTM, etc., the advantages are complementary, and the model we want to be implemented can be implemented in code. The key reserve layer (reservoir, reservoir, reserve pool) in ESN, at the physical level, such as putting logic into gamma arrays, implementing delay cycles on lasers, superconducting electronics, etc., traditional reservoirs can do a lot Things, here we only introduce the recurrent neural network level. At the end, I also give you a gift (3) that will be summarized in some suggestions on learning for sequence problems and RNN (for reference).

Note: This blog is a summary of the blogger's own thoughts. Of course, there will be parts where I learn from the big guys, but I will mark them. If there are any questions, please correct me and discuss them together. The blogger is very grateful.

(1) Introduction to ESN: RNN long-term dependence problem

Because this film focuses on ESN, I will jot down a few words here. Newbies who have never learned RNN are recommended to learn it first. There are many resources on the Internet, such as egstanford’s cs231n, Huashu, etc. The blogger is a flower book + paper + Google and look at the explanations of the big guys when you encounter a certain point, all right. Here is a brief introduction to some of RNN:

1.RNN

1. Brief review of Recurrent Neural Network (RNN)

1.1 RNN structure diagram

For RNN, we will involve calculation graphs that map input and parameters to output and loss. We need to understand the different network structure expressions in RNN (Figure 1 is an example of a part of the flower book, and Figure 2 is an example of what is not in the flower book). picture)                       

                                                                          figure 1

                                                                          figure 2

It also includes one-way RNN (weight sharing) and two-way RNN (not only related to past information, but also related to the follow-up, that is, contact context)

1.2 Three modes of RNN:

(1) tensor to variable length sequence

(2) Variable length sequence to tensor

  (3) From variable length sequence to variable length sequence (egseq2seq is two parts (Many-to-one:encoder, one-to-Many:decoder))

1.3 Parameter learning in RNN

a^{(t)}=b+W*h^{(t-1)}+U*^x{(t)}

h^{(t)}=tanh(a^{t})

o^{t}=c+V*h^{(t)}

\hat{y}=softmax(o^{t})

I believe that friends who have read Huashu know the process of forward propagation. It is not difficult to understand well, so I will not go into details.

--Then it is gradient descent back propagation to find the Loss, and continuously optimize the parameters.

Two algorithms are involved:

BPTT (back propagation over time algorithm): The parameter gradient needs to be fully forward propagated, calculated by back propagation, and the parameters are updated. The intermediate gradient needs to be retained, and the space complexity is high.

RTRL (Real-time Loop Learning): Loss can be calculated in real time with respect to parameter gradients, and the space complexity is low

Then I won’t go into details about the specific backpropagation derivation, but I have a numpy backpropagation code for RNN to share with you (the actual code is very different from hand-propagation)

http://gist.github.com/karpathy/d4dee566867f8291f086 (an example of min-char-rnn)

1.4 Practical applications of RNN

For example, sentiment analysis, machine translation, processing sequence data such as (text content in articles, audio content in speech, price trends in the stock market), generating image descriptions, etc. are quite interesting. I won’t give too many examples either.

2.1 Analysis of long-term dependency issues

2.1.1 From the perspective of forward propagation :

Suppose we consider an RNN structure without input and activation functions. The forward propagation formula is                                                                                                     h_{t}=W*h_{t0}, t=1,2,..,

For a certain previous moment, 

                                            h_{t}=W^{t-t0}*h_{t0},

According to matrix Jordan decomposition, it can be seen that as time t increases, if the amplitude in W is less than 1, the eigenvalues ​​will continue to attenuate toward zero, and if the amplitude is greater than 1, the eigenvalues ​​will continue to diverge, which will lead to disappearance during forward propagation and divergence.

2.1.2 From the perspective of gradient descent and backpropagation:

The same is the above network, remember t time Lt

                           \frac{\partial L_{t}}{\partial h_{t0}}=\frac{\partial L_{t}}{\partial h_{t}}\cdot \frac{\partial h_{t}}{\partial h_{t-1}}\cdot \frac{\partial h_{t-1}}{\partial h_{t-2}}\cdot ...\cdot \frac{\partial h_{t_{0}+1}}{\partial h_{t0}}=\frac{\partial L_{t}}{\partial h_{t}}\cdot W ^{t-t0}

There will also be a power of the weight matrix, because each time the weight parameters are shared, gradient explosion or gradient disappearance will occur (generally speaking, gradient disappearance will be more, because usually most values ​​​​are associated with tiny derivatives, repeated combinations Nonlinear function, RNN here is the linear hyperbolic tangent layer, the result is highly nonlinear, which will cause the gradient to disappear)

2.2 For ordinary RNN:

There are input data and activation functions during forward propagation. Due to the existence of the activation function of the input data, the information divergence or attenuation phenomenon may be alleviated. However, if the activation function is Relu and its value is greater than 0, there will still be the previous problem.

In backpropagation, due to the formula:

                                       \frac{\partial h_{t}}{\partial h_{t-1}}=diag(\sigma ^{'}(U*x_{t}+W*h_{t-1}))*W

There is an activation function derivative in . Considering that the activation function derivative generally has a value not exceeding 1, it can alleviate gradient explosion to a certain extent, but it may accelerate gradient disappearance and long-term dependence problems still exist.

2.2 How to solve the long-term dependency problem of RNN

Without changing the network framework: select appropriate initialization weight matrices, activation functions, and add regularization terms...

Change the network structure level: jump connections in the time dimension can construct a longer delay RNN, change the network model, such as gated RNN, echo state network, the most used LSTM (but this does not mean that the echo state network is not good)

http://proceedings.mlr.press/v28/pascanu13.pdficon-default.png?t=M666http://proceedings.mlr.press/v28/pascanu13.pdf

The above is a paper. After reading it, I think it is relatively good. It solves the gradient problem in RNN.

The above are just some appetizers before the meal, and the following are the hard dishes. Let's swim in the ocean of echo state network together! Set off! ! !


(2) The door to the new world: Echo State Network (ESN)

1. Echo State Network (ESN)

1.1 What is the echo state network and how is it derived?

Many bloggers will directly explain and analyze this network instead of fundamentally learning its origin and the problems that can be solved, so I will elaborate on this:

At that time, in the 1990s, different people discovered that RNN had the problem of gradient disappearance and gradient explosion. Some people wanted to avoid this problem by only processing the parameter space of gradient disappearance or explosion. In order to store the previous memory, small perturbations have little impact on the network model, and large disturbances have little impact on the network model. Changes in will not bring catastrophic effects (robustness) to the network model, and RNN must enter the gradient disappearance region in the parameter space. However, in experiments, when we modify or increase the network span, the impact on the gradient optimization problem becomes more and more obvious, and the probability of SGD in certain length sequences directly returns to zero.

The training of recursive neural networks is achieved through direct optimization of weights. The convergence speed is slow and it is easy to fall into local optimality. Troublesome optimization problems also arise. At the same time, it was also found that the cyclic weight mapping of the hidden unit and the weight mapping input to the hidden unit in RNN are often difficult parameters that need to be learned. In the past few years, researchers have discovered that in order to avoid this problem, the cyclic hidden unit is set to be fixed. The first two weights only learn the output weights. It solves the problem that the weight of RNN is difficult to determine, and indirectly can change the problem of RNN forward propagation information disappearing and backpropagation gradient disappearing/exploding.

1.2 Network structure of echo state network

 It is easy to see from the figure that it basically consists of three parts, input layer (input), reserve pool/reservoir/reservoir (reservoir), and output layer (output).

Input layer: input K×1 tensor, and then W_{in}input it to the reservoir layer by multiplying it with the weight matrix

Reservoir (N node network): Each node in it corresponds to a state. We use to x_{n}express that there is also an internal neuron connection weight matrix W of hidden->hidden layer in the reservoir.

W_{out}Output layer: target(y) is obtained by multiplying the output in the reservoir layer with the weight matrix

The values ​​of input layer u(t), reservoir neuron hidden state x(t), and output layer y(t) at time t:

                                       u(t)=[u1(t),u2(t),...,u_{K}(t)]T(K dimension)

                                        x(t)=[x1(t),x2(t),...,x_{N}(t)]T(N dimension)

                                        y(t)=[y1(t),y2(t),...,y_{L}(t)]T(L dimension)

1.3 The specific process of echo status network propagation

Forward propagation process:

(1)(input->reservoir):W_{in}*u(t)

(2) reservoir (N node network): 

                                      x(t+1)=f(Wx(t)+W_{in}u(t)+W_{back}y(t))

The f function is the internal activation function of the reservoir

(3)  (reservoir->output):

                                     W_{out}*x(t)

The update iteration process of y:

                                     y(t+1)=f_{out}(W_{out}[x(t+1),u(t+1),y(t)]+W_{bias}^{out})

(4) Loss function calculation:

                                     Loss=\sum_{t}^{N_{t}}\left | y(t)-W_{out} x(t)\right |^{2}+\eta \left | W_{out} \right |^{2}

      Derivation of loss function:

First we convert the loss function into matrix form:

                                       L=\left | W_{out}X-Y \right |

Then expand:

                                       L=\left (W _{out} -Y\right )^{T}\left (W _{out} -Y\right ) 

Then there is the step-by-step derivation, but it should be noted that the subsequent term merging is a scalar:

                                       L=X^{T}W_{out}^{T}W_{out}X-X^{T}W_{out}^{T}Y-Y^{T}W_{out}X+Y^{T}Y

                                       L=X^{T}W_{out}^{T}W_{out}X-2X^{T}W_{out}^{T}Y+Y^{T}Y

Matrix derivation rule:

                                        \frac{\partial L}{\partial out}=\frac{\partial X^{T}W_{out}^{T}W_{out}Y}{\partial W_{out}}-2X^{T}Y

                                        \frac{\partial X^{T}W_{out}^{T}W_{out}Y}{\partial W_{out}}=\frac{\partial X^{T}W_{out}^{T}}{\partial W_{out}}W_{out}X+\frac{\partial X^{T}W_{out}^{T}}{\partial W_{out}}W_{out}X

                                        2X^{T}W_{out}X=2X^{T}Y

                                        W_{out}XX^{T}=YX^{T}

                                       W_{out}=YX^{T}(XX^{T})^{-1}

The regularization term was not considered here before. The following is the derivation result of adding the regularization term. The principle is the same:

                                        W_{out}=YX^{T}(XX^{T}+\eta I)^{-1}

The first term of the loss function is equivalent to linear regression and is designed as a convex function of the output weight. The second item is bias, or regularization term, which is used to prevent problems such as overfitting.

Note: W_{in}, W W_{back}are all set in advance, generated randomly, and are not used as hyperparameters. What we want to learn is that W_{out}since the output function is a standard linear function, the output matrix can only be found through the least squares method or MSE. The optimal solution can be solved by a simple method, and does not require gradient descent backpropagation to solve it.

 1.4 Main parameters in the echo state network

1.4.1 Spectral radius

What is spectral radius? The spectrum of a matrix , or the spectral radius of a matrix, has a great relationship in theories such as eigenvalue estimation, generalized inverse matrices, numerical analysis, and numerical algebra. The spectral radius of a matrix is ​​the maximum value of the module of the eigenvalues ​​of the matrix.
An important property about the spectrum (radius) of a matrix is ​​that the spectral radius of a matrix on any complex field is not greater than any of its induced norms.

The spectral radius is the maximum value of the matrix eigenvalue module. A simple matrix can calculate all the eigenvalues ​​and then compare the largest one to be the spectral radius. For matrices with larger dimensions, you need to try an iterative algorithm.

Here is to make the maximum eigenvalue of the weight matrix of the hidden->hidden layer J^{(t)}. The original idea is to make the eigenvalue of the Jacobian matrix of the state-to-state transition function close to 1.

When the spectral radius is less than 1, and tanh is used as the activation function, the influence of the input of the echo state network and the reservoir state on the network will disappear after a long enough time. On the contrary, the state or input of the network will become larger and larger after many iterations, which will lead to chaos or even failure of the ESN. Since the reservoir is randomly connected, spectral radius <1 can ensure network convergence.

My personal suggestion for the derivation of spectral radius and specific choices is that it is best to read papers to learn. I have read several recommended papers.

ASU studies nonlinear dynamics, and an article by Ying-Cheng Lai, a leader in complex networks, studies the application of the spectral radius of reservoir in echo state networks: https://scholar.google.com/citations?view_op=view_citation&hl=en&user =42cK_xMAAAAJ&cstart=100&pagesize=100&citation_for_view=42cK_xMAAAAJ:4QKQTXcH0q8C icon-default.png?t=M666https://scholar.google.com/citations?view_op=view_citation&hl=en&user=42cK_xMAAAAJ&cstart=100&pagesize=100&citation_for_view=42cK_xMAAAAJ:4QK QTXcH0q8C

illustration:

 

 

 

 

The content is very good, I recommend reading it.

1.4.2 Reservoir size and depth

Reservoirs are often large-scale, that is, x(n) numbers, or N-dimensional vectors. The larger the N value, the higher the accuracy will be, but the efficiency will be reduced and over-fitting will occur. We need to initialize the number of neurons to determine the size of the reserve pool. The more nodes N, the stronger the fitting ability. Since in the echo state network, only the output weights need to be learned to linearly fit the output results, the general echo state The network needs to be much larger than the node size of conventional neural networks, that is, N>K.

2. Build an echo state network

(1) Initialization  W_{in},W,W_{back}, the number of neuron nodes in the reservoir size, the spectral radius, and the random sparse connections in the reservoir, adjust the parameters that need to be learned, i.e., W_{out}l, to linearly fit the reservoir output

(2) Then randomly generate a connection matrix, the connection mode between different neurons, and the direction and weight of the connection. Then the scaling matrix is ​​actually a normalization operation. Sometimes we will directly use a scaling factor and multiply the original randomly generated matrix by the scaling factor. This is faster than using eigenvalues ​​to scale, but to a large extent Accuracy is also lost. Why do we need to do this? The reason is similar to when we initialize the weights of some neural networks. For these neural networks, we usually initialize the weights between 0-1 (or -1 to 1) for two reasons: (1) Affected by the activation function, the sigmoid and tanh activation functions have a relatively large difference between 0 and 1, but after it is greater than 1, the activation value does not change much; (2) We derive the activation function, and we can see that when it is greater than 1 When , the image is relatively flat and its derivative is close to 0. As a result, when calculating the gradient, the gradient will be too small and the weight update cannot be smoothly implemented. For ESN, we do not use gradients to update weights, mainly due to the first point. Finally, the input weight V and output weight W are randomly generated.
These parameters will affect the length of the network's short-term memory. The smaller the input weight and the closer the spectral radius of the internal matrix is ​​to 1, the longer the short-term memory time of the network. However, while enhancing memory capabilities, this operation also reduces the network's ability to model "rapidly changing" systems.

(3) Then perform training, input the K-dimensional vector, and input it into the reservoir for backpropagation. It is worth noting that there is an "idling" process, which is actually to initialize the state of the reservoir. Why do we need to do this? Because the internal connections of the reserve pool are random, the initial input sequence to obtain the reserve pool state will be relatively noisy, so some data will be used to initialize the reserve pool state to reduce the impact of noise.

Linear regression determines the partial derivation of Wout:

                                        min\left \| WX-Y\right \|_{2}^{2}+\lambda \left \| W \right \|_{2}^{2}

                                        W=YX^{T}(XX^{T}+\lambda I)^{-1}

For detailed derivation, see the derivation of the loss function above, 1.3 (4)

Then this article is not the same as my method after reading the derivation, but it is relatively good. It is solved by least squares regression using the L2 norm penalty term. For the detailed derivation of using linear regression to determine the output weight, you can read this article. The blog is more detailed:

https://blog.csdn.net/cassiePython/article/details/80389394 icon-default.png?t=M666https://blog.csdn.net/cassiePython/article/details/80389394 3. Review some necessary knowledge and complexity discussion in reservoirs(because Reservoirs are quite critical, so I listed some separately)

(1) The matrix weight setting of hidden->hidden in the reservoir needs to allow the input to be echoed in the reservoir loop state. Whether it is a linear system or a nonlinear system, the setting of the amplitude and the setting of the spectral radius will have an impact. . Generally, we adopt a nonlinear function in the reservoir and output a linear regression function in the output.

(2) It is important to use sparse connections in the reservoir, rather than using the same medium-sized weight matrix. There are some pretty big ways that almost all weights in hidden->hidden connections are 0. What this does is create a loose internal structure so that information can hang around in one part of the network without having to propagate to other parts of the network too quickly. .

(3) It is also important to carefully select the input size of the hidden connection

These connections drive the state of the reservoir but do not eliminate the relevant recent historical information contained in the reservoir.

(4) Features in reserve: random generation, large-scale, sparse connections, and no nonlinear optimization, resulting in a very fast training process.

(5) Random weight matrix in the reservoir, but it may also require a stable distribution, uniform Gaussian or normal Gaussian or mixed Gaussian, including the sparsity of the matrix, which is open to discussion.

4. List some other questions

(1) Because it is just a simple linear regression from the reservoir to the output layer, there is no need to perform gradient descent backpropagation on the network to optimize parameters. What needs to be noted is that there is a connection from the output layer back to the reservoir, that is, the weight matrix Wback, These connections are not always necessary, but help tell the reservoir what output it has produced so far.

(2) Reservoir calculation research regards the circulation network as a dynamic network, and sets the input and circulation weights that make the dynamic system close to the stable edge. It is described by discrete mapping, so the state at time t+1 depends on the past state and input.

(3) The activation function here usually uses the tanh hyperbolic tangent function. In fact, it can be used. The regularization term and the bias term bias are very important, and they need to be optimized for specific task problems to achieve the best performance.

5. Analysis of advantages and disadvantages of ESNs:

5.1 good aspects:

Very fast to train (linear model)

Weight initialization is often important

Very impressive modeling of one-dimensional time series

5.2 bad aspects

It often requires more hidden units (i.e. units in the reservoir) than RNN to learn the output weight matrix.

Doesn't handle acoustic coefficient frames/video frames well

6. Practical applications of ESNs

 This is an example of predicting a one-dimensional dynamic system time series and predicting the results 1000 steps later. The code is as follows:

https://github.com/bnuliujing/EchoStateNetworksicon-default.png?t=M666https://github.com/bnuliujing/EchoStateNetworks

import pickle
import numpy as np
import matplotlib.pyplot as plt


class ESN():
    def __init__(self, data, N=1000, rho=1, sparsity=3, T_train=2000, T_predict=1000, T_discard=200, eta=1e-4, seed=2050):
        self.data = data
        self.N = N  # reservoir size 库的大小
        self.rho = rho  # spectral radius 谱半径
        self.sparsity = sparsity  # average degree 平均度       sparsity:稀疏性
        self.T_train = T_train  # training steps
        self.T_predict = T_predict  # prediction steps
        self.T_discard = T_discard  # discard first T_discard steps  discard:丢弃
        self.eta = eta  # regularization constant 正则化常数
        self.seed = seed  # random seed

    def initialize(self):
        """
        对连接权矩阵W_IR和W_res进行初始化
        其中W_IR(N*1)是从输入到库的连接权矩阵,W_res(N*N)是从库到输出的连接权矩阵
        """
        if self.seed > 0:
            np.random.seed(self.seed)
        # 生成形状为N * 1的,元素为[-1, 1]之间的随机值的矩阵
        self.W_IR = np.random.rand(self.N, 1) * 2 - 1  # [-1, 1] uniform
        # 生成形状为N * N的,元素为[0, 1]之间的随机值的矩阵
        W_res = np.random.rand(self.N, self.N)
        # 将W_res中大于self.sparsity / self.N的元素置0
        W_res[W_res > self.sparsity / self.N] = 0\
        # np.linalg.eigvals(W_res)求出W_res的特征值,W_res矩阵除以自身模最大的特征值的模
        W_res /= np.max(np.abs(np.linalg.eigvals(W_res)))
        # 在乘以谱半径
        W_res *= self.rho  # set spectral radius = rho
        self.W_res = W_res

    def train(self):
        u = self.data[:, :self.T_train]  # traning data T_train = 2000
        assert u.shape == (1, self.T_train)
        r = np.zeros((self.N, self.T_train + 1))  # initialize reservoir state r(N*(T_train + 1))
        for t in range(self.T_train):
            # @是Python3.5之后加入的矩阵乘法运算符
            r[:, t+1] = np.tanh(self.W_res @ r[:, t] + self.W_IR @ u[:, t])
        # disgard first T_discard steps  r丢弃前T_discard步变成r_p
        self.r_p = r[:, self.T_discard+1:]  # length=T_train-T_discard
        v = self.data[:, self.T_discard+1:self.T_train+1]  # target
        self.W_RO = v @ self.r_p.T @ np.linalg.pinv(
            self.r_p @ self.r_p.T + self.eta * np.identity(self.N))
        train_error = np.sum((self.W_RO @ self.r_p - v) ** 2)
        print('Training error: %.4g' % train_error)

    def predict(self):
        u_pred = np.zeros((1, self.T_predict))  # u_pred是形状为(1, self.T_predict)的全零矩阵
        r_pred = np.zeros((self.N, self.T_predict))  # r_pred是形状为(N, self.T_predict)的全零矩阵
        r_pred[:, 0] = self.r_p[:, -1]  # warm start 热启动
        for step in range(self.T_predict - 1):
            u_pred[:, step] = self.W_RO @ r_pred[:, step]
            r_pred[:, step + 1] = np.tanh(self.W_res @
                                          r_pred[:, step] + self.W_IR @ u_pred[:, step])
        u_pred[:, -1] = self.W_RO @ r_pred[:, -1]
        self.pred = u_pred

    def plot_predict(self):
        ground_truth = self.data[:,
                                 self.T_train: self.T_train + self.T_predict]
        plt.figure(figsize=(12, 4))
        plt.plot(self.pred.T, 'r', label='predict', alpha=0.6)
        plt.plot(ground_truth.T, 'b', label='True', alpha=0.6)
        plt.show()

    def calc_error(self):
        ground_truth = self.data[:,
                                 self.T_train: self.T_train + self.T_predict]
        rmse_list = []
        for step in range(1, self.T_predict+1):
            error = np.sqrt(
                np.mean((self.pred[:, :step] - ground_truth[:, :step]) ** 2))
            rmse_list.append(error)
        return rmse_list


if __name__ == "__main__":
    # http://minds.jacobs-university.de/mantas/code
    data = np.load('mackey_glass_t17.npy')  # data.shape = (10000,)
    data = np.reshape(data, (1, data.shape[0]))  # data.shape = (1, 10000)
    print(data.shape)
    esn = ESN(data)
    esn.initialize()
    esn.train()
    esn.predict()
    esn.plot_predict()

The code data set is at the bottom of this blog:

https://blog.csdn.net/comli_cn/article/details/109394553icon-default.png?t=M666https://blog.csdn.net/comli_cn/article/details/109394553

import torch.nn
from torchvision import datasets, transforms
from torchesn.nn import ESN
import time


def Accuracy_Correct(y_pred, y_true):
    labels = torch.argmax(y_pred, 1).type(y_pred.type())
    correct = len((labels == y_true).nonzero())
    return correct


def one_hot(y, output_dim):
    onehot = torch.zeros(y.size(0), output_dim, device=y.device)

    for i in range(output_dim):
        onehot[y == i, i] = 1

    return onehot


def reshape_batch(batch):
    batch = batch.view(batch.size(0), batch.size(1), -1)
    return batch.transpose(0, 1).transpose(0, 2)


device = torch.device('cuda')
dtype = torch.float
torch.set_default_dtype(dtype)
loss_fcn = Accuracy_Correct

batch_size = 256  # Tune it according to your VRAM's size.
input_size = 1
hidden_size = 500
output_size = 10
washout_rate = 0.2

if __name__ == "__main__":
    train_iter = torch.utils.data.DataLoader(
        datasets.MNIST('./datasets', train=True, download=True,
                       transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.1307,), (0.3081,))])),
        batch_size=batch_size, shuffle=True, num_workers=1, pin_memory=True)

    test_iter = torch.utils.data.DataLoader(
        datasets.MNIST('./datasets', train=False,
                       transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.1307,), (0.3081,))])),
        batch_size=batch_size, shuffle=False, num_workers=1, pin_memory=True)

    start = time.time()

    # Training
    model = ESN(input_size, hidden_size, output_size,
                output_steps='mean', readout_training='cholesky')
    model.to(device)

    # Fit the model
    for batch in train_iter:
        x, y = batch
        x = x.to(device)
        y = y.to(device)

        x = reshape_batch(x)
        target = one_hot(y, output_size)
        washout_list = [int(washout_rate * x.size(0))] * x.size(1)

        model(x, washout_list, None, target)
        model.fit()

    # Evaluate on training set
    tot_correct = 0
    tot_obs = 0

    for batch in train_iter:
        x, y = batch
        x = x.to(device)
        y = y.to(device)

        x = reshape_batch(x)
        washout_list = [int(washout_rate * x.size(0))] * x.size(1)

        output, hidden = model(x, washout_list)
        tot_obs += x.size(1)
        tot_correct += loss_fcn(output[-1], y.type(torch.get_default_dtype()))

    print("Training accuracy:", tot_correct / tot_obs)

    # Test
    for batch in test_iter:
        x, y = batch
        x = x.to(device)
        y = y.to(device)

        x = reshape_batch(x)
        washout_list = [int(washout_rate * x.size(0))] * x.size(1)

        output, hidden = model(x, washout_list)
        tot_obs += x.size(1)
        tot_correct += loss_fcn(output[-1], y.type(torch.get_default_dtype()))

    print("Test accuracy:", tot_correct / tot_obs)

    print("Ended in", time.time() - start, "seconds.")

This code uses the mnist data set and PyTorch module. You can search it on github and run it.

After traveling through mountains and rivers, this journey to Taoyuan is finally over. Before the end of the second part, I recommend this very good paper to everyone. It is also the only academic leader I admire, Geoffrey Hinton, who participated in this paper based on momentum and rmsprop. ESN’s paper, without further ado, here’s the link:

http://proceedings.mlr.press/v28/sutskever13.pdficon-default.png?t=M666http://proceedings.mlr.press/v28/sutskever13.pdf


(3) Sharing of RNN learning experience:

When I was sharing experiences with my seniors during the holidays, I thought that a good idea is that learning is not just a process of forward propagation, but often requires us to back propagate back to find the optimal one. Needless to say, the former is a step-by-step and solid learning process. , the latter means that we need to start from the problem by asking questions or other similar methods, clarify all the knowledge we have learned before, and supplement and update it. Take RNN as an example of how to solve long-term dependency problems. We start from two directions, one is the reason, and the other is the solution. To know the reason, we must have a comprehensive grasp of the structure, mode, mechanism, loss function, and mathematical derivation of RNN so that we will not miss any points when analyzing the cause, so that we can better grasp what we have learned, and we can also use it in every thinking. Consolidate it again. We might call it curiosity.

Small question, thinking more will be the best way to think about what you have learned. That’s great. Everyone can exchange experiences together. Every time is OK.

Guess you like

Origin blog.csdn.net/Tomorrow_bitter/article/details/126356552