Deep Learning for Text Processing (NLP)

introduction

insert image description here

Deep learning to process text is mainly related to the field of natural language processing (NLP). With the development of deep learning technology, the field of NLP has made great progress. Here are some major applications and techniques of deep learning in processing text:

  1. Word Embeddings (Word Embeddings): Word embedding is to map words in the vocabulary to dense vectors. Commonly used methods are Word2Vec, GloVe and FastText. These vectors capture the semantic similarity between words.
  2. Recurrent Neural Network (RNN): RNN is a neural network structure that processes sequential data such as text. It captures temporal dependencies in sequences by saving the state from previous time steps.
  3. Long Short-Term Memory Network (LSTM) and Gated Recurrent Unit (GRU): Due to the short-term memory problem of RNNs, LSTM and GRU are introduced to capture long-distance dependencies.
  4. Transformer structure: The Transformer structure that introduces the self-attention mechanism, such as BERT, GPT, T5 and other models, has achieved the current best performance on multiple NLP tasks.
  5. Sequence-to-sequence model (Seq2Seq): This model is often used in tasks such as machine translation, text summarization, and is usually used with LSTM or Transformer structures.
  6. Convolutional Neural Networks (CNN): Although CNNs are primarily used for image processing, they can also be used to capture local patterns in text. For example, for sentiment analysis and text classification tasks.
  7. Transfer Learning: With pre-trained models such as BERT, we can easily fine-tune for specific tasks, reducing training time and data requirements.
  8. Attention Mechanisms: Attention mechanisms enable a model to focus on specific parts of an input sequence, which is particularly useful in tasks such as machine translation and text summarization.

Common applications for processing text include: text classification, sentiment analysis, named entity recognition, text summarization, machine translation, question answering systems, and more.

In order to use text in a deep learning model, preprocessing usually first needs to be done, such as word segmentation, stemming, removal of stop words, conversion to word embedding, etc. These preprocessed text data can then be used as input for deep learning models.

1. Backpropagation

1.1 Example process implementation

insert image description here
insert image description here

1.2 Forward Propagation

  • Starting with the input layer, pass data through each layer.
  • At each layer, the output of the node is computed using the current weights.
  • These outputs become the input to the next layer.

1.3 Calculate the loss

  • Once the data has passed through all the layers and produced an output, a loss function (such as mean squared error) can be calculated to measure the accuracy of the network's predictions.

1.4 Back propagation error

  • Compute the output layer error and backpropagate it to the previous layers.
  • Calculate the error for each layer using the chain rule.

1.5 Updating weights

  • The weights of each layer are adjusted using the learning rate and previously calculated gradients.
  • The basic formula is: Δw=−η⋅∇J, where η is the learning rate and ∇J is the gradient of the loss function J with respect to the weight w.

1.6 Iteration

  • Repeat the above steps multiple times until the loss converges to a minimum or other stopping criteria are met.

The core of backpropagation is to calculate the partial derivative of the loss function with respect to each weight through the chain rule. These partial derivatives provide direction on how the weights should be adjusted to reduce error.

To implement backpropagation, an optimization algorithm, such as gradient descent or its variants (such as stochastic gradient descent, Adam, RMSprop, etc.), is usually used to update the weights in the network.

1.7 BackPropagation & Adam code example

#coding:utf8

import torch
import torch.nn as nn
import numpy as np
import copy

"""
基于pytorch的网络编写
手动实现梯度计算和反向传播
加入激活函数
"""

class TorchModel(nn.Module):
    def __init__(self, hidden_size):
        super(TorchModel, self).__init__()
        self.layer = nn.Linear(hidden_size, hidden_size, bias=False)
        self.activation = torch.sigmoid
        self.loss = nn.functional.mse_loss  #loss采用均方差损失

    #当输入真实标签,返回loss值;无真实标签,返回预测值
    def forward(self, x, y=None):
        y_pred = self.layer(x)
        y_pred = self.activation(y_pred)
        if y is not None:
            return self.loss(y_pred, y)
        else:
            return y_pred


#自定义模型,接受一个参数矩阵作为入参
class DiyModel:
    def __init__(self, weight):
        self.weight = weight

    def forward(self, x, y=None):
        x = np.dot(x, self.weight.T)
        y_pred = self.diy_sigmoid(x)
        if y is not None:
            return self.diy_mse_loss(y_pred, y)
        else:
            return y_pred

    #sigmoid
    def diy_sigmoid(self, x):
        return 1 / (1 + np.exp(-x))

    #手动实现mse,均方差loss
    def diy_mse_loss(self, y_pred, y_true):
        return np.sum(np.square(y_pred - y_true)) / len(y_pred)

    #手动实现梯度计算
    def calculate_grad(self, y_pred, y_true, x):
        #前向过程
        # wx = np.dot(self.weight, x)
        # sigmoid_wx = self.diy_sigmoid(wx)
        # loss = self.diy_mse_loss(sigmoid_wx, y_true)
        #反向过程
        # 均方差函数 (y_pred - y_true) ^ 2 / n 的导数 = 2 * (y_pred - y_true) / n , 结果为2维向量
        grad_mse = 2/len(x) * (y_pred - y_true)
        # sigmoid函数 y = 1/(1+e^(-x)) 的导数 = y * (1 - y), 结果为2维向量
        grad_sigmoid = y_pred * (1 - y_pred)
        # wx矩阵运算,见ppt拆解, wx = [w11*x0 + w21*x1, w12*x0 + w22*x1]
        #导数链式相乘
        grad_w11 = grad_mse[0] * grad_sigmoid[0] * x[0]
        grad_w12 = grad_mse[1] * grad_sigmoid[1] * x[0]
        grad_w21 = grad_mse[0] * grad_sigmoid[0] * x[1]
        grad_w22 = grad_mse[1] * grad_sigmoid[1] * x[1]
        grad = np.array([[grad_w11, grad_w12],
                         [grad_w21, grad_w22]])
        #由于pytorch存储做了转置,输出时也做转置处理
        return grad.T

#梯度更新
def diy_sgd(grad, weight, learning_rate):
    return weight - learning_rate * grad

#adam梯度更新
def diy_adam(grad, weight):
    #参数应当放在外面,此处为保持后方代码整洁简单实现一步
    alpha = 1e-3  #学习率
    beta1 = 0.9   #超参数
    beta2 = 0.999 #超参数
    eps = 1e-8    #超参数
    t = 0         #初始化
    mt = 0        #初始化
    vt = 0        #初始化
    #开始计算
    t = t + 1
    gt = grad
    mt = beta1 * mt + (1 - beta1) * gt
    vt = beta2 * vt + (1 - beta2) * gt ** 2
    mth = mt / (1 - beta1 ** t)
    vth = vt / (1 - beta2 ** t)
    weight = weight - (alpha / (np.sqrt(vth) + eps)) * mth
    return weight

x = np.array([-0.5, 0.1])  #输入
y = np.array([0.1, 0.2])  #预期输出

#torch实验
torch_model = TorchModel(2)
torch_model_w = torch_model.state_dict()["layer.weight"]
print(torch_model_w, "初始化权重")
numpy_model_w = copy.deepcopy(torch_model_w.numpy())
#numpy array -> torch tensor, unsqueeze的目的是增加一个batchsize维度
torch_x = torch.from_numpy(x).float().unsqueeze(0)
torch_y = torch.from_numpy(y).float().unsqueeze(0)
#torch的前向计算过程,得到loss
torch_loss = torch_model(torch_x, torch_y)
print("torch模型计算loss:", torch_loss)
# #手动实现loss计算
diy_model = DiyModel(numpy_model_w)
diy_loss = diy_model.forward(x, y)
print("diy模型计算loss:", diy_loss)

# #设定优化器
learning_rate = 0.1
optimizer = torch.optim.SGD(torch_model.parameters(), lr=learning_rate)
# optimizer = torch.optim.Adam(torch_model.parameters())
optimizer.zero_grad()
#
# #pytorch的反向传播操作
torch_loss.backward()
print(torch_model.layer.weight.grad, "torch 计算梯度")  #查看某层权重的梯度

# #手动实现反向传播
grad = diy_model.calculate_grad(diy_model.forward(x), y, x)
print(grad, "diy 计算梯度")
#
# #torch梯度更新
# optimizer.step()
# #查看更新后权重
# update_torch_model_w = torch_model.state_dict()["layer.weight"]
# print(update_torch_model_w, "torch更新后权重")
#
# #手动梯度更新
# diy_update_w = diy_sgd(grad, numpy_model_w, learning_rate)
# diy_update_w = diy_adam(grad, numpy_model_w)
# print(diy_update_w, "diy更新权重")

2. Optimizer – Adam

2.1 Analysis of Adam

Adam (Adaptive Moment Estimation) is a widely used optimization algorithm designed for deep learning models. It combines two other optimization algorithms: Adagrad and RMSprop.

Here are Adam's features and how it works:

  1. Momentum: Similar to traditional momentum methods, Adam uses a moving average to obtain past values ​​of gradients. This helps the algorithm navigate in the relevant gradient descent direction, especially when solving highly non-convex optimization problems.
  2. Learning rate adaptation: Like RMSprop and Adagrad, Adam adjusts the learning rate for each parameter based on past squared gradients. This helps the algorithm go fast in the early stages of learning and slow down as it approaches the minimum.

Specific algorithm:

  1. Initialization parameters.
  2. Calculate the gradient
  3. Compute the first moment estimate mt
  4. Compute the second moment estimate vt
  5. Offset correction for mt and vt
  6. update weight

Advantages:
high computational efficiency. The memory required is small. Suitable for most deep learning applications. Suitable for non-stationary objective functions, large problems, noisy or sparse gradients.

Cons:
Adam may take more time to converge, especially in the later stages of deep learning training.
For some problems it may not be as stable as other algorithms such as SGD or RMSprop.

2.2 Code example

class AdamOptimizer:
    def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.learning_rate = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        
        # 初始化一阶矩估计和二阶矩估计
        self.m = 0
        self.v = 0
        self.t = 0

    def step(self, gradient):
        self.t += 1  # 增加时间步

        # 更新偏差修正的一阶矩估计
        self.m = self.beta1 * self.m + (1 - self.beta1) * gradient

        # 更新偏差修正的二阶矩估计
        self.v = self.beta2 * self.v + (1 - self.beta2) * gradient**2

        # 计算偏差修正的一阶矩估计
        m_corr = self.m / (1 - self.beta1**self.t)

        # 计算偏差修正的二阶矩估计
        v_corr = self.v / (1 - self.beta2**self.t)

        # 更新参数
        update = self.learning_rate * m_corr / (v_corr**0.5 + self.epsilon)

        return update

3. NLP tasks

Natural Language Processing (NLP, Natural Language Processing) is a field at the intersection of computer science, artificial intelligence and linguistics, which enables machines to read, interpret, generate and respond to human language. As technology advances, NLP tasks become more and more advanced and play a role in many practical applications.

Here are some major NLP tasks:

  1. Text Classification: Classify a given text into predefined categories. For example, sentiment analysis (judging whether a text is positive, negative or neutral).
  2. Named Entity Recognition (NER): Identify specific categories of entities such as person names, place names, and institution names from text.
  3. Part-of-speech tagging: Assign a part-of-speech tag to each word in the text, such as noun, verb, adjective, etc.
  4. Syntactic analysis: constructing the syntactic structure of the text, usually represented by a tree diagram.
  5. Semantic role labeling: Determine the semantic role of each component in the sentence, such as agent, subject, etc.
  6. Language Model: Predict the probability of the next word or character. This is useful in many applications such as machine translation and speech recognition.
  7. Machine Translation: Translating text from one language to another.
  8. Question Answering Systems: Extract or generate answers from documents based on user questions.
  9. Text Generation: Generate new text based on a given input.
    10. Speech Recognition: Convert audio to text.

Now given a task
task: string classification - determine whether the specified character appears in the string
Example:
specified character: a
sample: abcd positive sample
bcde negative sample

4. Neural Networks Process Text

Task
Current input: string such as: abcd
Expected output: probability value Positive sample = 1, negative sample = 0, with 0.5 as the boundary
X = "abcd" Y = 1
X = "bcde" Y = 0

Modeling goal: find a mapping f(x) such that f("abcd") = 1, f("bcde") = 0

4.1 step1 character numericalization

Intuitively, does a -> 1, b -> 2, c -> 3 .... z -> 26 make sense?
obviously unreasonable

Each character is converted into a vector of the same dimension
a -> [0.32618175 0.20962898 0.43550067 0.07120884 0.58215387]
b -> [0.21841921 0.97431001 0.43676452 0.77925024 0.730789 1 ]
...
z -> [0.72847746 0.72803551 0.43888069 0.09266955 0.65148562]

那么“abcd” - > 4 * 5 的矩阵
[[0.32618175 0.20962898 0.43550067 0.07120884 0.58215387]
[0.21841921 0.97431001 0.43676452 0.77925024 0.7307891 ]
[0.95035602 0.45280039 0.06675379 0.72238734 0.02466642]
[0.86751814 0.97157839 0.0127658 0.98910503 0.92606296]]
矩阵形状 = 文本长度 * 向量长度

4.2 Step 2 Convert matrix to vector

求平均
[[0.32618175 0.20962898 0.43550067 0.07120884 0.58215387]
[0.21841921 0.97431001 0.43676452 0.77925024 0.7307891 ]
[0.95035602 0.45280039 0.06675379 0.72238734 0.02466642]
[0.86751814 0.97157839 0.0127658 0.98910503 0.92606296]]
->
[0.59061878 0.65207944 0.2379462 0.64048786 0.56591809]
由4 * 5 矩阵-> 1* 5 vector shape = 1*vector length

4.3 step 3 vector to value

Take the simplest linear formula y = w * x + b
w dimension is 1*vector dimension b is a real number
Example:
w = [1, 1], b = -1, x = [1,2]

4.4 step 4 numerical normalization

sigmoid function
x = 3 σ(x) = 0.9526
insert image description here
insert image description here

4.5 Summary

The overall mapping
"abcd" ---- each character is converted intovector---->4*5 Matrix
4*5 Matrix----Vector Averaging---->1*5 Vector
1*5 Vector----w*x + bLinear formula—> real number
real number----sigmoid normalization function—> real number between 0-1

The yellow part needs to be optimized through training

5. Embedding layer

5.1 Analysis

The Embedding layer is used in the neural network to convert discrete data (usually integer identifiers of text data) into continuous vector representations, that is, word embeddings. This layer is often used in Natural Language Processing (NLP) tasks such as text classification, sentiment analysis, or machine translation.

In the embedding layer, each unique identifier (such as the integer ID of a word) is mapped to a fixed-size vector. These vectors are optimized through the training of the model to be better at a given task.

insert image description here

Collocation vocabulary file
For Chinese, characters are usually used
For English tokens
Multiple languages ​​and symbols can appear in the same vocabulary
Purpose: "abc" --vocabulary–> 0,1,2 --Embedding layer–> 3* n matrix --model–>

5.2 Code Examples

#coding:utf8
import torch
import torch.nn as nn

'''
embedding层的处理
'''

num_embeddings = 6  #通常对于nlp任务,此参数为字符集字符总数
embedding_dim = 5   #每个字符向量化后的向量维度
embedding_layer = nn.Embedding(num_embeddings, embedding_dim)
print(embedding_layer.weight, "随机初始化权重")

#构造输入
x = torch.LongTensor([[1,2,3],[2,2,0]])
embedding_out = embedding_layer(x)
print(embedding_out)


6. Network Structure – Fully Connected Layer

Network structure - fully connected layer
, also known as linear layer
Calculation formula: y = w * x + b
W and b are the parameters participating in the training.
The dimension of W determines the dimension of the output of the hidden layer, which is generally called the number of hidden units (hidden size )
Example:
Input: x (dimension 1 x 3)
hidden layer 1: w (dimension 3 x 5)
hidden layer 2: w (dimension 5 x 2)
insert image description here

7. Network structure – RNN

7.1 Analysis

Recurrent Neural Networks (RNN) is a neural network structure for processing sequence data. Different from the traditional feedforward neural network, RNN has a memory function and can save the information of the previous step or multiple steps. This makes RNN particularly suitable for processing tasks that depend on previous information such as time series data, text, and speech.

In RNNs, neurons not only receive new inputs, but also maintain a state (usually represented by a hidden layer) that is updated at each iteration of the network. Simply put, an RNN takes an input and generates an output each time it updates its internal state.

A typical application of RNNs is natural language processing (NLP), such as text generation, text classification, and machine translation. However, the underlying RNN structure may encounter difficulties in processing long sequences due to the vanishing or exploding gradient problem. To address these issues, more complex RNN variants such as Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU) were developed.

The main idea of ​​the recurrent neural network
is to divide the entire sequence into multiple time steps, input the information of each time step into the model in turn, and pass the output result of the model to the next time step.
insert image description here

official:
insert image description here

insert image description here

7.2 Code example

#coding:utf8

import torch
import torch.nn as nn
import numpy as np


"""
手动实现简单的神经网络
使用pytorch实现RNN
手动实现RNN
对比
"""

class TorchRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(TorchRNN, self).__init__()
        self.layer = nn.RNN(input_size, hidden_size, bias=False, batch_first=True)

    def forward(self, x):
        return self.layer(x)

#自定义RNN模型
class DiyModel:
    def __init__(self, w_ih, w_hh, hidden_size):
        self.w_ih = w_ih
        self.w_hh = w_hh
        self.hidden_size = hidden_size

    def forward(self, x):
        ht = np.zeros((self.hidden_size))
        output = []
        for xt in x:
            ux = np.dot(self.w_ih, xt)
            wh = np.dot(self.w_hh, ht)
            ht_next = np.tanh(ux + wh)
            output.append(ht_next)
            ht = ht_next
        return np.array(output), ht


x = np.array([[1, 2, 3],
              [3, 4, 5],
              [5, 6, 7]])  #网络输入

#torch实验
hidden_size = 4
torch_model = TorchRNN(3, hidden_size)
# print(torch_model.state_dict())
w_ih = torch_model.state_dict()["layer.weight_ih_l0"]
w_hh = torch_model.state_dict()["layer.weight_hh_l0"]
print(w_ih, w_ih.shape)
print(w_hh, w_hh.shape)
#
torch_x = torch.FloatTensor([x])
output, h = torch_model.forward(torch_x)
print(h)
print(output.detach().numpy(), "torch模型预测结果")
print(h.detach().numpy(), "torch模型预测隐含层结果")
print("---------------")
diy_model = DiyModel(w_ih, w_hh, hidden_size)
output, h = diy_model.forward(x)
print(output, "diy模型预测结果")
print(h, "diy模型预测隐含层结果")

8. Network structure – CNN

8.1 Analysis

A network structure based on convolution operations, each convolution kernel can be regarded as a feature extractor
insert image description here

8.2 Code Examples

#coding:utf8

import torch
import torch.nn as nn
import numpy as np


"""
手动实现简单的神经网络
使用pytorch实现CNN
手动实现CNN
对比
"""
#一个二维卷积
class TorchCNN(nn.Module):
    def __init__(self, in_channel, out_channel, kernel):
        super(TorchCNN, self).__init__()
        self.layer = nn.Conv2d(in_channel, out_channel, kernel, bias=False)

    def forward(self, x):
        return self.layer(x)

#自定义CNN模型
class DiyModel:
    def __init__(self, input_height, input_width, weights, kernel_size):
        self.height = input_height
        self.width = input_width
        self.weights = weights
        self.kernel_size = kernel_size

    def forward(self, x):
        output = []
        for kernel_weight in self.weights:
            kernel_weight = kernel_weight.squeeze().numpy() #shape : 2x2
            kernel_output = np.zeros((self.height - kernel_size + 1, self.width - kernel_size + 1))
            for i in range(self.height - kernel_size + 1):
                for j in range(self.width - kernel_size + 1):
                    window = x[i:i+kernel_size, j:j+kernel_size]
                    kernel_output[i, j] = np.sum(kernel_weight * window) # np.dot(a, b) != a * b
            output.append(kernel_output)
        return np.array(output)


x = np.array([[0.1, 0.2, 0.3, 0.4],
              [-3, -4, -5, -6],
              [5.1, 6.2, 7.3, 8.4],
              [-0.7, -0.8, -0.9, -1]])  #网络输入

#torch实验
in_channel = 1
out_channel = 3
kernel_size = 2
torch_model = TorchCNN(in_channel, out_channel, kernel_size)
print(torch_model.state_dict())
torch_w = torch_model.state_dict()["layer.weight"]
# print(torch_w.numpy().shape)
torch_x = torch.FloatTensor([[x]])
output = torch_model.forward(torch_x)
output = output.detach().numpy()
print(output, output.shape, "torch模型预测结果\n")
print("---------------")
diy_model = DiyModel(x.shape[0], x.shape[1], torch_w, kernel_size)
output = diy_model.forward(x)
print(output, "diy模型预测结果")

#######################

#一维卷积层,在nlp中更加常用
kernel_size = 3
input_dim = 5
hidden_size = 4
torch_cnn1d = nn.Conv1d(input_dim, hidden_size, kernel_size)
# for key, weight in torch_cnn1d.state_dict().items():
#     print(key, weight.shape)

def numpy_cnn1d(x, state_dict):
    weight = state_dict["weight"].numpy()
    bias = state_dict["bias"].numpy()
    sequence_output = []
    for i in range(0, x.shape[1] - kernel_size + 1):
        window = x[:, i:i+kernel_size]
        kernel_outputs = []
        for kernel in weight:
            kernel_outputs.append(np.sum(kernel * window))
        sequence_output.append(np.array(kernel_outputs) + bias)
    return np.array(sequence_output).T

# x = torch.from_numpy(np.random.random((length, input_dim)))
# x = x.transpose(1, 0)
# print(torch_cnn1d(torch.Tensor([x])))
# print(numpy_cnn1d(x, torch_cnn1d.state_dict()))

9. Network structure – LSTM

9.1 Analysis

Complicating the hidden unit of RNN avoids the problems of gradient disappearance and information forgetting to a certain extent
insert image description here

The Long Short-Term Memory (LSTM) network is a special type of recurrent neural network (RNN), designed to solve the gradient disappearance or gradient explosion problem that may occur when the basic RNN processes long sequences. LSTM was first proposed by Sepp Hochreiter and Jürgen Schmidhuber in 1997, and has been widely used in various tasks involving sequence data, such as natural language processing, speech recognition, and time series forecasting.

The core idea of ​​LSTM is to introduce a "memory unit" (cell) and three "gates" (gate): input gate, output gate and forget gate. These gates and memory cells work together to decide how to update the state of the network.

  1. Input Gate: Decides which parts of the new input need to update the memory state.
  2. Forget Gate: Decide which old information needs to be forgotten or retained.
  3. Output gate: Based on the current input and memory unit, decide what information to output.

The three-gate structure enables LSTM to capture dependencies more efficiently in long sequences.

9.2 Code Example


import torch
import torch.nn as nn
import numpy as np

'''
用矩阵运算的方式复现一些基础的模型结构
清楚模型的计算细节,有助于加深对于模型的理解,以及模型转换等工作
'''

#构造一个输入
length = 6
input_dim = 12
hidden_size = 7
x = np.random.random((length, input_dim))
# print(x)

#使用pytorch的lstm层
torch_lstm = nn.LSTM(input_dim, hidden_size, batch_first=True)
for key, weight in torch_lstm.state_dict().items():
    print(key, weight.shape)

def sigmoid(x):
    return 1/(1 + np.exp(-x))

#将pytorch的lstm网络权重拿出来,用numpy通过矩阵运算实现lstm的计算
def numpy_lstm(x, state_dict):
    weight_ih = state_dict["weight_ih_l0"].numpy()
    weight_hh = state_dict["weight_hh_l0"].numpy()
    bias_ih = state_dict["bias_ih_l0"].numpy()
    bias_hh = state_dict["bias_hh_l0"].numpy()
    #pytorch将四个门的权重拼接存储,我们将它拆开
    w_i_x, w_f_x, w_c_x, w_o_x = weight_ih[0:hidden_size, :], \
                                 weight_ih[hidden_size:hidden_size*2, :],\
                                 weight_ih[hidden_size*2:hidden_size*3, :],\
                                 weight_ih[hidden_size*3:hidden_size*4, :]
    w_i_h, w_f_h, w_c_h, w_o_h = weight_hh[0:hidden_size, :], \
                                 weight_hh[hidden_size:hidden_size * 2, :], \
                                 weight_hh[hidden_size * 2:hidden_size * 3, :], \
                                 weight_hh[hidden_size * 3:hidden_size * 4, :]
    b_i_x, b_f_x, b_c_x, b_o_x = bias_ih[0:hidden_size], \
                                 bias_ih[hidden_size:hidden_size * 2], \
                                 bias_ih[hidden_size * 2:hidden_size * 3], \
                                 bias_ih[hidden_size * 3:hidden_size * 4]
    b_i_h, b_f_h, b_c_h, b_o_h = bias_hh[0:hidden_size], \
                                 bias_hh[hidden_size:hidden_size * 2], \
                                 bias_hh[hidden_size * 2:hidden_size * 3], \
                                 bias_hh[hidden_size * 3:hidden_size * 4]
    w_i = np.concatenate([w_i_h, w_i_x], axis=1)
    w_f = np.concatenate([w_f_h, w_f_x], axis=1)
    w_c = np.concatenate([w_c_h, w_c_x], axis=1)
    w_o = np.concatenate([w_o_h, w_o_x], axis=1)
    b_f = b_f_h + b_f_x
    b_i = b_i_h + b_i_x
    b_c = b_c_h + b_c_x
    b_o = b_o_h + b_o_x
    c_t = np.zeros((1, hidden_size))
    h_t = np.zeros((1, hidden_size))
    sequence_output = []
    for x_t in x:
        x_t = x_t[np.newaxis, :]
        hx = np.concatenate([h_t, x_t], axis=1)
        # f_t = sigmoid(np.dot(x_t, w_f_x.T) + b_f_x + np.dot(h_t, w_f_h.T) + b_f_h)
        f_t = sigmoid(np.dot(hx, w_f.T) + b_f)
        # i_t = sigmoid(np.dot(x_t, w_i_x.T) + b_i_x + np.dot(h_t, w_i_h.T) + b_i_h)
        i_t = sigmoid(np.dot(hx, w_i.T) + b_i)
        # g = np.tanh(np.dot(x_t, w_c_x.T) + b_c_x + np.dot(h_t, w_c_h.T) + b_c_h)
        g = np.tanh(np.dot(hx, w_c.T) + b_c)
        c_t = f_t * c_t + i_t * g
        # o_t = sigmoid(np.dot(x_t, w_o_x.T) + b_o_x + np.dot(h_t, w_o_h.T) + b_o_h)
        o_t = sigmoid(np.dot(hx, w_o.T) + b_o)
        h_t = o_t * np.tanh(c_t)
        sequence_output.append(h_t)
    return np.array(sequence_output), (h_t, c_t)


# torch_sequence_output, (torch_h, torch_c) = torch_lstm(torch.Tensor([x]))
# numpy_sequence_output, (numpy_h, numpy_c) = numpy_lstm(x, torch_lstm.state_dict())
#
# print(torch_sequence_output)
# print(numpy_sequence_output)
# print("--------")
# print(torch_h)
# print(numpy_h)
# print("--------")
# print(torch_c)
# print(numpy_c)

#############################################################

#使用pytorch的GRU层
torch_gru = nn.GRU(input_dim, hidden_size, batch_first=True)
# for key, weight in torch_gru.state_dict().items():
#     print(key, weight.shape)


#将pytorch的GRU网络权重拿出来,用numpy通过矩阵运算实现GRU的计算
def numpy_gru(x, state_dict):
    weight_ih = state_dict["weight_ih_l0"].numpy()
    weight_hh = state_dict["weight_hh_l0"].numpy()
    bias_ih = state_dict["bias_ih_l0"].numpy()
    bias_hh = state_dict["bias_hh_l0"].numpy()
    #pytorch将3个门的权重拼接存储,我们将它拆开
    w_r_x, w_z_x, w_x = weight_ih[0:hidden_size, :], \
                        weight_ih[hidden_size:hidden_size * 2, :],\
                        weight_ih[hidden_size * 2:hidden_size * 3, :]
    w_r_h, w_z_h, w_h = weight_hh[0:hidden_size, :], \
                        weight_hh[hidden_size:hidden_size * 2, :], \
                        weight_hh[hidden_size * 2:hidden_size * 3, :]
    b_r_x, b_z_x, b_x = bias_ih[0:hidden_size], \
                        bias_ih[hidden_size:hidden_size * 2], \
                        bias_ih[hidden_size * 2:hidden_size * 3]
    b_r_h, b_z_h, b_h = bias_hh[0:hidden_size], \
                        bias_hh[hidden_size:hidden_size * 2], \
                        bias_hh[hidden_size * 2:hidden_size * 3]
    w_z = np.concatenate([w_z_h, w_z_x], axis=1)
    w_r = np.concatenate([w_r_h, w_r_x], axis=1)
    b_z = b_z_h + b_z_x
    b_r = b_r_h + b_r_x
    h_t = np.zeros((1, hidden_size))
    sequence_output = []
    for x_t in x:
        x_t = x_t[np.newaxis, :]
        hx = np.concatenate([h_t, x_t], axis=1)
        z_t = sigmoid(np.dot(hx, w_z.T) + b_z)
        r_t = sigmoid(np.dot(hx, w_r.T) + b_r)
        h = np.tanh(r_t * (np.dot(h_t, w_h.T) + b_h) + np.dot(x_t, w_x.T) + b_x)
        h_t = (1 - z_t) * h + z_t * h_t
        sequence_output.append(h_t)
    return np.array(sequence_output), h_t

# torch_sequence_output, torch_h = torch_gru(torch.Tensor([x]))
# numpy_sequence_output, numpy_h = numpy_gru(x, torch_gru.state_dict())
#
# print(torch_sequence_output)
# print(numpy_sequence_output)
# print("--------")
# print(torch_h)
# print(numpy_h)


10. Network structure – GRU

10.1 Analysis

A simplified variant of LSTM
has advantages and disadvantages with LSTM effects on different tasks
insert image description here

The gated recurrent unit (Gated Recurrent Unit, GRU) is a variant of the recurrent neural network (RNN), proposed by Kyunghyun Cho et al. in 2014. GRU is designed to solve the gradient disappearance problem faced by the standard RNN structure when processing long sequences, and has a similar purpose to the long short-term memory (LSTM) network.

Compared with LSTM, GRU has a simpler structure and mainly consists of two gates:

  1. Update Gate: Decides which information is passed from the previous state to the current state.
  2. Reset Gate: Determines which past information is used with current inputs to update the current state.

Because GRUs have fewer gates and parameters, they are generally faster to train, especially on datasets that do not need to capture extremely long dependencies.

10.2 Code Examples

import torch
import torch.nn as nn

# 定义模型结构
class GRUModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(GRUModel, self).__init__()
        self.gru = nn.GRU(input_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, x):
        out, _ = self.gru(x)
        out = self.fc(out[:, -1, :])
        return out

# 参数设置
input_dim = 64  # 输入的特征维度
hidden_dim = 50  # GRU层的隐藏状态维度
output_dim = 1   # 输出层的维度(用于二分类)

# 初始化模型、损失函数和优化器
model = GRUModel(input_dim, hidden_dim, output_dim)
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# 假设我们有一个形状为(batch_size, seq_len, input_dim)的输入数据和相应的标签
# 这里我们仅用随机数据作为例子
batch_size = 32
seq_len = 10
x = torch.randn(batch_size, seq_len, input_dim)
y = torch.randint(0, 2, (batch_size, output_dim), dtype=torch.float32)

# 前向传播
outputs = model(x)

# 计算损失
loss = criterion(outputs, y)

# 反向传播和优化
optimizer.zero_grad()
loss.backward()
optimizer.step()

11. Network structure – TextCNN

11.1 Analysis

Use one-dimensional convolution to encode the text The encoded text matrix is ​​converted into a vector by pooling for classification
insert image description here

11.2 Code Examples

import torch
import torch.nn as nn
import torch.nn.functional as F

# 定义TextCNN模型
class TextCNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_classes, kernel_sizes=[3, 4, 5], num_filters=100):
        super(TextCNN, self).__init__()
        
        # 嵌入层
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        
        # 卷积层
        self.convs = nn.ModuleList([
            nn.Conv1d(in_channels=embed_dim, out_channels=num_filters, kernel_size=k)
            for k in kernel_sizes
        ])
        
        # 全连接层
        self.fc = nn.Linear(len(kernel_sizes) * num_filters, num_classes)

    def forward(self, x):
        # 输入x形状:[批量大小, 序列长度]
        
        # 嵌入
        x = self.embedding(x)  # 输出形状:[批量大小, 序列长度, 嵌入维度]
        
        # Conv1D期望输入形状:[批量大小, 嵌入维度, 序列长度]
        x = x.permute(0, 2, 1)
        
        # 卷积
        x = [F.relu(conv(x)) for conv in self.convs]
        
        # 池化
        x = [F.max_pool1d(c, c.size(-1)).squeeze(-1) for c in x]
        
        # 拼接
        x = torch.cat(x, 1)
        
        # 全连接层
        x = self.fc(x)
        
        return x

# 超参数
vocab_size = 5000  # 词汇表大小
embed_dim = 300  # 嵌入维度
num_classes = 2  # 类别数
kernel_sizes = [3, 4, 5]  # 卷积核大小
num_filters = 100  # 卷积核数量

# 初始化模型
model = TextCNN(vocab_size, embed_dim, num_classes, kernel_sizes, num_filters)

# 示例输入(在实际应用中,这应是预处理并编码为整数的文本序列)
# 输入形状:[批量大小, 序列长度]
input_batch = torch.randint(0, vocab_size, (32, 50))

# 前向传播
output = model(input_batch)

12. Pooling layer

12.1 Analysis

  • The input dimension of subsequent network layers is reduced, the model size is reduced, and the calculation speed is improved
  • Improved the robustness of Feature Map to prevent overfitting
    insert image description here

12.2 Code Examples

#coding:utf8
import torch
import torch.nn as nn

'''
pooling层的处理
'''

#pooling操作默认对于输入张量的最后一维进行
#入参5,代表把五维池化为一维
layer = nn.AvgPool1d(5)
#随机生成一个维度为3x4x5的张量
#可以想象成3条,文本长度为4,向量长度为5的样本
x = torch.rand([3, 4, 5])
print(x)
print(x.shape)
#经过pooling层
y = layer(x)
print(y)
print(y.shape)
#squeeze方法去掉值为1的维度
y = y.squeeze()
print(y)
print(y.shape)

13. Normalization

insert image description here

insert image description here

Normalization is a technique used in machine learning and statistics to scale features so that they are comparable and interpretable. Common normalization methods include Min-Max Normalization and Z-score Normalization (also called standardization).

Benefits of Standardization

  1. Accelerated Training: Normalized data converges faster during training.
  2. Features Comparable: Make different features comparable.
  3. Helps with certain algorithms: e.g. k-NN and SVM etc. rely on distance between data points, normalization is helpful in these cases.

14. Dropout layer

14.1 Analysis

  • Effect: reduce overfitting
  • According to the specified probability, randomly discard some neurons (reduce them to zero)
  • The remaining elements are multiplied by 1 / (1 – p) to scale up

insert image description here

  • How to understand its role
  • Forcing a neural unit to work together with randomly selected other neural units eliminates and weakens the joint adaptability between neuron nodes and enhances the generalization ability
  • It can be regarded as a kind of model averaging, because the hidden layer nodes that are randomly ignored each time are different, so that the network trained each time is different, and a "new" model can be made for each training
  • Enlightenment: The more complicated the calculation method, the better

14.2 Code Example

#coding:utf8

import torch
import torch.nn as nn
import numpy as np


"""
基于pytorch的网络编写
测试dropout层
"""

import torch

x = torch.Tensor([1,2,3,4,5,6,7,8,9])
dp_layer = torch.nn.Dropout(0.5)
dp_x = dp_layer(x)
print(dp_x)


Guess you like

Origin blog.csdn.net/m0_63260018/article/details/132463952