Deep Learning|Linear Regression and Linear Classification

1. Linear neural network

A linear neural network is structurally very similar to a perceptron, with the main difference being the activation function. The activation function of the perceptron can only output two possible values, namely -1 and 1, while the output of the linear neural network can take any value, and its activation function is a linear function, such as the purelin function in the figure below. Usually, the linear neural network uses the LMS (Least Mean Square) algorithm to adjust the weight and bias of the network.

The LMS algorithm was proposed by Widrow and Hoff when they studied the scheme-pattern recognition of adaptive linear elements, also known as the least mean square algorithm. The LMS algorithm is based on the Wiener filter and developed with the help of the steepest descent method to approach the optimal solution in a recursive manner. Therefore, the LMS algorithm has the advantages of low computational complexity and the use of limited precision to achieve algorithm stability, making the LMS algorithm the most stable and most widely used algorithm among adaptive algorithms. The specific process of the LMS algorithm is: (1) Determine the parameters, including the global step size and the number of layers (order); (2) Initialize the initial value; (3) Operation, including the result output, error signal and weight update.

The most common functions of linear neural networks are to solve linear regression problems and linear classification problems, and these two types of problems belong to supervised learning in machine learning. Supervised learning is based on the correct answer, using a hypothesis to approximate the mapping relationship between an input and a given output, typically regression problems and classification problems.

The following is an expanded description of linear regression and linear classification problems.

2. Linear regression

Regression is one of the most powerful tools in statistics. It is suitable for the prediction of continuous distribution data. When an input value is given, a specific value can be predicted. The purpose of regression is to establish a regression equation to predict the target value. By solving the regression coefficient of the regression equation by a certain method, the mapping relationship between the output value and the input value can be obtained.

The functional form based on the mapping relationship can be divided into linear regression, quadratic regression, nonlinear regression, etc.; according to the dimension of the input value, it can be divided into unary regression, binary regression and even multiple regression. Therefore, linear regression is to give a point set D , use a linear function to fit the point set and minimize the error between the point set and the fitting function, and the obtained linear function is the linear regression equation.

2.1 Linear regression model

Linear regression first needs to make the following assumptions: (1) The relationship between the independent variable x and the dependent variable y is linear, that is, y can be expressed as the weighted sum of elements in x ; (2) certain random disturbance items (noise ) exists, it should have randomness, such as conforming to normal distribution, etc.

Specific to reality, the researchers hope to predict the market price of used cars based on the car brand effect, car performance, and car age. To develop a model to predict the market price of used cars, researchers need to collect real datasets including brand value, performance, vehicle age, and market price of used cars. In machine learning terminology, the data set is called a training data set, each piece of data is called a sample or data point, the target of prediction (the market price of a used car) is called a label, and the independent variable (brand value, performance, vehicle age) on which the prediction is based is called as a feature or covariate. We can use n to represent the number of samples in the data set. For the sample with index i , the input can be expressed as:

The corresponding label is .

Based on the above assumptions, this relationship can be envisioned as:

In terms of vectors, it can be expressed as:

The vector x corresponds to the feature of a single data sample, and X represents n samples of the entire data set, where each row of X is a sample, and each column is a feature.

For a feature set X , the predicted value can be expressed as:

Given training data features X and corresponding known labels y , the goal of linear regression is to find a set of weight vectors w and biases b that, given new sample features sampled from the same distribution as X , are such that And the bias can make the error of predicting the label of the new sample as small as possible.

Before starting to find the best model parameters w and b , we also need a way to measure the quality of the model and a way to improve the quality of the model.

2.2 Loss function

The loss function can quantify the difference between the actual value of the target and the predicted value. As shown in the figure below, usually we will choose a non-negative number as the loss, and the smaller the value, the smaller the loss, and the loss for perfect prediction is 0. The most used loss function in regression problems is the square error function. When the predicted value of sample i is , the corresponding real label is , then the square error can be defined as the following formula:

2.3 Analytical solution

The solution of linear regression can be expressed simply by a formula, and this kind of solution is called analytical solution. First, we incorporate the bias b into the parameter w by appending a column in the matrix containing all parameters. Our aim is to minimize the following expression:

This has only one critical point on the loss plane, which corresponds to the loss minima for the entire region. Setting the derivative of the loss with respect to w to 0 yields an analytical solution:

Simple problems like linear regression have analytical solutions, but not all problems have analytical solutions. The analytical solution can perform good mathematical analysis, but the analytical solution has strict restrictions on the problem, which prevents it from being widely used in deep learning.

2.4 Stochastic Gradient Descent

The usage of gradient descent is to calculate the derivative (gradient) of the loss function with respect to the model parameters. Before each update of the parameters, the entire data set needs to be traversed. Usually, a small batch of samples is randomly selected each time an update needs to be calculated. This variant is called small batch stochastic gradient descent.

In each iteration, we first randomly sample a mini-batch, which consists of a fixed number of training samples. Then, we compute the derivatives of the mini-batch average loss with respect to the model parameters. Finally, we multiply the gradient by a predetermined positive number η and subtract it from the current parameter value.

2.5 Implement linear regression with python

First generate the dataset, using the linear model parameters , b = 4.2 and the noise term ε to generate the dataset and label expressions:

Assuming that ε follows a normal distribution with a mean of 0, to simplify the problem, the standard deviation is set to 0.01.

# 根据带有噪声的线性模型构造一个人造数据集。我们使用线性模型参数
# w , b 和噪声d生成数据集及其标签
def synthetic_data(w, b, num_examples):
    X = torch.normal(0,1,(num_examples, len(w)))    ###生成mean=0, std=1, size=(num_examples, len(w)) 的向量
    y = torch.matmul(X,w) + b
    y += torch.normal(0, 0.01, y.shape)
    return  X, y.reshape((-1,1))   ### 将Y 转换为列向量
true_w = torch.tensor([2, -3.4])
true_b = 4.2
features, labels = synthetic_data(true_w, true_b, 1000)
print('features:', features[0], '\nlabel:', labels[0])
plt.figure()
plt.scatter(features[:, 1].detach().numpy(),
            labels.detach().numpy(),1)
plt.show()

After defining the data generation function, we can define X and y as features and labels, and the data graph is shown in the figure below. From the scatter points in the figure, the data distribution meets the linear requirement.

Once the data is defined, a function that accepts the data needs to be defined, including batch size, features, and labels.

def data_iter(batch_size, features, labels):
    num_examples = len(features)
    indices = list(range(num_examples))
    random.shuffle(indices)  ## 打乱下标
    for i in range(0, num_examples, batch_size):
        batch_indices = torch.tensor(
            indices[i:min(i + batch_size, num_examples)])
        yield features[batch_indices], labels[batch_indices]

batch_size = 10

for X, y in data_iter(batch_size,features, labels):
    print(X, '\n',y )
    break

It should be noted that because the data needs to be trained, it must be scrambled, that is, random.shuffle. Here we print the data we read to get the data format as (10, 2).

tensor([[-0.9799, -0.5394],
        [-0.1818,  0.4705],
        [-1.0967, -2.5218],
        [-1.4719,  0.4218],
        [-0.7889, -1.4477],
        [-0.2622, -0.1918],
        [-1.1138, -0.8647],
        [-0.5958, -0.3762],
        [-1.6837, -2.3087],
        [-1.5623, -0.2522]]) 
 tensor([[ 4.0597],
        [ 2.2267],
        [10.5830],
        [-0.1603],
        [ 7.5412],
        [ 4.3141],
        [ 4.9181],
        [ 4.2847],
        [ 8.6720],
        [ 1.9464]])

After reading the data, we can enter the training phase. The main steps are: (1) Initialize the weight w , because w needs to participate in the calculation of the gradient; (2) Initialize the bias scalar b ; (3) Define the model linreg; (4) Define the loss function; (5) Define the gradient optimization function.

###定义初始化模型参数 w 与b
w = torch.normal (0, 0.01, size=(2,1), requires_grad=True)
b = torch.zeros(size=(1,1), requires_grad=True)   ## 因为偏差是个标量 所以size为1*1

def linreg(X, w, b):
    '''线性回归模型'''
    return  torch.matmul(X, w) + b

def squared_loss(y_hat, y):
    ''''均方损失'''
    return (y_hat-y.reshape(y_hat.shape))**2 / 2    ###向量大小可能不一样,所以统一reshape 成size(y_hat)

def sgd(params, lr, batch_size):  ##
    '''
    1. params :给定所有参数
    2. lr:学习率
    3. batch_size: 输入的批次量大小
    小批量随机梯度下降
    '''
    with torch.no_grad():
         for param in params:
             param -= lr * param.grad / batch_size
             param.grad.zero_()  ##torch不会自动将梯度重新设置为0,这里需手动设置

After that, you can enter the training phase, and you need to set the learning rate and the number of iterations.

lr = 0.03
num_epochs = 20
net = linreg
loss = squared_loss

for epoch in range (num_epochs):
    for X,y in data_iter(batch_size, features, labels):
        l = loss(net(X, w, b), y)
        l.sum().backward()
        sgd([w,b], lr, batch_size)
    with torch.no_grad():
        train_1 = loss(net(features, w, b), labels)
        print(f'epoch{epoch + 1}, loss{float(train_1.mean()):f}')

 ## 人工数据集 可以手动查看误差
print(f'w的估计误差:{true_w - w.reshape(true_w.shape)}')
print(f'b的估计误差:{true_b - b}')   ## b 是个标量 不需要resize

3. Linear Classification

3.1 Classification

线性神经网络也可用于解决分类问题,对于任意一个输入,分类的任务就是其分配到K个类别之一的。属于类别的所有样本x构成的集合,称为类别的决策区域记为,决策区域之间的边界称为决策边界。

所谓线性分类模型是指决策边界为线性边界的分类模型。即,线性分类模型的决策边界具有如下的形式:

下图展示了二维输入空间的线性和非线性分类模型的情况。

简单来说,分类问题就是对样本进行类别的划分,在实际生活中经常会面临这类问题。例如,把猫和狗分成不同的类,具体到不同品种,猫和狗又各自会分为不同的类型,这就是一种分类问题。抽象成散点来说,可以将图中的散点按照坐标分为两类,两种类型大致分布在各自的区域中。若要解决这个问题,就要让训练好的网络具有两个输入节点,分别输入x坐标和y坐标;还需要两个输出节点,输出分类编码。通常采用的分类编码在计算机学科中被称为独热编码(One-hot),如上图(b)中的散点可分为两类,可以让输出10代表第一类,01代表第二类。推广到多分类上,可以让100代表第一类,010代表第二类,001代表第三类。

关于二分类问题的输出也可以不采用独热编码,可以只有一个输出,输出的范围为0~1,代表输入被认定为是第一类或第二类的概率。这种情况下,输出层的激活函数一般用Sigmoid函数。

3.2 用python实现分类

为了简单起见和普适性,此处不引用猫狗识别这样的实际问题,而是采用具有一定随机性的散点分布进行展示。

3.2.1 准备工作

首先导入需要的模块,如下:

import torch                        ## 导入torch模块
import torch.nn as nn               ## 简化nn模块
import matplotlib.pyplot as plt     ## 导入并简化matplotlib模块

我们希望构建大致在两种范围内的散点集,而这些散点还需要具有一定的随机性。此处引用torch中的normal()接口,该接口适用于产生符合均值为mean、标准差为std的正态分布随机数,其中mean和std不一定是一个值,也可以是一个数组,但两个数组的size必须相同。

torch.normal(mean,std)

在参数均为数组的情况下,最后生成的Tensor变量的size也与数组size相同。

x = torch.ones(5)
y = torch.normal(x, 1)
print(y)

上述代码运行后会产生一维随机数组,随机值符合均值为1,标准差为1的正态分布,运行结果如下:

tensor([1.5439, 0.8622, 1.4256, 0.0724, 1.1194])

如果把x改成二维数组,如:

x = torch.ones(2, 2)

运行结果如下:

tensor([[ 0.4945, 1.3048],
        [-0.4960, 1.2658]])

我们希望构建两种不同的样本集,但两种样本集又有着自己的总体特征,可以让两个不同的散点集合符合不同的正态分布,例如:

data = torch.ones(100,2)        ## 数据总数(总框架)
x0 = torch.normal(2*data, 1)    ## 第一类坐标
x1 = torch.normal(-2*data, 1)    ## 第二类坐标
## 画图
for item in x0:
    plt.scatter(item[0],item[1])
for item in x1:
    plt.scatter(item[0],item[1])
plt.show()

x0和x1均继承了data的尺寸,是一个二维数组,尺寸为100*2。因此,x0和x1均可代表一个点集,因为正态分布参数不同,所以两个点集是不同的点集,绘制的散点图如下图所示。

在训练网络时候,我们希望训练集是一个整体,所以将x0和x1合并成一个样本集。合并Tensor张量的接口为torch.cat(),例如:

import torch

x0 = torch.ones(2, 2)  ## 创建二维Tensor张量
x1 = torch.zeros(2, 2)
print(x0)
print(x1)
x = torch.cat((x0, x1), 0)  ## 按列合并
print(x)
x = torch.cat((x0, x1), 1)  ## 按行合并
print(x)
x = torch.cat((x0, x1))  ## 默认按列合并
print(x)

输出结果如下:

tensor([[1., 1.],
        [1., 1.]])
tensor([[0., 0.],
        [0., 0.]])
tensor([[1., 1.],
        [1., 1.],
        [0., 0.],
        [0., 0.]])
tensor([[1., 1., 0., 0.],
        [1., 1., 0., 0.]])
tensor([[1., 1.],
        [1., 1.],
        [0., 0.],
        [0., 0.]])

将样本按列合并之后,第一列代表所有的x坐标,第二列代表所有的y坐标。合并完成之后,为防止数据类型错误,将其转换成Float类型的Tensor变量,代码如下:

x = torch.cat((x0, x1)).type(torch.FloatTensor)

之后,给散点集打上标签,告诉计算机哪些是第一类点,哪些是第二类点,可以将第一类标记为0,第二类标记为1,用y来储存这些标签。

y0 = torch.zeros(100)  ## 第一类标签储存为0
y1 = torch.ones(100)  ## 第二类标签储存为1
y = torch.cat((y0, y1)).type(torch.LongTensor)

3.2.2 构建网络

根据分类问题的分析,网络需要有两个输入和输出,采用隐藏节点为15的隐含层,选择ReLU函数作为隐含层激活函数,Softmax函数作为输出层激活函数。

构建网络的方法和回归问题类似,代码如下:

class Net(nn.Module):  ## 定义类,储存网络结构
    def __init__(self):
        super(Net, self).__init__()  ## nn模块搭建网络
        self.classify = nn.Sequential(  ## nn模块搭建网络
            nn.Linear(2, 15),  ## 全连接层,2个输入,15个输出
            nn.ReLU(),  ## ReLU激活函数
            nn.Linear(15, 2),  ## 全连接层,15个输入,2个输出
            nn.Softmax(dim=1)
        )

    def forward(self, x):  ## 定义前向传播过程
        classification = self.classify(x)  ## 将x传入网络
        return classification  ## 返回预测值

3.2.3 训练网络

构建网络后,需要对网络进行训练,训练的设置和前面设置的几乎一致,采用SGD算法进行优化。通常分类问题采用交叉熵函数作为损失函数,接口为CrossEntropyLoss(),代码如下:

net = Net()
optimizer = torch.optim.SGD(net.parameters(), lr=0.03)  ## 设置优化器
loss_func = nn.CrossEntropyLoss()  ## 设置损失函数
for epoch in range(100):  # 训练部分
    out = net(x)  ## 实际输出
    loss = loss_func(out, y)  ## 实际输出和期望输出传入损失函数
    optimizer.zero_grad()  ## 清除梯度
    loss.backward()  ## 误差反向传播
    optimizer.step()  ## 优化器开始优化

3.2.4 完整程序(附可视化过程)

使用神经网络解决分类问题的完整程序代码如下:

import torch  ## 导入torch模块
import torch.nn as nn  ## 简化nn模块
import matplotlib.pyplot as plt  ## 导入并简化matplotlib模块

data = torch.ones(100, 2)  ## 数据总数(总框架)
x0 = torch.normal(2 * data, 1)  ## 第一类坐标
x1 = torch.normal(-2 * data, 1)  ## 第二类坐标
y0 = torch.zeros(100)  ## 第一类标签储存为0
y1 = torch.ones(100)  ## 第二类标签储存为1
x = torch.cat((x0, x1)).type(torch.FloatTensor)
y = torch.cat((y0, y1)).type(torch.LongTensor)


class Net(nn.Module):  ## 定义类,储存网络结构
    def __init__(self):
        super(Net, self).__init__()  ## nn模块搭建网络
        self.classify = nn.Sequential(  ## nn模块搭建网络
            nn.Linear(2, 15),  ## 全连接层,2个输入,15个输出
            nn.ReLU(),  ## ReLU激活函数
            nn.Linear(15, 2),  ## 全连接层,15个输入,2个输出
            nn.Softmax(dim=1)
        )

    def forward(self, x):  ## 定义前向传播过程
        classification = self.classify(x)  ## 将x传入网络
        return classification  ## 返回预测值


net = Net()
optimizer = torch.optim.SGD(net.parameters(), lr=0.03)  ## 设置优化器
loss_func = nn.CrossEntropyLoss()  ## 设置损失函数
plt.ion  ## 打开交互模式
for epoch in range(100):  # 训练部分
    out = net(x)  ## 实际输出
    loss = loss_func(out, y)  ## 实际输出和期望输出传入损失函数
    optimizer.zero_grad()  ## 清除梯度
    loss.backward()  ## 误差反向传播
    optimizer.step()  ## 优化器开始优化
    if epoch % 2 == 0:  ## 每2poch显示
        plt.cla()  ## 清除上一次绘图
        classification = torch.max(out, 1)[1]  ## 返回每一行中最大值的下标
        class_y = classification.data.numpy()  ## 转换成numpy数组
        target_y = y.data.numpy()  ## 标签页转换成numpy数组
        plt.scatter(x.data.numpy()[:, 0], x.data.numpy()[:, 1], c=class_y,
                    s=100, cmap='RdYlGn')  ## 绘制散点图
        accuracy = sum(class_y == target_y) / 200  ## 计算准确率
        plt.text(1.5, -4, f'Accuracy={accuracy}',
                 fontdict={'size': 20, 'color': 'red'})  ## 显示准确率
        plt.pause(0.4)  ## 时间0.4s
    plt.show()
plt.ioff()  ## 关闭交互模式
plt.show()

输出结果大致变化过程如下图所示:

当准确率逼近1时,可以看到分类已经完成,且效果显著,说明训练的模型有效。

Guess you like

Origin blog.csdn.net/weixin_58243219/article/details/129622074