Machine Learning & Deep Learning Practical Notes (1): Pytorch Basics and Linear Regression

Recently, I started learning and programming machine learning and deep learning, and I probably stumbled over it, so I will record the process and experience

Linear Regression Theory

Overview of Linear Regression Models

The content of linear regression is that we assume some data, there is a linear relationship between the output (label) and the input, that is, y = w T x + by=w^Tx+b can be usedy=wTx+b to represent, and then we need to find a set of most suitablew and bw and bw and b to make the model most consistent with the real situation, here we call our model the H model, because Wu Enda's course is a hypothesis (that is, a hypothesis), we want our hypothesis to be the closest to the real situation

data set

Here, we define the dataset as
{ ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , ⋯ , ( x ( N ) , y ( N ) ) } \{ (x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),\cdots,(x^{(N)},y^ {(N)})\}{(x(1),y(1)),(x(2),y(2)),,(x(N),y( N ) )}The mathematical form of each sample in this is( x ( i ) , y ( i ) ) , i = 1 , 2 , ⋯ , N (x^{(i)},y^{( i)}),i=1,2,\cdots,N(x(i),y(i)),i=1,2,,N , wherexxx is referred to as input,yyy is called label or output,iii indicates the number of samples.
Of course, some tutorials or books use subscripts to define
{ ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯ , ( x N , y N ) } \{(x_1 ,y_1),(x_2,y_2),\cdots,(x_N,y_N)\}{(x1,y1),(x2,y2),,(xN,yN)} wherexxxyyy is generally of different shapes, for example, in target detection or image classification,xxx is an image data (probably understood as a matrix or a three-dimensional array),yyy represents the label of this image, for example, there are five types of images, thisyyWhat category does y belong to, or in the house price prediction, xxx is a vector, including various characteristics of a certain house that can be quantified (such as area, geographical location, etc.),yyy represents the selling price of the house, which may not conform to everyone's perception of vectors, but it can be regarded as a data pair here, that is, a pair of data that is related to each other. This is also the mathematical definition of the data set in the following program

loss function

We have a hypothetical model, so how do we judge whether this model is close to or even equal to the real model? For example, if we want to predict the house price, then we make predictions based on various characteristics of a house, so is the predicted value close to or equal to its actual selling price? If the H model is close to or even equal to the real model, then we think it is a good model, if the difference is far, then the model is not good, so we need to judge whether the model is good or bad.
Because the input and output are continuous variables, we can think of the concept of distance. Can we use the distance between the predicted value and the real value as a judgment indicator?
We call this function of judging whether the model is good or bad called LLL , which is the loss function, is understood as the gap between the real model and the H model, that is, the loss between different models.
So we want to define the loss function like this at first glance: direct difference
L = y ^ − y = w T x + b − y L=\hat{y}-y=w^Tx+byL=y^y=wTx+by is to useHHThe predicted value of the H modely ^ \hat{y}y^and the real value yyThe distance between y is used as a loss, but it is meaningless to measure the loss on only one sample, we need to measure it on the data set
L = ∑ i = 1 N ( y ^ ( i ) − y ( i ) ) = ∑ i = 1 N ( w T x ( i ) + b − y ( i ) ) L=\sum^N_{i=1}(\hat{y}^{(i)}-y^{(i)})= \sum^N_{i=1}(w^Tx^{(i)}+by^{(i)})L=i=1N(y^(i)y(i))=i=1N(wTx(i)+by( i ) )
But we found a drawback, that is, the distance is positive and negative, and the distance between different samples will cancel each other out. If there are two samples( 0 , 1 ), ( 1 , 0 ) (0,1), (1,0)(0,1)(1,0 ) , then obviouslyy = − x + 1 y=-x+1y=x+1 is the most suitableHHH model, but we find thaty = xy = xy=x can also achieve a loss of 0, and even a negative loss, so we need to improve the model. two samplesThen what we think of is the absolute value function, which can make the distance non-negative, or the square function, too, so we use a new loss function and then we
want the loss function to be the smallest, so that we can getthe HHH model, so how do we minimize the model loss function?
We found that in this linear model, the value of the loss function depends entirely onwww andbbb has two parameters, so the loss function can also be written as
L = L ( w , b ) L=L(w,b)L=L(w,b ) Then we found that this is a multivariate function, and then we thought of the derivative - the rate of change of the function to the independent variable, then we can find the partial derivative
Here we found that it is difficult to derive the loss function using the absolute value, So we use the loss function of the difference of squares
L ( w , b ) = ∑ i = 1 N ∥ ( w T xi + b ) − yi ∥ 2 L(w,b)=\sum\limits^N_{i=1} \Vert(w^Tx_i+b)-y_i\Vert^2L(w,b)=i=1N(wTxi+b)yi2 respectively forwww andbbFind the partial derivative of b
L ( w , b ) = ∑ i = 1 N ∥ ( w T xi + b ) − yi ∥ 2 = ∑ i = 1 N [ ( w T xi + b ) 2 − 2 yi ( w T xi + b ) + yi 2 ] ∂ L ( w , b ) ∂ w = ∑ i = 1 N xi ( w T xi + b − yi ) It can be seen that the partial derivative with respect to w is also a vector ∂ L with the same shape as x ( w , b ) ∂ b = 2 ∑ i = 1 N ( b + w T xi − yi ) \begin{aligned} L(w,b)&=\sum\limits^N_{i=1}\Vert( w^Tx_i+b)-y_i\Vert^2 \\ &=\sum\limits^N_{i=1}\left[ (w^Tx_i+b)^2-2y_i(w^Tx_i+b)+y_i ^2 \right]\\ \frac{ \partial L(w,b) }{ \partial w }&=\sum\limits^N_{i=1}x_i\left( w^Tx_i+b-y_i \right )\\ It can be seen that the & partial derivative of w is also a vector of the same type as x\\ \frac{ \partial L(w,b) }{ \partial b }&=2\sum\limits^N_{i =1}\left(b+w^Tx_i-y_i\right)\\ \end{aligned}\\L(w,b)wL(w,b)It can be seen that for wbL(w,b)=i=1N(wTxi+b)yi2=i=1N[(wTxi+b)22 yi(wTxi+b)+yi2]=i=1Nxi(wTxi+byi)The partial derivative is also a vector of the same type as x=2i=1N(b+wTxiyi)Then let's think about it, at x = x 0 x=x_0x=x0If the derivative of the function at is positive, when xxAs x increases, the function value becomes larger, whenxxAs x decreases, the function value decreases. When the derivative is negative,xxAs x increases, the function value decreases,xxThe value of x decreases as the function increases, so can we achieve the minimum loss function by continuously calculating the derivative? The answer is yes
w = w − ∂ L ( w , b ) ∂ ww=w-\frac{ \partial L(w,b) }{ \partial w}w=wwL(w,b) b = b − ∂ L ( w , b ) ∂ b b=b-\frac{ \partial L(w,b) }{ \partial b} b=bbL(w,b)
In this way, we can move the model towards the direction of minimizing the loss function

PyTorchBasics

We want computers to do calculations instead of manual calculations, so that artificial intelligence can be realized, so we need programming. Fortunately, with the development of artificial intelligence, there have been many artificial intelligence programming frameworks. It includes a variety of very powerful frameworks, so that we don't need to be entangled in the details of the program, but focus on the overall framework and theory.
We chose the PyTorch framework, which is friendly to beginners and easy to deploy. At present, the mainstream framework in the field of deep learning research is also PyTorch. As for why we don’t choose TensorFlow, it is because the current API functions of TensorFlow are confusing, which will be difficult for beginners. Causes a huge obstacle, in contrast, PyTorch's functions are very compatible.
We create a file and import the package at the beginning

import torch

First of all, the basic data type of PyTorch is tensor, and various operations of PyTorch are also based on tensor, so understanding tensor is the most important step in deep learning programming.
What is a tensor? Tensor is a multi-dimensional array that can perform mathematical operations. If you say it directly, you may not understand it, so let's start with the simplest concept.
Do you remember the sequence of numbers? A sequence whose subscript is an integer. You may have come into contact with the sequence in various programs in various programming languages, such as one-dimensional arrays in C++, lists in Python, etc. The characteristics of these data structures are obvious. There is a subscript or index, and then a sequence composed of the same or different data elements. In order to avoid ambiguity, we only consider the case where the element types are the same.
The list in Python is like this

list=[10,20,30,40,50]
print(list[0])

In fact, this can be regarded as a sequence of numbers, with only one dimension. Different attributes represented by dimensions can represent different information. For example,
we can let this dimension represent time (in the case of equal intervals), and then we can use this sequence to represent Temperature changes, house price changes, etc.
For example, we set the subscript iii represents the first few hours, then the temperature in a day can be expressed as
q 0 , q 1 , ⋯ , qi , ⋯ , q 23 , i = 0 , 1 , ⋯ , 23 q_0,q_1,\cdots,q_i,\ cdots,q_{23},i=0,1,\cdots,23q0,q1,,qi,,q23,i=0,1,,23 This can represent the temperature data measured every hour between 0:00 and 23:00.
This dimension can also represent other attributes. For example, if we have 100 houses to be sold, then this dimension represents the sample number. For example, we can Use a sequence of length 100 to represent the selling price of these one hundred houses, and the subscript represents which sample
a 0 , a 1 , ⋯ . a 99 a_0,a_1,\cdots.a_{99}a0,a1,.a99
At this point, everyone may have a clear understanding. A sequence is a one-dimensional tensor with one dimension, which contains different data with the same attribute. This data can come from a certain variable attribute of an object (such as a certain The temperature of a day), can also come from the same feature of different objects (such as 100 houses have 100 selling prices), but they are all a kind of information. For example, the temperature of a day is temperature, not humidity. The selling prices of hundreds of houses are also price information, not other information.
Then we manually create a one-dimensional tensor

t1=torch.tensor([1,2,3,4])

Here, we use the tensor method in the torch library to create a tensor. We can pass in an array to initialize, and then return a one-dimensional tensor with a length of 4. We can use the subscript index method to call elements at different positions.
So, what about 2D tensors? In fact, there are two dimensions, and each dimension represents a kind of information. For example, we want to represent the temperature information of ten regions in a day. Is it possible? Then we can use a two-dimensional tensor to represent, this tensor has two dimensions, the first dimension represents different regions, and the second dimension represents temperature, similar to a matrix

Guess you like

Origin blog.csdn.net/qq_46202265/article/details/129121523