"Hands-on Learning Deep Learning" Notes (1) Introduction and Preliminary Knowledge

Station B| [MXNet/Gluon] The first lesson of hands-on deep learning: from hands-on to multi-category classification
Course materials: http://zh.gluon.ai
Interactive forum: http://discuss.gluon.ai
douyu live broadcast: https //www.douyu.com/jiangmen

1. Introduction to deep learning

Inclusion relationship : deep learning ⊂ \subset⊂Machine Learning⊂\subset Artificial Intelligence

Areas where deep learning is being applied : reinforcement learning (AlphaGo), object recognition (autonomous driving, unmanned stores), speech recognition, machine translation, recommendation systems, click prediction (advertising).

The purpose of the series of lectures: to understand deep learning through hands-on implementation. Compared with industrial applications, the main difference is the data scale and model complexity .

Left picture: deep learning wheel, the course uses MXNet/Gluon; right picture: course content

  Part of the reason for the rapid development of deep learning in the past decade: excellent capacity control algorithms, attention mechanisms, memory networks and neural encoder-decoders, generative confrontation networks, distributed training algorithms, parallel computing, and deep learning frameworks.

  Deep learning is a representation learning method with multi-level representations . At each level (starting from raw data), deep learning converts the representation of that level into a higher-level representation through a simple function. An extrinsic feature of deep learning is end-to-end training, rather than a system composed of individually debugged parts.
  Compared with other classical machine learning methods, deep learning differs in: the tolerance of non-optimal solutions , the use of non-convex nonlinear optimization , and the ability to try methods that have not been proven .

2. Preliminary knowledge (installation and use of the library)

Install MXNet library
ndarray, matrix creation and operation, broadcasting operation (broadcasting)
autograd, automatic derivation, chain derivation
file: linear-regression-scratch

The preparatory knowledge here includes data manipulation , data preprocessing , linear algebra , calculus , automatic differentiation , probability, etc.

2.1-2.2 Data manipulation/preprocessing

The specific code of this one can be viewed: Preliminary Knowledge | Data Operation

  1. Operation type : tensor, create tensor, access tensor shape, query total number of tensor elements, change tensor shape
  2. The way of creation : All 0 and all 1 tensors of specified shape, the value of each element in the tensor is obtained by random sampling in a specific probability distribution, and directly created by python list.
  3. Operator symbols : Element-wise operations on tensors of the same shape (+, -, ∗ *、/、 ∗ ∗ ** ), element-wise operations on tensors (such as exponentiation),linear algebra operations(vector point multiplication, matrix multiplication),vector concatenation,logical operations(construction of binary tensors),summation of tensor elements, etc.
  4. Broadcasting mechanism : Extends the array by copying elements so that both tensors have the same shape after transformation. Such as 3 × 1 3\times13×1 and1 × 2 1\times21×When adding 2 matrices, they will be broadcast as3 × 2 3\times23×2 matrices, and add them together.
  5. Indexing and slicing : Same as python arrays.
  6. Save memory : During some operations, memory may be allocated for some new results. Using python id()functions, you can find that the address of the referenced object changes before and after the operation. This change may not be desirable, and instructions that perform in-place operations are also Quite simply, using slice notation, eg Y[:]=<expression>.
  7. Conversion to other Python objects : To convert a tensor of size 1 to a Python scalar, you can call item()the function or python's built-in functions such as int(),float().
  8. Data preprocessing : Use pandas read_csvto read the data stored in csv , where NaNthe items represent missing values. The methods for dealing with missing values ​​include interpolation and deletion. After processing, convert the original numeric type to tensor format ( torch.tensor)

2.3 Linear Algebra

  1. Addition , subtraction, multiplication, and division of scalars , length, dimension, and shape of vectors, creation, transformation, and transposition of matrices
  2. Vectors are a generalization of scalars, matrices are generalizations of vectors, so data structures with more axes can be built , and tensors are a general way to describe n-dimensional arrays with any number of axes.
  3. Element-wise unary and binary operations on tensors do not change the resulting shape. Element-wise multiplication is called the Hamamard product, A * Bdenoted by .
  4. Use sum(), sum(axis=0), sum(axis=[0,1])to sum the tensor elements , use mean()the average , and you can also specify the dimension of the average, and there are also non-dimension-reduced sums and averages keepdims=True.
  5. Dot product of tensors dot(), matrix-vector producttorch.mv(A,x) , matrix-matrix multiplicationtorch.mm(A,B) .
  6. Norm norm, L 1 L_1L1The norm is the sum of absolute values, L 2 L_2L2The norm is the square root of the sum of squares, L p L_pLpNorm is pp1 p \frac{1}{p} of the p -th power sump1power, and so on.
  7. In deep learning, it is often necessary to try to solve optimization problems, such as: maximizing the probability assigned to observed data; minimizing the distance between predictions and real observations... These goals are often defined using norms.
  8. More on linear algebra: Online appendix on linear algebra operations or other good sources ( Kolter, 2008 , Petersen et al., 2008 , Strang, 1993 )

2.4-2.5 Calculus and automatic differentiation

  1. The approximation method is the origin of the integral. Differentiation was invented more than two thousand years later. Its most important application is the optimization problem. In the process of deep learning training model, it means minimizing a loss function (Loss Function), which can be The task of fitting the model is decomposed into two key issues:
    (1) Optimization : the process of fitting the model to the observed data. (2) Generalization : guides us to generate a model that is more effective than the training data set itself.
  2. symbol ddx \frac{d}{dx}dxdand DDD is a differential operator. Differential operations have their own rules for constant multiplication, addition, multiplication, and division.
  3. Commenting out special tags #@savewill save the corresponding function, class or statement in the d2l package.
  4. Some ways to use matplotlib
  5. The partial derivative of multivariate function ∂ y ∂ xi \frac{\partial y}{\partial x_i}xiy, Gradient ▽ xf ( x ) \triangledown_xf(x)xf(x)
  6. chain rule .
  7. Automatic differentiation : Manually updating the differential for complex models is cumbersome and error-prone. Automatic differentiation enables the system to backpropagate gradients, as follows:
import torch
x = torch.arange(4.0)
x.require_grad_(True)   # 此时 x.grad = None
y = 2 * torch.dot(x, x)  # y = tensor(28., grad_fn=<MulBackward0>)
y.backward() # x.grad = tensor([ 0.,  4.,  8., 12.])
  1. When y is not a scalar, the result of derivation of vector y with respect to vector x is a matrix. In deep learning, usually the purpose is not to calculate the differential matrix, but to calculate the sum of partial derivatives of each sample in the batchy.sum().backward()
  2. Separation of calculations : Use u=y.detach()a certain part of the backpropagation process as a constant, so that the separation of calculations can be achieved.
  3. Another benefit of automatic differentiation is that even if python control flow occurs in the function (such as conditional, loop or function call), the gradient can still be calculated.

2.5 Probability

  1. Simply put, machine learning is about making (probabilistic) predictions. Probability is a flexible language for specifying degrees of certainty that is useful in a wide variety of domains.
  2. Python operations on basic probability theory
import torch
from torch.distributions import multinomial
# 多项分布,模拟掷骰子
fair_probs = torch.ones([6]) / 6  # fair_probs = tensor([0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667])
multinomial.Multinomial(1, fair_probs).sample()  # tensor([0., 0., 1., 0., 0., 0.])
multinomial.Multinomial(10, fair_probs).sample()  # tensor([2., 4., 1., 0., 1., 2.])
  1. The axiom of probability theory: the probability is non-negative, the probability of the entire sample space is 1, and the probability of mutually exclusive event sequences is equal to the sum of their respective probabilities.
  2. Random variables, joint probability, conditional probability, Bayes' theorem
  3. Marginalized, i.e. BBThe probability of B is equivalent to computingAAAll possible choices of A , and sum the joint probabilities of all choices.
  4. Independence, Expectation, Variance

2.6 How to check documents

import torch
# 查看模块中可以调用哪些函数和类
print(dir(torch.distributions))
# 查找特定函数和类的用法
help(torch.ones)

Guess you like

Origin blog.csdn.net/lj164567487/article/details/129142744