hands-on deep learning v2 p1 introduction supervised learning vs unsupervised learning

1 Introduction

1.2. Key Components in Machine Learning

First introduce some core components. No matter what type of machine learning problem you will encounter these components:

  1. Data that can be used for learning (data);

  2. How to transform the data model (model);

  3. An objective function to quantify the effectiveness of the model;

  4. Algorithm for tuning model parameters to optimize the objective function .

1. Data

It is not enough just to have a lot of data, we also need the right data. If the data is full of errors, or if the characteristics of the data do not predict the task goal, then the model is likely to be ineffective. There's an old saying that captures this phenomenon well: " What's in is garbage, and what's out is garbage. "

2. Model

Most machine learning involves transformation of data. For example, a system that "takes photos and predicts smiling faces". Another example is to predict the normality and abnormality of the readings through a set of sensor readings ingested. While simple models can solve such simple problems, the problems we focus on in this book go beyond the limits of classical methods. The main difference between deep learning and classical methods is that the former focuses on powerful models, which are intricately intertwined by neural networks, including layer-by-layer data transformation, so it is called deep learning (deep learning) . In discussing deep models, this book will also mention some traditional methods.

3. Objective function

The preceding sections introduced machine learning as "learning from experience". The "learning" mentioned here refers to autonomously improving the efficiency of the model to complete certain tasks. But what counts as real improvement? In machine learning, we need to define a measure of the quality of the model. This measure is "optimizable" in most cases, which is called the objective function . We usually define an objective function and wish to optimize it to the lowest point. Because lower is better, these functions are sometimes called loss functions (or cost functions). But this is just a convention, we can also take a new function and optimize it to its highest point. The two functions are essentially the same, just flipping the sign.

When the task is trying to predict a value, the most common loss function is squared error , which is the square of the difference between the predicted value and the actual value. When trying to solve a classification problem, the most common objective function is to minimize the error rate, that is, the proportion of samples whose predictions do not match the truth. Some objective functions (such as squared error) are easy to optimize, and some objectives (such as error rate) are difficult to optimize directly due to non-differentiability or other complexities. In these cases, alternative targets are often optimized .

Usually, the loss function is defined in terms of model parameters and depends on the dataset. On a dataset, we can learn optimal values ​​for model parameters by minimizing the total loss. The data set consists of some samples collected for training, called training data set (training dataset, or called training set (training set)). However, a model that performs well on the training data does not necessarily have the same performance on the "new data set", where the "new data set" is usually called the test data set (test dataset, or test set ( test set)).

To sum up, the available dataset can usually be divided into two parts: the training dataset is used to fit the model parameters, and the test dataset is used to evaluate the fitted model. Then we observe the performance of the model on these two parts of the dataset. "A model's performance on the training dataset" can be thought of as "a student's score on a mock exam". This score is used as a reference for some real final exams, and even encouraging grades are no guarantee of final exam success. In other words, test performance may deviate significantly from training performance. When a model performs well on the training set but fails to generalize to the test set, the model is said to be overfitting . Just like in real life, even though you do well on the mock exams, you don't always hit the real exams perfectly.

4. Optimization algorithm

Once we have some data sources and their representations, a model, and an appropriate loss function, then we need an algorithm that can search for the best parameters to minimize the loss function . In deep learning, most popular optimization algorithms are usually based on a basic method - gradient descent . In short, at each step, gradient descent examines each parameter to see in which direction the training set loss would move if only a small change was made to that parameter. Then, it optimizes the parameters in the direction that can reduce the loss.

1.3 Supervised Learning

Supervised learning is good at predicting labels "given input features". Each "feature-label" pair is called an example . Sometimes, a sample can refer to an input feature even if the label is unknown. Our goal is to generate a model that is able to map any input feature to a label (i.e. predict).

To give a concrete example: Suppose we need to predict whether a patient is going to have a heart attack or not, then the observation "heart attack" or "heart attack not happening" would be the labels of the samples. Input features could be vital signs such as heart rate, diastolic and systolic blood pressure, etc.

Supervised learning works because when training the parameters, we provide the model with a dataset where each example has a true label. In probabilistic terms, we wish to predict the conditional probability of "estimating the label given the input features". Although supervised learning is only one of several broad classes of machine learning problems, most of the successful applications of machine learning in industry use supervised learning. This is because, to some extent, many important tasks can be clearly described as estimating the probability of something unknown given a specific set of available data.

The learning process of supervised learning can generally be divided into three steps:

  1. Randomly select a subset from a known large number of data samples to obtain ground-truth labels for each sample. Sometimes these samples are already labeled (e.g., will the patient recover within the next year?); sometimes these samples may need to be manually labeled (e.g., image classification). These inputs together with the corresponding labels form the training dataset;

  2. Choose a supervised learning algorithm that takes a training dataset as input and outputs a "learned model";

  3. Put previously unseen sample features into this "learned model" and use the model's output as a prediction for the corresponding label.

  1. Regression: A regression problem is the problem of predicting continuous values . For example, predict house prices, stock prices, or people's heights. These are continuous values ​​and our goal is to find the relationship between the input features and the continuous target values. For example, we might use features such as a house's size, location, year it was built, etc. to predict house prices.

  2. Classification: Classification problems are problems of predicting discrete values . For example, determining whether an email is spam or not spam, or whether a picture is a cat or a dog. In these cases, our goal is to classify samples into two or more classes based on input features.

  3. Labeling Problems: Labeling problems are predictions about multiple properties of objects , which are not mutually exclusive. For example, in natural language processing, we may need to label the part of speech (noun, verb, adjective, etc.) of each word in a sentence. In this case, each word can have multiple tags.

  4. Search: In supervised learning, search can be seen as learning a policy to find the best one among a large number of possible solutions . For example, an AI for a board game, which needs to decide the best move at each move.

  5. Recommender systems: A recommender system is an information filtering system that predicts a user's "rating" or "preference" for an item . For example, based on the user's past purchase history, browsing history and other information, predict which new products or services the user may like. For example, the Netflix recommendation system suggests new movies a user might like based on movies they have watched in the past.

1.3.2. Unsupervised Learning

All the examples so far have been related to supervised learning, where the model needs to be fed huge datasets: each sample contains features and corresponding label values. Just for fun, the "supervised learning" model is like a wage earner with an extremely professional job and an extremely mediocre boss. The boss stands behind and tells the model exactly what to do in each situation until the model learns the mapping from situation to action. It's easy to please this boss, just recognize the patterns and emulate their behavior as quickly as possible.

On the contrary, if the work does not have a very specific goal, you need to learn "spontaneously". For example, the boss may give us a lot of data, and then ask to use it to do some data science research, but there is no requirement for the results. Machine learning problems in which such data does not contain a "target" are often referred to as unsupervised learning , and unsupervised learning techniques will be discussed in later chapters of this book. So what kind of questions can unsupervised learning answer? Take a look at the example below.

  • Clustering problem: Can we classify data without labels? For example, given a set of photos, can we separate them into landscape photos, dogs, babies, cats, and mountain photos? Similarly, given a group of users' web browsing records, can we cluster users with similar behavior?

  • The principal component analysis problem: Can we find a small number of parameters that accurately capture the linearly dependent properties of the data? For example, the trajectory of a ball can be described by the ball's velocity, diameter, and mass. As another example, tailors have developed a small set of parameters that fairly accurately describe the shape of the human body to fit the clothes. Another example: is there a representation of an (arbitrarily structured) object in Euclidean space such that its symbolic properties are well matched? This can be used to describe entities and their relationships, e.g. "Rome" − " Italy" + "France" = "Paris".

  • The question of causality and probabilistic graphical models: Can we describe the root cause of much of the observed data? For example, if we have demographic data on housing prices, pollution, crime, geography, education, and wages, can we simply discover relationships between them based on empirical data?

  • Generative adversarial networks : Give us a way to synthesize data, even complex unstructured data like images and audio. The underlying statistical mechanisms are tests that check whether real and fake data are the same, and it is another important and exciting area of ​​unsupervised learning.

1.3.3. Interacting with the environment

Some people have been wondering: Where does the input (data) for machine learning come from? Where will the output of machine learning go? So far, whether it is supervised learning or unsupervised learning, we will pre-fetch a large amount of data, and then start the model without interacting with the environment. All learning here is done after the algorithm is disconnected from the environment, which is called offline learning . For supervised learning, the process of collecting data from the environment is similar to  Figure 1.3.6 .

1.3.4. Reinforcement Learning

If you're interested in using machine learning to develop interactions and actions with the environment, you might end up focusing on reinforcement learning . This could include artificial intelligence (AI) applied to robots, dialogue systems, and even the development of video games. Deep reinforcement learning (deep reinforcement learning) is a very popular research field that applies deep learning to reinforcement learning problems. The groundbreaking deep Q-network (Q-network) that beat humans at an Atari game using only visual input, and the AlphaGo program that beat a world champion at the board game Go, are two prominent examples of reinforcement learning.

In reinforcement learning problems, an agent interacts with an environment over a series of time steps. At each specific point in time, the agent receives some observations from the environment , and must choose an action , which is then transmitted back to the environment by some mechanism (sometimes called an actuator), and finally the agent learns from the Get a reward from it . Thereafter a new cycle starts, the agent receives subsequent observations, selects subsequent actions, and so on. The process of reinforcement learning   is illustrated in Figure 1.3.7 . Note that the goal of reinforcement learning is to produce a good policy . The "actions" chosen by a reinforcement learning agent are governed by a policy, a function that maps from observations of the environment to actions.

 The generality of the reinforcement learning framework is very powerful. For example, we can convert any supervised learning problem into a reinforcement learning problem. Assuming we have a classification problem, we can create a reinforcement learning agent with an "action" for each classification. We can then create an environment that gives the agent a reward. This reward is consistent with the loss function of the original supervised learning problem.

Of course, reinforcement learning can also solve many problems that supervised learning cannot. For example, in supervised learning, we always want an input to be associated with the correct label. But in reinforcement learning, we don't assume that the environment tells the agent the optimal action for each observation. In general, the agent is just given some reward. Furthermore, the environment may not even tell which behaviors lead to rewards.

Take the application of reinforcement learning to chess as an example. The only real reward signal occurs at the end of the game: when the agent wins, the agent can be rewarded with 1; when the agent loses, the agent will be rewarded with -1. Therefore, reinforcement learners must deal with the problem of credit assignment: deciding which behaviors are rewarded and which behaviors are punished. Like an employee promotion, this promotion likely reflects a great deal of action from the previous year. Getting more promotions in the future requires figuring out which actions along the way led to promotions.

Reinforcement learning may also have to deal with some observability issues. That said, current observations may not tell everything about the current state. Let's say a cleaning robot finds itself stuck in a house with many identical closets. Inferring the robot's precise location (and thus its state) requires considering its previous observations before stepping into the closet.

In the end, at any point in time, a reinforcement learning agent may know a good policy, but there may be many better policies it has never tried. A reinforcement learning agent must constantly make a choice: whether it should exploit the current best policy, or explore a new policy space (giving up some short-term rewards in exchange for knowledge).

1.4. Summary

  • Machine learning is the study of how computer systems use experience (usually data) to improve performance on specific tasks. It combines ideas from statistics, data mining, and optimization. Often, it is used as a means to implement artificial intelligence solutions.

  • As a type of machine learning, representation learning focuses on how to automatically find the appropriate data representation. Deep learning is multi-level representation learning by learning multi-level transformations.

  • Deep learning not only replaces the shallow models of traditional machine learning, but also replaces labor-intensive feature engineering.

  • Much of the recent progress in deep learning has been triggered by the massive amounts of data generated by cheap sensors and internet-scale applications, as well as breakthroughs in computing power (via GPUs).

  • Whole system optimization is the key link to obtain high performance. The open source availability of effective deep learning frameworks makes the design and implementation of this very easy.

2. Preliminary knowledge

2.1. Data manipulation

In order to be able to perform various data operations, we need some way to store and manipulate data. Generally, we need to do two important things: (1) get the data; (2) process the data after it has been read into the computer. There's no point in fetching data without some way to store it.

First, we introduce n-dimensional arrays, also known as tensors . This section will be familiar to readers who have used the NumPy computing package in Python. No matter which deep learning framework is used, its tensor class (in MXNet ndarray, in PyTorch and TensorFlow Tensor) is similar to Numpy's ndarray. ndarrayBut the deep learning framework has more important functions than Numpy : first, GPU supports accelerated computing well, while NumPy only supports CPU computing; second, tensor classes support automatic differentiation. These features make tensor classes more suitable for deep learning. Unless otherwise specified, the tensor mentioned in this book refers to the instance of the tensor class.

Automatic Differentiation (AD) is a technique for computing derivatives

假设我们有以下函数:

f(x) = x^2 + 2x + 1
我们想要计算 f(x) 在 x=2 处的导数。手动计算的结果为:

f'(x) = 2x + 2
f'(2) = 6
现在我们可以使用自动微分来计算导数。以下是使用 TensorFlow 中的张量类来实现自动微分的示例代码:

import tensorflow as tf

# 定义变量 x,赋值为 2
x = tf.Variable(2.0)

# 定义函数 f(x)
def f(x):
    return x**2 + 2*x + 1

# 使用 TensorFlow 中的 GradientTape 记录梯度信息
with tf.GradientTape() as tape:
    # 计算函数值
    y = f(x)
    
# 计算导数
dy_dx = tape.gradient(y, x)

# 打印导数值
print(dy_dx)
运行上述代码,输出结果为:
tf.Tensor(6.0, shape=(), dtype=float32)

2.1.1. Getting Started

importtorch

import torch
x = torch.arange(12)
x
tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

print(x.shape)
print(x.numel())
x=x.reshape(3, 4)
print(x.shape)
print(x.numel())
torch.Size([12])
12
torch.Size([3, 4])
12

random value

torch.randn(3,4)
tensor([[ 0.5627, -0.0208,  0.7325,  0.4197],
        [ 1.6485, -2.6882, -2.7821,  0.7676],
        [ 0.8092,  0.0832,  1.0177,  0.6758]])

2.1.2. Operators, broadcasting mechanism, indexing and slicing, saving memory

x = torch.tensor([1.0, 2, 4, 8])
y = torch.tensor([2, 2, 2, 2])
x + y, x - y, x * y, x / y, x ** y  # **运算符是求幂运算
torch.exp(x)

In addition to element-wise calculations, we can perform linear algebra operations including vector dot products and matrix multiplication. We  explain the main points of linear algebra in Section 2.3 .

X = torch.arange(12, dtype=torch.float32).reshape((3,4))
Y = torch.tensor([[2.0, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
print(X.shape,Y.shape)
print(torch.cat((X, Y), dim=0).shape, torch.cat((X, Y), dim=1).shape)

torch.Size([3, 4]) torch.Size([3, 4])
torch.Size([6, 4]) torch.Size([3, 8])
广播机制:
x=torch.arange(6).reshape(1,6)
y=torch.arange(6).reshape(6,1)

x,y
(tensor([[0, 1, 2, 3, 4, 5]]),
 tensor([[0],
         [1],
         [2],
         [3],
         [4],
         [5]]))

tensor([[ 0,  1,  2,  3,  4,  5],
        [ 1,  2,  3,  4,  5,  6],
        [ 2,  3,  4,  5,  6,  7],
        [ 3,  4,  5,  6,  7,  8],
        [ 4,  5,  6,  7,  8,  9],
        [ 5,  6,  7,  8,  9, 10]])

Index and slice: start: end: step; commas separate dimensions

x[0,0:6:2]

tensor([0, 2, 4])

Save memory: x+=y

Convert to other Python objects

x=torch.arange(6).reshape(1,6)

y=x.numpy()
type(x),type(y)
(torch.Tensor, numpy.ndarray)

2.2. Data preprocessing

2.2.1. Reading the dataset

创建数据集:
import os

os.makedirs(os.path.join('..', 'data'), exist_ok=True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
    f.write('NumRooms,Alley,Price\n')  # 列名
    f.write('NA,Pave,127500\n')  # 每行表示一个数据样本
    f.write('2,NA,106000\n')
    f.write('4,NA,178100\n')
    f.write('NA,NA,140000\n')

# 如果没有安装pandas,只需取消对以下行的注释来安装pandas,在Jupyter Notebook
# !pip install pandas
import pandas as pd

data = pd.read_csv(data_file)
data
	NumRooms	Alley	Price
0	NaN	Pave	127500
1	2.0	NaN	106000
2	4.0	NaN	178100
3	NaN	NaN	140000

Handle missing values

Note that "NaN" entries represent missing values. To deal with missing data, typical methods include interpolation and deletion , where interpolation compensates for missing values ​​with a surrogate value, and deletion simply ignores missing values. Here we will consider interpolation.

inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = inputs.fillna(inputs.mean())
inputs
	NumRooms	Alley
0	3.0	Pave
1	2.0	NaN
2	4.0	NaN
3	3.0	NaN

For inputscategorical or discrete values ​​in , we treat "NaN" as a category. Since the "Alley Type" ("Alley") column only accepts two types of categorical values ​​"Pave" and "NaN", this  pandascolumn can be automatically converted into two columns "Alley_Pave" and "Alley_nan". Rows with an alley type of "Pave" will have a value of 1 for "Alley_Pave" and a value of 0 for "Alley_nan". Rows that are missing alley types will have "Alley_Pave" and "Alley_nan" set to 0 and 1 respectively.

inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)
   NumRooms  Alley_Pave  Alley_nan
0       3.0           1          0
1       2.0           0          1
2       4.0           0          1
3       3.0           0          1

 convert to tensor format

import torch

X, y = torch.tensor(inputs.values), torch.tensor(outputs.values)
X, y

2.3 Linear Algebra

Vector dot product:

A·B = A1B1 + A2B2 + ... + AnBn

Orthogonal A·B=0

 Norm:

 

 2.3.1. Scalars

import torch

x = torch.tensor(3.0)
y = torch.tensor(2.0)
x = torch.arange(4)
tensor([0, 1, 2, 3])

a=torch.arange(10).reshape(2,5)
tensor([[0, 1, 2, 3, 4],
        [5, 6, 7, 8, 9]])
转置:行列互换
a.T
tensor([[0, 5],
        [1, 6],
        [2, 7],
        [3, 8],
        [4, 9]])

 Symmetric matrix:\mathbf{A} = \mathbf{A}^\top

Dimensionality reduction : By default, calling the sum function reduces the dimensionality of a tensor along all axes, making it a scalar. We can also specify along which axis the tensor is reduced by summation. Taking a matrix as an example, in order to reduce dimensionality (axis 0) by summing the elements of all rows, this can be specified when calling the function axis=0. Since the input matrix is ​​dimensionally reduced along the 0 axis to generate the output vector, the dimensionality of the input axis 0 disappears in the output shape.

A = torch.arange(20, dtype=torch.float32).reshape(5, 4)
A_sum_axis0 = A.sum(axis=0)
A_sum_axis0, A_sum_axis0.shape
tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.],
        [12., 13., 14., 15.],
        [16., 17., 18., 19.]])

(tensor([40., 45., 50., 55.]), torch.Size([4]))
指定axis=1将通过汇总所有列的元素降维(轴1)。因此,输入轴1的维数在输出形状中消失。
A_sum_axis1 = A.sum(axis=1)
A_sum_axis1, A_sum_axis1.shape
(tensor([ 6., 22., 38., 54., 70.]), torch.Size([5]))
一个与求和相关的量是平均值(mean或average)
(tensor(9.5000), tensor(9.5000))
同样,计算平均值的函数也可以沿指定轴降低张量的维度。
A.mean(axis=0), A.sum(axis=0) / A.shape[0]
非降维求和
A.sum(axis=1, keepdims=True),A.sum(axis=1)
(tensor([[ 6.],
         [22.],
         [38.],
         [54.],
         [70.]]),
 tensor([ 6., 22., 38., 54., 70.]))
沿某个轴计算的累积总和
A.cumsum(axis=1)
tensor([[ 0.,  1.,  3.,  6.],
        [ 4.,  9., 15., 22.],
        [ 8., 17., 27., 38.],
        [12., 25., 39., 54.],
        [16., 33., 51., 70.]])

Supplement, which dimension is selected for axis, which dimension will be gone, keepdim

x = torch.arange(24).reshape(2,3,4)
x0=x.sum(axis=0)
x1=x.sum(axis=1)
x.shape,x0.shape,x1.shape,x0,x1
(torch.Size([2, 3, 4]),
 torch.Size([3, 4]),
 torch.Size([2, 4]),
 tensor([[12, 14, 16, 18],
         [20, 22, 24, 26],
         [28, 30, 32, 34]]),
 tensor([[12, 15, 18, 21],
         [48, 51, 54, 57]]))
x0=x.sum(axis=0,keepdim=True)
x1=x.sum(axis=1,keepdim=True)
 torch.Size([1, 3, 4]),
 torch.Size([2, 1, 4]),
x0=x.sum(axis=[0,1])
x1=x.sum(axis=[1,2])
 torch.Size([4]),
 torch.Size([2]),

2.3.7. Dot product

\mathbf{x}^\top \mathbf{y} = \sum_{i=1}^{d} x_i y_i

x = torch.arange(5, dtype = torch.float32)
y = torch.arange(5, dtype = torch.float32)
x, y, torch.dot(x, y) #点积
(tensor([0., 1., 2., 3., 4.]), tensor([0., 1., 2., 3., 4.]), tensor(30.))

Matrix-vector product (matrix-vector product)

x = torch.arange(5)
A=torch.arange(15).reshape(3,5)
x.shape,A.shape,torch.mv(A, x)
(torch.Size([5]), torch.Size([3, 5]), tensor([ 30,  80, 130]))

Matrix- matrix multiplication\mathbf{C} = \mathbf{AB} = \begin{bmatrix} \mathbf{a}^\top_{1} \\ \mathbf{a}^\top_{2} \\\vdots\\\mathbf{ a}^\top_n\\end{bmatrix}\begin{bmatrix}\mathbf{b}_{1}&\mathbf{b}_{2}&\cdots&\mathbf{b}_{m}\ \\end{bmatrix} = \begin{bmatrix}\mathbf{a}^\top_{1} \mathbf{b}_1 & \mathbf{a}^\top_{1}\mathbf{b}_2& \cdots & \mathbf{a}^\top_{1} \mathbf{b}_m \\ \mathbf{a}^\top_{2}\mathbf{b}_1 & \mathbf{a}^\top_{2} \mathbf {b}_2 & \cdots & \mathbf{a}^\top_{2} \mathbf{b}_m \\ \vdots & \vdots & \ddots &\vdots\\ \mathbf{a}^\top_{n } \mathbf{b}_1 & \mathbf{a}^\top_{n}\mathbf{b}_2& \cdots& \mathbf{a}^\top_{n} \mathbf{b}_m \end{bmatrix}.

A=torch.arange(16).reshape(4,4)

torch.mm(A, A)
tensor([[ 56,  62,  68,  74],
        [152, 174, 196, 218],
        [248, 286, 324, 362],
        [344, 398, 452, 506]])

Norm

Some of the most useful operators in linear algebra are norms . Informally, the norm of a vector is how big a vector is. The size concept considered here does not refer to dimensions, but to the size of components.

In linear algebra, a vector norm is a function f that maps a vector to a scalar. Given an arbitrary vector x, the vector norm satisfies some properties. The first property is that if we \alphascale all elements of a vector by a constant factor, its norm is also scaled by the absolute value of the same constant factor :

f(\alpha \mathbf{x}) = |\alpha| f(\mathbf{x}).

The second property is the familiar triangle inequality:f(\mathbf{x} + \mathbf{y}) \leq f(\mathbf{x}) + f(\mathbf{y}).

The third property simply states that the norm must be non-negative:f(\mathbf{x}) \geq 0.

This makes sense. Because in most cases, the smallest size of anything is 0. The last property requires that the norm be at least 0 if and only if the vector consists of all zeros.\forall i, [\mathbf{x}]_i = 0 \Leftrightarrow f(\mathbf{x})=0.

A norm sounds a lot like a measure of distance. The concept of non-negativity and the triangle inequality in Euclidean distance and Pythagorean theorem might shed some light. In fact, the Euclidean distance is an L2 norm: Suppose the elements in an n-dimensional vector x are x1,...,xn, and its L2 norm is the square root of the sum of the squares of the vector elements.

\|\mathbf{x}\|_2 = \sqrt{\sum_{i=1}^n x_i^2},

The square of the L2 norm is used more often in deep learning, and the L1 norm is also frequently encountered, which is expressed as the sum of the absolute values ​​​​of the vector elements:

\|\mathbf{x}\|_1 = \sum_{i=1}^n \left|x_i \right|.

torch.abs(x).sum()

Both the L2 norm and the L1 norm are special cases of the more general Lp norm:

\|\mathbf{x}\|_p = \left(\sum_{i=1}^n \left|x_i \right|^p \right)^{1/p}.

Similar to the L2 norm of a vector, the Frobenius norm\mathbf{X}\in\mathbb{R}^{m\timesn} of a matrix is ​​the square root of the sum of the squares of the matrix elements:

\|\mathbf{X}\|_F = \sqrt{\sum_{i=1}^m \sum_{j=1}^n x_{ij}^2}.

The Frobenius norm satisfies all the properties of a vector norm, it is like the L2 norm of a matrix-shaped vector. Calling the following function will compute the Frobenius norm of a matrix.

Norm and Target

In deep learning, we often try to solve optimization problems:  maximizing the probability assigned to observed data;  minimizing the distance between predictions and true observations. Represent items (such as words, products, or news articles) with vectors such that the distance between similar items is minimized and the distance between dissimilar items is maximized. The target, perhaps the most important component of a deep learning algorithm (besides the data), is often expressed as a norm.

summary

  • Scalars, vectors, matrices, and tensors are fundamental mathematical objects in linear algebra.

  • Vectors generalize from scalars, and matrices generalize from vectors.

  • Scalars, vectors, matrices, and tensors have zero, one, two, and any number of axes, respectively.

  • A tensor can be passed sumand meandimensionally reduced along the specified axis.

  • The element-wise multiplication of two matrices is called their Hadamard product. It is not the same as matrix multiplication.

  • In deep learning, we often use norms such as L1 norm, L2 norm and Frobenius norm.

  • We can perform various operations on scalars, vectors, matrices, and tensors.

practise

Prove that the transpose of the transpose of a matrix is ​​itself, ie (A^T)^T=A.

a = torch.arange(8).reshape(2,4)
a,a.T,a.T.T
(tensor([[0, 1, 2, 3],
         [4, 5, 6, 7]]),
 tensor([[0, 4],
         [1, 5],
         [2, 6],
         [3, 7]]),
 tensor([[0, 1, 2, 3],
         [4, 5, 6, 7]]))

Given two matrices A and B, prove that "the sum of their transposes" is equal to "the transpose of their sums", that is, (A^T+B^T)=(A+B)^T.

a = torch.arange(6).reshape(2,3)
b = torch.arange(6,0,-1).reshape(2,3)
a,b,(a+b).T,a.T+b.T
(tensor([[0, 1, 2],
         [3, 4, 5]]),
 tensor([[6, 5, 4],
         [3, 2, 1]]),
 tensor([[6, 6],
         [6, 6],
         [6, 6]]),
 tensor([[6, 6],
         [6, 6],
         [6, 6]]))


Given any square matrix A, is A+A^T always symmetric? Why?

Not necessarily, because the broadcast mechanism

a = torch.arange(6).reshape(1,6)
b = torch.arange(6,0,-1).reshape(6,1)
a,a.T,a+a.T

A tensor X of shape (2,3,4) is defined in this section. What is the output of len(X)?

a = torch.arange(24).reshape(2,3,4)
len(a)
2
竟然是第一个维度


For a tensor X of arbitrary shape, does len(X) always correspond to the length of a particular axis of X? What is this axis?

Yes
Run A/A.sum(axis=1) and see what happens. Please analyze the reason?

An error was reported: RuntimeError: The size of tensor a (4) must match the size of tensor b (3) at non-singleton dimension 1
Consider a tensor with shape (2,3,4), in axes 0, 1, What is the shape of the summed output over 2?

x = torch.arange(24).reshape(2,3,4)
x[0],x[0,0],x[0,0,0]
(tensor([[ 0,  1,  2,  3],
         [ 4,  5,  6,  7],
         [ 8,  9, 10, 11]]),
 tensor([0, 1, 2, 3]),
 tensor(0))

Give the linalg.norm function a tensor of 3 or more axes and observe its output. What does this function compute for tensors of arbitrary shape?
Norm norm

torch.linalg.norm(A, ord=None, dim=None, keepdim=False, *, out=None, dtype=None)
A是要计算范数的张量,ord是范数的阶数,dim是要沿着哪些维度计算范数,keepdim表示是否保持维度不变,out表示输出张量,dtype表示输出张量的数据类型。如果ord和dim都不指定,则默认计算2范数。
# 计算向量的2范数
x = torch.randn(3)
norm_x = torch.linalg.norm(x)

# 计算矩阵的Frobenius范数
A = torch.randn(3, 4)
norm_A = torch.linalg.norm(A)
norm_x,norm_A

2.4. Matrix calculation, automatic derivation, calculus

Derivative

 

 

 

 

 To sum up: the dimension remains unchanged, the numerator does not transfer to the denominator

 2.4.1. Derivatives and differentials

Suppose we have a function f: R → R whose inputs and outputs are scalars. If a derivative of f exists, this limit is defined as

f'(x) = \lim_{h \rightarrow 0} \frac{f(x+h) - f(x)}{h}.

Let's get acquainted with a few equivalent notations for derivatives.
Given y=f(x), where x and y are the independent and dependent variables of the function f, respectively. The following expressions are equivalent:

f'(x) = y' = \frac{dy}{dx} = \frac{df}{dx} = \frac{d}{dx} f(x) = Df(x) = D_x f(x),

 To visualize this interpretation of derivatives, we will use matplotlib, a popular plotting library in Python. To configure matplotlibthe properties of the generated graph, we need to define a couple of functions. Below, use_svg_displaythe function specifies that matplotlibthe package outputs svg charts for cleaner images.

Note that a comment #@saveis a special tag that will save the corresponding function, class or statement in d2lthe package. Therefore, they can be called directly later without redefinition (for example, d2l.use_svg_display()).

With these three functions for graph configuration, define a plotfunction to succinctly plot multiple curves , as we need to visualize many curves throughout the book.

def use_svg_display():  #@save
    """使用svg格式在Jupyter中显示绘图"""
    backend_inline.set_matplotlib_formats('svg')
def set_figsize(figsize=(4, 3)):  #@save
    """设置matplotlib的图表大小"""
    use_svg_display()
    d2l.plt.rcParams['figure.figsize'] = figsize
#@save
def set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend):
    """设置matplotlib的轴"""
    axes.set_xlabel(xlabel)
    axes.set_ylabel(ylabel)
    axes.set_xscale(xscale)
    axes.set_yscale(yscale)
    axes.set_xlim(xlim)
    axes.set_ylim(ylim)
    if legend:
        axes.legend(legend)
    axes.grid()
#@save
def plot(X, Y=None, xlabel=None, ylabel=None, legend=None, xlim=None,
         ylim=None, xscale='linear', yscale='linear',
         fmts=('-', 'm--', 'g-.', 'r:'), figsize=(3.5, 2.5), axes=None):
    """绘制数据点"""
    if legend is None:
        legend = []

    set_figsize(figsize)
    axes = axes if axes else d2l.plt.gca()

    # 如果X有一个轴,输出True
    def has_one_axis(X):
        return (hasattr(X, "ndim") and X.ndim == 1 or isinstance(X, list)
                and not hasattr(X[0], "__len__"))

    if has_one_axis(X):
        X = [X]
    if Y is None:
        X, Y = [[]] * len(X), X
    elif has_one_axis(Y):
        Y = [Y]
    if len(X) != len(Y):
        X = X * len(Y)
    axes.cla()
    for x, y, fmt in zip(X, Y, fmts):
        if len(x):
            axes.plot(x, y, fmt)
        else:
            axes.plot(y, fmt)
    set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend)
x = np.arange(0, 3, 0.1)
plot(x, [f(x), 2 * x - 3], 'x', 'f(x)', legend=['f(x)', 'Tangent line (x=1)'])

Partial derivative

Let y=f(x₁,x₂,...,xₙ) be a function with n variables. The partial derivative of y with respect to the i-th parameter xi is:

\frac{\partial y}{\partial x_i} = \lim_{h \rightarrow 0} \frac{f(x_1, \ldots, x_{i-1}, x_i+h, x_{i+1}, \ldots, x_n) - f(x_1, \ldots, x_i, \ldots, x_n)}{h}.

To compute ∂y/∂xi, we can simply treat xi as a constant, and compute dy/dxi. (Only xi+h, other unchanged, then divide by h)

For the representation of partial derivatives, the following are equivalent:

\frac{\partial y}{\partial x_i} = \frac{\partial f}{\partial x_i} = f_{x_i} = f_i = D_i f = D_{x_i} f.

gradient

We can concatenate the partial derivatives of a multivariate function with all its variables to get the gradient vector of that function. Specifically, let the input of a function f: ℝ^n → ℝ be an n-dimensional vector x = [x₁,x₂,…,xₙ]ᵀ, and the output be a scalar. The gradient of a function f with respect to x is a vector of n partial derivatives:

\nabla_{\mathbf{x}} f(\mathbf{x}) = \bigg[\frac{\partial f(\mathbf{x})}{\partial x_1}, \frac{\partial f(\mathbf{x})}{\partial x_2}, \ldots, \frac{\partial f(\mathbf{x})}{\partial x_n}\bigg]^\top,

 chain rule

However, the above method may be difficult to find the gradient. This is because in deep learning, multivariate functions are usually composite , so it is difficult to apply any of the above rules to differentiate these functions. Fortunately, the chain rule can be used to differentiate composite functions.

Let's consider univariate functions first. Assuming that the functions y = f(u) and u = g(x) are both differentiable, according to the chain rule:

\frac{dy}{dx} = \frac{dy}{du} \frac{du}{dx}.

Now consider a more general scenario, the case where a function has an arbitrary number of variables. Suppose a differentiable function y has variables u₁,u₂,…,uₘ, where each differentiable function uᵢ has variables x₁,x₂,…,xₙ. Note that y is a function of x₁,x₂,…,xₙ. For any i=1,2,…,n, the chain rule gives:

\frac{\partial y}{\partial x_i} = \frac{\partial y}{\partial u_1} \frac{\partial u_1}{\partial x_i} + \frac{\partial y}{\partial u_2} \frac{\partial u_2}{\partial x_i} + \cdots + \frac{\partial y}{\partial u_m} \frac{\partial u_m}{\partial x_i}

summary

  • Differentiation and integration are two branches of calculus, the former can be applied to optimization problems in deep learning.

  • The derivative can be interpreted as the instantaneous rate of change of a function with respect to its variable, which is also the slope of the tangent to the curve of the function.

  • A gradient is a vector whose components are the partial derivatives of a multivariate function with respect to all of its variables.

  • The chain rule can be used to differentiate composite functions.

Operation

  1. Plot the function y = f(x) = x³ - 1/x and its tangent at x=1.
x = np.arange(0, 2, 0.1)
plot(x, [f(x), x **3 - 1/x], 'x', 'f(x)', legend=['f(x)', 'Tangent line (x=1)'])
  1. Find the gradient of the function f(x) = 3x₁² + 5eˣ₂.

6x1,5x2

  1. What is the gradient of the function f(x) = ‖x‖²?

2x

  1. Try to write the chain rule for the function u = f(x, y, z), where x = x(a, b), y = y(a, b), z = z(a, b). (It turned out to be discussed in categories)

 2.5. Automatic differentiation

Deep learning frameworks speed up derivatives by automatically computing derivatives, known as automatic differentiation. In practice, according to the designed model, the system will build a computational graph (computational graph) to track which data is calculated and combined with which operations to generate output. Automatic differentiation enables the system to subsequently backpropagate gradients. Here, backpropagating means tracing the entire computational graph, filling in the partial derivatives with respect to each parameter.

example

 First, we create the variable xand assign it an initial value. Before we can compute the gradient of y with respect to x, we need a place to store the gradient. Importantly, we don't allocate new memory every time we differentiate a parameter. Since we often update the same parameters thousands of times, allocating new memory each time can quickly exhaust memory. Note that the gradient of a scalar function with respect to a vector x is a vector and has the same shape as x.

import torch

x = torch.arange(4.0)
x.requires_grad_(True)  # 等价于x=torch.arange(4.0,requires_grad=True)
x.grad,x  # 默认值是None
(None, tensor([0., 1., 2., 3.], requires_grad=True))

# x是一个长度为4的向量,计算x和x的点积,得到了我们赋值给y的标量输出。 接下来,通过调用反向传播函数来自动计算y关于x每个分量的梯度,并打印这些梯度。 梯度 gradient; grad
y = 2 * torch.dot(x, x)
y
tensor(28., grad_fn=<MulBackward0>)

y.backward()
x.grad
tensor([ 0.,  4.,  8., 12.])

y.backward()
x.grad
# tensor([ 0.,  8., 16., 24.])
# 函数y=2xᵀx关于向量x的梯度应为4x。
# 让我们快速验证这个梯度是否计算正确。
x.grad == 4 * x
# tensor([True, True, True, True])
# 现在计算x的另一个函数。
# 在默认情况下,PyTorch会累积梯度,我们需要清除之前的值
x.grad.zero_()
y = x.sum()
y.backward()
x.grad
# tensor([1., 1., 1., 1.])

Backpropagation for nonscalar variables

When ynot a scalar, the most natural interpretation of the derivative of a vector ywith respect to a vector is a matrix. xFor high-order and high-dimensional ysums x, the result of the derivation can be a high-order tensor.

However, while these more exotic objects do appear in advanced machine learning (including deep learning), when inverse computation of vectors is invoked, we typically try to compute the loss function for each component of a batch of training samples Derivative. Here, our purpose is not to compute the differential matrix, but to compute the sum of partial derivatives for each sample in the batch individually.

# 对非标量调用backward需要传入一个gradient参数,该参数指定微分函数关于self的梯度。
# 本例只想求偏导数的和,所以传递一个1的梯度是合适的
x.grad.zero_()
y = x * x
# 等价于y.backward(torch.ones(len(x)))
y.sum().backward()
x.grad
tensor([0., 2., 4., 6.])

separate computing

Sometimes, we want to move some computations outside of the recorded computation graph. For example, assume yis xcomputed as a function of and zis computed as a function of sum y. xImagine that we want to compute the gradient zwith respect xto , but for some reason want to treat yas a constant, and only take into account the role played after it has been computed x.y

Here it is possible to detach yto return a new variable uwith ythe same value as , but discard yany information about how the computation graph was computed. In other words, gradients do not flow backwards uthrough x. Therefore, the backpropagation function below computes partial derivatives z=u*xwith respect to , while treating them as constants, rather than partial derivatives with respect to .xuz=x*x*xx

x.grad.zero_()
y = x * x
u = y.detach()
z = u * x
z.sum().backward()
x,x.grad,u,x.grad == u
(tensor([0., 1., 2., 3.], requires_grad=True),
 tensor([0., 1., 4., 9.]),
 tensor([0., 1., 4., 9.]),
 tensor([True, True, True, True]))

Thanks to the recorded ycomputation of , we can then ycall backpropagation on to obtain the derivative y=x*xwith respect to , ie .x2*x

x.grad.zero_()
y.sum().backward()
x.grad,x,x.grad == 2 * x
(tensor([0., 2., 4., 6.]),
 tensor([0., 1., 2., 3.], requires_grad=True),
 tensor([True, True, True, True]))

Gradient calculation of Python control flow

One benefit of using automatic differentiation is that even if constructing the computational graph of a function requires control flow through Python (e.g., conditionals, loops, or arbitrary function calls), we can still compute the gradients of the resulting variables. In the code below, whileboth the number of iterations of the loop and ifthe result of the statement depend on athe value of the input.

def f(a):
    b = a * 2
    while b.norm() < 1000:
        b = b * 2
    if b.sum() > 0:
        c = b
    else:
        c = 100 * b
    return c
a = torch.randn(size=(), requires_grad=True)
d = f(a)
d.backward()
a,a.grad ,d,a.grad == d / a
(tensor(0.1172, requires_grad=True),
 tensor(16384.),
 tensor(1919.7633, grad_fn=<MulBackward0>),
 tensor(True))

summary

  • Deep learning frameworks can compute derivatives automatically: we first attach the gradient to the variable for which we want to compute partial derivatives, then record the computation of the target value, execute its backpropagation function, and access the resulting gradient.

 practise

  1. Why is computing the second derivative more expensive than the first derivative?

The second derivative is the derivative of the first derivative

  1. Immediately after running the backpropagation function, run it again and see what happens.

# 练习
import torch
x = torch.arange(5.,requires_grad=True)
y = 2 * torch.dot(x**2,torch.ones_like(x))
y.backward()
print(x.grad,x)
y.backward()
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

This error usually occurs when using PyTorch for automatic derivation (Autograd). It means that when computing backpropagation, an attempt is made to traverse the calculation graph for the second time or access an intermediate variable that has been released. This usually occurs when certain operations in the computation graph are computed multiple times, or when certain nodes in the computation graph are used multiple times.

torch.dot()function, you need to pass two tensors as parameters, the two tensors must have the same shape

  1. In the control flow example, we compute the derivative dwith respect to , what happens if we change the variable to a random vector or matrix?aa

import torch

x = torch.randn(3, requires_grad=True)
y = torch.dot(x,x)
y.backward()
x.grad

import torch:导入PyTorch库,使得可以使用其中提供的函数和类。
x = torch.randn(3, requires_grad=True):创建一个形状为(3,)的随机向量x,并将requires_grad设置为True,表示需要对x进行自动求导。
y = torch.dot(x,x):计算向量x的平方和,并将结果赋值给变量y。
y.backward():对变量y进行自动求导,计算出对x的梯度,并将梯度值保存在x.grad中。
x.grad:输出x的梯度值。
因为y是一个标量,所以可以对它进行自动求导。在这个例子中,y的值是向量x的平方和,即x[0]^2 + x[1]^2 + x[2]^2。对y求导后,得到的梯度值就是一个和x具有相同形状的向量,其中每个元素的导数值等于对应的x元素的两倍。因此,x.grad的值就是一个长度为3的张量,其中每个元素的值等于对应的x元素的两倍。

在这段代码中,x是一个形状为(3,)的张量,它具有三个元素。y是一个标量,它的值是向量x的平方和,即:
y = x[0]^2 + x[1]^2 + x[2]^2
在进行自动求导之后,x.grad保存了y对每个元素x[i]的梯度值,即:
x.grad[i] = dy/dx[i]
其中,dy/dx[i]表示y对x[i]的偏导数。根据向量求导的规则,我们可以计算出:
dy/dx[i] = 2 * x[i]
因此,x.grad的值等于向量x中每个元素的两倍,即:
x.grad = [2x[0], 2x[1], 2*x[2]]

Calculated gradients give different results

  1. Redesign an example of finding the gradient of the control flow, run it and analyze the results.

一个简单的控制流例子是计算条件语句的加权和。假设有两个向量x和y,我们要计算它们的加权和,如果x中的元素大于y中的元素,则权重为1,否则为2。可以使用以下代码实现:
import torch

x = torch.tensor([1, 3, 2], dtype=torch.float32, requires_grad=True)
y = torch.tensor([2, 2, 1], dtype=torch.float32, requires_grad=True)
接着,我们定义变量s并初始化为0,然后使用一个循环遍历x和y中的元素,并根据条件语句计算加权和。在这个例子中,如果x[i]大于y[i],则将x[i]加入到s中,否则将2*y[i]加入到s中。

s = 0
for i in range(len(x)):
    if x[i] > y[i]:
        s = s + x[i]
    else:
        s = s + 2*y[i]

s.backward()

print(x.grad)
print(y.grad)
这些值表示x和y对应元素的梯度。例如,x.grad[0]的值为1,表示当x[0]增加一个很小的量时,加权和s也会增加一个很小的量。反之,如果将x[0]减少一个很小的量,则加权和s也会减少一个很小的量。同样地,y.grad[0]的值为2,表示当y[0]增加一个很小的量时,加权和s会减少两倍的这个量。这些梯度信息可以用来优化模型,以使得加权和最小化。
导数
tensor([0., 1., 1.])
tensor([2., 0., 0.])
  1. \frac{df(x)}{dx}Let f(x)=sin(x), plot the graph of f(x) and where the latter does not use f'(x)=\cos(x)

import torch
import matplotlib.pyplot as plt
import numpy as np

# 定义函数 f(x) = sin(x)
def f(x):
    return torch.sin(x)

# 构造输入张量 x,并将 requires_grad 属性设置为 True
x = torch.tensor(np.linspace(-3*np.pi, 3*np.pi, 100), requires_grad=True)

# 计算函数 f(x) 的值
y = f(x)

# 计算函数 f(x) 对 x 的导数
y.backward(torch.ones_like(x))

# 绘制函数 f(x) 和其导数的图像
plt.plot(x.detach().numpy(), y.detach().numpy(), label="f(x)")
plt.plot(x.detach().numpy(), x.grad.detach().numpy(), label="f'(x)")
plt.legend()
plt.show()

或者
%matplotlib inline
import matplotlib.pylab as plt
from matplotlib.ticker import FuncFormatter, MultipleLocator
import numpy as np
import torch

f,ax=plt.subplots(1)

x = np.linspace(-3*np.pi, 3*np.pi, 100)
x1= torch.tensor(x, requires_grad=True)
y1= torch.sin(x1)
y1.sum().backward()

ax.plot(x,np.sin(x),label='sin(x)')
ax.plot(x,x1.grad,label="gradient of sin(x)")
ax.legend(loc='upper center', shadow=True)

ax.xaxis.set_major_formatter(FuncFormatter(
lambda val,pos: '{:.0g}\pi'.format(val/np.pi) if val !=0 else '0'
))
ax.xaxis.set_major_locator(MultipleLocator(base=np.pi))

plt.show()

2.6 Probability

The law of large numbers tells us that this estimate gets closer and closer to the true underlying probability as the number of tosses increases.

Basic Probability Theory

In statistics, we call the process of drawing samples from a probability distribution sampling . Loosely speaking, a distribution can be thought of as a distribution of probabilities to events, which we will give a more formal definition later on. A distribution that assigns probabilities to some discrete choices is called a multinomial distribution.

To draw a sample, i.e. roll a dice, we simply pass in a vector of probabilities. The output is another vector of the same length: its value at index i is the number of occurrences of i in the sampled results.

%matplotlib inline
import torch
from torch.distributions import multinomial
from d2l import torch as d2l

fair_probs = torch.ones([6]) / 6
torch.ones([6]) / 6 , multinomial.Multinomial(1, fair_probs).sample()
# multinomial.Multinomial函数创建了一个多项式分布的随机变量。该函数的第一个参数1表示我们只想要生成一个样本,第二个参数fair_probs是一个长度为6的张量,代表了多项式分布中每个可能的取值的概率。函数的返回值是一个长度为6的张量,其中只有一个元素为1,其余为0。这个张量表示了我们生成的随机变量的取值,即在6个可能的取值中随机选择一个。
(tensor([0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667]),
 tensor([1., 0., 0., 0., 0., 0.]))


# 在估计一个骰子的公平性时,我们希望从同一分布中生成多个样本。 如果用Python的for循环来完成这个任务,速度会慢得惊人。 因此我们使用深度学习框架的函数同时抽取多个样本,得到我们想要的任意形状的独立样本数组。
multinomial.Multinomial(100, fair_probs).sample()
# tensor([22., 16., 15., 18., 18., 11.])
# 现在我们知道如何对骰子进行采样,我们可以模拟1000次投掷。 然后,我们可以统计1000次投掷后,每个数字被投中了多少次。 具体来说,我们计算相对频率,以作为真实概率的估计。

# 将结果存储为32位浮点数以进行除法
counts = multinomial.Multinomial(1000, fair_probs).sample()
counts / 1000  # 相对频率作为估计值
# tensor([0.1760, 0.1650, 0.1690, 0.1640, 0.1560, 0.1700])
# 因为我们是从一个公平的骰子中生成的数据,我们知道每个结果都有真实的概率1/6, 大约是0.1667,所以上面输出的估计值看起来不错。
# 我们也可以看到这些概率如何随着时间的推移收敛到真实概率。 让我们进行500组实验,每组抽取10个样本。

counts = multinomial.Multinomial(10, fair_probs).sample((500,))
cum_counts = counts.cumsum(dim=0)
estimates = cum_counts / cum_counts.sum(dim=1, keepdims=True)

d2l.set_figsize((6, 4.5))
for i in range(6):
    d2l.plt.plot(estimates[:, i].numpy(),
                 label=("P(die=" + str(i + 1) + ")"))
d2l.plt.axhline(y=0.167, color='black', linestyle='dashed')
d2l.plt.gca().set_xlabel('Groups of experiments')
d2l.plt.gca().set_ylabel('Estimated probability')
d2l.plt.legend();

Axioms of Probability Theory

When dealing with dice throwing, we call the set S={1, 2, 3, 4, 5, 6} a sample space or an outcome space, where each element is an outcome ). An event is a random outcome of a set of given sample spaces. For example, "seeing a 5" ({5}) and "seeing an odd number" ({1, 3, 5}) are both valid events for rolling a dice. Note that event A has occurred if the outcome of a random experiment is in A. That is, if 3 points are thrown, since 3 ∈ {1, 3, 5}, we can say that the event of "seeing an odd number" occurred.

Probability can be thought of as a function that maps sets to true values. In a given sample space S, the probability of event A, denoted as P(A), satisfies the following properties:

  • For any event A, its probability is never negative, that is, P(A)≥0;
  • The probability of the entire sample space is 1, that is, P(S)=1;
  • For any countable sequence A1, A2, ... of mutually exclusive events (Ai∩Aj=∅ for all i≠j), the probability of any event in the sequence is equal to their respective probability The sum, that is, P(⋃i=1∞ Ai)=∑i=1∞ P(Ai).

The above are also axioms of probability theory, proposed by Kolmogorov in 1933. With this axiomatic system in place, we can avoid any philosophical debate about randomness; instead, we can reason rigorously in the language of mathematics. For example, assuming that event A1 is the entire sample space, and Ai=∅ when all i>1, then we can prove that P(∅)=0, that is, the probability of an impossible event is 0.

Random Variables

In our random dice-rolling experiment, we introduced the concept of a random variable. A random variable can be of almost any number, and it can take on a value among a set of possibilities in a randomized experiment. Consider a random variable X with values ​​in the sample space S={1, 2, 3, 4, 5, 6} for rolling a dice. We can denote the event "seeing a 5" as {X=5} or X=5, and its probability as P({X=5}) or P(X=5). By P(X=a), we distinguish between a random variable X and the values ​​that X can take (say a).

However, this can lead to cumbersome representations. To simplify notation, on the one hand, we can represent P(X) as a distribution on a random variable X: the distribution tells us the probability that X takes a certain value. On the other hand, we can simply denote the probability that the random variable takes value a by P(a). Since an event in probability theory is a set of outcomes from a sample space, we can assign a range of desirable values ​​to a random variable. For example, P(1≤X≤3) means the probability of event {1≤X≤3}, that is, {X=1, 2, or, 3}. Equivalently, P(1≤X≤3) represents the probability that the random variable X takes a value from {1, 2, 3}.

Note that there is a subtle distinction between discrete random variables (such as the sides of a die) and continuous random variables (such as a person's weight and height). In real life, it doesn't make much sense to measure whether two people have the exact same height. If we take accurate enough measurements, we will eventually find that no two people on this planet are exactly the same height. In this case, it makes more sense to ask whether someone's height falls into a given interval, say between 1.79m and 1.81m. In these cases, we quantify this likelihood of seeing a certain value as density. The probability of the height being exactly 1.80 meters is 0, but the density is not. In any interval between two different heights, we have non-zero probability. In the remainder of this section, we consider probabilities in discrete spaces. The probability of continuous random variables can refer to the section on random variables in the Deep Learning Mathematics Appendix.

Handling multiple random variables

Many times, we consider multiple random variables. For example, we may need to model the relationship between diseases and symptoms. Given a disease and a symptom, such as "flu" and "cough", exist or not exist in a certain patient with some probability. We need to estimate these probabilities, and the relationships among them, so that we can use our inferences to achieve better medical care.

Here's another more complex example: the image contains millions of pixels, and thus has millions of random variables. In many cases, images are accompanied by a label (label), which identifies the object in the image. We can also treat the label as a random variable. We can even treat all metadata as random variables, such as position, time, aperture, focal length, ISO, focus distance, and camera type. All of these are jointly occurring random variables. When we are dealing with multiple random variables, there will be several variables that we are interested in.

2.6.2.1. Joint probability

The first is called the joint probability P(A=a, B=b). Given arbitrary values ​​of a and b, the joint probability answer: What is the probability that A=a and B=b are simultaneously satisfied? Note that for any value of a and b, P(A=a, B=b)≤ P(A=a). This is certain because for both A=a and B=b to happen, A=a must happen and B=b must happen (and vice versa). Therefore, the possibility of A=a and B=b occurring simultaneously is no greater than the possibility of A=a or B=b occurring alone.

2.6.2.2. Conditional Probability

The inequality of joint probability brings us an interesting ratio: 0≤P(A=a, B=b)/P(A=a)≤1. We call this ratio conditional probability, and denote it by P(B=b|A=a): it is the probability of B=b, provided that A=a has occurred.

2.6.2.3. Bayes Theorem

Using the definition of conditional probability, we can derive one of the most useful equations in statistics: Bayes' theorem. According to the multiplication rule, P(A, B)=P(B|A)P(A) can be obtained. According to symmetry, P(A, B)=P(A|B)P(B) can be obtained. Assuming P(B)>0, solving for one of the conditional variables, we get

P(A|B) = P(B|A)P(A)/P(B).

Note that here we use a compact notation: where P(A, B) is a joint distribution and P(A∣B) is a conditional distribution. This distribution can be evaluated on given values ​​A=a, B=b.

2.6.2.4. Marginalization The total probability formula can be written as:  Pr ( A ) = ∑ n Pr ( A ∣ B n ) Pr ( B n ) 

In order to be able to sum event probabilities, we need the sum rule (sum rule), that is, the probability of B is equivalent to calculating all possible choices of A, and aggregate the joint probabilities of all choices together:

P(B) = ∑AP(A, B).

This is also known as marginalization. The probability or distribution of marginalized outcomes is called marginal probability or marginal distribution.

expectation and variance

To generalize key features of probability distributions, we need some measure. The expectation (or average) of a random variable X is expressed as

E[X] = ∑x xP(X=x).

When the input of the function f(x) is a random variable drawn from the distribution P, the expected value of f(x) is

E_x~P[f(x)] = ∑x f(x)P(x).

In many cases, we wish to measure the bias of a random variable X from its expected value. This can be quantified by variance

Var[X] = E[(X - E[X])^2] = E[X^2] - E[X]^2.

The square root of the variance is called the standard deviation. The variance of a function of a random variable measures how much the value of the function deviates from the expectation of the function when different values ​​x are sampled from the distribution of the random variable:

Var[f(x)] = E[(f(x) - E[f(x)])^2].

 summary

  • We can sample from a probability distribution.

  • We can analyze multiple random variables using joint distributions, conditional distributions, Bayes theorem, marginalization, and independence assumptions.

  • Expectation and variance provide useful metrics for generalizing key characteristics of probability distributions.

test

  1. Carry out m=500 groups of experiments, each group draws n=10 samples. Change m and n, observe and analyze the experimental results.
fair_probs = torch.ones([10]) / 10
torch.ones([10]) / 10 , multinomial.Multinomial(1, fair_probs).sample()
multinomial.Multinomial(500, fair_probs).sample()
tensor([44., 54., 49., 40., 52., 52., 53., 55., 58., 43.])
tensor([0.0940, 0.1160, 0.0980, 0.1140, 0.0760, 0.1100, 0.0840, 0.1080, 0.1040,
        0.0960])

  1. Given two events with probabilities P(A) and P(B), compute the upper and lower bounds of P(A∪B) and P(A∩B). (Hint: Use a friend graph to show these situations.)

1 0

P(A∪B) = P(A) + P(B) - P(A∩B)

P(A∩B) >= 0

P(A∩B) <= P(A)

P(A∩B) <= P(B)

  1. Suppose we have a series of random variables, say A, B, and C, where B depends only on A and C depends only on B, can the joint probability P(A, B, C) be simplified? (Hint: this is a Markov husband chain.)

P(A, B, C) = P(A)P(B|A)P(C|B)

This is consistent with the conditional independence property of Markov chains.

  1. In Section 2.6.2.6, the first test is more accurate. Why doesn't the first test run twice, but the first and second test at the same time?

Let me explain the prerequisites to help everyone understand: a positive test result can only mean that the indicator is positive, but it does not mean that the patient is positive. It does not fully explain that as long as the indicator is abnormal, the patient must be in a sick state. It may be that the patient is positive due to other reasons (that is, P(D = 1, H = 0) ≠ 0). Not necessarily ill with this problem. Then, if detection method 1 tests positive for the patient for the first time, then the possibility of being positive for the second time will be relatively high. That is to say, there is a high probability that the detection results of the same detection method twice successively are correlated.

When calculating the joint probability of D1 and D2  P(D1 = 1, D2 = 1 | H = 1) or  P(D1 = 1, D2 = 1 | H = 0) , it should be noted that:

  • If two events are independent:
    then
P(D1 = 1, D2 = 1 | H = 1)  = P(D1 = 1 | H = 1) * P(D2 = 1 | H = 1)
  • If two events are not independent:
P(D1 = 1, D2 = 1 | H = 1)  ≠ P(D1 = 1 | H = 1) * P(D2 = 1 | H = 1)

For example, for example, Xiao Ming and Xiao Hei are classmates in one class (Xiao Hei told Xiao Ming that he will copy the exam for him), and Xiao Hong is a student in another class.

  • The probability that Xiaoming scores more than 80 points in the test is P(小明 > 80) = 0.8;
  • The probability that Xiao Hei will score more than 80 points in the test is P(小黑 > 80) = 0.1;
  • The probability of Xiaohong getting 80 points or more in the test is P(小红 > 80) = 0.1.

In an exam, may I ask, is the probability of both Xiao Ming and Xiao Hei getting 80 points or more in the test at the same time as the probability of Xiao Ming and Xiao Hong getting 80 points or more in the test?

Obviously different, the former probability will be less than 0.8, but greater than 0.8 * 0.1 = 0.08, of course, Xiao Hei may not be able to copy completely; the latter probability is 0.8 * 0.1 = 0.08.

It is conceivable that because the two tests are not independent, the denominator P(D1 = 1, D2 = 1)may be extremely large (because they are related to each other, and P(D1 = 1)almost at the same time P(D1 = 1)), and the probability of the patient being positive is not high in the final calculation.

Check out the documentation

 Find all functions and classes in a module

In order to know which functions and classes can be called in a module, dira function can be called. For example, we can query all properties in the random number generation module:

import torch

print(dir(torch.distributions))
['AbsTransform', 'AffineTransform', 'Bernoulli', 'Beta', 'Binomial', 'CatTransform', 'Categorical', 'Cauchy', 'Chi2', 'ComposeTransform', 'ContinuousBernoulli', 'CorrCholeskyTransform', 'CumulativeDistributionTransform', 'Dirichlet', 'Distribution', 'ExpTransform', 'Exponential', 'ExponentialFamily', 'FisherSnedecor', 'Gamma', 'Geometric', 'Gumbel', 'HalfCauchy', 'HalfNormal', 'Independent', 'IndependentTransform', 'Kumaraswamy', 'LKJCholesky', 'Laplace', 'LogNormal', 'LogisticNormal', 'LowRankMultivariateNormal', 'LowerCholeskyTransform', 'MixtureSameFamily', 'Multinomial', 'MultivariateNormal', 'NegativeBinomial', 'Normal', 'OneHotCategorical', 'OneHotCategoricalStraightThrough', 'Pareto', 'Poisson', 'PowerTransform', 'RelaxedBernoulli', 'RelaxedOneHotCategorical', 'ReshapeTransform', 'SigmoidTransform', 'SoftmaxTransform', 'SoftplusTransform', 'StackTransform', 'StickBreakingTransform', 'StudentT', 'TanhTransform', 'Transform', 'TransformedDistribution', 'Uniform', 'VonMises', 'Weibull', 'Wishart', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', 'bernoulli', 'beta', 'biject_to', 'binomial', 'categorical', 'cauchy', 'chi2', 'constraint_registry', 'constraints', 'continuous_bernoulli', 'dirichlet', 'distribution', 'exp_family', 'exponential', 'fishersnedecor', 'gamma', 'geometric', 'gumbel', 'half_cauchy', 'half_normal', 'identity_transform', 'independent', 'kl', 'kl_divergence', 'kumaraswamy', 'laplace', 'lkj_cholesky', 'log_normal', 'logistic_normal', 'lowrank_multivariate_normal', 'mixture_same_family', 'multinomial', 'multivariate_normal', 'negative_binomial', 'normal', 'one_hot_categorical', 'pareto', 'poisson', 'register_kl', 'relaxed_bernoulli', 'relaxed_categorical', 'studentT', 'transform_to', 'transformed_distribution', 'transforms', 'uniform', 'utils', 'von_mises', 'weibull', 'wishart']

__Functions starting and ending with " " (double underscore), which are special objects in Python, or _functions starting with a single " " (single underscore), which are usually internal functions, can generally be ignored . Based on the remaining function or property names, we might guess that this module provides various methods for generating random numbers, including sampling from a uniform distribution ( uniform), a normal distribution ( normal), and a multinomial distribution ( ).multinomial

Find usages of specific functions and classes

For more specific instructions on how to use a given function or class, helpa function can be called. As an example, let's look at the usage of tensor onesfunctions.

help(torch.ones)
Help on built-in function ones in module torch:

ones(...)
    ones(*size, *, out=None, dtype=None, layout=torch.strided, device=None, requires_grad=False) -> Tensor
    
    Returns a tensor filled with the scalar value `1`, with the shape defined
    by the variable argument :attr:`size`.
    
    Args:
        size (int...): a sequence of integers defining the shape of the output tensor.
            Can be a variable number of arguments or a collection like a list or tuple.
    
    Keyword arguments:
        out (Tensor, optional): the output tensor.
        dtype (:class:`torch.dtype`, optional): the desired data type of returned tensor.
            Default: if ``None``, uses a global default (see :func:`torch.set_default_tensor_type`).
        layout (:class:`torch.layout`, optional): the desired layout of returned Tensor.
            Default: ``torch.strided``.
        device (:class:`torch.device`, optional): the desired device of returned tensor.
            Default: if ``None``, uses the current device for the default tensor type
            (see :func:`torch.set_default_tensor_type`). :attr:`device` will be the CPU
            for CPU tensor types and the current CUDA device for CUDA tensor types.
        requires_grad (bool, optional): If autograd should record operations on the
            returned tensor. Default: ``False``.
...
    
        >>> torch.ones(5)
        tensor([ 1.,  1.,  1.,  1.,  1.])

Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...

summary

  • The official documentation provides extensive descriptions and examples outside of this book.

  • Usage documentation for the API can be used and viewed via calls dirand helpfunctions or in Jupyter Notepad .???

common problem:

ModuleNotFoundError: No module named ‘d2l’

pip install -U d2l --user

No module named ‘_lzma’

Step 1: Go to lzma official website to check how to install lzma in ubuntu environment lzma
official website installation tutorial
(1) sudo apt-get install liblzma-dev
(2) pip install backports.lzma
If it is python 3.6 like me, the second operation is to change Replace with:
pip3 install backports.lzma (may require sudo)

Step 2: Modify the existing lmza.py file
and modify line 27 of /usr/local/lib/python3.6/lzma.py line
as follows:

try:
    from _lzma import *
    from _lzma import _encode_filter_properties, _decode_filter_properties
except ImportError:
    from backports.lzma import *
    from backports.lzma import _encode_filter_properties, _decode_filter_properties

 Hands-on Deep Learning v2
1. Introduction — Hands-on Deep Learning 2.0.0 documentation

Guess you like

Origin blog.csdn.net/m0_61634551/article/details/131424735