Hands-on deep learning-pytorch version (1): Introduction & preliminaries

References

0. Environment installation

1 Introduction

  • Machine learning (ML) is a powerful class of techniques that can learn from experience . Usually in the form of observational data or interaction with the environment, the machine learning algorithm will accumulate more experience and its performance will gradually improve

1.1 Machine Learning in Everyday Life

  • If you need to write a program to respond to a "wake word", such as "Alexa"

    • Collect a dataset containing a large number of audio samples and label the samples with and without wake words
    • Using machine learning algorithms, instead of designing a system that "explicitly" recognizes wake words, it is only necessary to define a flexible program algorithm whose output is determined by many parameters (parameters), and then use the data set to determine the current "best wake word " . optimal set of parameters” that achieve the best performance for a task by some performance measure
    • What are parameters?
      • Any program with adjusted parameters is called a model . The collection of all the different programs (input-output mappings) generated by manipulating parameters is called a "model family". A meta-program that uses a dataset to select parameters is called a learning algorithm
    • The problem must be precisely defined, the nature of the inputs and outputs determined, and an appropriate model family selected. In this example, the model receives a piece of audio as input and produces a choice between yes or no as output
      • If you want to handle completely different input or output, for example: mapping from image to subtitle, or from English to Chinese, you may need a completely different model family
  • In machine learning, learning is the process of training a model . Through this process, the correct set of parameters can be discovered so that the model enforces the desired behavior. In other words, train the model with the data . The training process usually includes the following steps

    • (1) Start with a model with random initialization parameters, which has basically no "intelligence"
    • (2) Get some data samples (e.g. audio clips and corresponding yes or no labels)
    • (3) Adjust the parameters to make the model perform better in these samples
    • (4) Repeat steps (2) and (3) until the model performs satisfactorily on the task

1.2 Key Components in Machine Learning

1.2.1 Data (data)

  • Each data set is composed of samples , and most of the time, they follow independent and identical distribution. A sample is sometimes called a data point or a data instance, and usually each sample consists of a set of attributes called features, or covariates, on which the machine learning model performs predict. Suppose a special attribute is to be predicted, it is called a label (label, or target (target))

    • When working with image data, each individual photo is a sample whose features are represented by an ordered list of values ​​for each pixel
  • When the number of feature categories of each sample is the same, its feature vector is a fixed length, which is called the dimensionality of the data. However, not all data can be represented by "fixed-length" vectors, and a major advantage of deep learning over traditional machine learning methods is that it can handle data of varying lengths

  • More data can be used to train more powerful models. Just having massive data is not enough, you also need the right data . If the data is full of errors, or if the characteristics of the data do not predict the task goal, then the model is likely to be invalid

    • For example, in a training data set related to medical treatment, if you want to train a skin cancer recognition model, but you have never "seen" people with black skin in the training data set, the model will suddenly be helpless
    • When the data is not sufficiently representative or even contains some social biases, the model is likely to be biased

1.2.2 Models

  • Most machine learning involves transformation of data
    • For example, a system that "takes photos and predicts smiley faces"
    • Another example is to predict the normality and abnormality of the readings through a set of sensor readings ingested

    The main difference between deep learning and classical methods is that the former focuses on powerful models that are intricately intertwined by neural networks, including layer-by-layer data transformation, so it is called deep learning.

1.2.3 Objective function

  • In machine learning, it is necessary to define the measure of the pros and cons of the model. This measure is "optimizable" in most cases, which is called the objective function (obiective function). It is common to define an objective function and wish to optimize it to the lowest possible point. Because lower is better, these functions are sometimes called loss functions (loss function, or cost function)

    • When the task is trying to predict a value , the most common loss function is the squared error (squared error) , which is the square of the difference between the predicted value and the actual value
    • When trying to solve classification problems, the most common objective function is to minimize the error rate , that is, the proportion of samples whose predictions do not match the actual situation
    • Some objectives, such as error rate, are difficult to optimize directly due to non-differentiability or other complexities, in which case alternative objectives are often optimized
  • Usually, the loss function is defined in terms of model parameters and depends on the dataset. On a dataset, the optimal values ​​of the model parameters can be learned by minimizing the total loss

    • The data set consists of some samples collected for training, called training data set (training dataset, or called training set (training set))
    • However, a model that performs well on the training data does not necessarily have the same performance on the "new data set", where the "new data set" is usually called the test data set (test dataset, or test set (test set))
  • The available datasets can usually be divided into two parts: the training dataset is used to fit the model parameters, and the test dataset is used to evaluate the fitted model

    • "A model's performance on the training dataset" can be thought of as "a student's score on a mock exam", this score is used as a reference for some real final exam, even if the grades are encouraging, the final exam is not guaranteed success
    • When a model performs well on the training set but fails to generalize to the test set, the model is said to be overfitting . Just like in real life, even though you do well on the mock exams, the real exams don't always hit the mark

1.2.4 Optimization algorithm (optimization algorithm)

  • Once you have some data sources and their representations, a model, and an appropriate loss function, then you need an algorithm that can search for the best parameters to minimize the loss function . In deep learning, most popular optimization algorithms are usually based on a basic method: gradient descent (gradientdescent)
    • At each step, gradient descent examines each parameter to see in which direction the training set loss would shift if only small changes were made to that parameter
    • Then, it optimizes the parameters in the direction that can reduce the loss

1.3 Various machine learning problems

1.3.1 Supervised learning

  • Supervised learning is good at predicting labels "given input features" . Each "feature-label" pair is called an example. Sometimes a sample can refer to an input feature even if the label is unknown. The goal is to produce a model that is able to map any input feature to a label (i.e. predict)

    • Example: Suppose you need to predict whether a patient is going to have a heart attack or not, then the observations "heart attack" or "heart attack not happening" would be the labels of the samples. Input features may be vital signs such as heart rate, diastolic and systolic blood pressure, etc.
  • Supervised learning works because when training parameters the model is given a dataset where each example has a true label. Most successful applications of machine learning in industry use supervised learning. This is because many important tasks can be described as estimating the probability of something unknown given a specific set of available data . for example:

    • Given an English sentence, predict the correct French translation
    • Predict the price of a stock for the next month based on this month's financial report data
  • The learning process of supervised learning can generally be divided into three steps:

    • 1. Randomly select a subset from a known large number of data samples, and obtain true labels for each sample. Sometimes, these samples are already labeled (e.g., will the patient recover within the next year?); sometimes, these samples may need to be manually labeled (e.g., image classification). Together these inputs and corresponding labels form the training dataset
    • 2. Choose a supervised learning algorithm that takes a training dataset as input and outputs a "learned model"
    • 3. Put the sample features that have not been seen before into this "model that has completed learning", and use the output of the model as the prediction of the corresponding label

insert image description here

1.3.1.1 Regression
  • Regression (regression) is one of the simplest supervised learning tasks. Many problems in life can be classified as regression problems. A good rule of thumb for judging regression problems is that any question about "how much" is likely to be Regression problem . for example:
    • how many hours does this operation take
    • How much rain will this town expect in the next 6 hours
1.3.1.2 Classification
  • 设计一款应用程序能够自动理解从图像中看到的文本,并将手写字符映射到对应的已知字符之上。这种 “哪一个” 的问题叫做分类 (classification) 问题。分类问题希望模型能够预测样本属于哪个类别 (category,类(class))

    • 最简单的分类问题是只有两类,这被称之为二项分类 (binomial classification);当有两个以上的类别时,把这个问题称为多项分类 (multiclass classification) 问题
    • 回归是训练⼀个回归函数来输出⼀个数值;分类是训练一个分类器来输出预测的类别,与解决回归问题不同,分类问题的常见损失函数被称交叉熵 (cross-entropy)
  • 用概率语言来理解模型:给定一个样本特征,模型为每个可能的类分配一个概率。比如,猫狗分类例子中,分类器可能会输出图像是猫的概率为 0.9:分类器 90% 确定图像描绘的是一只猫。预测类别的概率大小传达了一种模型的不确定性

  • 分类可能变得比二项分类、多项分类复杂得多。例如,有一些分类任务的变体可以用于寻找层次结构,层次结构假定在许多类之间存在某种关系。因此,并不是所有的错误都是均等的。人们宁愿错误地分入一个相关的类别,也不愿错误地分入一个遥远的类别,这通常被称为层次分类 (hierarchical classifcation)

1.3.1.3 标记问题
  • 学习预测不相互排斥的类别的问题称为多标签分类 (multi-label classifcation)。举个例子,人们在技术博客上贴的标签:“机器学习” “技术” “小工具” “编程语言” “Linux” “云计算” “AWS”。一篇典型的文章可能会用 5~10 个标签,因为这些概念是相互关联的。关于 “云计算” 的帖子可能会提到 “AWS”,而关于 “机器学习” 的帖子也可能涉及 “编程语言”

1.3.2 无监督学习 (unsupervised learning)

  • 以上所有的例子都与监督学习有关,即需要向模型提供巨大数据集:每个样本包含特征和相应标签值。如果工作没有十分具体的目标,就需要 “自发” 地去学习了。比如,老板可能会给我们一大堆数据,然后要求用它做一些数据科学研究,却没有对结果有要求。这类数据中不含有 “目标” 的机器学习问题通常称为无监督学习
    • 聚类 (clustering)
      • 没有标签的情况下是否能给数据分类呢? 比如,给定一组照片,能把它们分成风景照片、狗、婴儿、猫和山峰的照片吗? 同样,给定一组用户的网页浏览记录,能否将具有相似行为的用户聚类呢?
    • 主成分分析 (principal component analysis):
      • 能否找到少量的参数来准确地捕捉数据的线性相关属性?比如,一个球的运动轨迹可以用球的速度、直径和质量来描述。再比如,裁缝们已经开发出了一小部分参数,这些参数相当准确地描述了人体的形状,以适应衣服的需要
    • 因果关系 (causality) 和概率图模型 (probabilistic graphical models)
      • 能否描述观察到的许多数据的根本原因? 例如,如果有关于房价、污染、犯罪、地理位置、教育和工资的人口统计数据,能否简单地根据经验数据发现它们之间的关系?

1.3.3 强化学习 (reinforcement learning)

  • 不管是监督学习还是无监督学习,都会预先获取大量数据,然后启动模型,不再与环境交互。这里所有学习都是在算法与环境断开后进行的,被称为离线学习 (offline learning)

  • 强化学习不需要输入数据集,并能与环境交互且采取行动。这可能包括应用到机器人、对话系统,甚至开发视频游戏的人工智能。深度强化学习(deep reinforcement learning) 将深度学习应用于强化学习的问题

  • 在强化学习问题中,智能体 (agent) 在一系列的时间步骤上与环境交互。在每个特定时间点,智能体从环境接收一些观察 (observation),并且必须选择一个动作 (action),然后通过某种机制 (有时称为执行器) 将其传输回环境,最后智能体从环境中获得奖励 (reward)。此后新一轮循环开始,智能体接收后续观察,并选择后续操作,依此类推

强化学习的目标是产生一个好的策略 (policy)。强化学习智能体选择的 “动作” 受策略控制,即一个从环境观察映射到行动的功能

insert image description here

1.4 深度学习的起源

  • 为了解决各种各样的机器学习问题,深度学习提供了强大的工具。虽然许多深度学习方法都是最近才有重大突破,但使用数据和神经网络编程的核心思想已经研究了几个世纪

  • 神经网络 (neural networks) 的得名源于生物灵感。研究人员一直试图组装类似于相互作用的神经元网络的计算电路,其核心是当今大多数网络中都可以找到的几个关键原则:

    • 线性和非线性处理单元的交替,通常称为层 (layers)
    • 使用链式规则 (也称为反向传播 (backpropagation)) 一次性调整网络中的全部参数

1.5 深度学习的发展

  • 新的容量控制方法:如 dropout(Srivastava et al,2014),有助于减轻过拟合的危险
  • 注意力机制:如何在不增加可学习参数的情况下增加系统的记忆和复杂性
  • 多阶段设计:例如,存储器网络 (Sukhbaatar et al,2015) 和神经编程器-解释器 (Reed and De Freitas.2015)
  • 生成对抗网络 :(Godfellow et al,2014)。生成式对抗性网络的关键创新是:用具有可微参数的任意算法代替采样器。然后对这些数据进行调整,使得鉴别器(实际上是一个双样本测试) 不能区分假数据和真实数据
  • 构建并行和分布式训练算法:在许多情况下,单个 GPU 不足以处理可用于训练的大量数据
  • 深度学习框架:在传播思想方面发挥了至关重要的作用
    • 第一代框架:Caffe、Torch 和 Theano
    • 第二代框架:TensorFlow、CNTK、Caffe2 和 Apache MXNet
    • 第三代框架:PyTorch、MXNet 的 Gluon API 和 Jax

1.6 深度学习的特点

  • 机器学习是人工智能的一个分支/方法,而深度学习是机器学习的一个子集

  • 机器学习可以使用数据来学习输入和输出之间的转换,例如在语音识别中将音频转换为文本。在这样做时,通常需要以适合算法的方式表示数据,以便将这种表示转换为输出

    • 深度学习是 “深度” 的,模型学习了许多 “层” 的转换,每一层提供一个层次的表示
    • 例如,靠近输入的层可以表示数据的低级细节,而接近分类输出的层可以表示用于区分的更抽象的概念
    • 由于表示学习 (representation learning) 目的是寻找表示本身,因此深度学习可以称为 “多级表示学习”
  • 深度学习方法中最显著的共同点是使用端到端训练:与其基于单独调整的组件组装系统,不如构建系统,然后联合调整它们的性能。因此,深度学习的一个关键优势:不仅取代了传统学习管道末端的浅层模型,还取代了劳动密集型的特征工程过程

2. 预备知识

2.1 数据操作

  • n 维数组,也称为张量(tensor),无论使用哪个深度学习框架,它的张量类 (在 MXNet 中为 ndarray,在 Pytorch 和 TensorFlow 中为 Tensor) 都与 Numpy 的 ndarray 类似
    • 深度学习框架比 NumPy 的 ndarray 多一些功能:首先,GPU 很好地支持加速计算,而 NumPy 仅支持 CPU 计算;其次,张量类支持自动微分。这些功能使得张量类更适合深度学习(如果没有特殊说明,本文中所说的张量均指的是张量类的实例)

2.1.1 入门

  • A tensor represents an array of values, which may have multiple dimensions. A tensor with one axis corresponds to a mathematical vector (vector), a tensor with two axes corresponds to a mathematical matrix (atrix), and a tensor with more than two axes has no special mathematical name
    import torch
    
    x = torch.arange(12) # 使用 arange 创建一个行向量,包含以 0 开始的前 12 个整数
    print(x)
    print(x.shape) # 通过张量的 shape 属性来访问张量 (沿每个轴的长度) 的形状
    
    # 如果只想知道张量中元素的总数,即形状的所有元素乘积,可以检查它的大小 (size)
    # 因为这里在处理的是一个向量,所以 shape 与 size 相同
    print(x.numel())
    
    # 要想改变一个张量的形状而不改变元素数量和元素值,可以调用 reshape 函数
    # 虽然张量的形状发生了改变,但其元素值并没有变。通过改变张量的形状张量的大小不会改变
    X = x.reshape(3, 4) # 把张量 x 从形状为 (12,) 行向量转换为形状为 (3, 4) 矩阵
    # 可以通过 -1 来⾃动计算出维度的功能
    # 用 x.reshape(-1,4) 或 x.reshape(3,-1) 来取代 x.reshape(3,4)
    print(X)
    
    # 创建全 0 或全 1 常量
    y = torch.zeros((2, 3, 4))
    y = torch.ones((2, 3, 4))
    print(y)
    
    # 每个元素都从均值为 0、标准差为 1 的标准高斯分布 (正态分布) 中随机采样
    z = torch.randn(3, 4)
    print(z)
    
    # 提供包含数值的 Python 列表 (或嵌套列表),为所需张量中的每个元素赋予确定值
    # 最外层的列表对应于轴 0,内层的列表对应于轴 1
    w = torch.tensor([[2, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
    print(w)
    

2.1.2 Operators

  • Common standard arithmetic operators (+, -, *, /, and ) can be upgraded to element-wise operations on arbitrary tensors of the same shape **
  • Concatenate multiple tensors together, stacking them end-to-end to form a larger tensor, just provide a list of tensors, and give along which axis to concatenate
    import torch
    
    x = torch.tensor([1.0, 2, 4, 8])
    y = torch.tensor([2, 2, 2, 2, ])
    print(x + y, x - y, x * y, x / y, x ** y) # ** 运算符是求幂运算
    print(torch.exp(x))
    
    x = torch.arange(12, dtype=torch.float32).reshape(3, 4) # 创建一个形状为 (3,4) 的张量 x,其中包含了从 0 到 11 的连续数字,其数据类型为 torch.float32
    y = torch.tensor([[2.0, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
    print(torch.cat((x, y), dim=0)) # 将张量x和y在维度0上进行拼接,即将y拼接在x的下方,拼接后的张量形状为(6,4)
    print(torch.cat((x, y), dim=1)) # 将张量x和y在维度1上进行拼接,即将y拼接在x的右侧,拼接后的张量形状为(3,8)
    # 通过逻辑运算符构建二元张量
    # 如果 x 和 y 在该位置相等,则新张量中相应项的值为 1,否则该位置为 0
    print(x == y)
    # 对张量中的所有元素进⾏求和,会产⽣⼀个单元素张量
    print(x.sum())
    

2.1.3 Broadcast mechanism

  • Even if the shapes are different, you can still perform element-wise operations by calling the broadcasting mechanism, which works like this
    • Extends one or both arrays by duplicating elements appropriately so that after the transformation, both tensors have the same shape
    • Perform element-wise operations on the resulting array
import torch

a = torch.arange(3).reshape((3, 1))
b = torch.arange(2).reshape((1, 2))
print(a)
print(b)

# 将两个矩阵⼴播为⼀个更⼤的 3×2 矩阵:矩阵 a 复制列,矩阵 b 复制⾏,然后再按元素相加
print(a + b)

2.1.4 Indexing and Slicing

  • Like Python arrays, elements in tensors can be accessed by index: the first element has index 0 and the last element has index -1; ranges can be specified to include the first element and the element before the last
    import torch
    
    x = torch.arange(12, dtype=torch.float32).reshape(3, 4)
    
    
    print(x[-1]) # 用 [-1] 选择最后一个元素
    print(x[1:3]) # 用 [1:3] 选择第二个和第三个元素
    
    # 指定索引来将元素写入矩阵
    x[1, 2] = 9
    print(x)
    
    # 为多个元素赋值相同的值:只需索引所有元素,然后为它们赋值
    x[0:2, :] = 12 # : 表示沿轴 1(列)的所有元素
    print(x)
    

2.1.5 Save memory

  • Running some operations may cause memory to be allocated for new results. For example, if you use y = x + y, the tensor pointed to by y will be dereferenced and will instead point to the tensor at the newly allocated memory
    import torch
    
    x = torch.arange(12, dtype=torch.float32).reshape(3, 4)
    y = torch.tensor([[2.0, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
    
    # 使用切片表示法将操作的结果分配给先前分配的数组,例如 Y[:]= <expression>
    z = torch.zeros_like(y) # 使用 zeros_like 来分配一个全 0 的块
    print('id(z):', id(z))
    z[:] = x + y
    print('id(z):', id(z))
    

2.1.6 Conversion to other python objects

  • It is easy to convert tensors defined by deep learning frameworks to NumPy tensors (ndarray) and vice versa. torch tensors and numpy arrays will share their underlying memory, and in-place operations changing one tensor will also change the other

    import torch
    
    x = torch.arange(12, dtype=torch.float32).reshape(3, 4)
    A = x.numpy()
    B = torch.tensor(A)
    
    print(type(A), type(B))
    
  • To convert a tensor of size 1 to a Python scalar, call the item function or Python's built-in function

    import torch
    
    a = torch.tensor([3.5])
    
    print((a, a.item(), float(a), int(a)))
    

2.2 Data preprocessing

  • Among the data analysis tools commonly used in Python, the pandas package is usually used, which is compatible with tensors

2.2.1 Read dataset

  • 首先创建一个人工数据集,并存储在 CSV (逗号分隔值) 文件 …/data/house_tiny.csv中。以其他格式存储的数据也可以通过类似的方式进行处理。下面将数据集按行写入 CSV 文件中
  • 要从创建的 CSV 文件中加载原始数据集,导入 pandas 包并调用 read_csv 函数。该数据集有四行三列。其中每行描述了房间数量 (“NumRooms”)、巷子类型 (“Alley”) 和房屋价格 (“Price”)
    import os
    import pandas as pd
    
    os.makedirs(os.path.join('..', 'data'), exist_ok=True)
    data_file = os.path.join('..', 'data', 'house_tiny.csv')
    with open(data_file, 'w') as f:
        f.write('NumRooms,Alley,Price\n')
        f.write('NA,Pave,127500\n')
        f.write('2,NA,106000\n')
        f.write('4,NA,178100\n')
        f.write('NA,NA,140000\n')
    
    data = pd.read_csv(data_file)
    print(data)
    
    # 输出
       NumRooms Alley   Price
    0       NaN  Pave  127500
    1       2.0   NaN  106000
    2       4.0   NaN  178100
    3       NaN   NaN  140000
    

2.2.2 处理缺失值

  • “NaN” 项代表缺失值。为了处理缺失的数据,典型的方法包括插值法和删除法,其中插值法用一个替代值弥补缺失值,而删除法则直接忽略缺失值,这里将考虑插值法
    • 通过位置索引 iloc,将 data 分成 inputs 和 outputs,其中前者为 data 的前两列,而后者为 data 的最后一列。对于 inputs 中缺少的数值,用同一列的均值替换 “NaN” 项
    import os
    import pandas as pd
    
    os.makedirs(os.path.join('..', 'data'), exist_ok=True)
    data_file = os.path.join('..', 'data', 'house_tiny.csv')
    with open(data_file, 'w') as f:
        f.write('NumRooms,Alley,Price\n')
        f.write('NA,Pave,127500\n')
        f.write('2,NA,106000\n')
        f.write('4,NA,178100\n')
        f.write('NA,NA,140000\n')
    
    data = pd.read_csv(data_file)
    # 通过位置索引 iloc,将 data 分成 inputs 和 outputs
    # inputs 为 data 的前两列,outputs 为 data 的最后一列
    inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]
    inputs = inputs.fillna(inputs.mean())
    
    print(inputs)
    
    # 输出
       NumRooms Alley
    0       3.0  Pave
    1       2.0   NaN
    2       4.0   NaN
    3       3.0   NaN
    

2.2.3 转换为张量格式

  • 上面的 inputs 和 outputs 中的所有条目都是数值类型,它们可以转换为张量格式。当数据采用张量格式后,可以通过张量函数来进一步操作
    import torch
    
    x, y = torch.tensor(inputs.values), torch.tensor(outputs.values)
    print(x, y)
    

2.3 线性代数

2.3.1 标量

  • 仅包含一个数值被称为标量(scalar),本文中标量变量由普通小写字母表示,标量由只有⼀个元素的张量表示
    import torch
    
    x = torch.tensor(3.0)
    y = torch.tensor(2.0)
    
    print(x + y, x * y, x / y, x**y)
    

2.3.2 向量

  • 向量可视为标量值组成的列表,这些标量值被称为向量的元素或分量,向量通常记为粗体、小写的符号
  • 通过一维张量表示向量。一般来说,张量可以具有任意长度,取决于机器的内存限制
    import torch
    
    x = torch.arange(4)
    
    print(x)
    
  • 通常默认列向量是向量的默认方向,向量 x \mathbf{x} x can be written as
    x = [ x 1 x 2 ⋮ xn ] \mathbf{x}=\begin{bmatrix}x_1\\x_2\\\vdots\\x_n\end{bmatrix}x= x1x2xn
Length, Dimensions and Shape
  • A vector is just an array of numbers, and just like every array has a length, so does every vector. In mathematical notation, if say a vector x \mathbf{x}x consists of n real-valued scalars, which can be expressed asx ∈ R n \mathbf{x}\in\mathbb{R}^{n}xRn . The length of a vector is usually called the dimension of the vector
    import torch
    
    x = torch.arange(4)
    print(len(x)) # 调用 Python 的内置 len() 函数来访问张量的长度
    # 当用张量表示一个向量(只有一个轴) 时,也可通过 .shape 属性访问向量长度
    # 形状(shape)是⼀个元素组,列出了张量沿每个轴的⻓度(维数)
    print(x.shape)
    

    Dimension : The dimension of a vector or axis is used to represent the length of the vector or axis, that is, the number of elements of the vector or axis. However, the dimensionality of a tensor is used to indicate the number of axes the tensor has. So, in this sense, the dimension of an axis of a tensor is the length of that axis

2.3.3 Matrix

  • Vectors generalize scalars from order zero to order one, and matrices generalize vectors from order one to order two, usually in bold, capital letters , and in code as tensors with two axes
  • When a matrix has the same number of rows and columns, its shape becomes square and is called a square matrix

A = [ a 11 a 12 ⋯ a 1 n a 21 a 22 ⋯ a 2 n ⋮ ⋮ ⋱ ⋮ a m 1 a m 2 ⋯ a m n ] \mathbf{A}=\begin{bmatrix}a_{11}&a_{12}&\cdots&a_{1n}\\a_{21}&a_{22}&\cdots&a_{2n}\\\vdots&\vdots&\ddots&\vdots\\a_{m1}&a_{m2}&\cdots&a_{mn}\end{bmatrix} A= a11a21am 1a12a22am 2a1na2 namn

  • When calling the function to instantiate a tensor, a matrix of shape mxn can be created by specifying two components m and n
  • When the rows and columns of a matrix are swapped, the result is called the transpose of the matrix. Use AT to access
    A ⊤ = [ a 11 a 21 … am 1 a 12 a 22 … am 2 ⋮ ⋮ ⋱ ⋮ a 1 na 2 n … amn ] \mathbf{A}^\top=\begin{bmatrix}a_{11}&a_{21}&\ldots&a_{m1}\\a_{12}&a_{22}&\ldots&a_{m2}\\\ vdots&\vdots&\ddots&\vdots\\a_{1n}&a_{2n}&\ldots&a_{mn}\end{bmatrix}A= a11a12a1na21a22a2 nam 1am 2amn
  • As a special type of square matrix, a symmetric matrix A is equal to its transpose: A = A ⊤ \mathbf{A}=\mathbf{A}^{\top}A=A
    import torch
    
    A = torch.arange(20).reshape(5, 4)
    B = torch.tensor([[1, 2, 3], [2, 0, 4], [3, 4, 5]]) # 对称矩阵
    
    print(A)
    print(A.T) # 访问矩阵的转置
    print(B == B.T)
    

2.3.4 Tensors

  • Vectors are a generalization of scalars, matrices are generalizations of vectors, and data structures with more axes can also be built: Tensors are a general way to describe n-dimensional arrays with any number of axes . For example, a vector is a rank-1 tensor and a matrix is ​​a rank-2 tensor . Tensors are denoted by capital letters in special fonts (for example, X, Y, and Z) Their indexing mechanism is similar to that of matrices
  • Tensors become even more important when you start working with images, which come as n-dimensional arrays with 3 axes corresponding to height, width, and a channel axis , which represents the color channels (red, green, and blue)
    import torch
    
    a = 2
    X = torch.arange(24).reshape(2, 3, 4)
    
    print(X)
    print(a + X)
    print(a * X)
    

2.3.5 Basic properties of tensor algorithms

  • Any element-wise unary operation does not change the shape of its operands. Likewise, given any two tensors with the same shape, the result of any element-wise binary operation will be a tensor of the same shape
  • The element-wise multiplication of two matrices is called the Hadamard product (mathematical symbol ⊙ \odot
    A ⊙ B = [ a 11 b 11 a 12 b 12 … a 1 n b 1 n a 21 b 21 a 22 b 22 … a 2 n b 2 n ⋮ ⋮ ⋱ ⋮ a m 1 b m 1 a m 2 b m 2 … a m n b m n ] \mathbf{A}\odot\mathbf{B}=\begin{bmatrix}a_{11}b_{11}&a_{12}b_{12}&\ldots&a_{1n}b_{1n}\\a_{21}b_{21}&a_{22}b_{22}&\ldots&a_{2n}b_{2n}\\\vdots&\vdots&\ddots&\vdots\\a_{m1}b_{m1}&a_{m2}b_{m2}&\ldots&a_{mn}b_{mn}\end{bmatrix} AB= a11b11a21b21am 1bm 1a12b12a22b22am 2bm 2a1nb1na2 nb2 namnbmn
  • Multiplying or adding a tensor by a scalar will not change the shape of the tensor, where each element of the tensor will be added or multiplied by the scalar
    import torch
    
    A = torch.arange(20, dtype=torch.float32).reshape(5, 4)
    B = A.clone() # 通过分配新内存,将 A 的一个副本分配给B
    
    # print(A + B)
    print(A * B)
    

2.3.6 Dimensionality reduction

  • It can represent the sum of elements of any shape tensor, and the sum of elements in matrix A can be recorded as ∑ i = 1 m ∑ j = 1 naij \sum_{i=1}^{m}\sum_{j=1}^{n }a_{ij}i=1mj=1naij
  • By default, calling the sum function reduces the dimensions of the tensor along all axes, making it a scalar . It is also possible to specify along which axis the tensor is reduced by summation
    • Taking the matrix as an example, in order to reduce the dimension ( axis 0 ) by summing the elements of all rows , you can specify axis=0 when calling the function
    • Specifying axis=1 will reduce dimensionality by summing the elements of all columns ( axis 1 )
  • The quantity associated with the summation is the mean (or average), which is calculated by dividing the sum by the total number of elements. Callable function to compute the average of tensors of arbitrary shape
  • Functions that compute the mean can also reduce the dimensionality of tensors along the specified axis
    import torch
    
    A = torch.arange(20, dtype=torch.float32).reshape(5, 4)
    A_sum_axis0 = A.sum(axis=0)
    A_sum_axis1 = A.sum(axis=1)
    
    print(A.sum()) # 矩阵 A 中元素和
    print(A_sum_axis0)
    print(A_sum_axis1)
    print(A.mean())
    print(A.mean(axis=0))
    
non-reduced summation
  • Sometimes it is useful to keep the number of axes constant when calling the function to calculate the sum or mean
  • If you want to calculate the cumulative sum of the elements of A along some axis, say axis=0 (by row), you can call the cumsum function. This function does not reduce the dimensionality of the input tensor along any axis
    import torch
    
    A = torch.arange(20, dtype=torch.float32).reshape(5, 4)
    sum_A = A.sum(axis=1, keepdims=True) # 非降维(列)求矩阵 A 中元素和
    
    print(sum_A)
    print(A.cumsum(axis=0)) 
    

2.3.7 Dot product

  • Given two vectors x , y ∈ R d \mathbf{x},\mathbf{y}\in\mathbb{R}^{d}x,yRd , their dot product is thesum of element-wise products at the same position x ⊤ y = ∑ i = 1 dxiyi \mathbf{x}^\top\mathbf{y}=\sum_{i=1}^dx_iy_ixy=i=1dxiyi
    import torch
    
    x = torch.arange(4, dtype=torch.float32)
    y = torch.ones(4, dtype=torch.float32)
    
    print(torch.dot(x, y))
    
  • Dot products are useful in many situations. For example, given a set of vectors x ∈ R d \mathbf{x}\in\mathbb{R}^{d}xRThe value represented by d , and a set ofw ∈ R d \mathbf{w}\in\mathbb{R}^{d}wRd represents the weight. x \mathbf{x}The values ​​in x are weighted according tow \mathbf{w}The weighted sum of w can be expressed as the dot product x ⊤ w \mathbf{x}^{\top}\mathbf{w}xw
    • When the weight is non-negative and the sum is 1 (ie ( ∑ i = 1 dwi = 1 ) \left(\sum_{\boldsymbol{i}=1}^dw_{\boldsymbol{i}}=1\right)(i=1dwi=1 ) ),the dot product represents the weighted average
    • After normalizing the two vectors to unit length, the dot product represents the cosine of their angle

2.3.8 Matrix-vector product

  • Define the matrix A ∈ R m × n \mathbf{A}\in\mathbb{R}^{m\times n}ARm × n and vectorx ∈ R n {\mathbf{x}}\in\mathbb{R}^{n}xRn , matrix-vector productA x \mathbf{Ax}Ax is a column vector of length m
  • Use matrix-vector products to describe the complex calculations required to solve each layer of a neural network given the values ​​of the previous layer
    A x = [ a 1 ⊤ a 2 ⊤ ⋮ am ⊤ ] x = [ a 1 ⊤ xa 2 ⊤ x ⋮ am ⊤ x ] \mathbf{A}\mathbf{x}=\begin{bmatrix}\mathbf{a}_1^\top\\\mathbf{a}_2^\top\\\vdots\\\ mathbf{a}_m^\top\end{bmatrix}\mathbf{x}=\begin{bmatrix}\mathbf{a}_1^\top\mathbf{x}\\\mathbf{a}_2^\top\ mathbf{x}\\\vdots\\\mathbf{a}_m^\top\mathbf{x}\end{bmatrix}Ax= a1a2am x= a1xa2xamx
    import torch
    
    A = torch.arange(20, dtype=torch.float32).reshape(5, 4)
    x = torch.arange(4, dtype=torch.float32)
    
    print(torch.mv(A, x)) # 使⽤ mv 函数执行矩阵-向量积
    

2.3.9 Matrix-matrix multiplication

  • Suppose two A ∈ R n × k \mathbf{A}\in\mathbb{R}^{n\times k}ARn×k B ∈ R k × m \mathbf{B}\in\mathbb{R}^{k\times m} BRk × m
    A = [ a 11 a 12 ⋯ a 1 in 21 a 22 ⋯ a 2 k ⋮ ⋮ ⋱ ⋮ an 1 an 2 ⋯ ank ] , B = [ b 11 b 12 ⋯ b 1 mb 21 b 22 ⋯ b m ⋮ ⋮ ⋱ ⋮ bk 1 bk 2 ⋯ bkm ] \mathbf{A}=\begin{bmatrix}a_{11}&a_{12}&\cdots&a_{1k}\\a_{21}&a_{22}&\cdots&a_ {2k}\\\vdots&\vdots&\ddots&\vdots\\a_{n1}&a_{n2}&\cdots&a_{nk}\end{bmatrix},\quad\mathbf{B}=\begin{bmatrix}b_{ 11}&b_{12}&\cdots&b_{1m}\\b_{21}&b_{22}&\cdots&b_{2m}\\\vdots&\vdots&\ddots&\vdots\\b_{k1}&b_{k2}&\ cdots&b_{km}\end{bmatrix}A= a11a21an 1a12a22an 2a1 ka2 kank ,B= b11b21bk 1b12b22bk2 _b1 mb2 mbkm

C = AB = [ a 1 ⊤ a 2 ⊤ ⋮ an ⊤ ] [ b 1 b 2 ⋯ bm ] = [ a 1 ⊤ b 1 a 1 ⊤ b 2 ⋯ a 1 ⊤ bma 2 ⊤ b 1 a 2 ⊤ b 2 ⋯ a 2 ⊤ bm ⋮ ⋮ ⋱ ⋮ an ⊤ b 1 an ⊤ b 2 ⋯ an ⊤ bm ] \mathbf{C}=\mathbf{A}\mathbf{B}=\begin{bmatrix}\mathbf{a}_1^ \top\\\mathbf{a}_2^\top\\\vdots\\\mathbf{a}_n^\top\end{bmatrix}\begin{bmatrix}\mathbf{b}_1&\mathbf{b}_2& \cdots&\mathbf{b}_m\end{bmatrix}=\begin{bmatrix}\mathbf{a}_1^\top\mathbf{b}_1&\mathbf{a}_1^\top\mathbf{b}_2&\ cdots&\mathbf{a}_1^\top\mathbf{b}_m\\\mathbf{a}_2^\top\mathbf{b}_1&\mathbf{a}_2^\top\mathbf{b}_2&\cdots& \mathbf{a}_2^\top\mathbf{b}_m\\\vdots&\vdots&\ddots&\vdots\\\mathbf{a}_n^\top\mathbf{b}_1&\mathbf{a}_n^\ top\mathbf{b}_2&\cdots&\mathbf{a}_n^\top\mathbf{b}_m\end{bmatrix}C=AB= a1a2an [b1b2bm]= a1b1a2b1anb1a1b2a2b2anb2a1bma2bmanbm

  • Matrix-matrix multiplication AB \mathbf{AB}AB is viewed as simply performing m matrix-vector products
    import torch
    
    A = torch.arange(20, dtype=torch.float32).reshape(5, 4)
    B = torch.ones(4, 3)
    
    print(torch.mm(A, B)) # 使⽤ mm 函数执行矩阵-矩阵乘法
    

2.3.10 Norm

  • Informally, the norm of a vector is how big a vector is. The size concept considered here does not involve dimensions, but the size of components

  • In linear algebra, a vector norm is a function that maps a vector to a scalar fff . Given any vectorx \mathbf{x}x , the vector norm must satisfy some properties

    • The first property is: If by constant factor α \alphaα scalesall elements of the vector, and its norm is also scaled by the absolute value of the same constant factor
      f ( α x ) = ∣ α ∣ f ( x ) f(\alpha\mathbf{x})=|\alpha|f( \mathbf{x})f(αx)=αf(x)
    • The second property is the triangle inequality
      f ( x + y ) ≤ f ( x ) + f ( y ) f(\mathbf{x}+\mathbf{y})\leq f(\mathbf{x})+f( \mathbf{y})f(x+y)f(x)+f(y)
    • The third property is that the norm must be non- negative (the norm is at least 0, if and only if the vector is composed of all 0)
      f ( x ) ≥ 0 f(\mathbf{x})\geq0f(x)0
  • Euclidean distance is a L 2 L_2L2Norm , assuming n-dimensional vector x \mathbf{x}The elements in x are x 1 , . . . , xn x_1,...,x_nx1,...,xn, whose L 2 L_2L2The norm is the square root of the sum of the squares of the vector elements

    • L 2 L_2L2The subscript 2 is often omitted in the norm, that is to say ∥ x ∥ \|\mathbf{x}\|x is equivalent to∥ x ∥ 2 \|\mathbf{x}\|_2x2
      ∥ x ∥ 2 = ∑ i = 1 n x i 2 \|\mathbf{x}\|_2=\sqrt{\sum_{i=1}^nx_i^2} x2=i=1nxi2
  • L 2 L_2 is used more often in deep learningL2The square of the norm also often encounters L 1 L_1L1Norm, which is expressed as the sum of the absolute values ​​of vector elements
    ∥ x ∥ 1 = ∑ i = 1 n ∣ xi ∣ \|\mathbf{x}\|_1=\sum_{i=1}^n|x_i|x1=i=1nxi

  • L 2 L_2L2Norm and L 1 L_1L1The norms are all more general L p L_pLpLet
    ∥ x ∥ p = ( ∑ i = 1 n ∣ xi ∣ p ) 1 / p \|\mathbf{x}\|_p=\left(\sum_{i=1}^n|x_i|^ p\right)^{1/p}xp=(i=1nxip)1/p

  • Similar to the L 2 L_2 of the vectorL2Norm, matrix X ∈ R m × n \mathbf{X}\in\mathbb{R}^{m\times n}XRThe Frobenius norm of m × n is the square root of the sum of the squares of the matrix elements

    • The Frobenius norm satisfies all the properties of the vector norm, it is like L 2 L_2 of the matrix vectorL2 范数
      ∥ X ∥ F = ∑ i = 1 m ∑ j = 1 n x i j 2 \|\mathbf{X}\|_F=\sqrt{\sum_{i=1}^m\sum_{j=1}^nx_{ij}^2} XF=i=1mj=1nxij2
    import torch
    
    u = torch.tensor([3.0, 4.0])
    
    print(torch.norm(u))     # 计算 L2 范数
    print(torch.abs(u).sum)  # 计算 L1 范数
    print(torch.norm(torch.ones((4, 9)))) # 计算矩阵范数
    

    In deep learning, one often tries to solve optimization problems: maximizing the probability assigned to observed data and minimizing the distance between predictions and true observations. Items (such as words, products, or news articles) are represented by vectors such that the distance between similar items is minimized and the distance between dissimilar items is maximized. target, usually expressed as the norm

2.4 Calculus

  • The approximation method is the origin of integral calculus . Another branch of calculus, the most important application of differential calculus in differential calculus is optimization problems

insert image description here

  • In deep learning, models are "trained" by continually updating them so that they get better and better as they see more and more data. Often, getting better means minimizing a loss function . The end result is a model that performs well on never-before-seen data. But "training" a model can only fit the model to data that can actually be seen . Therefore, the task of fitting a model can be decomposed into two key problems
    • Optimization: The process of fitting a model to observed data
    • Generalization: Use mathematical principles and the wisdom of practitioners to guide the generation of models that are more effective than the data set used for training

2.4.1 Derivatives and differentials

  • In deep learning, it is common to choose a loss function that is differentiable with respect to the model parameters. In short, for each parameter, if you increase or decrease this parameter by an infinitesimal amount, you can know how fast the loss will increase or decrease

  • Suppose there is a function f : R → R f:\mathbb{R}\to\mathbb{R}f:RR , whose inputs and outputs are scalars. ifffThe derivative of f
    exists, and this limit is defined as f ′ ( x ) = lim ⁡ h → 0 f ( x + h ) − f ( x ) h f'(x)=\lim_{h\to0}\frac{f (x+h)-f(x)}hf(x)=h0limhf(x+h)f(x)

  • If f ′ ( a ) f^{\prime}(a)f (a)exists, then it is calledfff inaaa is differentiable. ifffIf f is differentiable on every number in an interval, then this function is differentiable in this interval. The derivativef ′ ( a ) f^{\prime}(a)f (a)is interpreted asf ( a ) f(a)f ( a ) relative toxxThe instantaneous rate of change of x . The so-called instantaneous rate of change is based onxxchange in x hhh , andhhh is close to 0

  • Several equivalent symbols of derivative
    f ′ ( x ) = y ′ = dydx = dfdx = ddxf ( x ) = D f ( x ) = D xf ( x ) f'(x)=y'=\frac{dy} {dx}=\frac{df}{dx}=\frac{d}{dx}f(x)=Df(x)=D_xf(x)f(x)=y=dxdy=dxdf=dxdf(x)=Df(x)=Dxf(x)

  • where ddx \frac{d}{dx}dxdand DDD isa differential operator, which means differential operation.Common differential rulesare as follows:
    ddx [ C f ( x ) ] = C ddxf ( x ) \frac d{dx}[Cf(x)]=C\frac d{dx}f( x)dxd[Cf(x)]=Cdxdf(x)

d d x [ f ( x ) + g ( x ) ] = d d x f ( x ) + d d x g ( x ) \frac d{dx}[f(x)+g(x)]=\frac d{dx}f(x)+\frac d{dx}g(x) dxd[f(x)+g(x)]=dxdf(x)+dxdg(x)

d d x [ f ( x ) g ( x ) ] = f ( x ) d d x [ g ( x ) ] + g ( x ) d d x [ f ( x ) ] \frac{d}{dx}[f(x)g(x)]=f(x)\frac{d}{dx}[g(x)]+g(x)\frac{d}{dx}[f(x)] dxd[f(x)g(x)]=f(x)dxd[g(x)]+g(x)dxd[f(x)]

d d x [ f ( x ) g ( x ) ] = g ( x ) d d x [ f ( x ) ] − f ( x ) d d x [ g ( x ) ] [ g ( x ) ] 2 \frac{d}{dx}\left[\frac{f(x)}{g(x)}\right]=\frac{g(x)\frac{d}{dx}[f(x)]-f(x)\frac{d}{dx}[g(x)]}{[g(x)]^2} dxd[g(x)f(x)]=[g(x)]2g(x)dxd[f(x)]f(x)dxd[g(x)]

  • Below f ( x ) = 3 x 2 − 4 xf(x)=3x^2-4xf(x)=3x _24 x as an example to better explain the derivative
import numpy as np
import matplotlib.pyplot as plt

def f(x):
    return 3 * x ** 2 - 4 * x

def numerical_lim(f, x, h):
    return (f(x + h) - f(x)) / h

# 通过令 x = 1 并让 h 接近 0,则 numerical_lim 结果接近 2
h = 0.1
for i in range(5):
    # 将变量 h 格式化为浮点数,并保留小数点后五位
    print(f'h={
      
      h:.5f}, numerical limit={
      
      numerical_lim(f, 1, h):.5f}')
    h *= 0.1

# 定义 set_figsize 函数来设置图表大小
    # 参数 figsize 是一个长度为 2 的元组,用于指定图形的宽度和高度
    # plt.rcParams 是 Matplotlib 库中的一个全局变量,它是一个字典对象,存储了一些全局默认值
    # figure.figsize 是其中一个键值对,用于设置图形的尺寸,默认值为 (6.0, 4.0)
def set_figsize(figsize=(3.5, 2.5)):
    plt.rcParams['figure.figsize'] = figsize  # 修改全局默认值 figsize

# 用于设置由 matplotlib 生成图表的轴的属性
# axes:图形的轴对象(坐标轴)  legend:图例
def set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend):
    axes.set_xlabel(xlabel)  # x 轴的标签
    axes.set_ylabel(ylabel)  
    axes.set_xscale(xscale)  # x 轴的刻度缩放方式,可以是线性(linear)或对数(log)
    axes.set_yscale(yscale)
    axes.set_xlim(xlim)      # x 轴的范围
    axes.set_ylim(ylim)      
    if legend:               # 如果 legend 参数不为空
        axes.legend(legend)  # 则调用轴对象的 legend() 方法设置图例
    axes.grid()              # 使用 grid() 方法在图形中加入网格

# 定义一个 plot 函数来简洁地绘制多条曲线
# fmts:曲线的样式,可以是一条曲线的样式(字符串)或多条曲线的样式(字符串组成的列表)
    # ('-','m--'):这个字符串元组表示一个紫色 (m) 的虚线 (--) 
    # ('g-.'):这个字符串元组表示一个绿色 (g) 的点划线 (-.) 
    # ('r:'):这个字符串元组表示一个红色 (r) 的点线 (:) 
def plot(X, Y=None, xlabel=None, ylabel=None, legend=None, xlim=None,
        ylim=None, xscale='linear', yscale='linear',
        fmts=('-', 'm--', 'g-.', 'r:'), figsize=(3.5, 2.5), axes=None):

    # 绘制数据点
    if legend is None:
        legend = []       # 设置图例为空列表
    set_figsize(figsize)  # 设置图标大小
    
    # 如果 X 只有⼀个轴,则输出 True
    def has_one_axis(X):
        # 判断 X 对象是否有一个属性 ndim,并且是否等于 1(即 X 是否为一维对象)
        # 判断 X 是否为 list 类型的对象,并且第一个元素是否没有属性 __len__(即判断 X 的第一个元素是否是一维对象)
        return (hasattr(X, "ndim") and X.ndim == 1 or isinstance(X, list)
                and not hasattr(X[0], "__len__"))
    
    if has_one_axis(X):    # 如果 X 只有一个轴
        X = [X]            # 则将 X 放入一个列表中
    if Y is None:
        X, Y = [[]] * len(X), X  # X、Y 的初始值都是包含有 len(X) 个空列表的列表
    elif has_one_axis(Y):  # 如果 Y 只有一个轴
        Y = [Y]
    if len(X) != len(Y):   # 如果 X 和 Y 长度不等,则将 X 复制多份以匹配 Y 的长度
        X = X * len(Y)

    fig, ax = plt.subplots()  # 创建一个新的图形窗口,并返回一个包含 FigFig 对象和 Axes 对象的元组(fig, ax)
    # 遍历 X、Y 和 fmts 三个列表,并使用 zip 函数将它们中的元素一一对应起来
    for x, y, fmt in zip(X, Y, fmts):
        if len(x):              # x 列表不为空
            ax.plot(x, y, fmt)  # 则调用 ax.plot() 函数绘制 x 和 y 之间的数据点,并使用 fmt 参数指定样式
        else:                   # 如果 x 列表为空
            ax.plot(y, fmt)     # 则说明传入的是一个一维的 y 数组
    set_axes(ax, xlabel, ylabel, xlim, ylim, xscale, yscale, legend)
    plt.show()  # 显示图形窗口

# 绘制函数 f(x) 及其在 x = 1 处的切线 y = 2x - 3,其中系数 2 是切线的斜率
x = np.arange(0, 3, 0.1)  # 定义一个从 0 到 3,步长为 0.1 的数组 x
plot(x, [f(x), 2 * x - 3], 'x', 'f(x)', legend=['f(x)', 'Tangent line (x=1)'])
# 输出
h=0.10000, numerical limit=2.30000
h=0.01000, numerical limit=2.03000
h=0.00100, numerical limit=2.00300
h=0.00010, numerical limit=2.00030
h=0.00001, numerical limit=2.00003

insert image description here

2.4.2 Partial derivatives

  • In deep learning, functions often depend on many variables. Therefore, it is necessary to extend the idea of ​​differentiation to multivariate functions

  • y = f ( x 1 , x 2 , … , x n ) y=f(x_{1},x_{2},\ldots,x_{n}) y=f(x1,x2,,xn) isafunction of n variables. yyy about sectioniii parametersxi x_ixi 的偏导数 (partial derivative) 为
    ∂ y ∂ x i = lim ⁡ h → 0 f ( x 1 , … , x i − 1 , x i + h , x i + 1 , … , x n ) − f ( x 1 , … , x i , … , x n ) h \begin{aligned}\frac{\partial y}{\partial x_i}=\lim\limits_{h\to0}\frac{f(x_1,\ldots,x_{i-1},x_i+h,x_{i+1},\ldots,x_n)-f(x_1,\ldots,x_i,\ldots,x_n)}{h}\end{aligned} xiy=h0limhf(x1,,xi1,xi+h,xi+1,,xn)f(x1,,xi,,xn)

  • Several equivalent symbols of partial derivatives
    ∂ y ∂ xi = ∂ f ∂ xi = fxi = fi = D if = D xif \begin{aligned}\frac{\partial y}{\partial x_i}=\frac{\partial f}{\partial x_i}=f_{x_i}=f_i=D_if=D_{x_i}f\end{aligned}xiy=xif=fxi=fi=Dif=Dxif

2.4.3 Gradient

  • The partial derivatives of a multivariate function with all its variables can be concatenated to obtain the gradient vector of the function . Specifically, let the function f : R n → R f:\mathbb{R}^n\to\mathbb{R}f:RnThe input to R is annn- dimensional vector $\mathbf{x}=[x_1,x_2,\ldots,x_n]^\top $, and the output is a scalar. functionf ( x ) f(x)f ( x ) relative toxxThe gradient of x is a vector containingnnn 个偏导数的向量
    ∇ x f ( x ) = [ ∂ f ( x ) ∂ x 1 , ∂ f ( x ) ∂ x 2 , … , ∂ f ( x ) ∂ x n ] ⊤ \nabla_{\mathbf{x}}f(\mathbf{x})=\left[\frac{\partial f(\mathbf{x})}{\partial x_1},\frac{\partial f(\mathbf{x})}{\partial x_2},\ldots,\frac{\partial f(\mathbf{x})}{\partial x_n}\right]^\top xf(x)=[x1f(x),x2f(x),,xnf(x)]
  • Assume xxx isnnn -dimensional vector, the following rules are often used when differentiating multivariate functions
    ∇ x A x = A ⊤ \nabla_{\mathbf{x}}\mathbf{A}\mathbf{x}=\mathbf{A}^\topxAx=A

∇ x x ⊤ A = A \nabla_{\mathbf{x}}\mathbf{x}^{\top}\mathbf{A}=\mathbf{A} xxA=A

∇ x x ⊤ A x = ( A + A ⊤ ) x \nabla_{\mathbf{x}}\mathbf{x}^\top\mathbf{A}\mathbf{x}=(\mathbf{A}+\mathbf{A}^\top)\mathbf{x} xxAx=(A+A)x

∇ x ∥ x ∥ 2 = ∇ xx ⊤ x = 2 x \nabla_x\|\mathbf{x}\|^2=\nabla_x\mathbf{x}^\top\mathbf{x}=2\mathbf{x}xx2=xxx=2x _

2.4.4 The chain rule

  • The above method can be difficult to find the gradient, because in deep learning, multivariate functions are usually composite, so it is difficult to apply any of the above rules to differentiate these functions. But the chain rule can be used to differentiate composite functions
    • Consider univariate functions first. Suppose the function y = f ( u ) y = f(u)y=f(u) u = g ( x ) u = g(x) u=g ( x ) are all differentiable, according to the chain rule
      dydx = dydududx \frac{dy}{dx}=\frac{dy}{du}\frac{du}{dx}dxdy=of udydxof _
    • Consider the case where a function has an arbitrary number of variables. Assuming a differentiable function yyy 有变量 u 1 , u 2 , … , u m u_{1},u_{2},\ldots,u_{m} u1,u2,,um, where each differentiable function ui u_iuiThere are variables x 1 , x 2 , … , xn x_{1},x_{2},\ldots,x_{n}x1,x2,,xn. where yyy x 1 , x 2 , … , x n x_{1},x_{2},\ldots,x_{n} x1,x2,,xn 的函数
      ∂ y ∂ x i = ∂ y ∂ u 1 ∂ u 1 ∂ x i + ∂ y ∂ u 2 ∂ u 2 ∂ x i + ⋯ + ∂ y ∂ u m ∂ u m ∂ x i \frac{\partial y}{\partial x_i}=\frac{\partial y}{\partial u_1}\frac{\partial u_1}{\partial x_i}+\frac{\partial y}{\partial u_2}\frac{\partial u_2}{\partial x_i}+\cdots+\frac{\partial y}{\partial u_m}\frac{\partial u_m}{\partial x_i} xiy=u1yxiu1+u2yxiu2++umyxium

2.5 Automatic differentiation

  • Deep learning frameworks speed up derivatives by automatically computing derivatives, known as automatic differentiation. In practice, according to the designed model, the system will build a computational graph (computational graph) to track which data is calculated and combined with which operations to generate output. Automatic differentiation enables the system to subsequently backpropagate gradients . Backpropagation here means tracing the entire computational graph, filling in the partial derivatives with respect to each parameter

2.5.1 A Simple Example

  • Suppose the pair function y = 2 x ⊤ xy=2\mathbf{x}^{\top}\mathbf{x}y=2x _ xwith respect to the column vectorx \mathbf{x}derivative of x
    import torch
    
    # 计算 y 关于 x 的梯度之前,需要⼀个地方来存储梯度,不会在每次对一个参数求导时都分配新的内存
    # 一个标量函数关于向量 x 的梯度是向量,并且与 x 具有相同的形状
    x = torch.arange(4.0, requires_grad=True)  # 等价于 x.requires_grad_(True) 
    y = 2 * torch.dot(x, x)
    y.backward()  # 调用反向传播函数来自动计算 y 关于 x 每个分量的梯度
    
    print(y)
    print(x.grad)  # 默认值是 None
    print(x.grad == 4 * x)  # y = 2(x^T)x 关于 x 的梯度应为 4x,此行代码为验证梯度计算是否正确
    
    # 输出
    tensor(28., grad_fn=<MulBackward0>)
    tensor([ 0.,  4.,  8., 12.])
    tensor([True, True, True, True])
    

2.5.2 Backpropagation for non-scalar variables

  • yyWhen y is not a scalar, the vectoryyy with respect to the vectorxxThe most natural interpretation of the derivative of x is a matrix. For high-order and high-dimensionalyyy andxxx , the result of derivation can be a higher-order tensor
  • But when calling the reverse calculation of the vector, it is usually trying to calculate the derivative of the loss function of each component in a batch of training samples. The purpose here is not to calculate the differentiation matrix but to calculate the partial derivative sum of each sample in the batch separately.
  • The purpose of backpropagation is to calculate yyy aboutxxThe derivative of x , that is, finddy / dx dy/dxd y / d x . According to the chain rule,dy / dx dy/dxd y / d x equals/ dz dy/dzd y / d z timesdz / dx dz/dxd z / d x , wherezzz y y y andxxThe intermediate variable of x . Also, due toyyy forxxThe partial derivative of x is a vector, so the result returned by backpropagation is also a vector, and the result is stored inx.grad x.gradx . g r a d _
    import torch
    
    x = torch.arange(4.0, requires_grad=True)
    y = 2 * torch.dot(x, x)
    y.backward()
    
    # 对⾮标量调⽤ backward 需要传⼊⼀个 gradient 参数,该参数指定微分函数关于 self 的梯度
    x.grad.zero_()  # 在默认情况下,PyTorch 会累积梯度,因此需要清除之前的值
    y = x * x
    # 给定 torch.ones(len(x)) 作为反向传播的输入,意味着将一个与 x 的长度相同的全 1 向量作为 y 对于自己的导数进行反向传播
    # 根据链式法则,这相当于求 dy/dy,即对 y 求导。由于 y 是一个函数,因此返回值将是与 x 的导数相关的向量
    y.sum().backward()  # 等价于 y.backward(torch.ones(len(x)))
    
    print(x.grad)
    
    # 输出
    tensor([0., 2., 4., 6.])
    

2.5.3 Gradient calculation of Python control flow

  • One benefit of using automatic differentiation is that even if building the computational graph of a function requires control flow through Python (e.g., conditionals, loops, or arbitrary function calls), gradients of the resulting variables can still be computed. In the code below, both the number of iterations of the while loop and the result of the if statement depend on the value of the input a
    import torch
    
    def f(a):
        b = a * 2
        while b.norm() < 1000:
            b = b * 2
            if b.sum() > 0:
                c = b
            else:
                c = 100 * b
            return c
    
    a = torch.randn(size=(), requires_grad=True)
    d = f(a)
    d.backward()
    
    print(a.grad == d / a)
    
    # 输出
    tensor(True)
    

2.6 Probability

2.6.1 Basic probability theory

  • Suppose you roll a dice and want to know what are the odds of seeing a 1, if the dice are fair then all six outcomes {1,...,6} are equally likely to occur so we can say that the probability of 1 occurring is 1/6 . The law of large numbers tells us that this estimate gets closer and closer to the true underlying probability as the number of tosses increases
  • In statistics, the process of drawing samples from a probability distribution is called sampling.
    • Generally speaking, the distribution can be regarded as the probability distribution of events
    • A distribution that assigns probabilities to some discrete choices is called a multinomial distribution
  • To draw a sample for rolling a dice, simply pass in a vector of probabilities. Output another vector of the same length: its value at index i is the number of occurrences of i in the sampled results. 1000 throws can be simulated. Then count how many times each number has been hit after 1000 throws, and calculate the relative frequency as an estimate of the true probability
    import torch
    from torch.distributions import multinomial
    
    fair_probs = torch.ones([6]) / 6
    counts = multinomial.Multinomial(1000, fair_probs).sample()
    print(counts / 1000)  # 相对频率作为估计值
    
    # 输出(每个结果的真实概率约为 0.167)
    tensor([0.1600, 0.1840, 0.1700, 0.1650, 0.1670, 0.1540])
    
  • These probabilities can be seen to converge to the true probabilities over time for 500 sets of experiments with 10 samples in each set. Each solid line corresponds to one of the 6 values ​​of the dice and gives the estimated probability of the dice taking that value after each set of experiments. These 6 solid curves converge towards the true probability when more data is obtained through more experiments
    import torch
    import matplotlib.pyplot as plt
    from torch.distributions import multinomial
    
    fair_probs = torch.ones([6]) / 6
    counts = multinomial.Multinomial(10, fair_probs).sample((500,))
    cum_counts = counts.cumsum(dim=0)
    estimates = cum_counts / cum_counts.sum(dim=1, keepdim=True)
    
    def set_figsize(figsize=(6, 4.5)):
        plt.rcParams['figure.figsize'] = figsize
    
    for i in range(6):
        plt.plot(estimates[:, i].numpy(), label=("P(die=" + str(i + 1) + ")"))
    
    set_figsize((6, 4.5))
    plt.axhline(y=0.167, color='black', linestyle='dashed')
    plt.xlabel('Groups of experiments')
    plt.ylabel('Estimated probability')
    plt.legend()
    plt.show()  # plt.legend() 需要调用 plt.show() 来显示图例
    

insert image description here

Axioms of Probability Theory
  • When dealing with dice throwing, the set S = { 1 , 2 , 3 , 4 , 5 , 6 } \mathcal{S}=\{1,2,3,4,5,6\}S={ 1,2,3,4,5,6 } is called the sample space (sample space) or result space (outcome space), where each element is a result. An event is a random outcome of a given set of sample spaces. For example, "seeing a 5" ({5}) and "seeing an odd number" ({1,3,5}) are both valid events for rolling a dice
  • Probability can be thought of as a function that maps a set to a true value . In a given sample space S \mathcal{S}S , eventAAThe probability of A is expressed asP ( A ) P(A)P ( A ) , satisfy the following
    • For any event AAA , whose probability is never negative, that is,P ( A ) > 0 P(A) > 0P(A)>0
    • The probability of the entire sample space is 1, that is, P ( S ) = 1 P(\mathcal{S}) = 1P(S)=1
    • For mutually exclusive events: $P(A_1\cup A_2\cup A_3\cup…) = P(A_1)+P(A_2)+P(A_3)+\cdots $
Random Variables
  • In the random experiment of throwing scatterers, the concept of random variable (random variable) is introduced. A random variable can be almost any number, and it can take a value among a set of possibilities in a random experiment. Consider a random variable XXX , whose value is in the sample spaceS = { 1 , 2 , 3 , 4 , 5 , 6 } \mathcal{S}=\{1,2,3,4,5,6\}S={ 1,2,3,4,5,6 } in. The event "seeing a 5" can be expressed as{ X = 5 } {\{X = 5\}}{ X=5 } orX = 5 X = 5X=5 , its probability is expressed asP ( { X = 5 } ) P(\{X=5\})P({ X=5 }) orP ( X = 5 ) P(X=5)P(X=5)

    • P ( X ) P(X) can beP ( X ) is denoted as a random variableXXdistribution on X , the distribution represents XXProbability of X getting a certain value
    • You can use P ( a ) P(a)P ( a ) means that the random variable takes the valueaaprobability of a
  • Since an event in probability theory is a set of outcomes from a sample space, it is possible to assign a range of desirable values ​​to a random variable. For example, P ( 1 ≤ X ≤ 3 ) P(1\leq X\leq3)P(1X3 ) means the event{ 1 ≤ X ≤ 3 } \{1\leq X\leq3\}{ 1X3},即 { X = 1 , 2 , o r , 3 } \{X=1,2,\mathrm{or},3\} { X=1,2,or,3 } probability. Equivalently,P ( 1 ≤ X ≤ 3 ) P(1\leq X\leq3)P(1X3 ) represents the random variableXXX from{ 1 , 2 , 3 } \{1,2,3\}{ 1,2,3 } The probability of taking a value in

  • There is a subtle distinction between discrete random variables (such as each side of a dice) and continuous random variables (such as the weight and height of a person). The following subsections mainly consider probabilities in discrete spaces, and continuous random variables Probability Reference for : Continuous Random Variables

2.6.2 Dealing with Multiple Random Variables

joint probability
  • Joint probability : P ( A = a , B = b ) P(A = a, B = b)P(A=a,B=b)
    • Given any value aaa andbbb A = a A = a A=a sumB = b B = bB=The probability that b satisfies both
    • for any aaa andbbThe value of b satisfies: P ( A = a , B = b ) ≤ P ( A = a ) P(A=a,B=b)\leq P(A=a)P(A=a,B=b)P(A=a) P ( A = a , B = b ) ≤ P ( B = b ) P(A=a,B=b)\leq P(B=b) P(A=a,B=b)P(B=b)
Conditional Probability
  • The inequality of joint probability leads to an interesting ratio: 0 ≤ P ( A = a , B = b ) P ( A = a ) ≤ 1 0\leq\frac{P(A=a,B=b)}{P (A=a)}\leq10P(A=a)P(A=a,B=b)1,称这个比率为条件概率 (conditional probability),并用 P ( B = b ∣ A = a ) P(B=b\mid A=a) P(B=bA=a) 表示它: A = a A = a A=a 已发生的前提下, B = b B = b B=b 发生的概率
贝叶斯定理
  • 根据条件概率和乘法法则
    P ( A , B ) = P ( B ∣ A ) P ( A ) P(A,B)=P(B\mid A)P(A) P(A,B)=P(BA)P(A)

P ( A , B ) = P ( A ∣ B ) P ( B ) P(A,B)=P(A\mid B)P(B) P(A,B)=P(AB)P(B)

  • 于是得到贝叶斯定理
    P ( A ∣ B ) = P ( B ∣ A ) P ( A ) P ( B ) P(A\mid B)=\frac{P(B\mid A)P(A)}{P(B)} P(AB)=P(B)P(BA)P(A)

    P ( A , B ) P(A,B) P(A,B ) is ajoint distribution,P ( A ∣ B ) P(A\mid B)P(AB ) is aconditional distribution

Marginalization
  • In order to be able to sum event probabilities, the summation rule is required, namely BBThe probability of B is equivalent to computingAAAll possible choices of A , and aggregate the joint probabilities of all choices, which is also calledmarginalization. The probability or distribution of marginalized results is called marginal probability or marginal distribution
    P ( B ) = ∑ AP ( A , B ) P(B)=\sum_AP(A,B)P(B)=AP(A,B)
independence
  • If two random variables AAA andBBB is independent, implying that eventAAThe occurrence of A andBBThe occurrence of event B is irrelevant
    • In this case, it is common to state this as A ⊥ BA\perp BAB
    • According to Bayes' theorem, P ( A ∣ B ) = P ( A ) P(A\mid B)=P(A) can be obtained immediatelyP(AB)=P(A)
  • Given P ( A ∣ B ) = P ( A , B ) P ( B ) = P ( A ) P(A\mid B)=\frac{P(A,B)}{P(B)}=P( A)P(AB)=P(B)P(A,B)=P ( A )等了于P ( A , B ) = P ( A ) P ( B ) P(A,B)=P(A)P(B)P(A,B)=P(A)P(B)
    • Thus two random variables are independent if and only if: the joint distribution of two random variables is the product of their respective distributions
  • Given another random variable CCWhen C , two random variablesAAA andBBB is conditionally independent if and only ifP ( A , B ∣ C ) = P ( A ∣ C ) P ( B ∣ C ) P(A,B\mid C)=P(A\mid C)P(B\mid C)P(A,BC)=P(AC)P(BC)
    • This situation is expressed as A ⊥ B ∣ CA\perp B\mid CABC

2.6.3 Expectation and variance

  • In order to generalize the key features of a probability distribution, some measure is needed. A random variable XXThe expectation of X (expectation, or average (average))
    E [ X ] = ∑ xx P ( X = x ) E[X]=\sum_xxP(X=x)E [ X ]=xxP(X=x)

  • When the function f ( x ) f(x)The input of f ( x ) is from the distributionPPWhen the random variable drawn in P , f ( x ) f(x)The expected value of f ( x )
    is E x ∼ P [ f ( x ) ] = ∑ xf ( x ) P ( x ) E_{x\sim P}[f(x)]=\sum_xf(x)P(x)ExP[f(x)]=xf(x)P(x)

  • In many cases, it is desirable to measure a random variable XXThe bias of X from its expected value, which can be quantified by the variance
    V ar [ X ] = E [ ( X − E [ X ] ) 2 ] = E [ X 2 ] − E [ X ] 2 \mathrm{Var}[X ]=E\left[(XE[X])^2\right]=E[X^2]-E[X]^2Yes [ X ]=E[(XE [ X ] )2]=E [ X2]E [ X ]2

    The square root of the variance is called the standard deviation

  • The variance of a function of a random variable is a measure of: when different values ​​of xx are sampled from the distribution of the random variableWhen x , the function value deviates from the expected degree of the function
    Var ⁡ [ f ( x ) ] = E [ ( f ( x ) − E [ f ( x ) ] ) 2 ] \operatorname{Var}[f(x)] =E\left[\left(f(x)-E[f(x)]\right)^2\right]Var[f(x)]=E[(f(x)E[f(x)])2]

Guess you like

Origin blog.csdn.net/qq_42994487/article/details/132286834