Deep Learning_Enda Wu_Study Notes (2022)

Deep Learning_Enda Wu_DeepLearning.ai

Wu Enda deep learning video link

P10

image-20220706152812170

Syntax of superscript and subscript: H 2 O, CO 2 , Popcorn TM , C 14 2

Loss Function for Logistic Regression

L(y ^ ,y)=-[y logy ^ + (1-y) log(1-y ^ )], where y ^ is the predicted label and y is the real label.

When y=1, L(y ^ ,y)=- log y ^ , because y ^ is between 0 and 1, logy ^ is negative, but - logy ^ is positive, at this time y ^ tends to 1, L(y ^ ,y)=- logy ^ is smaller.

Similarly, when y=0, L(y ^ ,y)=- log(1-y ^ ), at this time y ^ tends to 0, L(y ^ ,y)=- log(1-y ^ ) smaller.

Q18

image-20220706153009261

In computing with large amounts of data, vectorization is faster and more time-saving than for loops.

import numpy as np
import time

a = np.random.rand(1000000)
b = np.random.rand(1000000)

tic = time.time()
c = np.dot(a,b)
toc = time.time()

print(c)
print("Vectorized version:" + str(1000*(toc-tic)) +"ms")

c = 0
tic = time.time()
for i in range(1000000):
    c += a[i]*b[i]
toc = time.time()

print(c)
print("For loop:" + str(1000*(toc-tic)) +"ms")

P23

image-20220706153116233

Using assert(a.shape == (5,1))assertions like this will allow you to be more certain about the dimensionality of your vectors and catch errors in time.

Use assertions liberally to double-check the dimensions of matrices and arrays. Also, don't be afraid to use the reshape operation to make sure your matrices and vectors are the dimensions you need.


P29

image-20220706154901175
  • In the logistic regression on the left, a circle represents a two-step operation, first calculate z according to matrix multiplication, and then use the activation function to get a.

  • The neural network on the right just performs the process of logistic regression on the left many times.

  • a [l] i represents the i-th node of layer l.


P33

image-20220707083712348

Why do we need nonlinear activation functions?

如果你使用线性激活函数(或者叫恒等激活函数),那么神经网络的输出,仅仅是输入函数的线性变化。

在深度网络中,神经网络有许多许多隐藏层,结果发现,如果你使用线性激活函数,无论你的网络有多
少层,它所做的只是计算一个线性激活函数,还不如去除所有隐藏层。

P37

image-20220707105145799

Why do the weight parameters of neural networks need to be initialized randomly?

如果神经网络中,将所有权重参数初始化为零,那么所有的神经元将会肩负着相同的计算功能,并且也
将同样的影响作用在输出神经元上。经过一次迭代后,依然会得到相同的结果,这些神经元依然是“对
称”的,因此需要随机初始化成不一样的数值。

P45

image-20220711093428335

The concept of parameters and hyperparameters

参数:如W、b这些训练时自动更新的参数。
超参数:如学习率、迭代次数、隐藏层数、每层神经元的数目及激活函数的选择,这些都需要人为设定,称为超参数。
参数与超参数的关系:超参数的设定会影响参数的最终值。

P47

image-20220711114345752

training set/dev set (validation set)/test set

The traditional approach is to take a part of all data sets as a training set , then set aside a part as a set-out cross-validation set (sometimes also called a development set ), and then take a part from the end as a test set .

The whole workflow is that first you continuously train your algorithm with the training set , and then test it with your dev set (the hold-out cross-validation set) to determine which of many different models works best on the dev set . When this process has been going on for a long enough time, you may want to evaluate your final training results. You can use the test set to evaluate the best model in the results, so that no bias is introduced when evaluating the performance of the algorithm.

P48

image-20220712094922220

High Variance and High Bias

  • When the recognition error of the training set is 1%, and the recognition error of the development set is 11%, it is the case of high variance . (Because the errors of the training set are relatively low, but the difference between the two errors is too large)
  • High bias occurs when the recognition error on the training set is 15% and the recognition error on the development set is 16% . (Because the errors of the training set and the development set are relatively high, and the difference is not large)
  • When the recognition error of the training set is 15%, and the recognition error of the development set is 30%, it is a high bias and high variance situation. (Because the errors of the training set and the development set are relatively high, and the difference is too large)
  • When the recognition error of the training set is 0.5%, and the recognition error of the development set is 1%, it is a low bias and low variance situation (ideal situation). (Because the errors of the training set and the development set are relatively low, and the difference is small)

Note : The above situations are all based on the fact that the manual recognition error of the image is zero. If the human eye does not recognize the image well, it is impossible to expect the trained model to perform well. After all, the data determines the upper limit of machine learning. Good data people It is easy to identify with the naked eye, so the upper limit of machine learning will also be high.


P49

image-20220712102341549

Basic recipe for machine learning

  • To deal with the situation where the model has high bias (the training set performs poorly), the following measures can be taken:
    • Try a larger network (more hidden layers, or more hidden layer units, etc.)
    • extended training time
    • Choose another network architecture
  • To deal with the situation where the model has high variance (the performance of the training set and the development set are quite different), the following measures can be taken:
    • try to get more data
    • Regularization
    • Choose another network architecture

P69

image-20220715181623595

learning rate decay

In the initial steps of learning, you can take much larger steps, but as the learning starts to converge to a point, a lower learning rate allows you to take smaller steps, which will move you away from the minimum point The closer region is oscillated, i.e. the cost function is minimized. The formula for learning rate decay is as follows:

α = 1 1 + decay_rate ∗ epoch_num × α 0 \alpha=\frac{1}{1+decay_rate*epoch\_num}\times\alpha_0a=1+decay_ratee p o c h _ n u m1×a0

Among them, α \alphaα isthe current learning rate,decay _ rate decay\_rated ec a y _ r a t e is the set hyperparameterdecay rate,epoch _ num epoch\_nume p oc h _ n u m is the currentnumber of training rounds,α 0 \alpha_0a0is the initial learning rate .


P71

image-20220715181623595

hyperparameter search

  • Using random sampling, full search
  • Then implement a coarse-to-fine search process

P76

image-20220716142751186

Why choose Batch Norm algorithm?

  • Normalized input features (mean 0, variance 1) will greatly speed up the learning process . The same holds true for the hidden layer values.
  • The Batch Norm algorithm reduces the distribution instability of hidden unit values .
  • The Batch Norm algorithm reduces the problems caused by changes in input values ​​and makes these values ​​stable, so the back layer of the neural network can have a more stable foundation.
  • In fact, although the first few layers continue to learn, the power of the latter layer to adapt to the changes of the previous layer is weakened. The Batch Norm algorithm weakens the coupling of the parameters of the front layer and the parameters of the back layer. It allows each layer of the network to learn independently , which means that it is a little independent from other layers, which will effectively improve the learning speed of the entire network.
  • The Batch Norm algorithm has a slight regularization effect .

P80

image-20220716155610702

Criteria for a good deep learning framework

  • Ease of programming (during development and during deployment)
  • running speed
  • Is it truly open source (and well managed)

P82

image-20220716164359160

machine learning strategy

Assuming that the accuracy rate of the system reaches 90%, but it does not meet your expectations, there may be the following ideas for improving the accuracy rate:

  • Collect more training data
  • Collect more diverse data, or more diverse counterexamples
  • Train longer with gradient descent
  • Use Adam optimizer instead of other optimizers
  • try a larger network
  • try a smaller network
  • try dropout
  • Increase L 2 L_2L2Regularization
  • new network architecture
    • activation function
    • The number of neurons in the hidden layer

P87

image-20220717160613229

Size of training set, dev set (validation set) and test set

  • If the amount of data is small (such as 100, 1000, 10000 pictures), it can be divided according to the following ratio:
    • Training set: test set = 70%: 30% ;
    • Training set: development set: test set = 60%: 20%: 20% ;
  • If you have a large dataset (say 1,000,000 images)
    • 10,000 samples are enough for the development set or test set, so training set: development set: test set = 98%: 1%: 1% ;

P93

image-20220724135215146

Improve model performance

If a supervised learning works, it basically means you can do two things:

  1. Can fit the training set well (i.e. low avoidable bias );
  2. Results on the training set generalize well to either the dev or test sets (i.e. lower avoidable variance ).

If you want to improve your machine learning system, it is recommended to look at the gap between the training error and the Bayesian error estimate, which allows you to estimate the avoidable bias ; then look at the gap between the development set error and the training error to estimate How big is your variance .

Strategies to reduce avoidable bias

  • train a larger model
  • train longer
  • use a better optimization algorithm
  • Better Neural Network Architecture
  • More precise hyperparameter values ​​(hyperparameter search)

Strategies to reduce variance

  • train with more data
  • Regularization (L2 regularization or random dropout method, etc.)
  • Better Neural Network Architecture
  • More precise hyperparameter values ​​(hyperparameter search)

P99

image-20220810151805913

When moving from training set error to dev set error, two things change:

  1. The data that the algorithm sees is only the training set and not the development set.
  2. The development set and training set data distributions are different.

Processing when the data distribution of the training set and the development set are different

Assuming that the development set and the test set belong to the same distribution, and the training set belongs to different distributions, what we need to do is to randomly confuse the training set, and take out a small piece of data as the training-development set, just as the development set and the test set have the same distribution, the training set and The train-dev set also follows the same distribution. The difference is that now you only need to train your network with a part of the training set, instead of making the neural network responsible for the train-dev set.

Error Analysis

When doing error analysis, you need to compare the error of the classifier, that is, the training set error, the training-dev set error and the dev set error.

  • case one

    Assume a training set error of 1%, a train-dev set error of 9%, and a dev set error of 10%. We can see from this that there is a variance problem at this time, because the training set and the training-dev set come from the same distribution, but the error of the training-dev set is much larger than the training set error.

  • case two

    Assume a training set error of 1%, a train-dev set error of 1.5%, and a dev set error of 10%. Since the error from the training-dev set is similar to the error of the training set, and the error of the development set is much larger than the error of the training-dev set, there is a problem of data mismatch, that is, the distribution of the training set and the development set are different.

  • Case three

    Assume human-level Bayesian error of 0%, training set error of 10%, train-dev set error of 11%, and dev set error of 12%. This is where the problem of avoidable bias (high bias) arises because the performance falls far short of human-level performance.

  • Situation four

    Assume human-level Bayesian error of 0%, training set error of 10%, train-dev set error of 11%, and dev set error of 20%. Then it has two problems, first, the avoidable deviation is quite high, because the training set error is much larger than the Bayesian error at the class level; second, the data mismatch is quite large, because the development set error is much larger than the training-development The error of the set, that is, the different distributions of the training set and the development set.


P101

image-20220810165912110

When can transfer learning be used (take the transfer from task A to task B as an example)

  • Task A and task B must have the same input (for example, the input is a picture or audio)
  • The data volume of task A should be much larger than that of task B
  • Low-level features learned from task A (such as boundary detection, curve detection, or light and dark object detection, etc.) will help task B achieve the goal

P116

image-20220817101641504

Why use convolution

  • Parameter sharing can be realized and the amount of parameters can be reduced. The parameters of the fully connected layer are generally huge, while the parameters of the convolutional layer are relatively small. A feature detector (such as a vertical edge detector) useful in one part of the image is usually also useful in another part of the image.
  • Sparse connections can be established. The value of the output of each layer depends only on the value of a small part of the input.

P121

image-20220817164412181

1 × \times The role of × 1 convolution

Suppose we have a 28 × \times× 28 × \times × 192 feature map, how to reduce it to 28× \times× 28 × \times What about × 32? The answer is to use 321 × \times× 1 × \times × 192filtersare convolved, so thatthe number of channelscan be reduced from 192 to 32, and the width and height remain unchanged.

So how to reduce the width and height of the feature map? Suppose we have a 28 × \times× 28 × \times × 192 feature map, how to reduce it to 14× \times× 14 × \times What about × 192? The answer is to use pooling with a size of 2 and a step size of 2. At this time,the width and height will be halved, and the number of channels will remain unchanged.


P124

image-20220818091148351

Computational cost of ordinary convolution

C o m p u t a t i o n l   c o s t = # f i l t e r   p a r a m s × # f i l t e r   p o s i t i o n s × # o f   f i l t e r s Computationl \ cost = \#filter \ params \times \#filter \ positions \times \#of \ filters Computationl cost=#filter params×#filter positions×#of filters

其中, C o m p u t a t i o n l   c o s t Computationl \ cost C o m p u t a t i o n l cos t  represents the calculation cost;# filter params \#filter \ params# f i lt er p a r am s  indicates the size of the filter, that is,3 × 3 × 3 3 \times 3 \times 33×3×3 # f i l t e r   p o s i t i o n s \#filter \ positions # f i lt er p os i t i o n s  can be considered as the number of times the filter moves in the feature map, that is,4 × 4 4 \times 44×4 # o f   f i l t e r s \#of \ filters # o f f i lt ers  represents the number of filters, by4 × 4 × 5 4 \times 4 \times 54×4×5 shows that the number of filters is 5.


P130

image-20220820134632290

Data vs. Human Design

Throughout the history of machine learning, if there is a lot of data , people often use simpler algorithms and less manual design to achieve the goal, so it is not necessary to carefully design features for the problem . When you have a lot of data, you can use a huge neural network , or even a simpler structure , and let the neural network learn what we want to learn.

In contrast, without enough data , you usually see people doing more manual design, that is, doing more manual processing . I think manual design is actually the best way to get good results when the amount of data is small.

Therefore, learning algorithms usually have two sources of knowledge: one is labeled data , and the other is artificial design . There are many methods that can be used to manually design a system, whether it is carefully designing the characteristics , or carefully designing the network structure , or other components of the system . Therefore, when there is not much labeled data, it is necessary to work harder on manual design.


P142

image-20220822151341119

transposed convolution

As shown above, with 2 × 2 2 \times 22×2 as input, we hope to get4 × 4 4 \times 44×4 , so we choose3 × 3 3 \times 33×3 filter to operate, will also usep = 1 p=1p=A padding of 1 is in the output and finally, in this example uses=2 s=2s=2 strides.

  1. First, 2 × 2 2 \times 22×2 The upper left corner of the input is2 22 , we make3 × 3 3 \times 33×3 Each element of the filter is multiplied by2 22 , and write top = 1 p=1p=1 filled output, that is, from left to right, from top to bottom are2, 4, 2, 4, 0, 2, 0, 4, 2 2, 4, 2, 4, 0, 2, 0, 4 ,2242402042

  2. Shift the output to the right by 2 22 squares, because the strides = 2 s=2s=2 2 × 2 2 \times 2 2×2 The upper right corner of the input is1 11 , we make3 × 3 3 \times 33×3 Each element of the filter is multiplied by1 11 , and write top = 1 p=1p=1 filled output, that is, from left to right, from top to bottom are1 , 2 , 1 , 2 , 0 , 1 , 0 , 2 , 1 1, 2, 1, 2, 0, 1, 0, 2 ,11 , 2 , 1 , 2 , 0 , 1 , 0 , 2 , 1 . The overlapping parts are added together, and two grids overlap, that is,2 + 2, 2 + 0 2+2, 2+02+22+0

  3. Repeat the above steps.


P158

image-20220825164759300

recurrent neural network

What the recurrent neural network does is when it goes on to read the second word in the sentence, say x 2 x_2x2, the neural network not only passes x 2 x_2x2to predict y 2 y_2y2, it also takes the result of the calculation from the previous step as part of its input. At each step, the recurrent neural network passes activation values ​​to the next step for its use.


P160

image-20220825172521340

Types of Recurrent Neural Networks

  • One-to-one (standard original neural network, cyclic neural network is less used)
  • One-to-many (e.g. music generation)
  • Many-to-one (e.g. sentence sentiment classification)
  • Many-to-many 1 (Tx=Ty, such as identifying names)
  • Many-to-many2 (Tx != Ty, like machine translation)

P177

image-20220830171233150

Remove the deviation of Word Embedding

  1. Identify the direction of specific biases that need to be eliminated or reduced.
  2. Intermediate, for each undefined word, remove the bias by projection.
  3. Homogenize. For example, make the words girl and boy equal distances from the word doctor.

P185

image-20220901202953196

attention model

The attention mechanism is actually a process of finding weights. Suppose there is a sentence of 5 French words that we need to translate, we don't need to translate word by word. We compute a rich set of features about each word in the sentence and its surrounding words for five different positions in the sentence. So the question is, when you want to generate the first word, which part of the following French sentence should you look at? Our intuition tells us that we should look at the first French word and the few words around it, and we may actually not pay attention to those words at the end of the sentence (after all, we have only started to translate the first word).

The attention model will calculate a series of attention weights (parameters), assuming we use a ( 1 , 1 ) a(1,1)a(1,1 ) go to mark when you generate 1st1How much attention do you need when 1 word is 1 in French11 vocabulary. Similarly, we usea ( 1 , 2 ) a(1,2)a(1,2 ) go to mark when you generate 1st1How much attention do you need when 1 word is 2in French2 words. And so on to the third word, and so on. These together then tell us exactly how much attention we should be paying attention to in the sentence contextcorresponding to this position.


P189

self-attention mechanism

My summary: In general, the attention mechanism is a process of finding weights . Think of the source domain as consisting of a series of <Key, Value> data pairs , given an element (Query) in the target domain , calculate the similarity and association between it and the Key , and the corresponding value is Value , That is the weight coefficient . Finally, sum all the calculated Value weights to get the value of attention. So in essence, the attention mechanism is a weighted summation process .

dbbe63883e85fa35f24fad94fc8d9a4

In machine translation, it is first necessary to associate each word with the three values ​​of Query, Key and Value. As shown in the figure below, we assume that x < 3 > x^{<3>}x< 3 > is I'Afrique's word embedding (word embedding), thenq < 3 > q^{<3>}q<3> k < 3 > k^{<3>} k<3> v < 3 > v^{<3>} v< 3 > respectively by the weight matrixw Q w^QwQ w K w^K wK w V w^V wV timesx < 3 > x^{<3>}x< 3 > get, where the matrixwww is a learnable parameter. Through these parameters, you can get the Query, Key and Value of each word. So what are the functions of the Query, Key and Value vectors? q < 3 > q^{<3>}q< 3 > is a question about I'Afrique,q < 3 > q^{<3>}q< 3 > expresses a question: what happened in I'Afrique? We know that I'Afrique (translated into English as Africa) is a destination, then when we calculateA < 3 > A^{<3>}A<3> , what happened? First we calculateq < 3 > q^{<3>}q<3> k < 1 > k^{<1>} kThe inner product of < 1 > (the calculated value is related to the similarity), it will tell us that the answer is Jane (a person), that is, what happened in Africa. Then, calculateq < 3 > q^{<3>}q<3> k < 2 > k^{<2>} k< 2 > , it will tell us that the answer is visite, that is, what happened in Africa. By analogy, the answers to other words in this sentence can be calculated. The purpose of this operation is to obtain the most information needed to help us calculate theA < 3 > A^{<3>}AThe most useful expression for <3> .

Similarly, in order to establish an intuitive feeling, if k < 1 > k^{<1>}k< 1 > represents a person (Jane), andk < 2 > k^{<2>}k< 2 > represents the second word, namely visite, which is an action, then you will findq < 3 > q^{<3>}q<3> k < 2 > k^{<2>} kThe value multiplied by < 2 > is the largest, which means that visite provides the most relevant background for what happened in Africa. In other words, it is the destination of visite.

image-20220901212757786

P192

image-20220902124715739

end flowering



Finally, thank you for your study~

Guess you like

Origin blog.csdn.net/weixin_43800577/article/details/125718906