Andrew Ng deep learning course notes -1

The second week of the neural network infrastructure

2.1 dichotomous classification (Binary Classification)

Simple, it is determined that the problem will enter one of two categories.

For example, input: a color image (RGB three channels of three matrices, unroll By obtain eigenvectors); output: is or not (0 or 1) cat

Notation

Training samples: \ ((x, y), x \ in R ^ n, y \ in \ {0,1 \} \)

Data set comprising m samples: \ (\ {(x ^ {(1)}, y ^ {(1)}), \ dots, (x ^ {(m)}, y ^ {(m)}) \} \)

For convenience, common matrix representation \ (\ boldsymbol {X} = (x ^ {(1)}, \ dots, x ^ {(m)}) \), i.e., each column represents one feature vector, we also call \ ( \ boldsymbol {X} \ in R ^ {n \ times m} \) is the data matrix.

同理,\( \boldsymbol{Y} = (y^{(1)}, \dots, y^{(m)}), Y \in R^{1 \times m} \)

2.2 Logistic regression

Also nothing to say, a given input feature vector, which is desirable to obtain a probability category, that is, \ | method (\ hat {y} = P (y = 1 x) \), achieving this goal has many, depending on the model we choose. In the logistic regression, we use the \ (\ hat {y} = \ sigma (w ^ Tx + b), \ sigma (z) = \ frac {1} {1 + e ^ {- z}} \)

2.3 Logistic regression loss function

Here Andrew Ng distinguish the loss function (loss function) and the cost function (cost function).

Loss function and error function (error function) the same, generally refers to a single sample, for example, regression conventional mean square error: \ (L (\ hat {y}, y) = \ frac {1} {2} (\ hat { y} -y) ^ 2 \), two commonly used cross-entropy classification: \ (L (\ hat {y}, y) = - y \ ln \ hat {y} - (1-y) \ ln (1 - \ hat {y}) \);

The cost function generally refers to the entire data set: \ (J (w, b) = \ frac {1} {m} \ sum_ {i = 1} ^ {m} L (\ hat {y} ^ {(i)}, y ^ {(i)}) \)

Selecting as the benefits of cross-entropy of the two classification problem to be solved in that the loss function is a convex optimization problem, the use of mean square error is not guaranteed.

Gradient descent 2.4

This section is also very simple, machine learning to solve most problems eventually boils down to an optimization problem, that is, for the most value problem. Some optimization problems have closed-form solution, such as linear regression, while some optimization problem no closed-form solution, such as logistic regression. The solution is to use an iterative optimization algorithm, a gradient descent method is the most common iterative optimization algorithm, is a first order approximation.

A point gradient is a vector whose direction represents the fastest growing function value point of direction, but the problem we have to solve is usually a minimization problem, as long as the negative gradient direction, step by step in accordance with the appropriate step forward on you can reach the minimum point (the convex optimization problem is the minimum point) function. As can be seen from the foregoing description, the result depends on the gradient descent method step size and an initial position.

2.6 2.5 & Derivative

These two sections are Andrew Ng teacher told the class to add calculus (in fact, simply the derivative) knowledge. Skip ~

Figure 2.7 & 2.8 computing

The input variables can be represented as complex arithmetic calculation map, an operation in which each node is represented as a variable containing intermediate, which can more clearly be observed visually chain rule when the gradient back propagation.

 

Given our concern is always the final output derivative of the various intermediate variables, in order to more compact representation (programming also), the \ (\ frac {dJ} {dv} \) denoted \ (dv \), other similar.

2.9 & 2.10 Logistic regression gradient descent method

This section speaks gradient descent method logistic regression single sample, in fact, the chain rule is an example of seek gradient, it can also be calculated by calculating FIG.

 

 Wherein the derivative of the sigmoid function is very simple form, with the cross-entropy loss function can be seen

\ (\ Frac {dL (a, y)} {dz} = (- \ frac {y} {a} + \ frac {1-y} {1-a}) \ times to (1-a) = a and \)

Here \ (a \) is actually \ (\ hat {y} \), so it is very comfortable.

Speaking in front of a single sample, for the entire training set gradient of the cost function, m is the gradient of the cumulative averaging samples do. Such gradient descent update every time we must traverse the entire training set, a program can be used to achieve vectorization techniques instead of the for loop.

2.11 & 2.12 vectorization

As used greatly improve the calculation speed may be substituted for the explicit calculation of the quantization loop to a (mostly matrix multiplication), as in the python numpy libraries take advantage of parallel computing capability of the CPU or the GPU, for loop and do not to. Instead of using such libraries for numpy various calculation cycle:

Multiplying the matrix multiplication addition →

for + math.exp() → np.exp()

There np.abs (), np.log (), np.maximum (), v ** 2 (ndarray type) and the like

2.13 & 2.14 vectorization logistic regression

The foregoing techniques applied to the quantization introduced into the logistic regression, such as computing z:

\( Z=w^TX+[b, b, \dots, b]) \)

In python only one expression:

Z = np.dot(w.T, X) + b

Note that although only of a real number b, but in the calculation of python automatically develop into a vector, which is python broadcast mechanism.

Here are the complete logistic to achieve quantified:

Z = np.dot(w.T, X) + b
A = sigmoid(Z)
dZ = A - Y
d = np.dot (X, dZ.T) / m
db = np.sum (dz) / m
w = w - a * dw
b = b - a * db

In fact numpy can be represented by np.dot @ ()

The 2.15 Python broadcast

Note that matrix or vector operations are numpy array (ndarray) based.

Broadcast mechanism simply means that when participating +, -, *, when / two arrays are not the same shape calculation, Python automatically expanded to copy one array to another array of shapes, and then element-wise operation. Of course, this premise is within its sphere of competence, I summarized the following points:

  • If the same type involved in computing two array numbers, then a dimension which must be the same type, other type of dimension is 1 to
  • If the two arrays involved in computing the number of different type, then the last dimension must be the same type

A shape of the array such that the (2,3,4), the shape of the array B is (1,3,1), C is the shape of the array (1,4), said type A number of type 3, 3 dimensions are 2,3,4. Because the same type of A and B number, have the same dimensions a type and dimension of the other type is 1, it may be A + B and other operations directly; and A and C type numbers are different, but the last type the same dimensions, it is also possible for A + C and other operations directly.

Of course, practical considerations do not so complicated, we used only the two-dimensional array, such as the shape of the array A (m, n), B shape (m, 1), C shape (1, n), can be directly A + B, A + C and other operations.

This lesson Andrew Ng teacher online programming, or a little problem:

A = np.array([
[1,2,3,4],
[5,6,7,8],
])
cal = A.sum(axis=0)
per = 100 * A / cal.reshape(1,4)

Actually After computing sum, shape cal should be (4), after reshape is (1,4), although these two results look similar, but the reality is not the same, such as (4) can not be transferred home.

From my personal experience, the best of all the vectors are represented as two-dimensional array, which is the matrix of one row or a column n n rows, such as (1,4), after reunification convenient form matrix operations. However, in the process of performing a matrix operation often a dimensionality reduction operation result, it is desirable to obtain the (1,4) of the matrix, but to give (4) of the array, if not pay attention to the back may occur bug. Andrew Ng teacher taught us, if you count the confused, the more use reshape ensure that the result is the shape you want, anyway reshape the overhead is small.

2.16 description of the python / numpy vector

Uh ... awkward problem I mentioned earlier in this section Andrew Ng made a special explanation, it is clear that poor early look at this tutorial I do not eat a lot of loss was summed up. Sad!

Teacher's meaning is clear:

  • When vector, not by (n-,) array, use (n, 1) or (1, n) instead of the matrix;
  • Multi-use  Assert (a.shape == (n-,. 1))  for inspection;
  • Is often used  a = a.reshape (n, 1)  to ensure that the shape vector;

assert and reshape the overhead is very small, these two really useful, use them! !

2.17 Jupyter / Quick Guide Ipython notebook

If you follow the words Cousera courses need to submit the job with this program. Jupyter really useful, but I'm not used to use hey, still prefer to write code in VSCode inside.

 

Guess you like

Origin www.cnblogs.com/tofengz/p/12228548.html