[Andrew Ng depth study] neural networks and deep learning

Neural networks and deep learning

Introduction to deep learning

What is a neural network

A neural network is an effective learning algorithm inspired by the way the brain works obtained

Single Neural Networks: given one-dimensional data, such as housing area, create a \ (Relu \) (rectified linear function) function, mapping out house prices.

Multivariate neural networks: cube predictions, automatic generation of hidden units, just a given set of input and output data model allow themselves to learning.

Were supervised learning neural network

Real Estate Price Prediction Standard Neural Network
Image Identification Convolution neural network \ (CNN \)
Audio, language Recurrent Neural Networks \ (RNN \)
Radar signal recognition model Complex hybrid neural network

Structured data : a database or data sequences, each feature has a clearly defined

Unstructured data : audio, image, text, features may be words within the text image pixel values, or

Why deep learning popular

  1. The upper limit of the traditional performance model. Traditional data processing model is suitable for small, increasing the amount of data reaches the upper limit of the performance can not significantly improve performance
  2. The amount of data increases. Traditional model is not applicable processing huge amounts of data
  3. Depth learning model performance depends on the amount of data. The more depth learning model learning better data performance.
  4. The new algorithm update pushed to calculate speed, which gave birth to new ideas, continue to promote the computing speed update

Scale (scale model, the data size) to promote deep learning performance growth.

Neural network infrastructure

Dichotomous classification

Example: a 64 * 64 binary images, determines whether a cat, output 1 (cat) or 0 (not cat).

The image stored in the computer is generally three 64 * 64 matrix, corresponding to the distribution of Red, Green, Blue three kinds of luminance pixels.

The three matrices mapped to a feature vector \ (X-\) , the value of each matrix, one by line by line reading, consisting of a large vector 1 * 12288 (64 x 64 x 3 = 12288 ). With \ (n = 12288 \) represents a dimension of the feature vector.

So is the role model for dichotomous characteristics of the input variables \ (x \) output label \ (y \) to predict whether a cat.

Generally referred to the training set has \ (m \) training samples, where \ ((x ^ {(1 )}, y ^ {(1)}) \) represents a sample input \ (x ^ {(1) } \) output \ (^ {Y (. 1)} \) , in order to distinguish, the training set is denoted \ (m_ {train} \) test set to \ (test m_Low} {\) . With \ (X = (x ^ { (1)}, x ^ {(2)}, ..., x ^ {(m)}) \) referred to as the set of all characteristic variables. Then \ (n-X-* = m \) , * is the number of training samples characteristic dimension

Logistic Regression

Calculation \ (\ widehat} = P {Y (Y =. 1 | X) \ in (0,1) \) , where \ (X \) is a characteristic variable, the given \ (X \) , the parameters \ (W \) (also a vector) and a parameter \ (B \) , output \ (\ widehat {Y} = \ Sigma (W ^ the Tx + B) \) ( \ (\ Sigma (Z) \) is \ (the Sigmoid \ ) function, the output maps to (0,1))

\ [\ Widehat {y} = \ sigma (z) = \ frac {1} {1 + e ^ {-} z} \ quad z = Tx + a ^ b \]

Loss function : loss function is reflected on a single sample, it measures the performance of a single sample. Since (L-2 \) \ NORM poor performance in a subsequent seek the optimal values of gradient descent, so \ (Logistic \ Regression \) using

\ [L (\ widehat {y}, y) = - (ylog \ widehat {y} + (1-y) log (1- \ widehat {y})) \]

  • If \ (. 1 Y = \) : \ (L (\ widehat {Y}, Y) = - log \ widehat {Y} \) , wants to \ (L (\ widehat {y }, y) \) as small as possible , it is necessary \ (log \ widehat {y} \) as large as possible, because \ (\ widehat {y} \ ) is the result \ (the Sigmoid \) the result of the mapping function, \ (\ widehat {Y} \ in (0 , 1) \) , so \ (\ widehat {y} \ ) larger closer to the true value
  • Similarly \ (Y = 0 \) : \ (L (\ widehat {Y}, Y) = - log (l- \ widehat {Y}) \) , will \ (log (1- \ widehat { y} ) \) as large as possible, then the \ (\ widehat {y} \ ) be as small as possible

The cost function : based on the total cost parameters, reflecting the performance of the overall training samples.

\[J(w,b)=\frac{1}{m}\sum_{i=0}^mL(\widehat{y}^{(i)},y^{(i)})\]

  • \ (m \) is the number of training samples
  • \ (L (\ widehat {y } ^ {(i)}, y ^ {(i)}) \) represents the \ (I \) loss function training samples

Before to reverse the spread of

The forward sequence is the normal calculated reverse order of transmission is calculated derivative (chain rule)

Program in \ (dvar \) indicates the number of guide

Logistic regression gradient descent

\[ \begin{array}{l}{z=w^{T} x+b} \\ {\hat{y}=a=\sigma(z)} \\ {\mathcal{L}(a, y)=-(y \log (a)+(1-y) \log (1-a))}\end{array} \]

Single sample : assuming only two feature values, \ (W = (W_1, w_2) ^ T \) , input \ (W_1, x_1, w_2,, x_2, B \) , then \ (z = w_1x_1 + w_2x_2 + b \) , then \ (\ widehat = {} A = Y \ Sigma (Z) \) , the final calculation of \ (L (A, Y) \) . In \ (Logistic \) to do is to change the regression \ (w_1, w_2, b \ ) values such that \ (L (a, y) \) minimum.

Updating a step gradient:
\ [\ Array the begin {} {} {W_ {L}. 1: = W_. 1} {- \ Alpha W_ {D}. 1 \\} {2} {W_: W_ {2} = - \ alpha d w_ {2}} \\ {b: = b- \ alpha} \ end {array} \]

Vectorization

Erasing the quantization procedure is usually explicit \ (for \) Art cycle. And vector computing \ (for \) calculates the difference between the speed of the cycle is almost 300 times. As far as possible avoid the use of \ (for \) loop!

numpy library built a lot of vector function, the \ (w \) to quantify, the program calls the \ (np.zeros () \) , the training set \ (X \) and bias \ (b \) to quantify

python broadcast

  1. Do not use the shape as \ ((n,) \) This array of rank 1. Using (a = np.random.randn (5,1) \ ) \ This declaration statement of a specific size
  2. If has been \ ((n-,) \) This array of rank 1, can be converted using reshape
  3. Free to affirm assert statements, assertions, to determine the shape of the matrix, timely detection bug

Shallow neural network

Neural network representation

Input layer, hidden layer and output layer, the input layer does not generally regarded as a standard level, starting from the hidden layer of the network counts the number of layers, mathematical notation \ (a ^ {[n] } _ i \) represents the \ (n-\ ) the first layer of the network \ (I \) results nodes.

Output neural network

Analog \ (Logistic \) calculated regression, \ (Logistic \) regression to calculate \ (Z = W ^ the Tx + B \) , then calculate \ (a = \ sigma (z ) = \ widehat {y} \ ) , neural networks single hidden layer is for each node \ (a ^ {[1] } _ i \) are calculated again \ (z ^ {[1] } _ i = w ^ {[1] T} _ix + b ^ {[1]} _ i \)

Activation function

\ (Sigmoid \) , \ (tanh \) , \ (Relu \) , \ (Leaky Relu \)

Function name Pros and cons
Sigmoid In addition to other binary classification is rarely used, the output interval (0,1) is not conducive to centralized data
fishy Very good performance, centrosymmetric than usual sigmoid
resume The default activation function, gradient descent very friendly
Leaky ReLU In part to compensate ReLU gradient is less than 0 to 0 problems, but rarely used

Random initialization

Weight matrix should not be too large, generally 0.01 appropriate weight matrix too large lead to \ (z \) is too large, flattish portions fall activation function, gradient descent small, learning slows down.

DNN

Technically, \ (Logistic \) regression is a single layer neural network.

Forward spread is still time to use \ (for \) cycle.

Wherein the front layers recognize simple, more complex features can be identified later, such as the face contour \ (\ rightarrow \) features \ (\ rightarrow \) face, phoneme \ (\ rightarrow \) word \ (\ rightarrow \) phrase \ (\ rightarrow \) sentence

Listen deep learning is bluffing, put call has a lot of hidden layer of the neural network before. .

Super parameters: Control parameters actual parameters

I do not know how to take the best super parameter values, requiring multiple experiments, excel ultra-intuitive argument. . Metaphysics learning rate and super parameters. . .

Guess you like

Origin www.cnblogs.com/ColleenHe/p/11704342.html