Neural networks and deep learning
Introduction to deep learning
What is a neural network
A neural network is an effective learning algorithm inspired by the way the brain works obtained
Single Neural Networks: given one-dimensional data, such as housing area, create a \ (Relu \) (rectified linear function) function, mapping out house prices.
Multivariate neural networks: cube predictions, automatic generation of hidden units, just a given set of input and output data model allow themselves to learning.
Were supervised learning neural network
Real Estate Price Prediction | Standard Neural Network |
---|---|
Image Identification | Convolution neural network \ (CNN \) |
Audio, language | Recurrent Neural Networks \ (RNN \) |
Radar signal recognition model | Complex hybrid neural network |
Structured data : a database or data sequences, each feature has a clearly defined
Unstructured data : audio, image, text, features may be words within the text image pixel values, or
Why deep learning popular
- The upper limit of the traditional performance model. Traditional data processing model is suitable for small, increasing the amount of data reaches the upper limit of the performance can not significantly improve performance
- The amount of data increases. Traditional model is not applicable processing huge amounts of data
- Depth learning model performance depends on the amount of data. The more depth learning model learning better data performance.
- The new algorithm update pushed to calculate speed, which gave birth to new ideas, continue to promote the computing speed update
Scale (scale model, the data size) to promote deep learning performance growth.
Neural network infrastructure
Dichotomous classification
Example: a 64 * 64 binary images, determines whether a cat, output 1 (cat) or 0 (not cat).
The image stored in the computer is generally three 64 * 64 matrix, corresponding to the distribution of Red, Green, Blue three kinds of luminance pixels.
The three matrices mapped to a feature vector \ (X-\) , the value of each matrix, one by line by line reading, consisting of a large vector 1 * 12288 (64 x 64 x 3 = 12288 ). With \ (n = 12288 \) represents a dimension of the feature vector.
So is the role model for dichotomous characteristics of the input variables \ (x \) output label \ (y \) to predict whether a cat.
Generally referred to the training set has \ (m \) training samples, where \ ((x ^ {(1 )}, y ^ {(1)}) \) represents a sample input \ (x ^ {(1) } \) output \ (^ {Y (. 1)} \) , in order to distinguish, the training set is denoted \ (m_ {train} \) test set to \ (test m_Low} {\) . With \ (X = (x ^ { (1)}, x ^ {(2)}, ..., x ^ {(m)}) \) referred to as the set of all characteristic variables. Then \ (n-X-* = m \) , * is the number of training samples characteristic dimension
Logistic Regression
Calculation \ (\ widehat} = P {Y (Y =. 1 | X) \ in (0,1) \) , where \ (X \) is a characteristic variable, the given \ (X \) , the parameters \ (W \) (also a vector) and a parameter \ (B \) , output \ (\ widehat {Y} = \ Sigma (W ^ the Tx + B) \) ( \ (\ Sigma (Z) \) is \ (the Sigmoid \ ) function, the output maps to (0,1))
\ [\ Widehat {y} = \ sigma (z) = \ frac {1} {1 + e ^ {-} z} \ quad z = Tx + a ^ b \]
Loss function : loss function is reflected on a single sample, it measures the performance of a single sample. Since (L-2 \) \ NORM poor performance in a subsequent seek the optimal values of gradient descent, so \ (Logistic \ Regression \) using
\ [L (\ widehat {y}, y) = - (ylog \ widehat {y} + (1-y) log (1- \ widehat {y})) \]
- If \ (. 1 Y = \) : \ (L (\ widehat {Y}, Y) = - log \ widehat {Y} \) , wants to \ (L (\ widehat {y }, y) \) as small as possible , it is necessary \ (log \ widehat {y} \) as large as possible, because \ (\ widehat {y} \ ) is the result \ (the Sigmoid \) the result of the mapping function, \ (\ widehat {Y} \ in (0 , 1) \) , so \ (\ widehat {y} \ ) larger closer to the true value
- Similarly \ (Y = 0 \) : \ (L (\ widehat {Y}, Y) = - log (l- \ widehat {Y}) \) , will \ (log (1- \ widehat { y} ) \) as large as possible, then the \ (\ widehat {y} \ ) be as small as possible
The cost function : based on the total cost parameters, reflecting the performance of the overall training samples.
\[J(w,b)=\frac{1}{m}\sum_{i=0}^mL(\widehat{y}^{(i)},y^{(i)})\]
- \ (m \) is the number of training samples
- \ (L (\ widehat {y } ^ {(i)}, y ^ {(i)}) \) represents the \ (I \) loss function training samples
Before to reverse the spread of
The forward sequence is the normal calculated reverse order of transmission is calculated derivative (chain rule)
Program in \ (dvar \) indicates the number of guide
Logistic regression gradient descent
\[ \begin{array}{l}{z=w^{T} x+b} \\ {\hat{y}=a=\sigma(z)} \\ {\mathcal{L}(a, y)=-(y \log (a)+(1-y) \log (1-a))}\end{array} \]
Single sample : assuming only two feature values, \ (W = (W_1, w_2) ^ T \) , input \ (W_1, x_1, w_2,, x_2, B \) , then \ (z = w_1x_1 + w_2x_2 + b \) , then \ (\ widehat = {} A = Y \ Sigma (Z) \) , the final calculation of \ (L (A, Y) \) . In \ (Logistic \) to do is to change the regression \ (w_1, w_2, b \ ) values such that \ (L (a, y) \) minimum.
Updating a step gradient:
\ [\ Array the begin {} {} {W_ {L}. 1: = W_. 1} {- \ Alpha W_ {D}. 1 \\} {2} {W_: W_ {2} = - \ alpha d w_ {2}} \\ {b: = b- \ alpha} \ end {array} \]
Vectorization
Erasing the quantization procedure is usually explicit \ (for \) Art cycle. And vector computing \ (for \) calculates the difference between the speed of the cycle is almost 300 times. As far as possible avoid the use of \ (for \) loop!
numpy library built a lot of vector function, the \ (w \) to quantify, the program calls the \ (np.zeros () \) , the training set \ (X \) and bias \ (b \) to quantify
python broadcast
- Do not use the shape as \ ((n,) \) This array of rank 1. Using (a = np.random.randn (5,1) \ ) \ This declaration statement of a specific size
- If has been \ ((n-,) \) This array of rank 1, can be converted using reshape
- Free to affirm assert statements, assertions, to determine the shape of the matrix, timely detection bug
Shallow neural network
Neural network representation
Input layer, hidden layer and output layer, the input layer does not generally regarded as a standard level, starting from the hidden layer of the network counts the number of layers, mathematical notation \ (a ^ {[n] } _ i \) represents the \ (n-\ ) the first layer of the network \ (I \) results nodes.
Output neural network
Analog \ (Logistic \) calculated regression, \ (Logistic \) regression to calculate \ (Z = W ^ the Tx + B \) , then calculate \ (a = \ sigma (z ) = \ widehat {y} \ ) , neural networks single hidden layer is for each node \ (a ^ {[1] } _ i \) are calculated again \ (z ^ {[1] } _ i = w ^ {[1] T} _ix + b ^ {[1]} _ i \)
Activation function
\ (Sigmoid \) , \ (tanh \) , \ (Relu \) , \ (Leaky Relu \)
Function name | Pros and cons |
---|---|
Sigmoid | In addition to other binary classification is rarely used, the output interval (0,1) is not conducive to centralized data |
fishy | Very good performance, centrosymmetric than usual sigmoid |
resume | The default activation function, gradient descent very friendly |
Leaky ReLU | In part to compensate ReLU gradient is less than 0 to 0 problems, but rarely used |
Random initialization
Weight matrix should not be too large, generally 0.01 appropriate weight matrix too large lead to \ (z \) is too large, flattish portions fall activation function, gradient descent small, learning slows down.
DNN
Technically, \ (Logistic \) regression is a single layer neural network.
Forward spread is still time to use \ (for \) cycle.
Wherein the front layers recognize simple, more complex features can be identified later, such as the face contour \ (\ rightarrow \) features \ (\ rightarrow \) face, phoneme \ (\ rightarrow \) word \ (\ rightarrow \) phrase \ (\ rightarrow \) sentence
Listen deep learning is bluffing, put call has a lot of hidden layer of the neural network before. .
Super parameters: Control parameters actual parameters
I do not know how to take the best super parameter values, requiring multiple experiments, excel ultra-intuitive argument. . Metaphysics learning rate and super parameters. . .