Computer vision & deep learning related finishing

1. Computer Vision

1.1 Development history

  1. Deep learning development history - target classification:
    insert image description here
  2. Classification Model and Accuracy
    insert image description here
  3. LeNet
    series, 2 convolutions and 3 full connections, first used for digital recognition
  4. AlexNet (12-year ImageNet champion)
  • Residual, 5 convolutions and 3 full connections, multiple small convolutions instead of a single large convolution;
  • Use the ReLu activation function to solve the gradient decimal problem;
  • Introduce dropout to avoid model overfitting;
  • max pooling;
  1. ZF-Net (13 years ImageNet champion)
  • Only a dense connection structure of a GPU is used;
  • Change the AlexNet first layer convolution kernel from 11 to 7, and the step size from 4 to 2.
  1. VGG-Net (2nd place in ImageNet classification in 2014)
  • Deeper networks, convolutional layers use smaller filter sizes and intervals;
  • Multiple small convolutions make the network more nonlinear and have fewer parameters.
  1. GoogLenet (No. 1 in ImageNet classification in 2014).
  • Introducing the Inception module, using different sizes of convolution kernels means different sizes of convolution kernels (receptive fields), and finally splicing means the fusion of features of different scales.
  • Average pooling is used to replace the fully connected layer;
  • To avoid gradient disappearance, the network adds 2 additional auxiliary softmax for forward conduction gradient.
    insert image description here
  1. Resnet:
    (1) Introduce the residual unit to simplify the learning objectives and difficulty, speed up the training, and when the model is deepened, there will be no degradation problem; it can effectively solve the problem of gradient disappearance and gradient explosion in the training process.
    (2) The second version: BN and ReLu of v2 are used before convolution. The advantage is: the backpropagation basically conforms to the assumption, and the information transmission is unimpeded; the BN layer acts as a pre-activation, which plays a role of regularization.
  2. DenseNet: dense connection; strengthens feature propagation, encourages feature reuse, and greatly reduces the amount of parameters.
  3. Inception: parallel connection, invented by Google.
  4. Inceptionv1–GoogLenet,

Why not run filters with multiple sizes on the same level? The network essentially gets slightly wider, not deeper. Therefore, the Inception module is proposed.

(1) GoogLeNet adopts an Inception modular (9) structure, with a total of 22 layers;
(2) In order to avoid gradient disappearance, the network adds 2 additional auxiliary softmax for forward conduction gradient.
(3) The use of convolution kernels of different sizes means different sizes of receptive fields, and the final splicing means the fusion of features of different scales
(4) The common convolution in CNN (1* 1, 3* 3, 5* 5) , pooling operations (3 * 3) are stacked together (the size of convolution and pooling is the same, and the channels are added), on the one hand, the width of the network is increased, and on the other hand, the adaptability of the network to scale is increased. However, in the original version of Inception, all convolution sums are performed on the output of the previous layer, and the calculation required for the 5*5 convolution kernel is too large, requiring about 120 million calculations, resulting in The thickness of the feature map is very large.
(5) In order to reduce the amount of calculation, 1 * 1 convolution is added before 3 * 3 and 5 * 5 to reduce the thickness of the feature map and form the network structure of InceptionV1.

insert image description here
insert image description here

  1. Inceptionv2: Three optimizations

One is to increase BN when inputting:
(1) All outputs are guaranteed to be between 0 and 1.
(2) The mean value of all output data is close to 0, and the standard deviation is close to 1 in a normal distribution. Make it fall into the sensitive area of ​​the activation function, avoid gradient disappearance, and speed up convergence . BN plays a role of regularization in a sense. BN will standardize the interior of each mini-batch data, so that the output can be normalized to the normal distribution of N(0,1), which speeds up the training speed of the network and can also Increase the learning rate.
(3) Accelerate the convergence speed of the model and have a certain generalization ability.
(4) Dropout can be reduced or canceled to simplify the network structure. V2 is 14 times faster when the training reaches the accuracy of V1, and the final convergence accuracy is also higher than that of V1.

The second is to replace the 5 * 5 convolution kernel with two 3 * 3 to reduce the amount of parameters. 25 is reduced to 2 * 3 * 3 = 18
convolution decomposition, and the delayed 5 * 5 convolutional layer is replaced by a small network composed of two consecutive 3 * 3 convolutional layers, while maintaining the scope of the receptive field and reducing The amount of parameters is increased, and the network is also deepened.

The third is to decompose the n * n convolution kernel size into two convolutions of 1 * n and n * 1

  1. InceptionV3
    integrates all the upgrades mentioned in the previous Inception v2, and also uses 7x7 convolution

(1) It is very effective to decompose into small convolutions. Considering the N*1 convolution kernel, a larger two-dimensional convolution is split into two smaller one-dimensional convolutions (7*7 split into 7*1 and 1*7), on the one hand, it saves a lot of parameters, speeds up the operation and reduces overfitting, and at the same time, the depth of the network is further deepened, which increases the nonlinearity of the network.
(2) Optimized the structure of the Inception Module. From input to output, the convolutional network should gradually reduce the size of the picture and gradually increase the number of output channels, that is, let the space be structured and transform the spatial information into high-level abstract feature information (all inception common); Inception Module uses multiple The idea of ​​branching to extract high-level features of different abstraction levels is very effective, which can enrich the expressive ability of the network

  1. InceptionV4

Use Residual Connection to improve the V3 structure.

  1. Advantages of inception:

1. The 1x1 convolution kernel is used, which is cost-effective, and can add a layer of feature transformation and nonlinear transformation with a small amount of calculation.
2. Batch Normalization is proposed. Through certain means, the input value distribution of each layer of neurons is pulled to a normal distribution with a mean of 0 and a variance of 1, so that it falls into the sensitive area of ​​the activation function, avoiding the disappearance of the gradient, and accelerating convergence.
3. Introduce the Inception module, a structure combining four branches.

  1. MobileNet:https://www.guyuehome.com/37658

MobileNet is an efficient model proposed for mobile and embedded devices. MobileNet is based on a pipelined architecture (streamlined), using depth-separable convolution (ie Xception variant structure) to build a lightweight deep neural network. The width factor a is used to control the number of input and output channels, and the resolution factor p controls the resolution of the input. For example, for depth-separated convolution, the standard convolution (4, 4, 3, 5) whcn is decomposed into:
(1) Depth convolution part: size (4, 4, 1, 3), acting on each input On each channel, the output feature map is (3,3,3,5)
(2) Point-by-point convolution part: the size is (1,1,3,5), acting on the output feature map of the depth convolution, The final output is (3,3,5).

  1. shuffleNet: dedicated to mobile devices with limited computing power, mainly includes two operations: point-by-point group convolution (to reduce computational complexity) and channel shuffle (to help information flow)

1.2 Target detection

  1. RCNN process:

    (1) Pre-training: Train a classification neural network on ImageNet
    (2) Use selective search to find candidate areas
    (3) Resize candidate areas into the size of CNN input
    (4) fine-tuning: fine in your own training data set -tune CNN, as a classification problem to identify K+1 categories, K is the number of target categories of interest, and 1 is the background category. Fine-tune uses a smaller learning-rate to oversample on the positive sample (most of the candidate areas from the selective search are background).
    (5) Remove the last classification layer of CNN after fine-tune, pass each candidate area through CNN, and output it as a feature vector.
    (6) Use the feature vector to train a binary SVM classifier for each category (positive samples are areas where the IOU of the candidate area and the real area is greater than or equal to 0.6, and the others are negative samples) (7) In order to reduce the positioning error of the Selective Search candidate area
    , Use the regression model to predict new localizations.

  2. BBOX
    bounding box regression
    loss function, the output of the regression model is di ( p ) d_{i}(p)di(p) 其中 p = ( p x , p y , p w , p h ) p=(p_{x},p_{y},p_{w},p_{h}) p=(px,py,pw,ph) outputs the center coordinates, width and height of the candidate area for Selective Search. g = ( gx , gx , gy , gw , gh ) g=(g_{x},g_{x},g_{y},g_{w},g_{h})g=(gx,gx,gy,gw,gh) is the center coordinates, width and height of the real coordinates.
    L reg = ∑ i ∈ { x , y , w , h } ( ti − di ( p ) ) 2 + λ ∥ w ∥ 2 L_{reg}=\sum_{i\in \left \{ x,y,w ,h \right \}}\left ( t_{i}-d_{i}\left ( p \right ) \right )^{2}+\lambda \left \| w \right \|^{2}Lreg=i{ x,y,w,h}(tidi(p))2+lw2
    t x = ( g x − p x ) p w t_{x}=\frac{\left ( g_{x}-p_{x}\right )}{p_{w}} tx=pw(gxpx)
    t y = ( g y − p y ) p h t_{y}=\frac{\left ( g_{y}-p_{y}\right )}{p_{h}} ty=ph(gypy)
    t w = l o g ( g w p w ) t_{w}=log\left ( \frac{g_{w}}{p_{w}} \right ) tw=log(pwgw)
    t h = l o g ( g h p h ) t_{h}=log\left ( \frac{g_{h}}{p_{h}} \right ) th=log(phgh)
    After training, the center coordinates, height and width of the predicted central target:
    g ^ x = pwdx ( p ) + px \hat{g}_{x}=p_{w}d_{x}\left ( p \ right )+p_{x}g^x=pwdx(p)+px
    g ^ y = p h d y ( p ) + p y \hat{g}_{y}=p_{h}d_{y}\left ( p \right )+p_{y} g^y=phdy(p)+py
    g ^ w = p w e x p ( d w ( p ) ) \hat{g}_{w}=p_{w}exp\left ( d_{w}\left ( p \right ) \right ) g^w=pwexp(dw(p))
    g ^ h = p h e x p ( d h ( p ) ) \hat{g}_{h}=p_{h}exp\left ( d_{h}\left ( p \right ) \right ) g^h=phexp(dh(p))

The value of regularization is determined by cross validation.
Not all candidate regions output by Selection Search contain real targets. For these regions, there is no need to join the calculation process of Bbox regression.
R-CNN only adds candidate regions with an IoU value greater than or equal to 0.6 to the calculation process of Bbox regression

  1. Fast R_CNN

Integrate the following three independent modules in R_CNN to reduce the amount of calculation:
CNN: extract image features
SVM: target classification recognition
Regression model: positioning

Instead of extracting features through CNN independently for each candidate region, the entire image is extracted through CNN, and then regional features are extracted from the CNN feature map through the RoI Pooling layer based on the candidate regions of Selection Search.
Loss function:
insert image description hereSummary of Fast R-CNN

 - 由于图像只通过CNN一次,而不是让每一个候选区独立通过CNN,减少了运算量。
 - 将R-CNN中的多个SVM的分类合并为一个DNN,让分类和定位可以同时训练。
 - 但是依然依靠Selective Search选择候选区域 
  1. Improvements of Faster R-CNN:
    (1) Remove Selective Search, and integrate the selection of candidate regions into the deep learning network model (Region Proposal Network: RPN and fast R-CNN combined)
    (2) RPN: sliding window, there are K boxes, each box has 2 categories and 2K categories, with 4K positioning.

    Faster R-CNN steps:
    (1) Pre-train a CNN for classification
    (2) Use CNN's feature map as output, end-to-end fine-tune RPN + CNN. When IOU>0.7 is a positive sample, IOU<0.3 is a negative sample.
    Use a sliding window on the feature map.
    For each sliding window, generate multiple anchors (equivalent to the candidate area selected by Selective Search). An anchor is determined by the center position of the sliding window, the window size, and the window aspect ratio. In the paper, 3 sizes and 3 aspect ratios are used, so a sliding window position corresponds to 9 (3*3) anchors.
    (3) Fix the weight of RPN, use the current RPN to train a FastR-CNN.
    (4) Fix the weight of CNN, Fast R-CNN, and train RPN.
    (5) Fix CNN, RPN, train the weight of Fast R-CNN
    (6) Repeat steps 4 and 5 until satisfied.
    insert image description here

  2. SSD network structure:
    (1) A network without a fully connected layer,
    (2) borrowing from the network model of VGG16,
    (3) replacing the first and second fully connected layers of VGG16 with convolutional layers,
    (4) removing The last fully connected layer,
    (5) Added 4 sets of convolutional layers (Conv8, Conv9, Conv10, Conv11)
    (6) The output feature map of Conv4 is used to detect the smallest object
    (7) The output feature map of Conv11 is used For detecting the largest object
    ssd classification loss function: softmax.
    ssd's bounding box prediction uses relative values.

     smooth L1 loss:结合了L1的优势(随着x的增大,梯度的增长是恒定的)
     L2的优势:当x特别小的时候,梯度迅速减小,减少震荡。
    

insert image description here

  1. How to determine the width, height and center of the anchor:
    insert image description hereinsert image description hereinsert image description here
    i anchor j groundtruth is really the object bounding box

  2. The number of filters in the prediction layer:

    • The width and height of the prediction layer are equal to the width and height of the input feature map
    • Corresponding to each pixel of the feature map, it is necessary to predict 4 values ​​​​including position information, plus c values ​​​​including category probability, and c is the total number of categories (including background categories)
    • Each convolution kernel is a tensor of 3*3*p, where p is the number of channels of the input feature map
    • If the input feature map is set to K anchor boxes, for a feature map of m*n pixels, kmn(c+4) convolution kernels are required.
  3. VGG16 does not count maxpooling, and calculates Conv and Fc with a total of 16 layers
    insert image description here

  4. Hole convolution
    (1) Atrous convolution (Dilated convolution)
    (2) The basic network of SSD VGG16 has completed pre-training on ILSVRC classification data. FC6 and FC7 are replaced by Atrous convolutional layers, pool5 is replaced from 2 2-s2 to 3 3-s1
    (3) Atrous convolution can increase the market with fewer parameters. Convolution with holes reduces the size of the kernel (reduces the number of parameters), which can save memory.
    (4) Moreover, the effectiveness of convolution with holes is based on an assumption: the number of closely adjacent pixels is the same, and all inclusion is redundant. It is better to skip H (hole size) and take one.

    The traditional way to predict the category and position: convolutional layer + fully connected layer, predict the center position and width of the object, only one position can be predicted
    Now predict the category and position: convolutional layer + convolutional layer. Simultaneous prediction of object probabilities and bounding boxes

  5. SSD vs YOLO
    (1) Use a smaller convolution kernel
    (2) Use different prediction periods for objects with different aspect ratios
    (3) Use the output of different convolutional layers to predict objects of different sizes

  6. Summary of SSD
    (1) Simultaneous positioning of multiple categories
    (2) Use the output feature maps of multiple convolutional layers to predict targets of different scales
    (3) The more abundant the Anchor used, the better the effect
    (4) Comparing with Faster R-CNN Comparable, and faster

  7. GAN (Ganerative Adversarial Networks)
    (1) Ian Goodfellow proposed in 2014
    (2) Unsupervised learning tasks
    (3) Using two deep neural networks: Generator (generator), Discrimination (discriminator) The
    basic steps of training GAN:
    ( 1) Sampling the noise and the actual data set, select m
    (2) use these data to train the discriminator
    (3) sample different noise sets of size m
    (4) train the generator on this data
    (5) from Step 1 starts repeating

  8. Binarized neural network
    (1) When making predictions, the weights and activation values ​​of the network are binary (-1/+1)
    (2) When training, the weights and activation values ​​of the binarization are related to the gradient Calculation
    (3) Due to the weight of the network, the activation value is binary, which brings two benefits:

    1. The size of the model is reduced by 32 times.
    2. Mathematical calculations (addition, subtraction, multiplication, and division) can be implemented using bit-wise bit operations, which is convenient for hardware to implement algorithms, faster and more energy-saving.

1.3 Image Segmentation

  1. Semantic segmentation of image segmentation:
    (1) Definition: The goal of semantic image segmentation is to mark the category of each pixel of the image. Because we need to predict every pixel in an image, this task is often called dense prediction.
    (2) Application:

    (1) Autonomous Driving—Autonomous driving is a complex task that requires perception in a changing environment. Since safety is paramount, this task also needs to be performed with the utmost precision. Semantic segmentation provides information about spaces on the road, as well as detecting lane markings and traffic signs.
    (2) Medical image processing: It can assist radiologists in image analysis, greatly reducing the time required for diagnosis.
    (3) Geosensing (development dataset spacenet): Satellite imagery is used to detect land cover information, for applications: such as detecting areas of deforestation.
    (4) Precision agriculture: Precision agriculture robots can reduce the amount of herbicides that need to be sprayed in fields, and semantic segmentation of crops and weeds can help them trigger weeding behaviors in real time.

  2. Instance segmentation of image segmentation: Instance segmentation is one step closer than semantic segmentation. In addition to pixel-level classification, each instance of the class needs to be classified separately. Semantic segmentation does not distinguish instances of a particular class.

  3. U-Net:

The problem of conv2DTranspose
(1) Compared with the interpolation method (bi-cubic difference bicubic) or the nearest neighbor interpolation, Conv2DTranspose is a supervised learning algorithm that requires training (2) it will
produce a checkerboard effect, and one of the solutions is to interpolate first , and then use Conv2DTranspose

2. Deep Learning

  1. Why use 1*1 convolution kernel

    (1) The purpose of 1 1 convolution is to reduce the dimension and also to correct the linear activation (ReLu). For example, the output of the previous layer is 100 100 128, after passing through the 5 5 convolution layer with 256 channels (stride=1, pad=2), the output data bits are 100 100 256, and the parameters of the convolution layer are 128 5 5 256 = 819200. Adding the output of the previous layer first passes through the 1 1 convolutional layer with 32 channels , and then through the 5 5 convolutional layer with 256 outputs , then the output data is still 100 100 256, but the number of convolution parameters has been reduced to 128 1 1 32+32 5 5*256=204800, about 4 times less.
    (2) Deepen the level of the network, but also enhance the nonlinearity of the network.

  2. Pooling layer backpropagation

    Another non-guidable link in the CNN network is the Pooling operation, because the Pooling operation changes the size of the feature map. If you do 2×2 pooling (the step size is also 2), it is assumed that the feature of the l+1 layer The map has 16 gradients, so the first layer will have 64 gradients, which makes it impossible for the gradient to propagate down. In fact, the idea to solve this problem is also very simple, that is, to transfer the gradient of 1 pixel to 4 pixels, but it is necessary to ensure that the sum of the transferred loss (or gradient) remains unchanged. According to this principle, the backpropagation of mean pooling and max pooling is also different.

    (1) mean pooling: the forward propagation of mean pooling is to average the values ​​in a patch to do pooling, then the process of back propagation is to divide the gradient of an element into n equal parts and assign them to the previous layer , so as to ensure that the sum of gradients (residuals) before and after pooling remains unchanged
    (2) max pooling max pooling also meets the principle that the sum of gradients remains unchanged. The forward propagation of max pooling is to pass the largest value in the patch to The next layer, while the values ​​of other pixels are discarded directly. Then backpropagation is to pass the gradient directly to a certain pixel in the previous layer, while other pixels do not accept the gradient, that is, it is 0. So the difference between the max pooling operation and the mean pooling operation is that it is necessary to record which pixel has the largest value during the pooling operation, that is, max id. This variable is to record the position of the maximum value, because it is used in backpropagation

  3. Meta-learning: learning to learn

Meta-learning meta learning is also known as "learning to learn", that is, using previous knowledge and experience to guide the learning of new tasks. It is the ability of the network to learn to learn, and it is one of the commonly used methods to solve the few-shot learning problem.

 元学习中的“元”是什么意思:元学习的本质是增加学习器在多任务的泛化能力,元学习对于任务和数据都需要采样,因此学习到的F(x)可以在未出现的任务中迅速(依赖很少的样本)建立起mapping。因此,“元”体现在网络对于每个任务的学习,通过不断的适应每个具体任务,使网络具备了一种抽象的学习能力。

 元学习中的训练和测试:meta-learning中为了区别概念,将训练过程定义为“meta-training”、测试过程定义为“Meta-testing”。区别于一般神经网络端到端的训练方式,元学习的训练过程和测试过程各需要两类数据集(Support/Query set).
 
 小样本分类任务属于“N-way k-shot”问题,其中,N代表选择的Testing data中样本的种类,k代表选择的K类Testing data中每类样本的数量,一般来说N小于Testing data的总类别数。
 
 如何构建S'和Q':我们从Testing data中随机选出N个类。然后,再从这N类中按照类别一次随机选出k+x个样本(x代表可以选任意个),其中的k个样本将被用作Support set S’,另外的x个样本将被用作Query set Q'。S和Q的构建同理,不同的是从Training data中选择的样本累呗和每类样本数量均不作约束。
 
 如何训练:Meta-learning通常采用一种被称为Episodic Training的方法来训练。
  1. The difference and connection between "meta-learning" and "transfer learning":

From the perspective of the goal, the essence of meta-learning and transfer learning is to increase the generalization ability of the learner in multi-task, but meta-learning is more focused on the double sampling of tasks and data, that is, tasks and data need to be sampled, specifically Say that for a 10-category task, meta-learning may only build a 5-classifier, and each training episode can be regarded as a sub-task, and the learned F(x) can help in unseen tasks. Quickly create a mapping here. Migration learning refers more to the ability to transfer from one task to other tasks, and does not place much emphasis on the concept of task space.

元学习:学习如何学习,思想挺突破的,不过应用还不算太广,有待考究
NAS:  神经架构搜索,可用于搜索最优的模型参数,类似于贝叶斯方法搜索最优超参数,算不上深度学习的突破。
自监督:学习数据特征,不需要标签,具有一定的现实意义,但是思想不难想到,算不上突破。
  1. Two common types of transfer learning scenarios:

1. The convolutional network is used as a feature extractor. Use the network pre-trained on ImageNet, remove the last fully connected layer, and use the rest as a feature extractor (for example, AlexNet is a 4096-dimensional feature vector before the final classifier). The features extracted in this way take the CNN codes. Once you have such features, you can use a linear classifier (Liner SVM, Softmax, etc.) to classify images.
2.Fine-tuning convolutional network. Replace the input layer (data) of the network and continue training with new data. When Fine-tune, you can choose to fine-tune all or part of the layers. Usually, the previous layers extract generic features of the image (eg edge detection, color detection), which are useful for many tasks. The latter layer extracts features related to a specific category, so fine-tune often only needs the layer after Fine-tuning.

  1. Migration learning transfer learning and fine-tuning fine-tuning:

1. Fine-tuning fine-tuning is one of the means to achieve transfer learning.
2. Generally speaking, when the amount of data is sufficient, the effect of transfer learning is not as good as complete retraining.
3. However, the training time and training cost required for migration learning are far less than training a complete model.

  1. Feature engineering:

1. Perform well on small and medium datasets.
2. Manual processing and extraction of data, the process of manual extraction. Sometimes also refers to "washing data"

  1. Indicates learning:

1. Perform well on a large number of complex data sets.
2. Automatically learn useful data features, the process of model automatic learning.
3. Another benefit of representation learning is that highly abstract features can be used on other related issues through transfer learning

  1. Transfer learning:

1. Now the most commonly used structures in engineering are vgg, resnet, and inception. Designers usually directly apply the original model to train the data once, and then select a model with better effect for fine-tuning and model reduction.
2. The model used in engineering must be fast while having high precision.
3. The commonly used method of model reduction is to reduce the number of convolutions and reduce the number of modules of resnet.

  1. RoIAlign:

    The reason why RoIAlign is proposed is that the previous RoIPooling operation will cause a large overlap in instance segmentation, and RoIAlign solves the problem of mis-alignment caused by the two quantizations in the RoI Pooling operation. Explain the basic principles and implementation details of ROI Align in detail

  2. Use a convolutional layer to replace the fully connected layer at the end of the CNN:
    reference breaks through the input size limit;

  3. FPN: Refer to
    FPN, which can handle small targets very well, use the underlying features to obtain more accurate position information, use high-level features to obtain strong semantic information, integrate multi-layer feature information, and output in different feature layers, improving detection performance .

  4. Principle of YOLOv3

  5. How to solve the size and dimension mismatch of the residual structure:

  6. map:

  7. Deep Choice for Deep Learning?
    1. An intuitive explanation, from the perspective of model complexity. If we can increase the complexity of a learning model, its learning ability can be improved. How to increase the complexity of the neural network? Either widen, that is, increase the number of neurons in the hidden layer network; or deepen, that is, increase the number of hidden layers. When it becomes wider, it just adds some computing units and increases the number of functions. When it becomes deeper, it not only increases the number, but also increases the degree of embedding between functions.
    2. Deep learning can learn higher-dimensional features through the conversion of multiple layers to solve more complex tasks.
    3. Why can we use such a model now? There are many factors. First, we have larger data; second, we have powerful computing equipment; third, we have many effective training techniques. 4. As already
    reflected in the ZFNet network, there is a hierarchy between features, and the level is deeper , the feature invariance is strong, the category classification ability is stronger, and a deeper network is required to learn complex tasks.

  8. How to improve the effect of training the model?

1. Obtain good data by improving data. Data preprocessing; zero mean 1 variance, data expansion or enhancement,
2. Diagnose whether the network is overfitting or underfitting. Via Bias Variance. Regularization solves overfitting, and early stopping prevents overfitting.
3. Through the selection of learning rate and activation function, improve the number of fully connected layers of the network, the number of layers, optimization algorithm, stochastic gradient, RMSprop, momentum, adam, use batchnormlization. 4. Weight initialization Xavier initialization, keep input and
output The terminal variance is consistent, avoiding that all outputs tend to 0;

  1. Why gradient disappearance and gradient explosion occur:
    The current methods of optimizing neural networks are all based on BP, that is, the error calculated according to the loss function is backpropagated through the gradient to guide the update and optimization of the deep network weights. The process of passing the error from the last layer forward requires the help of the chain rule (Chain Rule), so the backpropagation algorithm can be said to be the application of gradient descent in the chain rule.

    The chain rule is a form of multiplication, so when the number of layers is deeper, the gradient will propagate exponentially. The gradient disappearance problem and the gradient explosion problem generally become more and more obvious as the number of network layers increases. When the error calculated according to the loss function is used to update the weights of the deep network through gradient backpropagation, the obtained gradient value is close to 0 or very large, that is, the gradient disappears or explodes. Gradient disappearance or gradient explosion are actually the same in essence.

  2. Analysis of the reasons for gradient disappearance and gradient explosion
    [Gradient disappearance] often occurs. The reasons are: one is in the deep network, and the other is the use of inappropriate activation functions, such as sigmoid. When the gradient disappears, the hidden layer close to the output layer has a relatively normal gradient, so the weight update is relatively normal, but when it is closer to the input layer, due to the phenomenon of gradient disappearance, the hidden layer close to the input layer will The weight update is slow or stagnant. This leads to learning that is only equivalent to the shallow network of the next few layers during training.

    [Gradient explosion] generally occurs when the deep network and the weight initialization value are too large. In a deep neural network or recurrent neural network, the gradient of the error can be accumulated and multiplied in the update. If the gradient value between network layers is greater than 1.0, then repeated multiplication will cause the gradient to grow exponentially, the gradient becomes very large, and then causes a large update of the network weights, and thus makes the network unstable.

    The gradient explosion will be accompanied by some subtle signals, such as: ① The model is unstable, resulting in a significant change in the loss during the update process; ② During the training process, in extreme cases, the value of the weight becomes so large that it overflows, causing the model Loss becomes NaN and so on.

  3. How to solve gradient explosion and disappearance.
    Both gradient disappearance and gradient explosion problems are caused by the network being too deep and the update of network weights unstable, which is essentially due to the multiplication effect in gradient backpropagation. There are mainly the following methods to solve gradient disappearance and explosion:
    1. Pre-training plus fine-tuning -
    2. Gradient clipping: set a threshold for the gradient
    3. Weight regularization (for gradient explosion) -
    4. Use different activation functions - choose relu Activation functions whose gradients mostly fall on constants
    5, use batchnorm -
    6, use residual structure -
    7, use LSTM network

Specifically:
(1) pre-training+fine-tunning

This method comes from a paper published by Hinton in 2006. In order to solve the gradient problem, Hinton proposed an unsupervised layer-by-layer training method. The output is used as the input, and the output of the hidden node of this layer is used as the input of the hidden node of the next layer. This process is "pre-training" layer by layer; after the pre-training is completed, the entire network is "fine-tuned" ( fine-tuning). This idea is equivalent to finding the local optimum first, and then integrating them to find the global optimum. This method has certain advantages, but it is not widely used at present.

(2) Gradient clipping: set a threshold for the gradient

The gradient clipping scheme is mainly proposed for gradient explosion. The idea is to set a gradient clipping threshold, and then when updating the gradient, if the gradient exceeds this threshold, it is forced to be within this range. This prevents exploding gradients.

(3) Weight regularization

Another way to solve the gradient explosion is to use weight regularization (weithts regularization). Regularization mainly limits overfitting by regularizing the network weights. If a gradient explosion occurs, the weight will become very large. Conversely, limiting the size of the weight through the regularization term can also prevent the occurrence of gradient explosion to a certain extent. The more common ones are L1 regularization and L2 regularization, and there are corresponding APIs in each depth framework that can use regularization.

For details about L1 and L2 regularization, please refer to my previous article - underfitting, overfitting and how to prevent overfitting

(4) Select the activation function whose gradient such as relu mostly falls on the constant

The derivative of the relu function is always equal to 1 in the positive part, so using the relu activation function in the deep network will not cause the problem of gradient disappearance and explosion.

For details about activation functions such as relu, please refer to my previous article-Review the past and learn the new-Activation functions and their respective advantages and disadvantages

(5) batch normalization

BN is to eliminate the impact of weight parameter scaling by standardizing the output of each layer to be consistent with the mean and variance, and then solve the problem of gradient disappearance and explosion, or it can be understood that BN pulls the output from the saturation zone. unsaturated region.

For details about Batch Normalization (BN), please refer to my previous article - Commonly used Normalization methods: BN, LN, IN, GN

(6) Shortcut for residual network (shortcut)

Speaking of the residual structure, I have to mention this paper: Deep Residual Learning for Image Recognition. Paper link: http://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf
insert image description here

Compared with the previous straight-forward network structure, there are many such cross-layer connection structures (as shown in the figure above) in the residual. Such a structure has great benefits in backpropagation and can avoid gradient disappearance.

(7) "gate" structure of LSTM

The full name of LSTM is long-short term memory networks. The structural design of LSTM can improve the problem of gradient disappearance in RNN. The main reason is the complex "gates" inside the LSTM, as shown in the figure below.
insert image description here

LSTM can "remember" the "residual memory" of the previous training through its internal "gate" in the next update.
15. Have you done any other projects related to job application, explaining the content of the current master's research, and has it had any effect?

解答建议。自己做的事情和学的任何技能能够与申请的岗位建立联系。
  1. Why use many small convolution kernels (such as 3x 3 ) instead of several large convolution kernels?

This is well explained in the original paper of VGGNet.
There are two reasons:
first, you can use several smaller kernels instead of several larger kernels to get the same receptive field and capture more spatial context, but with smaller kernels, you use parameters and The amount of calculation is less.
Second, because with smaller kernels you will use more filters, you will be able to use more activation functions, allowing your CNN to learn a more discriminative mapping function.

  1. Why CNNs usually have an encoder-decoder structure in image segmentation?
编码器CNN基本上可以被认为是特征提取网络,
而解码器使用该信息通过“解码”特征并放大到原始图像大小来预测图像分割区域。
  1. Why do we use convolutions on images and not just FC layers?

This one is interesting because companies don't usually ask this question. As you might expect, I got this question from a company that specializes in computer vision. There are two parts to this answer.

First, convolutions preserve, encode and actually use the spatial information from the image. If we only use FC layers, we will have no relative spatial information.

Second, Convolutional Neural Networks (CNNs) have partial built-in translation invariance because each convolution kernel acts as its own filter/feature detector. And this reduces a large number of parameters and reduces overfitting.

  1. Convolution flipped 180 degrees

(1) Convolutional neural network image processing convolution does not rotate 180 degrees. It is to extract the features of the image, in fact, it just draws on the characteristics of "weighted summation".
(2) Mathematical convolution, such as signal processing/traditional image processing, when processing convolution, rotate 180 degrees. Determined according to the needs of the problem.
(3) The convolution kernel in mathematics is known or given. The convolution kernel in the convolutional neural network is originally a trainable parameter, which is not given. It is learned according to data training, regardless of whether it is flipped or not. are positional parameters.

  1. dropout

dropout comes from the assumption that

When the network fits well to the training set but poorly to the verification set, can each iteration randomly update the network parameters (weights) instead of updating them all? The introduction of such randomness is Can increase the ability to generalize the network. So there is dropout.

How does dropout work?

每层神经元代表一个学习到的中间特征,当数据量过小,出现过拟合时,显然这时神经元表示的特征相互之间重复和冗余。
dropout直接作用是:减少冗余,即增加每层各个特征之间的正交性。每次迭代只有一部分参数更新,减慢收敛速度。

The specific method of dropout?

Dropout 一般在全连接层使用,被drop的神经元不参加当前轮的参数优化。只有训练时使用dropout,测试时所有神经元都参与运算。

在训练的时候,我们只需要按一定的概率(retaining probability)p 来对weight layer 的参数进行随机采样,将这个子网络作为此次更新的目标网络。可以想象,如果整个网络有n个参数,那么我们可用的子网络个数为 2^n 。

测试时,不对网络的参数做任何丢弃,这时dropout layer相当于进来什么就输出什么。然后,把测试时dropout layer的输出乘以训练时使用的retraining probability  p 。是为了使得dropout layer 下一层的输入和训练时具有相同的“意义”和“数量级”。

dropout calls?

在tensorflow中:tf.nn.dropout(上层输出,暂时舍弃的概率)
  1. L2normalization: Solve the problem that the size of the features in the feature map is too different
    insert image description here

  2. Hard Negative Mining: How to generate negative samples
    (1) Randomly select some non-target regions as negative samples
    (2) Train an object detector
    (3) Use the object detector to run on some randomly selected images, collecting those that are mistaken for objects area, add negative samples
    (4) back to the second step

  3. Non-max suppression
    We may detect the same target with different sizes and aspect ratios. In order to avoid multiple detections of the same target, use Non-max suppression (
    1) sort by the output probability of the detected target
    (2) discard The predicted position with too low probability
    (3) repeat: select the predicted position with the highest probability, if it overlaps with another predicted position (for example, the overlap rate IoU is greater than 0.5), keep the predicted position with the highest probability, and discard the other one.

  4. Convolution and deconvolution
    insert image description hereinsert image description hereDeconvolution
    transposed convolutions /deconvolution /up convolutions
    (1) The output of convolution and pooling is reduced relative to the input size
    (2) Pooling helps us understand what objects in the image are by increasing the field of view , but this operation also loses the position information of the object
    (3) In semantic segmentation, we not only need to know what the object in the image is, but also need to know where the object is. We need an operation that scales up the image while preserving object position information.
    (4) Transposed convolution is the best choice for image oversampling (up sampling). It learns the best weights through the error backward transfer to make the conversion of low-resolution images into high-resolution images the best.

The problem of conv2DTranspose
(1) Compared with the interpolation method (bi-cubic difference bicubic) or the nearest neighbor interpolation, Conv2DTranspose is a supervised learning algorithm that requires training (2) it will
produce a checkerboard effect, and one of the solutions is to interpolate first , and then use Conv2DTranspose

  1. GAN(Ganerative Adversarial Networks)

    (1)Ian Goodfellow 2014年提出
    (2)非监督式学习任务
    (3)使用两个深度神经网络:Generator(生成器),Discrimination(判别器)
    

Basic steps to train a GAN:

(1)对噪声及和实际数据集进行采样,选择m个
(2)使用这些数据训练判别器
(3)采样大小为m的不同的噪声集
(4)在此数据上训练生成器
(5)从步骤1开始重复
  1. Binarized Neural Network

    (1) When making predictions, the weights and activation values ​​of the network are binary values ​​(-1/+1)
    (2) When doing training, the calculation of binarized weights and activation value parameters and gradients
    (3) due to The weights and activation values ​​of the network are binary, which brings two benefits:
    1. The size of the model is reduced by 32 times.
    2. Mathematical calculations (addition, subtraction, multiplication, and division) can be implemented using bit-wise bit operations, which is convenient for hardware to implement algorithms, faster and more energy-saving.

  2. Do you understand common loss functions, common activation functions, and ELU functions:

Common loss functions: 0-1 loss function, absolute value loss function, log logarithmic loss function, square loss function, exponential loss function, hinge loss function, cross-entropy loss function.

 (1)0-1损失函数

L ( Y , f ( X ) ) = { 1 , Y ≠ f ( X ) 0 , Y = f ( X ) L(Y,f(X))=\left\{\begin{matrix} 1,Y\neq f(X) & \\ 0,Y=f(X) & \end{matrix}\right. L ( Y ,f(X))={ 1,Y=f(X)0,Y=f(X)

(2)绝对值损失函数

L(Y,f(X))=|Y-f(X)|

(3)log对数损失函数

L(Y,P(Y|X))=-logP(Y|X)

(4)平方损失函数

L ( Y ∣ f ( X ) ) = ∑ N ( Y − f ( X ) ) 2 L(Y|f(X))=\sum_{N}(Y-f(X))^2 L(Yf(X))=N(Yf(X))2

(5)指数损失函数

L(Y|f(X))=exp[-yf(x)]

(6)hinge损失函数

L(y,f(x))=max(0,1-yf(x))

(7)交叉熵损失函数

C = − 1 n ∑ x [ y l n a + ( 1 − y ) l n ( 1 − a ) ] C=-\frac{1}{n}\sum_{x}^{}[ylna+(1-y)ln(1-a)] C=n1x[ y l na+(1y ) l n ( 1a)]

Common activation functions are: Sigmoid, Tanh, ReLU, leaky ReLU

 (1)Sigmoid函数:

f ( x ) = 1 1 + e − x f(x)=\frac{1}{1+e^{-x}} f(x)=1+ex1
Features: It can transform the continuous real value of the input into an output between 0 and 1. In particular, if it is a very large negative number, the output is 0; if it is a very large positive number, the output is 1. Disadvantages
:

  • Disadvantage 1: The gradient disappears when the gradient is reversed in the deep neural network. The probability of gradient explosion is very small, and the probability of gradient disappearance is relatively high.

  • Disadvantage 2: The output of sigmoid is not 0 mean (ie zero-centered).

  • Disadvantage 3: Its analytical formula contains exponentiation, and it is relatively time-consuming to solve it by computer. For larger deep networks, this can significantly increase training time.

    (2)Tanh函数:
    

    tanh ( x ) = ex − e − xex + e − x tanh(x)=\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}t english ( x )=ex +exexex
    Features: It solves the problem of non-zero-centered output of the sigmoid function, and the convergence speed is faster than that of sigmoid. However, the problem of gradient disappearing and power operation still exist.

    (3)ReLU函数:
    

    f(x)=max(0,x)
    features:

  • 1. The ReLU function uses the threshold to output the dependent variable, so its computational complexity will be lower than that of the remaining two functions (the latter two functions perform exponential operations)

  • 2. The non-saturation of the ReLU function can effectively solve the problem of gradient disappearance and provide a relatively wide activation boundary.

  • 3. The unilateral suppression of ReLU provides the sparse expression ability of the network.

  • 4. The limitation of ReLU: it will cause the death of neurons during its training process.
    This is because the function causes the negative gradient to be set to 0 when passing through the ReLU unit, and is not activated by any data afterwards, that is, the gradient flowing through the neuron is always 0 and does not respond to any data. In actual training, if the learning rate is set too high, more than a certain proportion of neurons will die irreversibly, and then the parameter gradient cannot be updated, and the entire training process will fail.

     (4)LeaKy ReLU函数:
    

    f ( x ) = { x ( x > 0 ) a x ( x ≤ 0 ) f(x)=\left\{\begin{matrix} x &(x>0) & \\ ax& (x\leq 0) & \end{matrix}\right. f(x)={ xax(x>0)(x0)
    The difference between LReLU and ReLU is that when x<0, its value is not 0, but a linear function with a slope of a. Generally, a is a small normal number, which not only achieves unilateral suppression, but also retains Part of the negative gradient information is not completely lost. But on the other hand, the selection of the value of a increases the difficulty of the problem, requiring strong artificial prior or repeated training to determine the appropriate parameter value.

      (5) PReLU
    

    Based on this, the parameterized PReLU (Parametric ReLU) came into being. The main difference between it and LReLU is that the slope a of the negative axis is used as a learnable parameter in the network for backpropagation training and joint optimization with other parameter-containing network layers. Another variant of LReLU adds a "randomization" mechanism. Specifically, during the training process, the slope a is used as a random sampling that satisfies a certain distribution; it is then fixed during the test. Random ReLU (RReLU) can play a role of regularization to a certain extent.

      (6)ELU函数:
    

    f ( x ) = { x , x > 0 a ( e x − 1 ) , x ≤ 0 f(x)=\left\{\begin{matrix} x &, &x>0 \\ a(e^{x}-1)& ,& x\leq 0 \end{matrix}\right. f(x)={ xto ( ex1),,x>0x0
    The ELU function is an improved version of the ReLU function. Compared with the ReLU function, when the input is negative, it has a certain output, and this part of the output also has a certain anti-interference ability. This can eliminate the problem of ReLU dying, but it is still the problem of gradient saturation and exponential operation.


  1. Why not use MSE for classification problems, but use the basic expression of cross entropy LR as
    follows: h θ x = g ( θ T x ) = 1 1 + e − θ T x h_\theta {x}=g(\theta ^ {T}x)=\frac{1}{1+e^{-\theta ^{T}}x}hix=g ( iTx)=1+eiTx1
    The result of gradient descent update derivation using cross entropy as the loss function is as follows:
    First, the loss function is obtained as follows:
    C = 1 n ∑ [ ylny ^ + ( 1 − y ) ln ( 1 − y ^ ) ] C=\frac{1 }{n}\sum [yln\hat{y}+(1-y)ln(1-\hat{y})]C=n1[ y l ny^+(1y ) l n ( 1y^)]
    If we use MSE (mean square error) as the loss function, the loss function and the result of derivation are as follows:
    C = ( y − y ^ ) 2 2 C=\frac{\left ( y-\hat {y} \right )^{2}}{2}C=2(yy^)2
    ∂ C ∂ w = ( y ^ − y ) σ ′ ( z ) ( x ) \frac{\partial C}{\partial w}=(\hat{y}-y){\sigma }'(z)(x) wC=(y^y ) p (z)(x)
    Using the square loss function, you will find that the speed of gradient update is very related to the gradient of the sigmoid function itself. The gradient of the sigmoid function in its domain is not greater than 0.25, so the training will be very slow. If cross-entropy is used, this situation will not occur. Its derivative is a difference. If the error is large, the update will be faster, and if the error is small, the update will be slower. This is exactly what we want.
    When using the sigmoid function as the probability of positive samples, the square loss is used as the loss function at the same time. At this time, the constructed loss function is non-convex, which is not easy to solve, and it is easy to obtain its local optimal solution. If maximum likelihood is used, the objective function is the logarithmic likelihood function, and the loss function is a high-order continuous derivable convex function about unknown parameters, which is convenient for finding its global optimal solution. (About whether it is a convex function or not, it is determined by the definition of a convex function. For a one-variable function, its second derivative is always non-negative. For a multivariate function, its Hessian matrix (the Hessian matrix is ​​a square matrix composed of the second derivative of a multivariate function) positive definiteness to judge.)
  2. Calculation formula of F1score
    To calculate F1score, first calculate Precision and Recall, the formula is as follows:
    P precision = TPTP + FP Precision=\frac{TP}{TP+FP}Precision=TP+FPTP
    R e c a l l = T P T P + F N Recall=\frac{TP}{TP+FN} Recall=TP+FNTP
    F − s c o r e = 2 P R P + R F-score=\frac{2PR}{P+R} Fscore=P+R2PR
  3. The idea of ​​distillation, why distillation
    Knowledge distillation is to distill the knowledge contained in the trained model into another model. Specifically, knowledge distillation can transfer the knowledge of one network to another network, and the two networks can be homogeneous or heterogeneous. The method is to train a teacher network first, and then use the output of the teacher network and the real label of the data to train the student network.
    During the training process, complex models and massive computing resources are required to extract information from very large and highly redundant datasets. In experiments, the models with the best results are often large in scale, and even obtained by ensembles of multiple models. However, large models are inconvenient to deploy to services. Common bottlenecks are as follows:
    (1) Slow inference speed
    (2) High requirements for deployment resources (memory, video memory, etc.). During deployment, we have problems with delays and scattered resources. strict restrictions.
    Therefore, model compression (reducing the number of parameters of the model while ensuring performance) has become an important issue. And "model distillation" is a method of model compression.
  4. The ablation experiment
    is similar to the "control variable method".
    Assume that in a target detection system, A, B, and C are used to achieve good results, but at this time you don’t know which one of A, B, and C is responsible for the good effect, so you keep A, B, remove C and experiment to see what role C plays in the overall system.

3. Machine Learning

3.1 KNN

The idea of ​​KNN algorithm:

在训练数据集中数据和标签已知的情况下,输入测试数据,将测试数据的特征与训练集中对应的特征进行相互比较,找到训练集中与之最为相似的前K个数据,则该测试数据对应的类别就是K个数据中出现次数最多的那个分类,其算法的描述为:
(1)计算测试数据与各个训练数据之间的距离;
(2)按照距离的递增关系进行排序;
(3)选取距离最小的K个点;
(4)确定前K个点坐在的类别的出现频率;
(5)返回前K个点中出现频率最高的类别作为测试数据的预测分类。

What's wrong with KNN's K setting being too large?

KNN中的K值选取对K近邻算法的结果会产生重大影响。

如果选择较小的K 值,就相当于用较小的领域中的训练实例进行预测,“学习”近似误差(近似误差:可以理解为对现有训练集的训练误差)会减小,只有与输入实例较近或相似的训练实例才会对预测结果起作用,与此同时带来的问题是“学习”的估计误差会增大,换句话说,K值的减小就意味着整体模型变得复杂,容易发生过拟合;
如果选择较大的K值,就相当于用较大的领域中的训练实例进行预测,其优点是可以减少学习的估计误差,但缺点是学习的近似误差会增大。这时候,与输入实例较远(不相似的)训练实例也会对预测起作用,使预测发生错误,且K值的增大就意味着整体的模型变得简单。
在实际应用中,K值一般取一个比较小的数值,例如采用交叉验证法来选择最优的K值。经验规则:K一般低于训练样本数的平方根。

3.2 The difference between GBDT and bagging, why the sample weight changes

GBDT is a Boosting-based algorithm.
The difference between Bagging and Boosting:

(1)样本选择上:
 Bagging:训练集是原始集中有放回选取的,从原始集中选出的各轮训练集之间是独立的。
 Boosting:每一轮的训练集不变,只是训练集中每个样例在分类器中的权重发生变化。而权值是根据上一轮的分类结果进行调整。
(2)样例权重:
Bagging:使用均匀取样,每个样例的权重相等。
Boosting:根据错误率不断调整样例的权值,错误率越大则权重越大。
(3)预测函数:
Bagging:所有预测函数的权重相等。
Boosting:每个弱分类器都有相应的权重,对于分类误差小的分类器会有更大的权重。
(4)并行计算:
Bagging:各个预测函数可以并行生成。
Boosting:各个预测函数只能顺序生成,因为后一个模型参数需要前一轮模型的结果。

Reasons for sample weight changes in Boosting:

通过提高哪些在前一轮被弱分类器分错样例的权值,减小前一轮对样例的权值,来使得分类器对误分的数据有较好的效果。

3.3 Gradient descent idea

梯度下降是一种非常通用的优化算法,能够为大范围的问题找到最优解.梯度下降的中心思想是迭代地调整参数从而使损失函数最小化.
假设你迷失在山上的迷雾中,你能感觉到只有你脚下路面的坡度。快速到达山脚的一个策略就是沿着最陡的方向下坡,这就是梯度下降的做法。即通过测量参数向量相关的损失函数的局部梯度,并不断沿着降低梯度的方向调整,知道梯度降为0,达到最小值。

Default value:
Θ : = Θ − η Δ loss \Theta :=\Theta -\eta \Delta lossTh:=Thη Δ l oss
corresponds to each weight formula:
wi : = wi − η ∂ loss ∂ wi w_{i}:=w_{i}-\eta \frac{\partial loss}{\partial w_{i}}wi:=withewiloss

3.4 Least square method

最小二乘法(又称最小平方法)是一种数学优化技术。通过最小化误差的平方和寻找数据的最佳函数匹配。利用最小二乘法可以简便地求得未知的数据,并使得这些求得的数据与实际数据之间误差的平方和为最小。最小二乘法还可以用于曲线拟合,来解决回归问题。回归学习最常用的损失函数是平方损失函数,在此种情况下,回归问题可以用著名的最小二乘法来解决。最小二乘法是机器学习领域最有名和有效的算法之一,是机器学习最基础的算法。

The so-called "two times" means square.

S ϵ 2 = ∑ ( y − yi ) 2 min = = = > ( true − value ) y S_{\epsilon ^{2}}=\sum \left ( y-y_{i} \right )^{2} min ===> ( true - value ) ySϵ2=(yyi)2min===>(truevalue)y

同一组数据,选择不同的f(x),通过最小二乘法可以得到不一样的拟合曲线。
不同的数据,更可以选择不同的f(x),通过最小二乘法可以得到不一样的拟合曲线。

3.5 Linear models

In supervised learning, if the predicted variable is discrete, we call it classification (such as decision tree, support vector machine, etc.), and if the predicted variable is continuous, we call it regression. In regression analysis, if only one independent variable and one dependent variable are included, and the relationship between the two can be approximated by a straight line, this regression analysis is called unary linear regression analysis. If two or more independent variables are included in the regression analysis, and there is a linear relationship between the dependent variable and the independent variable, it is called multiple linear regression analysis. For two-dimensional space linearity is a straight line; for three-dimensional space linearity is a plane, for multi-dimensional space linearity is a hyperplane...

Multiple linear regression (n element 1 time): y=ax1+bx2+cx3+…

Polynomial regression (n-degree n times): y = ax 1 n + bx 2 n − 1 + . . . y=ax_{1}^{n}+bx_{2}^{n-1}+...y=ax1n+bx2n1+...

Both the least squares method and the gradient descent method find the minimum value of the loss function by derivation.

Same point:

1、本质相同:都是在给定已知数据(independent & dependent variables)的前提下对dependent variables算出一个一般性的估值函数。然后对给定数据的dependent variables进行估值。

2、目标相同:都是在已知数据的框架内,使得估算值与实际值的总平方差尽量更小(事实上未必一定要使用平方)。

3.6 Comparison of FM vs SVM

FM和SVM最大的不同,在于特征组合时权重的计算方法
SVM的二元特征交叉参数是独立的,而FM的二元特征交叉参数是两个K维的向量v_i,v_j,交叉参数不是独立的,而是互相影响的。
FM可以在原始形势下进行优化学习,而基于kernel的非线性SVM通常需要在对偶形势下进行
FM的模型预测与训练样本独立,而SVM则与部分训练样本有关,即支持向量。
FM模型有两个优势:
(1)在高度稀疏的情况下特征之间的交叉仍然能够估计,而且可以泛化到未被观察的交叉参数的学习。
(2)模型的预测的时间复杂度是线性的。

3.7 Randomness of Random Forest

The randomness of the random forest is reflected in the fact that the training samples of each tree are random, and the split attribute set of each node in the tree is also randomly selected and determined. With these two random guarantees, the random forest will not produce overfitting.

3.8 k-means and spectral clustering

(1) The clustering algorithm is an unsupervised machine learning algorithm, that is, there is no category label y, and it is necessary to group similar data into groups according to data characteristics. The K-means clustering algorithm randomly selects K points as the clustering center, calculates the distance between other points and the center point, selects the center with the closest distance and classifies it, calculates the new center point of each class after the classification is completed, and recalculates each clustering points and center points and select the closest classification, and repeat this process until the center point no longer changes.
(2) The idea of ​​spectral clustering is to regard samples as vertices, and the similarity between samples as weighted edges, thus turning the clustering problem into a graph segmentation problem: find a graph segmentation method to connect different groups of The weight of edges should be as low as possible (which means that the similarity between groups should be as low as possible), and the weight of edges within a group should be as high as possible (which means that the similarity within a group should be as high as possible), so as to achieve the purpose of clustering.

4. Data processing

4.1 Normalization + Standardization + Centralization

1. Definition

Normalization: Convert the eigenvalues ​​of the samples to the same dimension, and map the data to the [0,1] or [-1,1] interval, which is only determined by the extreme value of the variable, because the interval scaling method is normalization kind of.

Standardization: The data is processed according to the columns of the feature matrix. It is converted into a standard normal distribution by calculating the z-score method, which is related to the overall sample distribution. Each sample point can have an impact on standardization.

What they have in common is that they can cancel the error caused by different dimensions; they are all a kind of linear transformation, and they all compress the vector X in proportion and then translate it. Linear transformations do not change the numerical ordering of the original data.

Centralization: the average value is 0, and there is no requirement for standard deviation.

The difference between standardization and centering: Standardization is the raw score minus the mean and then divided by the standard deviation, and centering is the raw score minus the mean. Therefore, the general process is first centralized and then standardized.

2. Why normalization/standardization:

(1) Certain model solution requirements:

When using the gradient descent method to solve the optimization problem, after normalization/standardization, the solution speed of the gradient descent can be accelerated, that is, the convergence speed of the model can be improved. The contour line formed when it is not normalized/standardized is elliptical, and it is likely to take a "zigzag" route (vertical long axis) during iteration, which leads to many iterations before convergence. After the features are normalized, the corresponding contour lines will become round, which can converge faster when the gradient descent is used to solve the problem.

Some classifiers need to calculate the distance between samples (such as Euclidean distance), such as KNN. If the value range of a feature is very large, the distance calculation mainly depends on this feature, which is contrary to the actual situation (for example, the actual situation is that the feature with a small value range is more important).

(2) Dimensionless: Normalization

(3) Avoid numerical problems: too large a number will cause numerical problems.

3. During data preprocessing:

3.1 Normalization:

(1) Min-Max normalization:

x’ = (x - X_min) / (X_max - X_min)

(2) Average normalization:

x’ = (x - μ) / (MaxValue - MinValue)

One flaw of the above two normalizations is that when new data is added, it may lead to changes in max and min, which need to be redefined.

(3) Nonlinear normalization:

1) Logarithmic function normalization: y=log10(x)

2) Inverse cotangent function conversion: y=atan(x)*2/pai

It is often used in scenarios where the data differentiation is relatively large. Some data are large and some are small. Through some mathematical function, the original value is mapped. The methods include log, exponential, tangent, and more. It is necessary to determine the curve of the nonlinear function according to the data distribution, such as log(v,2) or log(v,10).

3.2 Standardization

Z-score normalization (standard deviation normalization/zero mean normalization): x' = (x - μ)/σ

The expectation, mean, median, and mode of the normal distribution are the same, and they are all equal to μ, which is the position parameter of the normal distribution and describes the central tendency position of the normal distribution. The law of probability is that the probability of taking a value close to μ is greater, and the probability of taking a value farther away from μ is smaller. The normal distribution takes X=μ as the axis of symmetry, and the left and right are completely symmetrical.

σ describes the degree of dispersion of data distribution of normal distribution data. The larger σ is, the more dispersed the data distribution is, and the smaller σ is, the more concentrated the data distribution is. Also known as the shape parameter of the normal distribution, the larger the σ, the flatter the curve, and conversely, the smaller the σ, the thinner and taller the curve.

3.3 Centralization:

x’ = x - μ

4. When to use normalization/standardization

(1) If there is a requirement for the range of output results, use normalization.

(2) If the data is relatively stable and there are no extreme maximum and minimum values, use normalization.

(3) If there are outliers and a lot of noise in the data, using standardization can indirectly avoid the influence of outliers and extreme values ​​through centralization. Normalization is generally preferred.

(4) In classification and clustering algorithms, when it is necessary to use distance to measure similarity, or when PCA technology is used for dimensionality reduction, it is better to use z-score standardization.

(5) Linear normalization or other normalization methods can be used when distance measurement, covariance calculation, and data do not conform to normal distribution are not involved. For example, in image processing, after converting an RGB image into a grayscale image, its value is limited to the range of [0, 255].

5. Which models must be normalized/standardized

(1)SVM

(2)KNN

(3) neural network

4.2 pandas

How does panda read very large files

 data_path=r"E:\demo.csv"
    def read_bigfile(path):
        #分块,每一块是一个chunk,之后将chunk进行拼接
        df=pd.read_csv(path,engine='python',encoding="gbk",iterator=True)
        
        loop=True
        chunkSize=10000
        chunks=[]
        while loop:
            try:
                chunk= df.get_chunk(chunkSize)
                chunks.append(chunk)
            except StopIteration:
                loop =False
                print("Iteration is stopped.")
        df=pd.concat(chunks,ignore_index=True)
    after_df=read_bigfile(path=data_path)

4.3 What optimizations does Python do in memory?

Python uses memory pools to reduce memory fragmentation and improve execution efficiency. Garbage collection is mainly completed by reference counting, the problem caused by circular references of container objects is solved by mark-clearing, and the efficiency of garbage collection is improved by generational collection.

4.4 How to save memory

Manually recycle unused variables;
convert numeric data to 32-bit or 16-bit (limit the data type)
code example as follows:

def reduce_mem_usage(props):
    # 计算当前内存
    start_mem_usg = props.memory_usage().sum() / 1024 ** 2
    print("Memory usage of the dataframe is:", start_mem_usg, "MB")

    # 哪些列包含空值,空值用-999填充。why:因为np.nan当做float处理
    NAlist = []
    for col in props.columns:
        # 这里只过滤了object格式,如果你的代码中还包含其他类型,请一并过滤
        if (props[col].dtypes != object):

            print('*******************')
            print("columns:",col)
            print("dtype before",props[col].dtype)

            #判断是否是int类型
            isInt =False
            mmax=props[col].max()
            mmin=props[col].min()

            #Integer does not support NA,therefore Na needs to be filled
            if not np.isfinite(props[col]).all():
                NAlist.append(col)
                props.fillna(-999,inplace=True)#用-999填充

            #test if column can be conberted to an integer
            asint=props[col].fillna(0).astype(np.int64)
            result=np.fabs(props[col]-asint)
            result=result.sum()
            if result < 0.01:#绝对误差和小于0.01认为可以转换的,要根据task修改

                isInt =True

            #make integer /unsigned Integer datatypes
            if isInt:
                if mmin >=0: # 最小值大于0,转换成无符号整型
                    if mmax <=255:
                        props[col]=props[col].astype(np.uint8)
                    elif mmax<=65535:
                        props[col]=props[col].astype(np.uint16)
                    elif mmax<=4294967295:
                        props[col]=props[col].astype(np.uint32)
                    else:
                        props[col]=props[col].astype(np.uint64)
                else:#转换为有符号整型
                    if mmin>np.iinfo(np.int8).min and mmax<np.iinfo(np.int8).max:
                        props[col]=props[col].astype(np.int8)
                    elif mmin>np.iinfo(np.int16).min and mmax<np.iinfo(np.int16).max:
                        props[col]=props[col].astype(np.int16)
                    elif mmin>np.iinfo(np.int32).min and mmax<np.iinfo(np.int32).max:
                        props[col]=props[col].astype(np.int32)
                    elif mmin>np.iinfo(np.int64).min and mmax<np.iinfo(np.int64).max:
                        props[col]=props[col].astype(np.int64)
            else:#注意:这里对于float都转换成float16,需要根据情况更改
                props[col]=props[col].astype(np.float16)

            print('dtype after',props[col].dtype)
            print("****************")

        print("__MEMORY USAGE AFTER COMPLETION:__")
        mem_usg=props.memory_usage().sum()/1024**2
        print("Memory usage is :",mem_usg,"MB")
        print("This is ",100*mem_usg/start_mem_usg,"%of the iniitial sise")

if __name__=="__main__":
    props??? dataframe??
    reduce_mem_usage(props)

4.5 How to solve the problem of data imbalance?

答:
1、利用重采样中的下采样和上采样,对小数据类别采用上采样,通过复制来增加数据,不过这种情况容易出现过拟合,建议用数据扩增的方法,对原有数据集进行翻转,旋转,平移,尺度拉伸,对比度,亮度,色彩变化来增加数据。对大数据类别剔除一些样本量。
2、组合不同的重采样数据集:假设建立十个模型,选取小数据类1000个数据样本,然后将大数据类别10000个数据样本分为十份,每份为1000个,并训练十个不同的模型。
3、更改分类器评价指标: 在传统的分类方法中,准确率是常用的指标。 然而在不平衡数据分类中,准确率不再是恰当的指标,采用精准率即查准率P:真正例除以真正例与假正例之和。召回率即查全率F。真正例除以真正例与假反例之和。或者F1分数查全率和查准率加权衡=2*P*R/(P+R)。

4.6 How to deal with different distributions of training set and verification set test set

1、若训练集与验证集来自不同分布,比如一个网络爬虫获取的高清图像,一个是手机不清晰图像,人工合成图像,比如不清晰图像,亮度高的图像。
2、两种来源的数据一个来源数据大比如20万张,一个来源数据小,如五千张小数据集是我们优化目标,一种情况是将两组数据合并在一起,然后随机分配到训练验证测试集中。好处是,三个数据集来自同一分布。缺点:瞄准目标都是大数据那一类的数据,而不是我们的目标小数据集。另外一种情况是训练集全部用大数据集,开发与测试集都是小数据集数据,优点:瞄准目标,坏处是不同分布。
3、分析偏差和方差方法和同一分布的方法不一样,加一个训练开发集(从训练集留出一部分数据)。总共四个数据集,训练集、训练开发集、开发集、测试集。看训练开发集的准确率与训练集验证集的区别来判别式方差还是数据分布不匹配的造成的误差。
具体看如下链接:https://blog.csdn.net/koala_tree/article/details/78319908

4.7 What is data regularization/normalization? Why do we need it?

我觉得这一点很重要。数据归一化是非常重要的预处理步骤,用于重新缩放输入的数值以适应特定的范围,从而确保在反向传播期间更好地收敛。一般来说采取的方法都是减去每个数据点的平均值并除以其标准偏差。如果我们不这样做,那么一些特征(那些具有高幅值的特征)将在cost函数中得到更大的加权(如果较高幅值的特征改变1 %,则该改变相当大,但是对于较小的特征,该改变相当小)。数据归一化使所有特征的权重相等。

4.8 Dimensionality reduction

解释降维(dimensionality reduction),降维在哪里使用,降维的好处是什么?

降维是通过获得一组基本上是重要特征的主变量来减少所考虑的特征变量的过程。特征的重要性取决于特征变量对数据信息表示的贡献程度

数据集降维的好处可以是:

( 1 )减少所需的存储空间。

( 2 )加快计算速度(例如在机器学习算法中),更少的维数意味着更少的计算,并且更少的维数可以允许使用不适合大量维数的算法。

( 3 )将数据的维数降低到2D或3D可以允许我们绘制和可视化它,可能观察模式,给我们提供直观感受。

( 4 )太多的特征或太复杂的模型可以导致过拟合。

4.9 opencv-python image data type

用opencv读入图像时:读入默认是uint8格式的numpy array(矩阵)
src=cv2.imread('1.jpg')
print('0',type(src),src.shape,src.dtype)
#0 <class 'numpy.ndarray'> (1405, 2500, 3) uint8
 
filter=np.array([[1, 1], [1, 2]])
print(filter.dtype)
#int32
 
res=cv2.filter2D(src,-1,filter)
print('1',type(res))
#1 <class 'numpy.ndarray'>
 
cv2.imshow('res',res)
print(res)
cv2.waitKey(0)

insert image description here

4.10 Logistic regression

Logistic regression is a classification algorithm, so what is it regressing?
Logistic regression is based on the assumption that the data obeys the Bernoulli distribution, through the method of maximum likelihood, using the gradient descent method to solve the parameters, so as to achieve the purpose of classifying the data into two categories.
Logistic regression is a generalized linear regression model that reduces the prediction range and limits the prediction value to [0,1], and solves the classification problem.

4.11GBDT

Does GBDT understand? What is the base classifier used for? Is that the one you use for classification?

  • GBDT is a gradient boosting decision tree. It is a Boosting-based algorithm. It uses an additive model based on a decision tree as a learner. By continuously fitting the residual of the previous weak learner, it finally realizes a classification or regression model. The key is to use the value of the negative gradient of the loss function in the current model as an approximation of the residual to fit a regression tree.
  • The base classifier of GBDT uses a decision tree, which is also used for classification.
  • For classification problems: the exponential loss function is often used; for regression problems: the square error loss function is often used (at this time, its negative gradient is the residual in the usual sense), which is an approximation of the residual for the general loss function.

4.12 What are the improvements of XGBoosting compared to GBDT principle.

The improvements are mainly in the following aspects:

  • The traditional GBDT uses the CART tree as the base learner, and XGBoost also supports linear classifiers. At this time, XGBoost is equivalent to L1 and L2 regularized logistic regression (classification) or linear regression (regression);
  • Traditional GBDT only uses first-order derivative information during optimization, while XGBoost performs second-order Taylor expansion on the cost function to obtain first-order and second-order derivatives;
  • XGBoost adds a regular term to the cost function to control the complexity of the model. From the perspective of weighing the variance deviation, it reduces the variance of the model, makes the learned model simpler, and prevents overfitting. This is also a feature of XGBoost that is superior to traditional GBDT;
  • shrinkage (reduction), equivalent to the learning rate (eta in XGBoost). When XGBoost completes an iteration, it will multiply the weight of the leaf node by this coefficient, mainly to weaken the influence of each tree and allow more learning space later. (GBDT also has a learning rate);
  • Column sampling: XGBoost borrows from the practice of random forest and supports column sampling, which not only prevents overfitting, but also reduces calculations;
  • Handling of missing values: For samples with missing feature values, XGBoost can also automatically learn its splitting direction;
  • The XGBoost tool supports parallelism. Isn't Boosting a serial structure? How parallel? Note that the parallelism of XGBoost is not the parallelism of tree granularity. XGBoost can only proceed to the next iteration after one iteration (the cost function of the tth iteration contains the predicted value of the previous t-1 iteration). The parallelism of XGBoost is at the feature granularity. We know that one of the most time-consuming steps in learning a decision tree is to sort the values ​​of the features (because the best split point is to be determined). XGBoost pre-sorts the data before training, and then saves it as a block structure. Later This structure is used repeatedly in iterations, greatly reducing the amount of computation. This block structure also makes parallelism possible. When splitting nodes, it is necessary to calculate the gain of each feature, and finally select the feature with the largest gain for splitting. Then the gain calculation of each feature can be performed in multiple threads.

Reference link:
https://blog.csdn.net/comway_Li/article/details/82532573

Guess you like

Origin blog.csdn.net/zml194849/article/details/122947529