Summary of machine learning and deep learning interview knowledge points

 
  

Click on " Xiaobai Learning Vision " above , and choose to add " Star " or " Top "

重磅干货,第一时间送达

Author丨Oldpan

Source丨oldpan blog

Edit 丨 Extreme Market Platform

guide

 

This article summarizes some problems and some important knowledge points that will be encountered in the interview of autumn recruits, which is suitable for surprise and consolidation of basic knowledge before the interview.

foreword

Recently, autumn recruitment is approaching. This article is some important knowledge points compiled by Lao Pan when he was looking for a job there. Also good for solidifying the basics. In addition, I recommend a new book called "Hundred Faces of Machine Learning", which was published in August 2018. It includes many machine learning and deep learning problems that will be encountered during the interview process. It is more suitable for machine learning and in-depth interview preparation. Algorithm engineers in learning are of course also suitable for consolidating the foundation~ Books that must be read when you have time:

  • The mathematics series for programmers is suitable for reviewing knowledge and reviewing some basic linear algebra and probability theory.

  • In-depth study of flower books, summary books, explanations of basic knowledge, relatively comprehensive.

  • Statistical learning methods, summary books, not long, are the core.

  • Pattern Recognition and Machine Learning, clear and organized, explains machine learning in a Bayesian way.

  • The machine learning watermelon book is suitable as a teaching material, and the content is broad but not deep.

2a95276ae4e6dd14d08f53ac003d36ee.jpeg
Unbreakable machine learning

common knowledge questions

  • The L1 regularization can make the minority weights larger, and the majority weights are 0, resulting in sparse weights; the L2 regularization can make the weights approach 0 but non-zero, and obtain smooth weights;

  • In the AdaBoost algorithm, the formula of the weight update ratio of the misclassified samples is the same;

  • Both Boosting and Bagging are methods of combining multiple classifiers to vote, but Boosting determines its weight based on the correct rate of a single classifier, and Bagging can simply set the weight of all classifiers to be the same;

  • The EM algorithm cannot guarantee to find the global optimum;

  • In SVR, the kernel function width is small and underfitting, and the width is large and it is easy to overfit.

  • Both PCA and LDA are classic dimensionality reduction algorithms. PCA is unsupervised, that is, training samples do not require labels; LDA is supervised, that is, training samples require labels. PCA is to remove redundant dimensions in the original data, while LDA is to find a dimension that separates the data of different categories as much as possible after the original data is projected on this dimension.

PCA is an orthogonal projection, and its idea is to maximize the variance of the original data in each dimension of the projection subspace. Suppose we want to project N-dimensional data onto an M-dimensional space (M<N). According to PCA, we first find the covariance matrix of the N-dimensional data, and then find the first M largest eigenvalues ​​corresponding to eigenvectors, then these M eigenvectors are the basis of the projected space. After LDA projection, the intra-class variance is the smallest and the between-class variance is the largest. As shown in the figure below, there are two projection methods. After the projection on the left, the red data and the blue data still overlap. After the projection on the right, the red data and the blue data are just separated. The projection of LDA is similar to the projection method on the right. After projection, the data of different categories are separated as much as possible, while the data of the same category are distributed as compactly as possible.

3728fbe87a1a085e5875e61090abbf84.jpeg
PCA and LDA
  • Reference link: Comparison of PCA and LDA

KNN K-Nearest Neighbors

There is a lot of knowledge about the K-nearest neighbor algorithm, such as the steps of algorithm execution, application fields, and precautions, but I believe that many people are not very clear about the precautions for using the K-nearest neighbor algorithm. In this article, we will answer this question, and take everyone to understand the precautions of the k-nearest neighbor algorithm and the advantages and disadvantages of the k-nearest neighbor algorithm.

  • Notes on the K-nearest neighbor algorithm

The precautions for using the K-nearest neighbor algorithm are specifically that when using distance as a measure, it is necessary to ensure that all features are of the same order of magnitude in value, so as to prevent the calculation of distance from being dominated by features with large orders of magnitude. In the matter of data standardization, it is also important to note that the training data set and the test data set must use the same standard of standardization. There are two reasons for this in general. The first is that standardization can actually be regarded as a part of the algorithm. Since a number is subtracted from the data set and then divided by a number, these two numbers are used for all data. That said, we should treat everyone equally. The second is that the training data set is actually very small. When predicting new samples, the new samples are pitifully rare. If the new sample is only one data, its mean is itself, and the standard deviation is 0, which is not reasonable at all.

  • What are the advantages of the K nearest neighbor algorithm?

The advantages of the K-nearest neighbor algorithm are embodied in four aspects. The first is that the k-nearest neighbor algorithm is an online technology. New data can be directly added to the data set without retraining. The second is that the k-nearest neighbor algorithm is theoretically simple and easy to implement. The third is high accuracy and high tolerance to outliers and noise. The fourth is that the k-nearest neighbor algorithm inherently supports multi-classification, distinguishing it from perceptrons, logistic regression, and SVM.

  • What are the disadvantages of the K nearest neighbor algorithm?

The disadvantage of the K-nearest neighbor algorithm is that the basic k-nearest neighbor algorithm will re-perform a global operation every time it predicts the classification of a "point", and the amount of calculation for a data set with a large sample size is relatively large. Moreover, the K-nearest neighbor algorithm is prone to the disaster of dimensionality. When calculating the distance in a high-dimensional space, it will become very far away; when the samples are unbalanced, the prediction deviation is relatively large, and the selection of the k value depends on experience or cross-validation. The selection of k can use cross-validation or grid search. The larger the value of k, the greater the deviation of the model, and the less sensitive it is to noisy data. When the value of k is large, the model may be underfitted. The smaller the value of k, the larger the variance of the model will be. When the value of k is small, it will cause the model to overfit.

Two-dimensional Gaussian kernel function

If you were asked to write a Gaussian blur function, how would you write it?

`def gaussian_2d_kernel(kernel_size = 3,sigma = 0):
kernel = np.zeros([kernel_size,kernel_size])  
center = kernel_size // 2  


if sigma == 0:  
    sigma = ((kernel_size-1)*0.5 - 1)*0.3 + 0.8  


s = 2*(sigma**2)  
sum_val = 0  
for i in range(0,kernel_size):  
    for j in range(0,kernel_size):  
        x = i-center  
        y = j-center  
        kernel[i,j] = np.exp(-(x**2+y**2) / s)  
        sum_val += kernel[i,j]  
        #/(np.pi * s)  
sum_val = 1/sum_val  
return kernel*sum_val  


`

training sampling method

  • Cross-validation

  • leave one out

  • Bootstrap: A sampling method with replacement that may draw duplicate samples

Principles, differences, and application scenarios of Kmean and GMM

Convergence of kmeans?

  • You can see it here https://zhuanlan.zhihu.com/p/36331115

  • You can also see Baimian Machine Learning P93, P102

How to do kmeans on multiple computers

In fact, this is the case. First distribute to n machines, and ensure that the k initializations are the same. After one iteration, get k*n new means and put them on a new machine. Because the initialization is the same, the mean Arrange the same, and then do a weighted average of the n means belonging to each class, and then put them back on each machine for the next iteration.

KNN algorithm and process

Selection of K value:

  • If the value of K is small, the complexity of the model is high, overfitting is prone to occur, the estimation error of learning will increase, and the prediction result is very sensitive to the instance points of the neighbors.

  • A larger K value can reduce the estimation error of learning, but the approximation error of learning will increase, and the training instance farther from the input instance will also have an effect on the prediction, making the prediction error, and the complexity of the model will decrease as the k value increases. .

  • In the application, the k value generally takes a relatively small value, and the cross-validation method is usually used to select the optimal K value.

The selection of K value in KNN is very important to the result of classification. If the K value is selected too small, the model is too complex. The selection of K value is too large, resulting in blurred classification. So how to choose the K value? Some people use Cross Validation, some use Bayesian, and some use bootstrap. The distance measurement is another problem, and the more commonly used one is to use the Euclidean distance. But is this distance really universal? "Pattern Classification" pointed out that the Euclidean distance is sensitive to translation, which seriously affects the result of the judgment. A distance metric that is insensitive to known transformations (such as translations, rotations, scaling, etc.) must be chosen here. The book proposes the use of tangent distance to replace the traditional Euclidean distance.

Difference Between Unsupervised Learning and Supervised Learning

Supervised:

  • Perceptron

  • K nearest neighbor method

  • Naive Bayes

  • decision tree

  • logistic regression

  • Support Vector Machines

  • Lifting method

  • Hidden Markov Model

  • conditional random field

Unsupervised:

  • clustering-kmeans

  • SVD singular value decomposition

  • PCA principal component analysis

Generative Model: LDA KNN Hybrid Gaussian Bayesian Markov Deep Belief Discriminative Model: SVM NN LR CRF CART

The difference between logistic regression and SVM

Logistic regression is LR. When LR predicts data, it gives a probability that the prediction result is a positive class. This probability is obtained by mapping wTx to [0,1] through the sigmoid function. When wTx is very large (it can be considered as far from the decision boundary Very far away), the probability of being a positive class approaches 1; when wTx is very negative (it can be considered to be far from the decision boundary), the probability of being a positive class approaches 0. In LR, there is nothing more to do with the "distance from the decision boundary". In the process of solving the parameter w, there is no shadow of the distance from the decision boundary at all, and all samples are treated equally. The difference from the perceptron is that LR uses the distance from the decision boundary, which is used to give the prediction result a visible confidence interval. There is no such consideration in the perceptron, and it only judges according to symbols. The SVM goes one step further. In the process of solving the parameters, the points that are too far away from the decision boundary are discarded. Both LR and perceptrons are prone to overfitting, and only the structured risk minimization strategy after adding the L2 norm to SVM can solve the problem of overfitting. To sum it up:

  • The concept of "distance" from the hyperplane is not introduced before and after the perceptron, it only cares whether it is on one side of the hyperplane;

  • LR introduces distance, but there is no concept of distance when training the model to find its parameters, but only introduces distance in the final prediction stage to represent the confidence of classification;

  • SVM has the concept of distance in two places: first, there is a concept of distance when calculating the hyperplane parameters, which is manifested as focusing on points within a certain distance from the hyperplane, while all other points are no longer concerned. The points of interest are called "support vectors". Second, when predicting new samples, like LR, the distance represents the confidence.

Logistic regression can only solve binary classification problems, and softmax is used for multi-classification. Related Reference Links

  • https://blog.csdn.net/maymay_/article/details/80016175

  • https://blog.csdn.net/jfhdd/article/details/52319422

  • https://www.cnblogs.com/eilearn/p/9026851.html

bagging boosting and boosted trees

  • Bagging is to reduce the generalization error by combining several models, train several different models separately, and then let all the models vote on the output of the test sample. The reason models work on average is that different models usually do not produce exactly the same error on the test set. Extract the training set from the original sample set. Each round uses the Bootstraping method to extract n training samples from the original sample set (in the training set, some samples may be drawn multiple times, while some samples may not be drawn once). A total of Perform k rounds of extraction to obtain k training sets. (k training sets are independent of each other)

  • Bagging is a parallel learning algorithm. The idea is very simple, that is, each time from the original data, a data set with the same size as the original data set is extracted from the original data according to the uniform probability distribution. The sample points can be repeated, and then construct a classifier for each generated data set, and then combine the classifiers. For the classification problem, the k models obtained in the previous step are voted to obtain the classification result; for the regression problem, the mean value of the above models is calculated as the final result.

  • Boosting is a family of algorithms that can upgrade weak learners to strong learners. The sample distribution of each sampling of Boosting is different. Each iteration increases the weight of misclassified samples based on the results of the previous iteration. . Make the model pay more attention to difficult-to-classify samples in subsequent iterations. This is a process of continuous learning and continuous improvement, which is the essence of Boosting thinking. After the iteration, the base classifiers of each iteration are integrated, so how to adjust the sample weight and integrate the classifiers is the key issue we need to consider.

The difference between Bagging and Boosting:

  • 1) Sample selection: Bagging: The training set is selected by replacement in the original set, and the training sets selected from the original set are independent between each round of training sets. Boosting: The training set of each round remains unchanged, only the training set The weight of each sample in the classifier changes. The weight is adjusted according to the classification results of the previous round.

  • 2) Sample weight: Bagging: Using uniform sampling, the weight of each sample is equal Boosting: Constantly adjust the weight of the sample according to the error rate, the greater the error rate, the greater the weight.

  • 3) Prediction function: Bagging: The weights of all prediction functions are equal. Boosting: Each weak classifier has a corresponding weight, and the classifier with a small classification error will have a greater weight.

  • 4) Parallel computing: Bagging: Each prediction function can be generated in parallel Boosting: Each prediction function can only be generated sequentially, because the latter model parameter requires the result of the previous round of the model.

Bagging is the abbreviation of Bootstrap Aggregating, which means resampling (Bootstrap) and then taking the average of the model trained on each sample, so it reduces the variance of the model. Bagging, such as Random Forest, has this effect. Boosting is an iterative algorithm. Each iteration weights the samples according to the prediction results of the previous iteration, so as the iteration continues, the error will become smaller and smaller, so the bias of the model will continue to decrease. High variance is the model is too complex to overfit , remember too many details of noise, which is greatly affected by outlier; high bias is underfit, the model is too simple, and the cost function is not good enough. Boosting is to combine many weak classifiers into one strong classifier. Weak classifiers have high bias, while strong classifiers have low bias, so boosting has the effect of reducing bias. Variance is not the main consideration of boosting. Bagging is the averaging of many strong (or even too strong) classifiers. Here, the bias of each individual classifier is low, and the bias is still low after averaging; and each individual classifier is strong enough to cause overfitting, that is, the variance is high, and the average operation plays a role The function is to reduce this variance. Representative of the Bagging algorithm: RandomForest random forest algorithm attention points:

  • No pruning is required during the construction of the decision tree.

  • The number of trees in the entire forest and the characteristics of each tree need to be set manually.

  • When building a decision tree, the selection of split nodes is based on the minimum Gini coefficient.

In the random forest chapter of our upgraded version of machine learning, I wrote this formula on a whiteboard: p = 1 - (1 - 1/N)^N, which means that a sample is selected as a decision tree during a decision tree generation process. The probability of training samples, when N is large enough, is approximately equal to 63.2%. In short, the probability of a sample being selected is 63.2%, according to the expectation of the binomial distribution, which means that approximately 63.2% of the samples are selected. That is, 63.2% of the samples are not repeated, and 36.8% of the samples may not be in this training sample set. A random forest is a classifier that consists of multiple decision trees, and its output class is determined by the mode of the class output by individual trees. The randomness of the random forest is reflected in the fact that the training samples of each tree are random, and the split attribute set of each node in the tree is also randomly selected and determined. With these two random guarantees, the random forest will not produce overfitting. Random forest is a forest built in a random way. The forest is composed of many decision trees. The training samples assigned to each tree are random, and the set of split attributes of each node in the tree is also randomly selected. .

SVM

In addition to cs231n, related notebooks can also be viewed here.

https://momodel.cn/workspace/5d37bb9b1afd94458f84a521?type=module

Convex sets, convex functions, convex optimization

There are relatively few interviews, and those who are interested can look at:

https://blog.csdn.net/feilong_csdn/article/details/83476277

Why image segmentation in deep learning needs to be encoded first and then decoded

Downsampling is a means not an end:

  • Reduce the amount of video memory and calculation. The smaller the image is, the smaller the video memory is, and the amount of calculation is also small;

  • Increase the receptive field, and use the same 3x3 convolution to perform feature extraction on a larger image range. The large receptive field is very important for segmentation, and the small receptive field cannot do multi-category segmentation, and the segmentation is very rough

  • There are several sub-sampling branches with different degrees, which can facilitate the fusion of multi-scale features. Multi-level semantic fusion will make the classification more accurate.

The theoretical significance of downsampling, let me read it briefly, it can increase the robustness to some small disturbances of the input image, such as image translation, rotation, etc., reduce the risk of overfitting, reduce the amount of calculation, and increase the size of the receptive field . Related links: Why does image segmentation in deep learning need to be encoded first and then decoded?

The difference between (global) average pooling average pooling and (global) maximum pooling max pooling

  • Max pooling preserves texture features

  • Average pooling preserves the overall data characteristics

  • Global average pooling has a positioning effect (see Zhihu)

Max pooling extracts the "most important" features such as edges, while average pooling extracts features more smoothly. For image data, you can see the difference. While both are used for the same reason, I think max pooling is better for extracting extreme features. Average pooling sometimes doesn't extract good features because it will take them all and calculate the average, which might not work well for object detection type tasks but one motivation to use is to have a detector 平均池化for each spatial location , and pass 期望特征Average each spatial location, which behaves like averaging the predictions of different translations of the input image (sort of like data augmentation). Instead of using traditional fully connected layers for CNN classification, Resnet directly outputs the spatial average of feature maps from the last mlp conversion layer as the category confidence through the global average pooling layer, and then inputs the resulting vector to the softmax layer. In contrast, the Global average is more meaningful and interpretable, since it enforces the correspondence between features and categories, which can be achieved by using more powerful local modeling of the network. In addition, fully connected layers are prone to overfitting and rely heavily on dropout regularization, while global average pooling itself acts as a regularizer , which itself prevents overfitting of the overall structure .

  • https://zhuanlan.zhihu.com/p/42384808

  • https://www.zhihu.com/question/335595503/answer/778307744

  • https://www.zhihu.com/question/309713971/answer/578634764

The role of full connection, the relationship with the 1x1 convolutional layer

In actual use, the fully connected layer can be realized by the convolution operation: the fully connected layer with the full connection to the front layer can be converted into a convolution with a convolution kernel of 1x1; and the fully connected layer with the convolutional layer in the front layer can be converted into The convolution kernel is the global convolution of hxw, h and w are the height and width of the convolution results of the previous layer, respectively. Use global average pooling instead of convolution

  • Fully connected layers (FC) play the role of "classifier" in the entire convolutional neural network. If operations such as the convolutional layer, pooling layer, and activation function layer map the original data to the hidden layer feature space, the fully connected layer plays the role of mapping the learned "distributed feature representation" to the sample label space. . In actual use, the fully connected layer can be realized by the convolution operation: the fully connected layer with the full connection to the front layer can be converted into a convolution with a convolution kernel of 1x1; and the fully connected layer with the convolutional layer in the front layer can be converted into The convolution kernel is the global convolution of hxw, h and w are the height and width of the convolution result of the previous layer respectively

Then, the main functions of 1*1 convolution are as follows:

  • Dimension reduction (dimension reductionality). For example, a 500x500 image with a depth of 100 is made on 20 filters 1*1的卷积,那么结果的大小为500*500*20.

  • Add nonlinearities. After the convolutional layer passes through the excitation layer, the 1*1 convolution adds non-linear activation (non-linear activation) to the learning representation of the previous layer to improve the expressive ability of the network, but it can also be said that: make it from simple The linear transformation of the feature map becomes a linear combination between complex feature maps, thereby realizing a highly abstract process of features. This process is regarded as a transformation from linear to nonlinear, which increases the degree of abstraction. Instead of adding an activation function.

  • Individuals should reduce or increase the dimension to reduce the number of parameters and increase the depth of the network, as well as cross-channel feature aggregation

  • Can replace the fully connected layer

See the answer to this question https://www.zhihu.com/question/56024942/answer/369745892 

See the answer to this question https://www.zhihu.com/question/41037974/answer/150522307

The difference between concat and add(sum)

For two inputs, if the number of channels is the same and convolution is followed, add is equivalent to sharing the same convolution kernel for the corresponding channels after concat. Let’s explain it in detail below. Since the convolution kernels of each output channel are independent, we can only look at the output of a single channel. Suppose the channels of the two inputs are X1, X2, ..., Xc and Y1, Y2, ..., Yc respectively. Then the single output channel of concat is (* means convolution):

And the single output channel of add is:

Therefore, add is equivalent to adding a prior. When the two inputs can have the property of "the feature map of the corresponding channel is similar in semantics" (may not be rigorous), add can be used instead of concat, which saves more parameters and calculations. (concat is twice as fast as add). The pyramid in FPN [1] is to increase the resolution of the feature map with the smallest resolution but the strongest semantics, which can be added in nature. If you use concat, because the number of feature channels with small resolution is more, the amount of calculation is a lot of overhead. https://www.zhihu.com/question/306213462/answer/562776112

  • It is indeed much better to change concat to sum. These two are feature fusion. What is the essential difference? When I use it, there is no principle, just try both (in fact, I prefer to use sum, after all, it saves video memory).

  • I have done an experiment similar to ASP before, pyramid-shaped hollow convolution fusion, and the final experimental result is better than concat, but I don’t know how to explain the reason.

  • I have read some papers that concat is better than sum, maybe it has something to do with the specific situation of the data set.

  • The different features are summed, what’s the point? These features are lost again; if you directly concat and let the later network learn, it should be better, and more features are used.

How to change SSD to FasterRCNN

SSD is a direct classification, while FasterRcnn first judges whether it is a background and then classifies it. One is direct subdivision, and the other is coarse classification and then subdivision.

The principle of backpropagation

The principle of back propagation looks at the BP process in CS231n and the propagation of Jacobian.

The difference between GD, SGD, and mini batch GD

There is a corresponding chapter in Baimian Deep Learning.

deviation, variance

There is an article that has a good introduction, and it is also in the electronic version of CNNbook.

  • http://scott.fortmann-roe.com/docs/BiasVariance.html

  • The generalization error can be decomposed into the square of the deviation + variance + noise

  • The deviation measures the degree of deviation between the expected prediction of the learning algorithm and the actual result, and describes the fitting ability of the learning algorithm itself.

  • Variance measures the change in learning performance caused by changes in the training set of the same size, and describes the interference caused by data disturbance

  • The noise expresses the lower bound of the expected generalization error that can be achieved by learning any algorithm on the current task, and describes the difficulty of the problem itself.

  • Bias and variance are generally called bias and variance. Generally, the stronger the training error is, the smaller the bias is, and the larger the variance is, the generalization error will have a minimum value in the middle.

  • If the deviation is large and the variance is small, it is underfitting at this time, and if the deviation is small and the variance is large, it is overfitting.

Why does the gradient explode and how to prevent it

Multilayer neural networks usually have a cliff-like structure, which is caused by the multiplication of several larger weights. Encountering cliff structures with steep slopes, gradient updates can drastically change parameter values, often skipping such cliff structures entirely. Flower Book P177.

Distributed training, multi-card training

http://ai.51cto.com/art/201710/555389.htm 

https://blog.csdn.net/xs11222211/article/details/82931120#commentBox

Precision and recall and PR curve

This is better (TP and FP and ROC curve):

  • https://segmentfault.com/a/1190000014829322

The precision rate refers to the ratio of the number of correctly classified positive samples to the number of samples judged by the classifier as positive samples. The recall rate refers to the ratio of the number of correctly classified positive samples to the number of true positive samples. The Precision value and the Recall value are two contradictory and unified indicators. In order to improve the Precision value, the classifier needs to try to predict the sample as a positive sample when it is "more confident", but at this time it is often missed because it is too conservative Many "uncertain" positive samples lead to a low Recall value. How to weigh these two values, so there are more criteria such as PR curve, ROC curve and F1 score to judge. https://www.cnblogs.com/xuexuefirst/p/8858274.html

Compared with Yolov1, because of the use of Anchor Boxes, the recall rate of the model is greatly improved, and the map is slightly reduced by 0.2.

https://segmentfault.com/a/1190000014829322 

https://www.cnblogs.com/eilearn/p/9071440.html 

https://blog.csdn.net/zdh2010xyz/article/details/54293298

dilated convolution

Hole convolution is generally accompanied by padding. If dilation=6, then padding is also equal to 6. The size of the convolutional feature map after the hole convolution is unchanged , but the receptive field of this convolution is larger than the ordinary convolution of the same size. However, the number of channels can be changed.

  • In DeepLabv3+, the last ASPP layer passes a 1x1 convolution and three 3x3 hole convolutions, and then concats a feature map that is bilinearly sampled to the same dimension after global average pooling.

However, it should be noted that since the hole convolution itself will not increase the amount of calculation, but the subsequent resolution is not reduced, the subsequent calculation amount will indirectly increase. https://zhuanlan.zhihu.com/p/52476083

What to do if the data is not good, how to deal with unbalanced data, and how to deal with only a small amount of labels

Analyze specific issues.

What to do if overfitting is required during training

  • Deep Learning - General Model Debugging Tips

  • How to diagnose our CNN based on training/validation loss curves

  • Many tips about training neural networks Tricks (full summary version)

  • What kind of experience is a small data set in deep learning

If the actual capacity of the model is relatively large, then it can be said that the model can fully learn the entire data set, and overfitting will occur. At this time, if new data is added, the performance of the model will be further improved, indicating that the model has not been exhausted. The expected risk is the expected loss of the model with respect to the joint distribution, and the empirical risk is the average loss of the model with respect to the training dataset. According to the law of large trees, when the sample size N tends to infinity, the empirical risk tends to the expected risk. However, when the sample size is relatively small, the effect of empirical risk minimization learning may not be very good, and the phenomenon of "overfitting" will occur. Structural risk minimization is a strategy proposed to prevent overfitting.

https://lilianweng.github.io/lil-log/2019/03/14/are-deep-neural-networks-dramatically-overfitted.html 

https://www.jianshu.com/p/97aafe479fa1 (important)

Regularization

In Pytorch, weight_decay can only be set in optim. Currently, only L2 regularization is supported, and this regularization is for all parameters in the model, whether it is w or b, including W and b in BN.

What are the consequences of BN layer and L2 regularization together

It is because after the batch norm passes, the influence of weight is not so heavy, so the effect of l2 weight decay is not obvious. It is proved that L2 regularization has no regularization effect when combined with normalization. In contrast, regularization affects the range of weights and thus the effective learning rate.

https://www.cnblogs.com/makefile/p/batch-norm.html?utm_source=debugrun&utm_medium=referral

The difference between ROIPooling and ROIAlign

Spatial pyramid pooling (SSP) can make images of different sizes produce fixed output dimensions. Let me also ask a question, why is the roi pooling of fast rcnn a max pooling? After the roi pooling, the classification of a single roi is also done. Why is it not different from the pooling of the classification? My intuition is to look at a channel in the feature map, extract global features (for example, for classification) and use average pooling to extract global information; extract local features (for example, roi pooling) should use max pooling to extract the most obvious local features, After becoming a 7×7 grid, it is handed over to the following fc for classification. Related introduction:

  • SPPNet-introducing spatial pyramid pooling to improve RCNN

Implement image enhancement algorithm yourself

https://zhuanlan.zhihu.com/p/71231560

Image classification tricks

  • Amazon: Tricks for image classification with CNN (https://mp.weixin.qq.com/s/e4m_LhtqoUiGJMQfEZHcRA)

Ablation experiment

Because the author proposed a plan to change multiple conditions/parameters at the same time, in the next ablation experiment, he will control one condition/parameter unchanged one by one to see the results, which condition/parameter is right for the result have a greater impact. The following passage is excerpted from Zhihu, @人民艺术:Your friend said that you look very handsome today, and you want to know how much your hairstyle, top and trousers play, so you changed a few hairstyles, and your friend said that you are still pretty You are handsome, you have changed your shirt again, your friend said you are not handsome anymore, it seems that this dress is quite important.

Manual NMS and soft-NMS

https://oldpan.me/archives/write-hard-nms-c

Logistic Regression and Linear Regression

Linear regression: Find the optimal parameters by mean square error, and then estimate by least squares method or gradient descent method :

7028bd4562671695ab7b32ba18f622b9.jpeg
The prototype of logistic regression: logarithmic probability regression: logistic regression and logarithmic probability regression are the same, and can be obtained through deformation. In addition, logistic regression uses maximum likelihood probability for estimation. Brief summary:
  • Both linear regression and logistic regression are special cases of generalized linear regression models

  • Linear regression can only be used for regression problems, and logistic regression is used for classification problems (can be extended from binary classification to multi-classification)

  • Linear regression has no connection function or does not work. The connection function of logistic regression is a logarithmic probability function, which belongs to the Sigmoid function

  • Linear regression uses the least squares method as a parameter estimation method, and logistic regression uses the maximum likelihood method as a parameter estimation method

  • Both can use gradient descent

Notice:

  • The gradient descent method of linear regression is actually the same as our training neural network. First, the parameters need to be initialized, and then the parameters are updated by stochastic gradient descent: https://zhuanlan.zhihu.com/p/33992985

  • Linear regression and least square method: https://zhuanlan.zhihu.com/p/36910496

  • Maximum Likelihood https://zhuanlan.zhihu.com/p/33349381

Source article:

  • https://segmentfault.com/a/1190000014807779

  • https://zhuanlan.zhihu.com/p/39363869

  • https://blog.csdn.net/hahaha_2017/article/details/81066673

For convex functions, the local optimum is the global optimum, related link: http://sofasofa.io/forum_main_post.php?postid=1000329 

http://sofasofa.io/forum_main_post.php?postid=1000322Logistic classification with cross-entropy

What is attention and what types are there

https://zhuanlan.zhihu.com/p/61440116 

https://www.zhihu.com/question/65044831/answer/227262160

Linear and Nonlinear in Deep Learning

  • Convolution is linear

  • The activation function is non-linear

The problem of gradient disappearance and gradient explosion

18cec3b759a7afa5b76a64daaecc54e4.jpeg

The role of the Batch-norm layer

If you don’t look at it, you must enter the pit~ Whether it is training or deployment, you will step on the Batch Normalization of the pit

If the batch size is too small, the Loss curve will oscillate relatively large. The size is generally selected according to the law of the power of 2. Why? I didn’t answer it. The interviewer later explained that it was for the consideration of hardware computing efficiency. Haige later said that the threads opened during GPU training are powers of 2. The essence of a neural network is the distribution of learning data. If the training data and test data The different distributions will greatly reduce the generalization ability of the network. As the network training progresses, the change of each hidden layer causes the input of the latter layer to change, so that the distribution of each batch of training data will also change, causing the network to fit different data in each iteration. distribution, increasing the complexity of data training and the risk of overfitting.

928ca4724645d88f2f9039850cfaa5da.jpeg

Resistant to drastic changes in data.

60cbfb78e69f03e1194553a7e15ab6d3.jpeg

It should be noted that BN is in the convolutional network layer. Because of the parameter sharing mechanism, the parameters of each convolution kernel are shared among neurons in different positions, so they should also be normalized. (Look at the implementation process in detail) https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/batch_norm_layer.html But if the batch-size is not large during the training process, BN may not be used (MaskRcnn said so). So far, the theoretical and practical part of Batch Normalization is introduced here. In general, BN normalizes the input of each layer of the network to ensure that the mean and variance of the input distribution are fixed within a certain range, reducing the Internal Covariate Shift problem in the network and alleviating the gradient disappearance to a certain extent. Accelerated model convergence; and BN makes the network more robust to parameters and activation functions, reducing the complexity of neural network model training and parameter adjustment; finally, in the BN training process, the mean/variance of the mini-batch is used as the overall sample Statistical estimation introduces random noise, which has a regularizing effect on the model to a certain extent.

  • https://zhuanlan.zhihu.com/p/34879333

The relationship between BN and Bayesian:

  • Analysis of Batch Normalization from the perspective of Bayesian

How to ensure the same mean and var in BN cross-card training

In practice, I found that BN synchronization across cards is quite useful for performance. Especially for detection and segmentation tasks, the Batch size is originally small. If Batch Norm can be synchronized across cards, it is equivalent to increasing the batch size of Batch Norm so that more accurate mean and variance can be estimated, so this operation can improve performance.

How to implement SyncBN

The key to synchronizing BN across cards is to obtain the global mean and variance during the forward operation, and obtain the corresponding global gradient during the backward operation. The easiest way to achieve this is to first calculate the mean value synchronously, then send back each card and then calculate the variance synchronously, but in this way, it is synchronized twice. In fact, it only needs to be synchronized once. We used a very simple technique to change the formula of the variance (the formula is a picture, please search SyncBN online for details). In this way, in the forward calculation, we only need to calculate the sum on each card, and then calculate the global sum across the cards to get the correct mean and variance. Similarly, we only need to synchronize once in the backward calculation, and find The corresponding gradient and . We also shared this synchronization method in our recent paper Context Encoding for Semantic Segmentation. With cross-card BN, we don't have to worry about the model being too large and using multiple cards to affect the convergence effect, because no matter how many cards are used, as long as the global batch size is the same, the same effect will be obtained.

Why ResNet is easy to use

Appearing factors:

  • With the deepening of the network, the optimization function is more and more trapped in the local optimal solution

  • As the number of network layers increases, the problem of gradient disappearance becomes more serious, because the gradient will gradually decay during backpropagation

The reason is that the error propagation formula can be written in the form of multiplying the parameter W and the derivative F. When the error propagates from the L-th layer to the first hidden layer other than the input, many, many parameters and derivatives will be multiplied. At this time, the error is easy to disappear or expand, which makes it difficult to learn, and the fitting ability and generalization ability are poor. The F layer in the residual layer only needs to fit the residual Hx between the input x and the target output H. If the output of a certain layer has better fitted the expected result, then one more layer will not change the model. It is worse, because the output of this layer is directly connected to the two layers, which is equivalent to directly learning an identity mapping, and the skipped two layers only need to fit the residual between the output of the upper layer and the target.

  • https://zhuanlan.zhihu.com/p/42706477

  • https://zhuanlan.zhihu.com/p/31852747

386f8b935ccc95f09b9a11ba99dc8065.jpeg

Disadvantages of Resnet

In fact, resnet can't really realize the disappearance of the gradient. There are strong prior assumptions in it, and the layer that resnet really works is only in the middle, and the deep layer has a relatively small effect (the deep layer is the identity mapping), and the feature is underutilized. , the way of add hinders the circulation of gradients and information.

L1 norm and L2 norm application scenarios

L1 regularization can make a few weights larger, most weights are 0, and get sparse weights; L2 regularization can make weights close to 0 but non-zero, and get smooth weights; https://zhuanlan.zhihu .com/p/35356992

What are the ways of network initialization and their formula initialization process

The current weight initialization is divided into three categories:

  • Set all to 0 - almost never used

  • Random initialization (uniform random, normal distribution)

  • Glorot, the author of Xavier, believes that an excellent initialization should make the variance of the activation value and state gradient of each layer consistent during the propagation process. Suitable for sigmoid, but not for Relu.

  • He initialization is suitable for Relu.

Initialization, to put it bluntly, is to construct a smooth local geometric space to make optimization easier xavier distribution analysis:

  • https://prateekvjoshi.com/2016/03/29/understanding-xavier-initialization-in-deep-neural-networks/

Assume that the sigmoid function is used. When the weight value (the value refers to the absolute value) is too small, the variance will decrease every time the input value passes through the network layer, and the weighted sum of each layer is very small, which is equivalent to a linear function in the area near the sigmoid function 0, losing the power of DNN nonlinearity. When the value of the weight is too large, the variance of the input value will rise rapidly after passing through each layer, and the output value of each layer will be very large. At this time, the gradient of each layer will approach 0. xavier initialization can make the input value x variance After passing through the network layer, the variance of the output value y remains unchanged.

  • https://blog.csdn.net/winycg/article/details/86649832

  • https://zhuanlan.zhihu.com/p/57454669

The default weight initialization method in pytorch is He Kaiming's, for example:

Initialization of weights in resnet

for m in self.modules():
    if isinstance(m, nn.Conv2d):
        nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
    elif isinstance(m, nn.BatchNorm2d):
        nn.init.constant_(m.weight, 1)
        nn.init.constant_(m.bias, 0)


# Zero-initialize the last BN in each residual branch,
# so that the residual branch starts with zeros, and each residual block behaves like an identity.
# This improves the model by 0.2~0.3% according to https://arxiv.org/abs/1706.02677
if zero_init_residual:
    for m in self.modules():
        if isinstance(m, Bottleneck):
            nn.init.constant_(m.bn3.weight, 0)
        elif isinstance(m, BasicBlock):
            nn.init.constant_(m.bn2.weight, 0)

Solving for Model Parameter Quantities

def model_info(model):  # Plots a line-by-line description of a PyTorch model
    n_p = sum(x.numel() for x in model.parameters())  # number parameters
    n_g = sum(x.numel() for x in model.parameters() if x.requires_grad)  # number gradients
    print('\n%5s %50s %9s %12s %20s %12s %12s' % ('layer', 'name', 'gradient', 'parameters', 'shape', 'mu', 'sigma'))
    for i, (name, p) in enumerate(model.named_parameters()):
        name = name.replace('module_list.', '')
        print('%5g %50s %9s %12g %20s %12.3g %12.3g' % (
            i, name, p.requires_grad, p.numel(), list(p.shape), p.mean(), p.std()))
    print('Model Summary: %d layers, %d parameters, %d gradients' % (i + 1, n_p, n_g))
    print('Model Size: %f MB parameters, %f MB gradients\n' % (n_p*4/1e6, n_g*4/1e6))

Convolution calculation amount

Almost all of these are OK.

  • Ordinary convolution

  • separable convolution

  • full connection

  • point convolution

You can read this article by Lao Pan:

Multi-label and multi-classification

So, how to use softmax and sigmoid for multi-class classification and multi-label classification?

1. How to use softmax for multi-classification and multi-label classification Now suppose that the final output of the neural network model is such a vector logits=[1,2,3,4], which is the final fully connected output of the neural network. Here it is assumed that there are 4 categories in total. The method of multi-classification with softmax:  tf.argmax(tf.softmax(logits))first convert the logits into a probability distribution with softmax, and then take the one with the largest probability value as the classification of the sample. It seems that tf.argmax(logits) can also obtain the maximum value and get Correct sample classification, in this case softmax seems to have little effect. Then the main function of softmax is actually to calculate cross entropy. First, y in the sample set is a one-hot vector. If the model outputs logits and y directly to calculate cross entropy, Because logits=[1,2,3,4], the calculated cross entropy must be very large. This calculation method is wrong, but the logits should be converted into a probability distribution and then calculated, which is to use tf.softmax(logits) and y to calculate the cross entropy, of course, we can also directly use the method sofmax_cross_entropy_with_logits provided by tensorflow to calculate the parameters passed in by this method can be logits directly, because this can be seen from the name of the method, the parameters will be processed with softmax inside the method, Now we take the largest probability distribution as the final classification result, which is multi-classification. We can also take the top of the probability as the final multiple labels, or set a threshold and take the ones greater than the probability threshold. This achieves multi-label classification with softmax.

2. How to use sigmoid for multi-label classification Sigmoid is generally not used for multi-class classification, but for binary classification. It converts a scalar number to [0,1], if it is greater than a probability threshold (generally is 0.5), it is considered to belong to a certain category, otherwise it does not belong to a certain category. So how to use sigmoid for multi-label classification? In fact, a sigmoid classifier is applied to the results of each classification calculation in logits to determine whether the sample belongs to a certain category. It is also assumed that the final output of the neural network model is such a vector logits=[1,2,3,4], which is the final fully connected output of the neural network. Here it is assumed that there are 4 categories in total. tf.sigmoid(logits)Sigmoid should turn each number in logits into a probability value between [0,1], assuming the result is [0.01, 0.05, 0.4, 0.6], and then set a probability threshold, such as 0.3, if the probability value is greater than 0.3 , then the category is determined to be consistent, and here, the sample will be determined to be consistent with both category 3 and category 4.

Why should the data input be normalized?

In order to eliminate the dimensional influence between data features, in practical applications, the model solved by the gradient descent method usually requires data normalization, including linear regression, logistic regression, support vector machine, neural network, etc., but the decision model is not Very applicable.

Why is Naive Bayes high bias and low variance?

First, assume you know the relationship between the training set and the test set. To put it simply, we need to learn a model on the training set, and then use it on the test set. Whether the effect is good or not depends on the error rate of the test set. But in many cases, we can only assume that the test set and the training set conform to the same data distribution, but we cannot get the real test data. At this time, how to measure the test error rate when only seeing the training error rate?

Since there are few training samples (at least not enough), the model obtained through the training set is not always true. (Even if the correct rate is 100% on the training set, it does not mean that it portrays the real data distribution. We must know that depicting the real data distribution is our purpose, rather than only depicting the limited data points of the training set). Moreover, in practice, training samples often have certain noise errors, so if you pursue the perfection of the training set too much and use a very complex model, it will make the model regard the errors in the training set as real data distribution characteristics , resulting in an incorrect estimate of the data distribution. In this case, it will be a mess when it comes to the real test set (this phenomenon is called overfitting). But you can't use a too simple model, otherwise, when the data distribution is more complex, the model is not enough to describe the data distribution (reflected that even the error rate on the training set is very high, this phenomenon is less fitting). Overfitting means that the model adopted is more complicated than the true data distribution, while underfitting means that the model adopted is simpler than the true data distribution.

Under the framework of statistical learning, when we describe the complexity of the model, we have such a view that Error = Bias + Variance. Error here can probably be understood as the prediction error rate of the model, which is composed of two parts, one part is the inaccurate part (Bias) caused by the model being too simple, and the other part is caused by the model being too complex Greater room for change and uncertainty (Variance).

Therefore, it is easy to analyze Naive Bayes in this way. It simply assumes that the data are irrelevant and is a severely simplified model. Therefore, for such a simple model, the Bias part is larger than the Variance part in most cases, that is to say, high bias and low variance.

In practice, in order to make the Error as small as possible, we need to balance the proportion of Bias and Variance when selecting a model, that is, balance over-fitting and under-fitting.

Canny edge detection, what are the edge detection algorithms

https://zhuanlan.zhihu.com/p/42122107 

https://zhuanlan.zhihu.com/p/59640437

Traditional Object Detection

Traditional target detection is generally divided into the following steps:

  • Area selection: An image is first segmented (clustered) by the selective search method, and then by calculating the similarity of adjacent areas, 2000 frames are finally found, and the positive and negative examples are also carried out with GT judge.

  • Feature extraction: convert 2000 into feature vectors through SIFT or other feature extraction methods

  • Classifier classification: put the feature vector into the SVM for classification training, and put the parent class into the classifier for training.

Classic structure:

  • HoG + SVM

Disadvantages of traditional methods:

  • The region selection strategy based on the sliding window is not targeted, the time complexity is high, and the window is redundant

  • Handcrafted features are not very robust to changes in environmental diversity

Corrosion expansion, open operation and close operation

You can read the relevant content in the third edition of OpenCV, search for erode, dilation

some filters

  • https://blog.csdn.net/qq_22904277/article/details/53316415

  • https://www.jianshu.com/p/fbe8c24af108

  • https://blog.csdn.net/qq_22904277/article/details/53316415

  • https://blog.csdn.net/nima1994/article/details/79776802

  • https://blog.csdn.net/jiang_ming_/article/details/82594261

  • High-frequency and low-frequency information in the image and high-pass filter, low-pass filter

  • In the image, pixels with more obvious changes such as edge information are the high-frequency information in the image. Except for the edge part, content information with relatively gentle pixel changes and not very drastic content information is low-frequency information.

  • The high-pass filter is to highlight sharp changes (edges) and remove low-frequency parts, that is, as an edge extractor. The low-pass filter is mainly smoothing the brightness of the pixel. Mainly used for denoising and blurring, Gaussian blur is one of the most commonly used blur filters (smoothing filters), which is a low-pass filter that attenuates the strength of high-frequency signals.

Resize bilinear interpolation

When the network structure performs feature fusion, the bilinear interpolation method is better than the transposed convolution. Because a big problem with transposed convolution is that if the parameters are not configured properly, it is easy to have an obvious checkerboard phenomenon in the output feature map.

  • It should be noted that the effect of nearest neighbor interpolation is the worst.

Bilinear interpolation is also divided into two categories:

  • align_corners=True

  • align_corners=False

Generally speaking, using align_corners=True can ensure that the edges are aligned, while using align_corners=False will cause the edges to protrude. This one speaks very well:

  • https://blog.csdn.net/qq_37577735/article/details/80041586

Explanation of code implementation:

  • https://blog.csdn.net/love_image_xie/article/details/87969405

  • https://www.zhihu.com/question/328891283/answer/717113611 Look at the image display here: https://discuss.pytorch.org/t/what-we-should-use-align-corners-false/22663

gradient clipping gradient clipping

Improvements made to avoid gradient explosions should be distinguished from early termination. (Early termination is a method of regularization because when training large models with sufficient power to represent even overfitting, the training error gradually decreases over time but the error on the validation set rises again. This means that as long as We can just return the parameter setting that makes the error of the verification set the lowest) The first method is easy to understand, that is, first set a gradient range such as (-1, 1), and the gradient that is less than -1 is set to -1, and the gradient that is greater than this 1 The gradient is set to 1.

26ab5355d1d3f25dc5a67a4c5ffbb053.jpeg
  • https://wulc.me/2018/05/01/%E6%A2%AF%E5%BA%A6%E8%A3%81%E5%89%AA%E5%8F%8A%E5%85%B6%E4%BD%9C%E7%94%A8/

Implement a simple convolution

Convolution is generally implemented in the form of im2col, but in the interview, we simply implement a sliding window method. For example: use a 3x3 convolution kernel (filter box) to implement the convolution operation. The same is true for the source code of convolution on the PC side in NCNN.

`/*
输入:imput[IC][IH][IW]
IC = input.channels
IH = input.height
IW = input.width
卷积核: kernel[KC1][KC2][KH][KW]
KC1 = OC
KC2 = IC
KH = kernel.height
KW = kernel.width
输出:output[OC][OH][OW]
OC = output.channels
OH = output.height
OW = output.width
其中,padding = VALID,stride=1,
OH = IH - KH + 1
OW = IW - KW + 1
也就是先提前把Oh和Ow算出来,然后将卷积核和输入数据一一对应即可
*/
for(int ch=0;ch<output.channels;ch++)
{
for(int oh=0;oh<output.height;oh++)
{
for(int ow=0;ow<output.width;ow++)
{
float sum=0;
for(int kc=0;kc<kernel.channels;kc++)
{
for(int kh=0;kh<kernel.height;kh++)
{
for(int kw=0;kw<kernel.width;kw++)
{
sum += input[kc][oh+kh][ow+kw]*kernel[ch][kc][kh][kw];
}
}
}
//if(bias) sum +=bias[]
output[ch][oh][ow]=sum;
}
}
}
`

reference:

  • https://www.cnblogs.com/hejunlin1992/p/8686838.html

The process of convolution

Looking at the source code of Pytorch and the source code of caffe, they both convert convolution calculations into matrix operations, im2col, and then gemm. https://blog.csdn.net/mrhiuser/article/details/52672824

Calculation process of transposed convolution

https://cloud.tencent.com/developer/article/1363619

1*1What is the use of the convolution kernel, and what is the difference between 3*3the convolution kernel and one 1*3plus one3*1

1x1 convolution can change the number of channels of the upper layer network. The convolution kernel is larger than 1x1, which means that neighborhood information is required for feature extraction.

  • If the horizontal texture is extracted, the information density of the horizontal neighborhood is higher than the vertical information density.

  • Nuclear flat sideways is the most scientific. If you mention the vertical texture, the same reason, it is best to be tall and thin.

  • If you want to extract a rich variety of textures, then the expectation of horizontal neighborhood information density ~= vertical information density expectation

So for lazy people, the expectation of the optimal kernel size is square. As for 1*nand n*1, they are generally used together to realize the receptive field of n*n convolution kernels, which can increase the number of layers while reducing parameters, and can bring certain advantages when used in higher layers of CNN. Not all convolution kernels are square, but can also be rectangular, such as 3*5, which is used in text detection and license plate detection. This design is mainly for the shape of text lines or license plates, which can better learn features. In fact, I think the square rectangle has little effect, and the learning ability of the network is very strong. Of course, we can also learn the shape of convolution, similar to what deformable convolutionLao Pan will talk about later.

Comparison of the inverted structure of bottlenet and mobilenetv2 in ResNet

Note that in resnet, the dimension is reduced first and then increased, while in mobilenetv2, the dimension is increased first and then reduced (so it is called inversed).

  • https://zhuanlan.zhihu.com/p/67872001

  • https://zhuanlan.zhihu.com/p/32913695

Calculation of Convolutional Feature Map Size

Simple but easy to get wrong question:

  • Conv2D1215edd7deb4ec2321c53901dd885ad1.jpeg

The difference between dynamic graph and static graph

  • Static graphs are created once and then reused continuously; static graphs can be serialized in disk, can save the structure of the entire network, can be overloaded, and are very practical in deployment. Conditions and loops in tensorflow static graphs require specific syntax. pytorch can be implemented only with python syntax

  • The dynamic graph is created every time it is used, it is not easy to optimize, and the previous code needs to be repeated, but the dynamic graph is more concise than the static graph code

Depending on the use of dynamic computing or static computing, these numerous deep learning frameworks can be divided into two camps. Of course, some frameworks have both dynamic computing and static computing mechanisms (such as MxNet and the latest TensorFlow). Dynamic computation means that the program will execute in the order we write the commands. This mechanism will make debugging easier, and it will also make it easier to translate the ideas in our heads into actual code. Static computing means that the program will first generate the structure of the neural network when it is compiled and executed, and then perform the corresponding operations. In theory, mechanisms like static evaluation allow the compiler to optimize to a greater degree, but it also means that there is more generation gap between what you expect your program to do and what the compiler actually executes. This also means that errors in the code will be more difficult to find (for example, if there is a problem with the structure of the calculation graph, you may only find it when the code executes the corresponding operation). Although in theory static computation graphs have better performance than dynamic computation graphs, in practice we often find that this is not the case.

All networks over the years

This can be seen in the ninth lesson of CS231n and

  • https://ucbrise.github.io/cs294-ai-sys-sp19/assets/lectures/lec02/classic_neural_architectures.pdf

  • https://towardsdatascience.com/a-simple-guide-to-the-versions-of-the-inception-network-7fc52b863202

Formal summary:

  • LeNet-5: The first convolution is used to recognize handwritten arrays. The convolution size used is 5x5, s=1. It is a combination of ordinary convolution kernel pooling layers, and finally a fully connected layer.

  • AlexNet: 11x11 convolution was used in the first convolution, Relu was used for the first time, NormLayer was used but not the BN we often say. Using dropout, training was carried out on two GPUs, and the training method used was model parallelism,

  • ZFNet: An enhanced version of AlexNet, which changes the 11x11 convolution to 7x7, and also deepens the convolution channel depth on the basis of AlexNet. So in the classification competition, the effect is better than before.

  • VGGNet: Only small convolution 3x3 (s=1) and conventional pooling layers are used, but the depth is deeper than the previous one, and the last few layers are also fully connected layers followed by a softmax. The reason why 3x3 convolution is used is that the effective receptive field of three 3x3 convolutions is consistent with the receptive field of 7x7, and it is deeper and more nonlinear, and the parameters of the convolutional layer are also less, so it can be faster and can be properly deepened layers.

  • GoogleNet: The FC layer is not used, and the amount of parameters is greatly reduced compared with the previous one. The Inception module structure is proposed, which is the NIN structure (network within a network). However, the original Inception module has a very large amount of calculation, so a 1x1 conv "bottleneck" structure is added to each branch (see the figure for details). In order to avoid the disappearance of the gradient in the googlenet network structure, two softmax losses are added in the middle two positions, so there will be three losses. The loss of the entire network is obtained by multiplying the weights by three losses and adding related articles: https:/ /zhuanlan.zhihu.com/p/42704781 The characteristics of the inception structure: 1. The width of the network is increased, and the adaptability to different scales is also improved. 2. Use the 1x1 convolution kernel to reduce the dimensionality of the input feature map, which will greatly reduce the amount of parameters and thus reduce the amount of calculation. 3. In V3, multiple small convolution kernels are used instead of large convolution kernels. In addition to regular squares, we also have a decomposed version of 3x3 = 3x1 + 1x3. This effect is more regular when the depth is deeper. The convolution kernel is better. 4. The core idea of ​​inventing Bottleneck is to use multiple small convolution kernels to replace a large convolution kernel, and to use 1x1 convolution kernels to replace part of the work of the large convolution kernel. That is, first 1x1 down channel and then normal 3x3 and then 1x1 back.

  • Xception: Improved inception, and the proposed  depthwise Separable Conv ones are eye-catching. https://www.jianshu.com/p/4708a09c4352

  • ResNet: The deeper the network, the more difficult it is to optimize. There is a characteristic that needs to be understood. The deeper layer should at least perform the same as the shallow layer, and it cannot be worse than the shallow layer. For deeper Resnet (50+), the bottleneck layer (that is, two 1x1 dimensionality reduction and dimensionality enhancement) is used here to improve the efficiency of the network. For a more detailed description, see Baimian Machine Learning and ppt. Related explanation: https://zhuanlan.zhihu.com/p/42706477

  • DenseNet cannot simply say that densenet is better. Compared with the two, ResNet is a more general model, and DenseNet is a more specialized model. DenseNet may perform better than ResNet for image processing. The essence is that DenseNet can better match the information distribution characteristics of images, and it uses multi-scale Kernel. But there are also shortcomings. The most direct calculation is the number of all feature maps generated in one inference. Some frameworks will be optimized to automatically release the feature map of the earlier layer, so the video memory will be reduced, or the inplace operation will reduce a part of the video memory through recalculation, but because densenet needs to reuse the earlier feature map, Therefore, it cannot be released, resulting in excessive video memory usage. It is this _concat_ that causes the densenet to be more densely connected.

  • SeNet: The full name is Squeeze-and-Excitation Networks. It belongs to the category of attention feature extraction, adding GP (Global pooling) and two FCs plus sigmoid and scale. That is to generate an attention mask to multiply the input x to get a new x. The core idea is to learn the importance of each feature channel , and then improve useful features and suppress features that are not very useful for the current task according to this importance. This multiplies each feature layer channel by the important coefficient obtained through sigmoid, which is actually the same as using the bn layer to observe which coefficient is important. Disadvantage: Due to the 0~1 scale operation on the backbone, when the network is deep BP optimization, the gradient will easily disappear near the input layer, making the model difficult to optimize. http://www.sohu.com/a/161633191_465975

  • Wide Residual Networks

  • ResNeXt: It is a combination of resnet and inception. The residual connection next to it is directly connected to the x in the formula, and then the rest is 32 sets of independent transformations of the same structure, and finally merged, in line with the split-transform-merge mode . Although divided into 32 groups, they are all point convolution first to reduce the dimension, then 3x3 ordinary convolution, and then 1x1 convolution to increase the dimension (the opposite of Mobilenetv2) Related introduction: https://zhuanlan.zhihu.com/p/51075096

  • Densely Connected Convolutional Networks: It is beneficial to alleviate the disappearance of gradients and enhance the flow of features.

shufflenet:https://blog.csdn.net/u011974639/article/details/79200559

some knowledge of statistics

Normal distribution: https://blog.csdn.net/yaningli/article/details/78051361

About how to train (some problems in the training process)

Training shock caused by MaxPool (by adding L2Norm after MaxPool): https://mp.weixin.qq.com/s/QR-KzLxOBazSbEFYoP334Q

A good companion to the fully connected layer: Spatial Pyramid Pooling (SPP)

https://zhuanlan.zhihu.com/p/64510297

Receptive field calculation

There are two formulas for receptive field calculation, one common formula and one general term formula:

3e839d1a01ea42463f9932280ae40baf.jpeg

It should be noted that both convolution and pooling can increase the receptive field.

http://zike.io/posts/calculate-receptive-field-for-vgg-16/

This article is only for academic sharing. If there is any infringement, please contact to delete the article.

下载1:OpenCV-Contrib扩展模块中文版教程

在「小白学视觉」公众号后台回复:扩展模块中文教程,即可下载全网第一份OpenCV扩展模块教程中文版,涵盖扩展模块安装、SFM算法、立体视觉、目标跟踪、生物视觉、超分辨率处理等二十多章内容。


下载2:Python视觉实战项目52讲
在「小白学视觉」公众号后台回复:Python视觉实战项目,即可下载包括图像分割、口罩检测、车道线检测、车辆计数、添加眼线、车牌识别、字符识别、情绪检测、文本内容提取、面部识别等31个视觉实战项目,助力快速学校计算机视觉。


下载3:OpenCV实战项目20讲
在「小白学视觉」公众号后台回复:OpenCV实战项目20讲,即可下载含有20个基于OpenCV实现20个实战项目,实现OpenCV学习进阶。


交流群

欢迎加入公众号读者群一起和同行交流,目前有SLAM、三维视觉、传感器、自动驾驶、计算摄影、检测、分割、识别、医学影像、GAN、算法竞赛等微信群(以后会逐渐细分),请扫描下面微信号加群,备注:”昵称+学校/公司+研究方向“,例如:”张三 + 上海交大 + 视觉SLAM“。请按照格式备注,否则不予通过。添加成功后会根据研究方向邀请进入相关微信群。请勿在群内发送广告,否则会请出群,谢谢理解~

Guess you like

Origin blog.csdn.net/qq_42722197/article/details/131278593