Necessary knowledge points for computer vision algorithm interviews (2022 must)

Part 1: Deep Learning

1. Basic problems of neural network

(1) Backpropagation (to be able to overthrow) Backpropagation
is a method used when solving the loss function L to derive the parameter w. The purpose is to derivate the parameters layer by layer through the chain rule. It is emphasized here that the parameters should be randomly initialized instead of all set to 0, otherwise the values ​​of all hidden layers will be related to the input, which is called symmetric failure.
The general process is:
.First, the forward conduction calculates the activation values ​​and output values ​​of all nodes,
insert image description here
.Calculate the overall loss function:
insert image description here
.Then calculate the residual for each node of the L layer (here because UFLDL says residual difference, the essence is the derivative of the overall loss function to the activation value Z of each layer), so to derive the W, just multiply the derivative of the activation function to W,
insert image description here

(2) Gradient disappearance, gradient explosion
Gradient disappearance: This is essentially caused by the selection of the activation function. The simplest sigmoid function is an example. The gradient derivation result at both ends of the function is very small (saturation area), resulting in backward During the propagation process, due to the multiple use of the derivative value of the activation function, the overall product gradient result becomes smaller and smaller, and the gradient disappears.
Gradient explosion: Similarly, it occurs when the activation function is in the activation area and the weight W is too large. But gradient explosions are less likely to occur than gradient disappearances.
(3) Commonly used activation functions

insert image description here
insert image description here

(4) Parameter update method
insert image description here

(5) The method to solve overfitting is
  dropout, regularization, batch normalizatin, but it should be noted that dropout is only used during training, and some neurons are randomly deactivated.

2. CNN and RNN problems

(1) Idea
  Change the full connection to a local connection, which is caused by the particularity of the picture (the statistical properties of a part of the image are the same as other parts), and reduce the parameter value in a large range through local connection and parameter sharing. Different features of the picture can be extracted by using multiple filters (multiple convolution kernels).
(2) Selection of filter size
  Usually the size is an odd number (1, 3, 5, 7) 
(3) Output size calculation formula
  Output size = (N - F + padding * 2) / stride + 1
  step can be freely selected by Zero padding is used to realize the connection.
(4) Although the pooling function
  can reduce the output size (number of features) in a large range through the convolution method, it is still difficult to calculate and easy to overfit, so the static characteristics of the image are still used to pass the pooling. way to further reduce the size.

(5) Several commonly used models, it is best to remember the approximate size parameters of the model.
insert image description here
1. The principle of RNN:
  In a normal fully connected network or CNN, the signal of each layer of neurons can only be propagated to the upper layer, and the processing of samples is independent at each moment, so it is also called a forward neural network (Feed-forward+Neural +Networks). In RNN, the output of the neuron can be directly applied to itself at the next time stamp, that is, the input of the i-th layer neuron at the time m, in addition to the output of the (i-1) layer neuron at this time, it also includes other The output of itself at (m-1) time. So called recurrent neural network

2. The difference between RNN, LSTM, and GRU
. RNN introduces the concept of cycle, but in the actual process, there is a problem that the initial information disappears with time, that is, the problem of long-term dependence (Long-Term Dependencies), so LSTM is introduced.
.LSTM: Because LSTM has entry and exit and the current cell informaton is superimposed after being controlled by the input gate, RNN is multiplication, so LSTM can prevent the gradient from disappearing or exploding. Deduce forget gate, input gate, cell state, hidden information, etc. Because LSTM has in and out and the current cell informaton is superimposed after being controlled by the input gate, RNN is superimposed, so LSTM can prevent gradient disappearance or explosive changes is the key , the following figure is very clearly suitable for memory:
.GRU is a variant of LSTM that combines the forget gate and the inputs into a single update gate.

3. LSTM prevents gradient dispersion and explosion
  LSTM replaces the product with summation, making it difficult for gradient dispersion to occur. However, there is a correspondingly greater probability that there will be a gradient explosion, but this problem can be solved by adding a threshold to the gradient.
  
4. Lead word2vec

3、GAN

1. The idea of ​​GAN
GAN combines a generative model and a discriminative model, which is equivalent to the collision of a spear and a shield. The generative model is responsible for generating the best data to fool the discriminative model, and the discriminative model is responsible for identifying what is real and what is generated by the generative model. But these are only realized after understanding GAN, but why is this effective?
  Suppose we have the distribution Pdata(x), we hope to build a generative model to simulate the real data distribution, assuming the generative model is Pg(x;θθ), our purpose is to solve the value of θθ, usually we use the maximum likelihood However estimated. But the problem now is that we use NN to simulate Pdata(x), but it is difficult for us to solve the likelihood function, because we can't write the specific expression of the generative model, so we have GAN, which is to use the discriminant model instead of solving the process of maximum likelihood.
  In the most ideal state, G can generate a picture G(z) that is enough to "disguise the real one". For D, it is difficult to determine whether the picture generated by G is real, so D(G(z)) = 0.5. In this way, our goal is achieved: we get a generative model G, which can be used to generate pictures.

2. Expression of GAN
  By analyzing the expression of GAN, it can be seen that it is essentially a minmax problem. Among them, V(D, G) can be regarded as the difference between the generative model and the discriminative model, and minmaxD says that the smaller the maximum difference, the better. This way of measuring difference is actually called Jensen-Shannon divergence.
3. The actual calculation method of GAN
Because we cannot have the distribution of Pdata(x), we actually use sampling to calculate the difference (that is, the integral changes to the sum). The specific implementation process is as follows:

There are a few key points: the discriminant equation is trained K times, while the generative model only needs to be trained once per iteration, first maximizing (gradient up) and then minimizing (gradient down).
 However, in the actual calculation, the latter item of V will be updated very slowly due to the log function when D(x) is small, so in practice, the log(1-D(x)) of the latter item is usually changed to -logD(x).
  During the actual calculation, it was also found that no matter how well the generator is designed, the discriminator can always judge the true and false, that is, the loss is almost 0, which may be caused by sampling, the intersection of the generated data and the real data is too small, No matter how good the generative model is, the discriminative model can also tell the difference. There are two solutions: 1. Use WGAN 2. Introduce noise that decreases over time

4. There are some improvements to GAN, including the introduction of f-divergence, replacing Jensen-Shannon divergence, and many more. Here we mainly introduce WGAN
5. WGAN
mentioned above that f-divergence is used to measure the difference between two distributions, and the idea of ​​​​WGAN It is to use Earth Mover distance (excavator distance from Wasserstein distance).

Part II, Machine Learning Preparation

1. Issues related to decision trees
(1) Calculation of various entropies
  Entropy, joint entropy, conditional entropy, cross entropy, KL divergence (relative entropy)
. Entropy is used to measure uncertainty, so the entropy is the largest when it is equally divided
. KL divergence is used to measure the dissimilarity of two distributions, KL(p||q) is equal to cross entropy H(p,q)-entropy H (cross entropy can be regarded as the number of bits required to encode P with q, Subtracting the number of bits required by p itself, the KL divergence is equivalent to the extra bits required to encode p with q.
Mutual information: I(x,y) = H(x)-H(x|y) = H( y)-H(y|x) indicates how much the entropy of y will decrease after observing x.
(2) Commonly used tree construction methods: ID3, C4.5, and CART
  The above trees use information gain and information gain rate respectively , Gini index as the data segmentation standard.
Among them, information gain measures the degree of entropy reduction before and after segmentation according to a certain feature
. The finer the attributes, the higher the certainty. Therefore, the concept of information gain rate is proposed to reduce the role of features with more attributes. The Gini
index Gini used in the classification process of the CART tree can only be used to split the binary tree , and unlike the ID3 and C4.5 trees, the Cart tree does not delete the features used at each step.

(3) Prevent overfitting: pruning
  Pruning is divided into pre-pruning and post-pruning. The essence of pre-pruning is to stop early, and post-pruning usually determines whether to prune by measuring the change of the loss function after pruning. Post-pruning includes: error rate reduction pruning, pessimistic pruning, cost complexity pruning
(4) Several stop conditions for pre-pruning
. The samples in the node are of the same class
. If there are insufficient features, multiple classes are returned
. If a branch does not The value returns the multi-class in the parent node
. The number of samples is less than the threshold and returns multi-class
2. Questions related to logistic regression
(1) The formula derivation must be able to

(2) The basic concept of logistic regression
  This is best analyzed from the perspective of a generalized linear model. Logistic regression assumes that y obeys the Bernoulli distribution.
(3) L1-norm and L2-norm
  are actually sparse because L0-norm is the number of direct statistical parameters that are not 0 as rule items, but in fact it is not easy to implement, so L1-norm is introduced; and L1norm In essence, it is assumed that the parameter prior is subject to the Laplace distribution, and L2-norm assumes that the parameter prior is Gaussian distribution. The principle that we usually use images to answer this question that we see on the Internet is here.
   However, it is difficult to solve the L1-norm, which can be solved by the coordinate axis descent method or the minimum angle regression method.
(4) Comparison between LR and SVM
   First of all, the biggest difference between LR and SVM is the choice of loss function. The loss function of LR is Log loss (or logical loss can be used), while the loss function of SVM is hinge loss.
Second, both are linear models.
  Finally, SVM only considers support vectors (that is, a few points related to classification)
  
(5) The difference between LR and random forest
  Tree algorithms such as random forest are nonlinear, while LR is linear. LR focuses more on global optimization, while the tree model is mainly local optimization.
  
(6) Commonly used optimization methods
   Logistic regression itself can be solved with a formula, but because the complexity of inversion is too high, the gradient descent algorithm is introduced.
  First-order methods: gradient descent, stochastic gradient descent, mini stochastic gradient descent. The stochastic gradient descent is not only faster than the original gradient descent, but also can suppress the occurrence of local optimal solutions to a certain extent in local optimization problems.
  Second-order methods: Newton's method, quasi-Newton's method:
  Here is a detailed description of the basic principles of Newton's method and the application of Newton's method. Newton's method is actually to continuously update the position of the tangent through the intersection of the tangent and the x-axis until the intersection of the curve and the x-axis is reached to obtain the solution of the equation. In practical applications, we often require solution-convex optimization problems, that is, to solve the position where the first derivative of the function is 0, and Newton's method can just provide a solution to this problem. In practical application, Newton's method first selects a point as the starting point, and performs a second-order Taylor expansion to obtain a point with a derivative of 0 for an update until the requirement is met. At this time, Newton's method becomes a second-order solution problem, which is better than the first-order method is faster. The x we ​​often see is usually a multidimensional vector, which leads to the concept of the Hessian matrix (that is, the second derivative matrix of x). Disadvantages: Newton's method is a fixed-length iteration without a step factor, so it cannot guarantee a stable decline in the function value, and may even fail in severe cases. In addition, Newton's method requires that the function must be second-order differentiable. Moreover, the inverse complexity of calculating the Hessian matrix is ​​very large.
Quasi-Newton method: The method of constructing an approximate positive definite symmetric matrix of the Hessian matrix without the second-order partial derivative is called the quasi-Newton method. The idea of ​​the quasi-Newton method is to use a special expression to simulate the Hessian matrix or its inverse so that the expression satisfies the quasi-Newton condition. There are mainly DFP method (approximating the inverse of Hession), BFGS (directly approximating the Hession matrix), and L-BFGS (which can reduce the storage space required by BFGS).

3. SVM-related issues
(1) Why can SVM with a kernel classify nonlinear problems?
  The essence of the kernel function is the inner product of two functions, and this function can be expressed as a high-dimensional mapping to the input value in SVM. Note that the kernel does not directly correspond to the mapping. The kernel is just an inner product.
 
(2) Must the RBF kernel be linearly separable
  ? Taylor expanded.
  
(3) Commonly used kernel functions and conditions of kernel functions:
  When selecting kernel functions, you should start with linear kernels, and it is not necessary to choose Gaussian kernels when there are many features. You should choose models from simple to difficult. The kernel function we usually refer to is a positive definite sum function. The necessary and sufficient condition is that for any x belonging to X, the Gram matrix corresponding to K is required to be a positive semi-definite matrix.

. RBF kernel radial basis, the value of this type of function depends on the distance between specific points, so the Laplacian kernel is actually a radial basis kernel.
. Linear kernel: mainly used for linearly separable cases
. Polynomial kernel
(4) The basic idea of ​​SVM:
   maximize the interval to obtain the optimal separation hyperplane. The method is to formalize this problem as a convex quadratic programming problem, which can also be equivalent to a regularized hinge loss minimization problem. There are two types of SVM: hard margin maximization and soft margin SVM. At this time, the first thing to consider is how to define the interval, which leads to the concept of functional interval and geometric interval (here is only the idea), we choose geometric interval as the distance evaluation standard (why this is the case, you need to know how to find it) , we hope to maximize the geometric interval x between the hyperplane and require all points to be greater than this value. Through some changes, we get our common SVM expression. Then we found that the defined x is only determined by a few support vectors. For the original problem (primal problem), you can use the function package of convex function to solve it, but it is found that if you use the dual problem (dual) to solve it, it will become easier, and you can introduce the kernel function. The transformation of the original problem into a dual problem needs to satisfy the KKT condition (this condition should be carefully thought about) and it is still relatively easy to solve here. Because we said earlier that it can be turned into a soft interval problem, the penalty coefficient is introduced, which can also lead to the equivalent form of hinge loss (so that SVM can be solved with the idea of ​​​​gradient descent). I personally think that the difficult part is the SMO algorithm for solving parameters.

(5) Whether all optimization problems can be transformed into dual problems:
I feel very good about this problem, with the concepts of strong duality and weak duality. Let’s use Zhihu’s explanation
(6) to deal with data skew:
  the smaller the penalty coefficient C for the large number of classes, the less attention it means, on the contrary, the penalty coefficient for the small number of classes becomes larger.

4. Boosting and Bagging
(1) Random Forest
  Random Forest has changed the problem that the decision tree is easy to overfit, which is mainly optimized by two operations: 1. Boostrap draws sample values ​​from the bag with replacement 2. Every Randomly draw a certain number of features (usually sqr(n)).
  . Classification problem: Bagging voting method is used to select the category with the highest frequency
  . Regression problem: directly take the average of the results of each tree.
  insert image description here
(2) The essence of Boosting's AdaBoost
  Boosting is actually an additive model, which learns multiple classifiers by changing the weights of training samples and performs some linear combinations. And Adaboost is an additive model + exponential loss function + antecedent distribution algorithm. Adaboost starts from the weak classifier and trains repeatedly, in which the data weight or probability distribution is continuously adjusted, and at the same time, the weight of the samples misclassified by the weak classifier in the previous round is increased. Finally vote with the classifier (but the importance of the classifier is different).
(3) The GBDT of Boosting
   turns the base classifier into a binary tree, a binary regression tree is used for regression, and a binary classification tree is used for classification. Compared with the above Adaboost, the loss function of the regression tree is a square loss, and the classification problem can also be defined with an exponential loss function. But how to calculate the general loss function? GBDT (Gradient Boosting Decision Tree) is to solve the optimization problem of the general loss function by using the value of the negative gradient of the loss function in the current model to simulate the approximation of the residual in the regression problem.
  Note: Since GBDT is prone to overfitting problems, the recommended depth of GBDT should not exceed 6, while random forest can be above 15. 
(4) The difference between GBDT and Random Forest
This is similar to the above.
(5) The tool Xgboost
mainly has the following features:
. Supports linear classifiers
. The loss function can be customized, and the second-order partial derivative can be used.
Regularization items are added: the number of leaf nodes, and the L2-norm of the output score of each leaf node
. Supports feature sampling
. Supports parallelism under certain circumstances, only in tree building The stage will be used, and each node can find split features in parallel.
5. KNN and Kmean
(1) Disadvantages of KNN and Kmean
  Both belong to the lazy learning mechanism, which requires a lot of calculation distance process, and the speed is slow (but there are corresponding optimization methods).
(2) KNN
  KNN does not need to be trained, just use the labels of the nearest K points to judge the result of an unfamiliar point. KNN is equivalent to majority voting, which is equivalent to experience minimization. The optimization method of KNN is realized by Kd tree.
(3) Kmean
  requires customizing K cluster centers, and then artificially initializes the cluster centers, and obtains the final result by continuously adding new points to transform the center positions. The shortcomings of Kmean can be solved with the Kmean++ method (the idea is to maximize the distance between the initial cluster centers)

6.
   It is not appropriate to put the EM algorithm, HMM, and CRF together, but they are related to each other, so I will put them here together. Pay attention to the idea of ​​focusing on algorithms.
(1) EM algorithm
   The EM algorithm is used for the maximum likelihood estimation or the maximum a posteriori estimation of the hidden variable model. It consists of two steps: E step, expectation (expectation); M step, maximization (maxmization) . In essence, the EM algorithm is still an iterative algorithm, which calculates the current variable by continuously using the previous generation parameters to estimate the hidden variable until it converges.
  Note: The EM algorithm is sensitive to the initial value, and EM is an algorithm that continuously solves the maximization of the lower bound and approximates the maximization of the logarithmic likelihood function, that is to say, the EM algorithm cannot guarantee to find the global optimal value. The derivation method of EM should also be mastered.
(2) HMM Algorithm
  Hidden Markov Model is a generative model for labeling problems. There are several parameters (ππ, A, B): initial state probability vector ππ, state transition matrix A, observation probability matrix B. called the three elements of the Markov model.
Markov's three basic problems:
. Probability calculation problem: Given a model and an observation sequence, calculate the probability of the output of the observation sequence under the model. – "Forward and backward algorithm
. Learning problem: given the observation sequence, estimate the model parameters, that is, use the maximum likelihood estimation to estimate the parameters. – "Baum-Welch (also known as EM algorithm) and maximum likelihood estimation.
. Prediction problem: Given the model and observation sequence, solve the corresponding state sequence. – "Approximate algorithm (greedy algorithm) and Vibit algorithm (dynamic programming to find the optimal path)
(3) Conditional random field CRF
   Given a set of input random variables, the conditional probability distribution density of another set of output random variables. The conditional random field assumes that the output variable constitutes a Markov random field, and most of what we usually see is a linear chain random field, that is, a discriminant model that predicts the output from the input. The solution method is maximum likelihood estimation or regularized maximum likelihood estimation.
   The reason why HMM and CRF are always compared is mainly because both CRF and HMM use graph knowledge, but CRF uses Markov random field (undirected graph), while HMM is based on Bayesian network (with to the graph). And CRF also has: probability calculation problems, learning problems and prediction problems. The approximate calculation method is similar to HMM, except that the EM algorithm is not required for learning problems.
(4) Comparison between HMM and CRF
  The root is that the basic concepts are different, one is a generative model, and the other is a discriminative model, which leads to different solution methods.

7. Common basic problems
(1) The reason for data normalization (or standardization, pay attention to the difference between normalization and standardization)
  should be emphasized: it is better not to normalize if it can be normalized. The reason for data normalization is because The dimensions of each dimension are different. And it needs to be normalized depending on the situation.

. After some models are unevenly scaled in each dimension, the optimal solution is not equivalent to the original (such as SVM) and needs to be normalized.
. Some models are equivalent to the original scale, such as: LR does not need to be normalized, but in practice, the model parameters are often solved iteratively. If the objective function is too flat (imagine a very flat Gaussian model), the iterative algorithm will not converge. situation, so it is worst to perform data normalization.
Supplement: In fact, the essence is caused by different loss functions. SVM uses Euler distance. If a feature is too large, other dimensions will be dominated. And LR can make the loss function unchanged through weight adjustment.
(2) Measuring the quality of the classifier:
  Here, we must first know the four types of TP, FN (true judged as false), FP (false judged as true), and TN (you can draw a table).
Several commonly used indicators:
. Precision precision = TP/(TP+FP) = TP/~P (~p is the number of true predictions)
. Recall rate recall = TP/(TP+FN) = TP/ P
. F1 Value: 2/F1 = 1/recall + 1/precision
. ROC curve: ROC space is a false positive rate (FPR, false positive rate) as the X-axis, true positive rate (TPR, true positive rate) as the Y-axis The plane represented by the two-dimensional coordinate system. Among them, the true positive rate TPR = TP / P = recall, and the false positive rate FPR = FP / N
(3) SVD and PCA
  The concept of PCA is to maximize the variance after data projection, find such a projection vector, and satisfy the condition of maximum variance that is Can. After removing the mean, you can use SVD decomposition to solve such a projection vector, and select the direction with the largest eigenvalue.
(4) Methods to prevent overfitting
  The reason for overfitting is that the learning ability of the algorithm is too strong; some assumptions (such as independent and identical distribution of samples) may not be true; too few training samples cannot estimate the distribution of the entire space.
  Processing method:
. Early stop: Stop training if the model performance is not significantly improved after multiple iterations in training
. Data set expansion: increase the original data, add random noise to the original data, resampling. Regularization .
Cross
-validation
. Feature selection/feature dimensionality reduction
(5) Data imbalance problem
  This is mainly due to the imbalance of data distribution. The solution is as follows:
. Sampling, adding noise sampling to small samples, and downsampling to large samples
. Perform special weighting, such as in Adaboost or SVM
. Use an algorithm that is not sensitive to imbalanced data sets
. Change the evaluation criteria: use AUC/ROC for evaluation
. Bagging/Boosting/ensemble and other methods are used
. Consider the prior distribution of the data

Guess you like

Origin blog.csdn.net/weixin_45529272/article/details/127979186