Summary of machine learning topics

1. What properties do activation functions in deep learning need to have?

  • simple calculation
  • nonlinear
  • with saturation region
  • almost everywhere differentiable

ABD. Analysis: (1) Nonlinearity: The derivative cannot be a constant. (2) Differentiable almost everywhere: sigmoid is differentiable everywhere, and ReLU is not differentiable only at finite points). (3) The calculation is simple. (4) Non-saturation: Sigmoid has a saturation area, and there is a problem of gradient disappearance. Later, relu was proposed. (5) Monotonicity. (6) The output range is limited; (7) It is close to the identity transformation; (8) There are few parameters; (9) Normalization helps stabilize training.

2. The BatchNorm layer will count the mean and variance for the input batch to calculate the EMA. If the shape of the input batch is (B, C, H, W), the calculated mean and variance shapes are:

  •     B * 1 * 1 * 1
  •     1 * C * 1 * 1
  •     B * C * 1 * 1
  •     1 * 1 * 1 * 1

b. Analysis: BN normalizes each channel of multiple images, and there are as many mean and variance as there are channels.

3. CNN's common Loss function does not include which of the following ()

  • softmax_loss
  • sigmoid_loss
  • Contrastive_Loss
  • siamese_loss

 d. Analysis: The application of Contrastive_Loss (contrastive loss) in siamese network (twin neural network) can effectively deal with the relationship of paired data. Contrastive loss is mainly used in dimensionality reduction to ensure that samples that were originally similar are still similar after dimensionality reduction (in the new feature space). The expression is as follows:

 d is the distance between samples. When y=1 when the samples are similar, only the first half of the loss function remains. The larger the distance, the worse the model and the greater the loss. When the samples are not similar, y=0, and only the second half of the loss function is left. If the distance is smaller, the model is worse and the loss is greater.

4. Regarding the Attention-based Model, which of the following statements is correct ()

  • Similarity Measurement Model
  • is a new deep learning network
  • is an input-to-output scale model
  • Neither is right

a. Analysis: Attention-based Model is actually a measure of similarity. The more similar the current input is to the target state, the greater the weight of the current input will be, indicating that the current output is more dependent on the current input.

5. When considering a specific problem, you may have only a small amount of data to solve it. But luckily you have a pre-trained neural network for a similar problem. Which of the following methods can be used to utilize this pre-trained network?

  •     Freeze all but the last layer and retrain the last layer
  •     Retrain the entire model on new data
  •     Only tune the last few layers (fine tune)
  •     Evaluate each layer model and select a few of them to use

c. Analysis: Model fine-tuning method:

The amount of data is small and the data similarity is high: change the last few layers or only change the output layer.

The amount of data is small, and the data similarity is low: freeze some, train some.

Large amount of data, low data similarity: training from scratch.

Large amount of data, high data similarity: ideal. Retrain with pretrained weights.

6. What are the advantages of the HK algorithm based on the quadratic criterion function compared to the perceptron algorithm?

  •     small amount of calculation
  •     Can determine whether the problem is linearly separable
  •     Its solution is fully applicable to the case of nonlinear separability
  •     The solution has better adaptability

BD. Analysis: The HK algorithm obtains the weight vector under the minimum mean square error criterion. Applies to linearly separable and non-linear. For the case of linear separability: the optimal weight vector is given; for the case of non-linear separability: it can be distinguished to exit the iterative process.

7. The basic calculation unit in caffe is

  • blob
  • layer
  • net
  • solver

b. Analysis: blob: data storage unit of caffe; layer: computing unit of caffe.

8. What are the advantages of the Inception structure proposed by GoogLeNet?

  •     Ensure that the receptive field of each layer remains unchanged, and the depth of the network is deepened, making the network more accurate
  •     The receptive field of each layer increases, and the ability to learn small features becomes larger
  •     Effectively extract high-level semantic information and process high-level semantics to effectively improve network accuracy
  •     Use this structure to effectively reduce the weight of the network

d. Analysis: Multi-branch structure, large convolution kernels are split into small convolution kernels in series, 3*3 convolutions are split into 3*1 and 1*3 convolutions, and 1*1 convolutions are used to reduce network weight.

9. The figure below is a gradient descent diagram of a neural network training with four hidden layers using the sigmoid function as the activation function. This neural network suffers from the problem of vanishing gradients. Which of the following statements is correct? 

The first hidden layer corresponds to D, the second hidden layer corresponds to C, the third hidden layer corresponds to B, and the fourth hidden layer corresponds to A

Analysis: In the backpropagation of the neural network, the gradient is transmitted from the back to the front, the gradient decreases continuously, and finally becomes zero. At this time, the weight of the shallow neural network cannot be updated, so the learning rate of the previous hidden layer is lower than The learning rate of the hidden layer behind, that is, as the number of hidden layers increases, the classification accuracy decreases instead. This phenomenon is called vanishing gradient . The disappearance of the gradient causes the weight update of the back layer to be fast, and the weight update near the output layer is relatively normal, and the front layer network cannot be updated because the gradient cannot be passed. Therefore, the update near the input layer will become very slow, causing the hidden layer weights near the input layer to be almost unchanged, close to the initialized weights. In this way, when the network is very deep, the learning speed is very slow or cannot be learned. D has the slowest learning rate and is therefore the first hidden layer.

10. Which of the following statements about deep neural networks is false

A Using gradient clipping (gradient clipping) helps to slow down the gradient explosion problem
B If the batch size is too small, the effect of batch normalization will degrade
C When using SGD training, if the training loss changes gradually and no longer drops significantly, usually It can be further reduced by reducing the learning rate.
Increasing the coefficient of the L2 regular term helps to slow down the gradient disappearance problem

d. Analysis: L2 regularization is used to reduce model complexity, prevent overfitting, and cannot alleviate gradient disappearance. Residual structure, appropriate activation function, gradient clipping, BN, initialization, etc. can alleviate gradient disappearance.

11. Which of the following statements about neural networks is correct

A The loss function must be non-convex or non-concave with respect to the input
B There is some kind of deep neural network (at least one hidden layer) so that each local optimal solution is a global optimal solution
C Deep neural network is easy to fall into the local optimal solution
None of the above options are correct

Analysis: B. The neural network is easy to fall into the saddle rather than the local optimal solution, and there are very few local minimum points in the multi-dimensional loss function.

12. Convolutional neural networks are commonly used as the basic structure in image mining. Which of the following statements about convolution operations (conv) and pooling (pooling) are correct?

A conv is based on translation invariance, pooling is based on local correlation

Analysis: Translation invariance: For the same image and its translated version, the same result can be output. Local correlation: The pooling layer uses local correlation to downsample the image, which can reduce the amount of data processing while retaining useful information, which is equivalent to image compression .

13. Which approach in RNN can better deal with the gradient explosion problem?

A with LSTM or GRU

B gradient clipping

C dropout

Analysis: B. There is gradient accumulation in LSTM, which can alleviate the problem of gradient disappearance, but it will aggravate the problem of gradient explosion. The best practice is to limit the gradient range and perform gradient clipping.

14. Regarding the optimizer classically used in neural networks, which of the following statements is correct

A Adam's convergence speed is slower than RMSprop
B Compared with SGD or RMSprop and other optimizers, Adam's convergence effect is the best
C For lightweight neural networks, using Adam is more suitable than using RMSprop
D Compared with Adam or RMSprop, etc. Optimizer, the convergence effect of SGD is the best

Analysis: D. SGD can converge to the minimum, but it takes a long time. If you care about faster convergence and need to train a deeper and more complex network, it is recommended to use the learning rate adaptive optimization method.

15. (Multiple choice) What are the main reasons that affect the effect of the clustering algorithm?

A. Feature selection
B. Pattern similarity measure
C. Classification criteria
D. Sample quality of known categories

Analysis: ABC. Clustering is unsupervised and does not use labeled data.

1 6. (Multiple choice) In data cleaning, what is the method to deal with missing values?

A. Estimate
B. Deletion of the whole case
C. Deletion of variables
D. Deletion of pairs

 Analysis: ABCD.

17. In HMM, if the observation sequence and the state sequence that generate the observation sequence are known, which of the following methods can be used to directly estimate the parameters?

A. EM algorithm
B. Viterbi algorithm
C. Forward and backward algorithm
D. Maximum likelihood estimation

 Analysis: D. EM algorithm: only observation sequence, no state sequence to learn model parameters, that is, Baum-Welch algorithm. Viterbi algorithm: use dynamic programming to solve the prediction problem of HMM, not parameter estimation. Forward and backward algorithm: used to calculate probability. Maximum Likelihood Estimation: A supervised learning algorithm that estimates parameters when both the observation sequence and the corresponding state sequence exist. Note that estimating model parameters given a sequence of observations and a corresponding state sequence can be estimated using maximum likelihood. If there is no corresponding state sequence for a given observation sequence, EM is used to see the state sequence as unmeasurable hidden data.

18. What happens if L1 and L2 norms are added to Logistic Regression at the same time? .

A. Can do feature selection and prevent overfitting to a certain extent 

B. Can solve the curse of dimensionality problem 

C. Can speed up calculation 

D. More accurate results can be obtained

Analysis: A. The L1 norm has the characteristics of a coefficient solution, but it should be noted that the features not selected by L1 do not mean that they are not important, because only one of the two highly correlated features may be retained. If you need to determine which feature is important, then pass cross-validation. 

19. (Multiple choice) Assuming that a classmate accidentally repeated the two dimensions of the training data when using the Naive Bayesian (NB) classification model, then which statement about NB is correct?

A. The decisive role of this repeated feature in the model will be strengthened 

B. The accuracy of the model effect will be reduced compared to the case of no repeated features 

C. If all the features are repeated, the obtained model prediction results are the same as the model prediction results without repetition.

D. When the two columns of features are highly correlated, the conclusions obtained when the two columns of features are the same cannot be used to analyze the problem 

E. NB can be used for least squares regression 

F. None of the above statements are correct 

Analysis: BD.

20. (Multiple choice) Which of the following model methods belong to the discriminative model (Discriminative Model)?

A mixed Gaussian model 

B conditional random field model 

C Discrimination training 

D Hidden Markov Model 

Analysis: BC. Common discriminative models are: Logistic Regression (Logistical regression), Linear discriminant analysis (linear discriminant analysis), Supportvector machines (support vector machine), Boosting (integrated learning), Conditional random fields (conditional random field), Linear regression (linear Regression), Neural networks (neural network).

Common generative models include: Gaussian mixture model and other types of mixture model (Gaussian mixture and other types of mixture models), Hidden Markov model (hidden Markov), NaiveBayes (naive Bayesian), AODE (average single-dependent estimation) , Latent Dirichlet allocation (LDA topic model), Restricted Boltzmann Machine (restricted Boltzmann machine).

The generative model is to multiply the result according to the probability, while the discriminative model is to give the input and calculate the result.

21. There are two sample points, the first point is a positive sample, its eigenvector is (0,-1); the second point is a negative sample, its eigenvector is (2,3), from these two A training set composed of sample points to construct a classification surface equation of a linear SVM classifier?

A. 2x+y=4 

B. x+2y=5 

C. x+2y=3 

D. 2x-y=0

Analysis: C. For two points, the maximum interval is the vertical bisector, so just find the vertical bisector.

22. Which is incorrect about Logit regression and SVM ?

A. Logit regression objective function is to minimize the posterior probability 

B. Logit regression can be used to predict the probability of event occurrence 

C. The goal of SVM is to minimize structural risk 

D. SVM can effectively avoid model overfitting

Analysis: A. Logit regression is essentially a method of maximum likelihood estimation of weights based on samples, and the posterior probability is proportional to the product of the prior probability and the likelihood function. Logit only maximizes the likelihood function, and does not maximize the posterior probability, let alone minimize the posterior probability. And minimizing the posterior probability is what the Naive Bayesian algorithm does.

 23. (Multiple choice) Which of the following statements is correct ?

A. SVM is robust to noise (such as noisy samples from other distributions) 

B. In the AdaBoost algorithm, the weight update ratio of all misclassified samples is the same 

C. Boosting and Bagging are both methods of combining multiple classifier votes, both of which determine their weight based on the correct rate of a single classifier 

D. Given n data points, if half of them are used for training and generally for testing, the difference between training error and testing error will decrease as n increases

Analysis: BD. SVM itself has a certain robustness to noise, but experiments have proved that when the noise rate is lower than a certain level, the noise does not have much impact on SVM, but as the noise rate continues to increase, the recognition rate of the classifier will decrease. Each prediction function of Bagging has no weight, while Boosting has weight.

24. (Multiple choice) Which of the following is the best criterion for a linear classifier?

A. Perceptual criterion function 

B. Bayesian Classification 

C. Support vector machine 

D.Fisher criterion

 Analysis: ACD. There are three major categories of linear classifiers: perceptron criterion function, SVM, Fisher criterion, and Bayesian classifiers are not linear classifiers.

Perceptual criterion function: The principle of the criterion function is to minimize the sum of the distances from misclassified samples to the interface. Its advantage is that the classifier function is corrected by the information provided by the misclassified samples. This criterion is the basis of the artificial neural network multilayer perceptron.

Support Vector Machine: The basic idea is that under the condition of two types of linear separability, the designed classifier interface maximizes the interval between the two types, and its basic starting point is to minimize the risk of expected generalization. (Non-linear problems can be solved using kernel functions)

Fisher's Criterion: The broader name is linear discriminant analysis (LDA), which projects all samples onto a straight line starting from a far point, so that the distance between samples of the same type is as small as possible, and the distance between samples of different types is as large as possible, specifically to maximize the "generalized Ray profit business".

25. Which of the following time series models can better fit the analysis and prediction of volatility?

A. AR model 

B. MA model 

C. ARMA model 

D. GARCH model

Analysis: D.

The AR model is a kind of linear prediction, that is, given N data, the model can deduce the data before or after the Nth point (set P point), so its essence is similar to interpolation.

MA model (moving average model) is a moving average model that uses the trend moving average method to establish a linear trend prediction model.

ARMA model (auto regressive moving average model) auto regressive moving average model, one of the high-resolution spectral analysis methods of the model parameter method. This method is a typical method for studying the rational spectrum of stationary stochastic processes. Compared with the AR model method and the MA model method, it has more accurate spectrum estimation and better spectral resolution performance, but its parameter estimation is more cumbersome.

The GARCH model is called the generalized ARCH model, which is an extension of the ARCH model. The GARCH(p,0) model is equivalent to the ARCH(p) model. The GARCH model is a regression model specially tailored for financial data. Except for the same as the ordinary regression model, GARCH further models the variance of the error. It is especially suitable for the analysis and prediction of volatility. Such analysis can play a very important guiding role in the decision-making of investors, and its significance often exceeds the analysis and prediction of the value itself.

26. Suppose we suddenly encounter a problem during training. After a few cycles, the error decreases instantaneously. You think there is something wrong with the data, so you plot the data and find that maybe the data is too skewed to cause the problem. What are you going to do to deal with this problem?

A. Normalize the data 

B. Take the logarithmic change of the data 

C. Neither 

D. Perform principal component analysis (PCA) and normalization on the data

 Analysis: D.

27. Which decision boundary below is generated by a neural network?

 Analysis: ABCD.

28. The figure below shows that when training starts, the error is consistently high because the neural network is stuck in a local minimum before making progress towards the global minimum. To avoid this situation, which of the following strategies can we adopt?

 A. Change the learning rate, such as changing the learning rate continuously for the first few training cycles 

B. Initially reduce the learning rate by a factor of 10, then use the momentum term (momentum) 

C. Increase the number of parameters so that the neural network will not be stuck at the local optimum 

D. Everything else is wrong

 Analysis: A.

29. For a classification task, if the weight of the neural network is not randomly assigned at the beginning, and the second is set to 0, which of the following statements is correct?

A. None of the other options are correct 

B. No problem, the neural network will start training normally 

C. Neural networks can be trained, but all neurons end up recognizing the same thing 

D. The neural network will not start training because no gradient changes

 Analysis: C.

30. Suppose we have trained a convolutional neural network on the ImageNet dataset (object recognition). Then feed this convolutional neural network an all-white picture. The output for this input is equally likely to be any kind of object, right?

 Analysis: No.

31. Which of the following statements about model capacity is correct? (Refers to the ability of the neural network model to fit complex functions)

A. The number of hidden layers increases, and the model ability increases 

B. The proportion of Dropout increases, and the model ability increases 

C. As the learning rate increases, the model capacity increases 

D. Neither is correct

 Analysis: A. In fact, it is not quite right.

32. The neural network model got its name because it was inspired by the human brain. The neural network is composed of many neurons (Neuron), each neuron accepts an input, processes the input and gives an output. Which of the following statements about neurons is correct?

A. Each neuron has only one input and one output 

B. Each neuron has multiple inputs and one output 

C. Each neuron has one input and multiple outputs 

D. Each neuron has multiple inputs and multiple outputs 

E. All of the above are correct

Analysis: E.

Guess you like

Origin blog.csdn.net/qq_39066502/article/details/126723605