Machine Learning Final Question Bank
1. The machine learning algorithm that belongs to supervised learning is: Bayesian classifier
2. The machine learning algorithm that belongs to unsupervised learning is: hierarchical clustering
3. The conjugate distribution of the binomial distribution is: Beta distribution
4. The conjugate distribution of the multinomial distribution is: Dirichlet distribution
5. The characteristics of the Naive Bayesian classifier are: it is assumed that the attributes of each dimension of the sample are independent
6. Which of the following methods does not consider the prior distribution: maximum likelihood estimation
7. For the Bayesian classifier of normal density, when all kinds of covariance matrices are the same, the decision function is: linear decision function
8. The following are linear classification methods: Perceptron
9. The following methods are not affected by data normalization: decision tree
10. The gradient descent method will not be used in the following classification methods: minimum distance classifier
11. The following method uses maximum likelihood estimation: Logistic regression
12. The most accurate description of linear discriminant analysis is to find a projection direction such that: the intra-class distance is the smallest and the inter-class distance is the largest
13. A brief description of the principle of SVM can be summarized as: maximum interval classification
14. The algorithm performance of SVM depends on: all of the above (kernel function selection, kernel function parameters, soft interval parameter C)
15. The dual problem of support vector machine is: convex quadratic optimization
16. The following description of the support vector in the support vector machine is correct: the vector on the maximum interval support surface
17. Suppose you use a linear kernel SVM with an order of 2. After applying the model to the actual data set, the training accuracy and test accuracy are both 100%. Now increasing the complexity of the model (increasing the order of the kernel function), which of the following will happen: Overfitting
18. To avoid direct complex nonlinear transformation, the method of using linear means to realize nonlinear learning is: Kernel function method
19. The correct description of the decision tree node division index is: the greater the information gain, the better
20. In the following description, which belongs to the decision tree strategy is: maximum information gain
21. How to choose the base classifier in the ensemble learning, the learning efficiency is usually better: the classifiers are diverse and the differences are large
22. In ensemble learning, the minimum requirement for the correct rate of each base classifier: more than 50%
23. The following characteristics of the Bagging method are: Bootstraping is used when constructing the training set
24. The following characteristics of the Bagging method are: Bootstraping is used when constructing the training set
25. The random forest method belongs to: Bagging method
26. Suppose there is a data set S, but the data set has a lot of errors, using soft interval SVM training, the threshold is C, if the value of C is very small, which of the following statements is correct: Misclassification will occur
27. The threshold of soft-margin SVM tends to infinity. Which of the following statements is correct: As long as the optimal classification hyperplane exists, it can correctly classify all data
28. In general, under what circumstances does the K-NN nearest neighbor method work well: fewer samples but better typicality
29. The difference between the regression problem and the classification problem: the former predicts the function value as a continuous value, and the latter as a discrete value
30. Equivalent Regression Method of Least Squares Regression Method: Maximum Likelihood Regression with Linear Mean and Normal Error
31. Regularized regression analysis can avoid: overfitting
32. The problem of "beer-diapers" tells about the fact that when shopping in a supermarket, by analyzing the shopping list, it is found that men who buy diapers often buy beer. What is the question: association analysis
33. KL divergence is based on what constructs the separability criterion: class probability density
34. The density clustering method fully considers the relationship between samples: the density can reach
35. In mixed Gaussian clustering, which of the following processes is used: EM algorithm
36. What method is principal component analysis: dimensionality reduction method
37. When PCA performs dimensionality reduction processing, which features are preferred: the largest eigenvalue of the covariance matrix of the centered sample corresponds to the eigenvector
38. In the phenomenon of overfitting: the test error of the training sample is the smallest, but the correct recognition rate of the test sample is very low
39. As shown in the directed graph on the right, the Markov blanket of node G is: {D, E, F, H, I, J}
40. As shown in the undirected graph on the right, the Markov blanket of node G is: {D, E, I, J}
41. In the multi-layer perceptron method, it can be used as a nonlinear activation function of neurons: Logistic function
42. On the finite support set, the entropy of the following distribution is the largest: uniform distribution
43. Knowing the mean and variance, which of the following distributions has the largest entropy: Gaussian distribution
44. Among the following models, the probabilistic graphical model is: Restricted Boltzmann Machine
45. As shown in the directed graph on the right, the following statements are correct: B and G are independent of the condition of {C, F}
46. In standardized formulas, the purpose of use is to prevent the denominator from being zero
47. What are the correct steps for the gradient descent algorithm: 4,3,1,5,2 (initialization-input-calculation error-change weights to reduce error-iterative update) (1
) Calculate the difference between the predicted value and the real value (2) update
iteratively until the optimal weight is found
(3) pass the input into the network to get the output value
(4) initialize random weights and biases
(5) for each neuron that generates an error, Change the corresponding (weight) value to reduce the error
48. If you use a more complex regression model to fit the sample data, use ridge regression and adjust the regularization parameters to reduce the complexity of the model. If λ is large, the following statements about bias and variance are correct: if λ is large, the bias decreases and the variance decreases
49. Which of the following methods will increase the risk of poor fit of the model: data augmentation
50. The following statement is correct: In addition to the EM algorithm, gradient descent can also find the parameters of the mixed Gaussian model
51. When training a neural network, if the training error is too high, which of the following methods cannot greatly reduce the training error: increase training data
52. Which of the following activation functions can cause the gradient to disappear: Tanh
53. Increasing which of the following hyperparameters may cause the random forest model to overfit the data: (2) Depth of the decision tree
54. Which of the following statements about deep network training is correct: D
A. The training process requires the use of gradients, which measure the rate of change of the loss function relative to the model parameters
B. The loss function measures the difference between the model prediction results and the real values
C. The training process is based on a technique called backpropagation
D. All other options are correct 55.
Which of the following introduces nonlinearity in a neural network: ReLU
56. Using regularization items in linear regression, you find that many coefficients of the solution are 0, then this regularization item may be:
L0-norm, L1-norm
57. Regarding CNN, the following conclusion is correct: Pooling layer uses for reducing the spatial resolution of images
58. Regarding the k-means algorithm, the correct description is: the initial value is different, and the final result may be different
59. Which of the following descriptions about the phenomenon of overfitting is correct: the training error is small, and the test error is large
60. The following statement about the convolutional neural network is correct: the convolutional neural network can have multiple convolution kernels, which can be of different sizes
61. The loss function of the LR model is: cross entropy
62. The statement of GRU and LSTM is correct: GRU has fewer parameters than LSTM
63. The following methods cannot be used for feature dimensionality reduction: Monte Carlo method
64. Which of the following functions cannot be used as an activation function: y=2x
65. There are two sample points, the first point is a positive sample, its feature vector is (0,-1); the second point is a negative sample, its feature vector is (2,3), from these two A training set composed of sample points constructs a classification surface equation of a linear SVM classifier: x+2y=3
66. Under the premise that other conditions remain unchanged, which of the following practices is likely to cause overfitting problems in machine learning: Gaussian kernels are used instead of linear kernels in the SVM algorithm
67. The following method belongs to the unsupervised learning algorithm: K-Means clustering
68. What does Bootstrap data mean: sample n samples from a total of N samples with replacement
69. The following description of the Bayesian classifier is wrong: the prior probability is derived based on the posterior probability
70. In the following description of the Adaboost algorithm, the error is: learn multiple weak classifiers independently at the same time
71. In the following machine learning, when data preprocessing does not need to consider normalization processing: tree model
72. In the binary classification task, there are three classifiers h1, h2, h3, and three test samples x1, x2, x3. Suppose 1 means the classification result is correct, 0 means wrong, the results of h1 in x1, x2, x3 are (1,1,0), h2, h3 are (0,1,1), (1,0,1 ), integrate the three classifiers according to the voting method, the following statement is correct: the integration improves the performance
73. Regarding the Precision and Recall of the machine learning classification algorithm, which of the following definitions is correct (assuming tp = true positive, tn = true negative, fp = false positive, fn = false negative):
Precision= tp / (tp + fp), Recall = tp / (tp + fn)
74. Which of the following is not a commonly used feature selection algorithm for text classification: principal component analysis
75. In HMM, if the observation sequence and the state sequence that produces the observation sequence are known, which of the following methods can be used to directly estimate the parameters: maximum likelihood estimation
76. Which of the following distances will focus on the direction of the vector: cosine distance
77. The algorithm to solve the prediction problem in the hidden horse model is: Viterbi algorithm
78. In Logistic Regression, if L1 and L2 norms are added at the same time, what effect will it have: feature selection can be done, and overfitting can be prevented to a certain extent
79. What is the technical difference between the ordinary backpropagation algorithm and the backpropagation over time algorithm (BPTT): Unlike ordinary backpropagation, BPTT will superimpose the gradient of all corresponding weights in each time step
80. The gradient explosion problem means that when training a deep neural network, the gradient becomes too large and the loss function becomes infinite. In RNN, which of the following methods can better deal with the gradient explosion
problem: Gradient clipping
81. When training a neural network for image recognition tasks, it is common to draw a graph of training set error and validation set error for debugging. In the figure below, when is the best time to stop training
: C
[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-ctS8tH71-1655199702381) (C:\Users\Crescent_P\AppData\Roaming\Typora\typora-user-images\ image-20220612170843797.png)]
Question 1
A computer program learns a task T from experience E and uses P to measure performance. Also, the performance P of T increases with the increase of experience E.
Suppose we feed a learning algorithm a lot of historical weather data and let it learn to predict the weather. What is a reasonable choice for P?
A. The process of calculating a large amount of historical weather data
B. None of the above
C. The probability of correctly predicting the weather for a future date
D. The task of weather forecasting
Question 2
Suppose you are doing weather forecasting and using an algorithm to predict the temperature (Celsius/Fahrenheit) tomorrow, would you treat this as a classification problem or a regression problem?
A. Classification
B. Regression
Question 3
Let's say you're doing stock market predictions. You want to predict whether a company will declare bankruptcy within the next 7 days (by training on data from similar companies that were previously at risk of bankruptcy). Would you treat this as a classification problem or a regression problem?
A. Classification
B. Regression
Question 4
Some of the problems below are best solved using supervised learning algorithms, while others should be solved using unsupervised learning algorithms. Which of the following would you use supervised learning? (Select all that apply.) In each case, assume that an appropriate data set is available for the algorithm to learn from.
A. Based on a person's genetic (DNA) data, predict his/her chance of developing diabetes in the next 10 years
B. Based on a large dataset of medical records of cardiac patients, try to understand whether there are different patient groups for whom we can tailor different treatment options
C. Have a computer examine a piece of audio and classify whether the audio has a human voice (i.e. a vocal singing) or only instruments (no vocals)
D. Given data on the responses of 1000 medical patients to an experimental drug (such as treatment effects, side effects, etc.), discover whether there are different categories or "types" of patient response to the drug, and if so, what are those categories
Question 5
Which is a reasonable definition of machine learning?
A. Machine learning learns from labeled data
B. Machine learning enables computers to learn without being explicitly programmed
C. Machine learning is the science of computer programming
D. Machine learning is the field that allows robots to act intelligently
Question 6
Based on a student's performance in his freshman year, predict how he will perform in his sophomore year.
Let x be equal to the number of "A's" (including A-, A, and A+ grades) the student gets in his first year of college. Predict the value of y: the number of "A" grades obtained in the second year
Here each row is a training data. In linear regression, our hypothesis hθ(x) = θ0 + θ1x, and we use m to denote the number of training examples.
| x | y |
| 3 | 2 |
| 1 | 2 |
| 0 | 1 |
| 4 | 3 |
For the training set given above (note that this training set can also be referenced in other questions on this quiz), what is the value of m?
Question 7
For this problem, assume we use the training set from the first problem. And, our definition of the cost function is J(θ0,θ1)=12m∑i=1m(hθ(x(i))−y(i))2 to
find J(0,1)
Question 8
In question 1, the linear regression assumes θ0=−1, θ1=2, how to find hθ(6)?
Question 9
The relationship between the cost function J(θ0,θ1) and θ0,θ1 is shown in Figure 2. A contour plot of the same cost function is given in "Fig. 1". According to the illustration, choose the correct option (check all the correct items)
A. Starting from point B, a gradient descent algorithm with an appropriate learning rate will eventually help us reach or approach point A, that is, the cost function J(θ0,θ1) has a minimum value at point A
B. Point P (global minimum in Figure 2) corresponds to point C in Figure 1
C. Starting from point B, a gradient descent algorithm with an appropriate learning rate will eventually help us reach or approach point C, that is, the cost function J(θ0,θ1) has a minimum value at point C
D. Starting from point B, a gradient descent algorithm with an appropriate learning rate will eventually help us reach or approach point A, that is, the cost function J(θ0,θ1) has a maximum value at point A
E. Point P (global minimum in Figure 2) corresponds to point A in Figure 1
Question 10
Suppose for some linear regression problem (such as predicting housing prices), we have some training sets. For our training set, we can find some θ0, θ1 such that J(θ0, θ1)=0.
Which of the following statements is true? (check all correct items)
A. In order to achieve this, we must have θ0=0,θ1=0, so that J(θ0,θ1)=0
B. For the value of θ0 and θ1 satisfying J(θ0,θ1)=0, for each training example (x(i), y(i)), hθ(x(i))=y(i )
C. This is impossible: by the definition of J(θ0,θ1)=0, there cannot be θ0,θ1 such that J(θ0,θ1)=0
D. We can perfectly predict the value of y even for new examples we haven't seen yet (e.g. we can perfectly predict the price of a new house we haven't seen yet)
Question 11
[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-eIxCOuA6-1655199702382) (C:\Users\Crescent_P\AppData\Roaming\Typora\typora-user-images\ image-20220612171415306.png)]
Question 12
[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-HeGb32rU-1655199702382) (C:\Users\Crescent_P\AppData\Roaming\Typora\typora-user-images\ image-20220612171402929.png)]
Question 13
[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-6ZkiMIL2-1655199702383) (C:\Users\Crescent_P\AppData\Roaming\Typora\typora-user-images\ image-20220612171356197.png)]
Question 14
[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-rHrPn3OV-1655199702383) (C:\Users\Crescent_P\AppData\Roaming\Typora\typora-user-images\ image-20220612171343679.png)]
Question 15
Let A and B be 3x3 matrices, which of the following must be correct (choose all correct items)
A. A+B=B+A
B. If v is a 3D vector, then A∗B∗v is a 3D vector
C. A∗B∗A=B∗A∗B
D. If C=A∗B, Then C is a 6x6 matrix
Question 16¶
Assume m=4 students take a class with midterm and final exams. You have collected a dataset of their scores on two exams as follows:
midterm score | (mid-term score)^2 | final score |
---|---|---|
89 | 7921 | 96 |
72 | 5184 | 74 |
94 | 8836 | 87 |
69 | 4761 | 78 |
You want to use polynomial regression to predict a student's midterm grade. Specifically, suppose you want to fit a model with hθ(x)=θ0+θ1x1+θ2x2, where x1 is the interim score and x2 is (interim score)^2. Also, you plan to use both feature scaling (dividing by the "max-min" or range of the feature) and mean normalization.
What is the normalized x2(4) eigenvalue? (Hint: midterm=89, final=96 is training example 1)
Question 17
15 iterations of gradient descent were performed with α = 0.3, and J(θ) was calculated after each iteration. You will find that the value of J(θ) decreases slowly and is still decreasing after 15 iterations. Based on this, which of the following conclusions seems most plausible?
A. α=0.3 is a valid choice for the learning rate.
B. Instead of using the current value of α, it is better to try a smaller α value (such as α=0.1)
C. Instead of using the current value of α, it is better to try a larger α value (such as α=1.0)
Question 18
Assuming you have m=14 training examples, with n=3 features (not including the intercept term that needs to be added to be constant at 1), the normal equation is θ=(XTX)−1XTy. For given values of m and n, what are the dimensions of θ, X, and y in this equation?
A. X 14×3, y 14×1, θ 3×3
B. X 14×4, y 14×1, θ 4×1
C. X 14×3, y 14×1, θ 3×1
D. X 14×4, y 14×4, θ 4×4
Question 19
Suppose you have a dataset with m=1000000 examples and n=200000 features per example. You want to fit the parameters θ to our data using multiple linear regression. Should you use gradient descent or normal equations?
A. Gradient descent, because the calculation in the normal equation θ=(XTX)−1 is very slow
B. The normal equation because it provides an efficient way to solve directly
C. Gradient descent because it always converges to the optimal θ
D. Normal equation, because gradient descent may not find the optimal θ
Question 20
Which of the following are reasons to use feature scaling?
A. It prevents gradient descent from falling into a local optimum
B. It speeds up gradient descent by reducing the computational cost of each iteration of gradient descent
C. It speeds up gradient descent by reducing the number of iterations to get a good solution
D. It prevents the matrix XTX (for normal equations) from being irreversible (singular/degenerate)
Question 26
Suppose you have trained a logistic classifier that outputs a prediction hθ(x) = 0.4 on a new example x. This means (choose all correct items):
A. Our estimate of P(y=0∣x;θ) is 0.4
B. Our estimate of P(y=1∣x;θ) is 0.6
C. Our estimate of P(y=0∣x;θ) is 0.6
D. Our estimate of P(y=1∣x;θ) is 0.4
Question 27
Suppose you have the following training set, and fit a logistic regression classifier hθ(x)=g(θ0+θ1x1+θ2x2)
Which of the following is correct? Check all correct items
A. Adding polynomial features (for example, using hθ(x)=g(θ0+θ1x1+θ2x2+θ3x12+θ4x1x2+θ5x22)) can increase how well we fit the training data
B. At the optimal value of θ (e.g. found by fminunc), J(θ)≥0
C. Adding polynomial features (e.g. using hθ(x)=g(θ0+θ1x1+θ2x2+θ3x12+θ4x1x2+θ5x22) will increase J(θ) because we are now summing more terms
D. If we train gradient descent iterations enough times, for some examples x(i) in the training set, it is possible to get hθ(x(i))>1
Question 28
For logistic regression, the gradient is given by ∂∂θjJ(θ)=1m∑i=1m(hθ(x(i))−y(i))xj(i). Which of the following is the correct gradient descent update for logistic regression with learning rate α? Check all correct items
A. θ:=θ−α1m∑i=1m(θTx−y(i))x(i)
B. θj:=θj−α1m∑i=1m(11+e−θTx(i)−y(i))xj(i) (update all j at the same time)
C. θj:=θj−α1m∑i=1m(hθ(x(i))−y(i))x(i) (update all j at the same time)
D. θj:=θj−α1m∑i=1m(hθ(x(i))−y(i))xj(i) (update all j at the same time)
Question 29
Which of the following statements is true? Check all correct items
A. For logistic regression, gradient descent sometimes converges to a local minimum (and fails to find a global minimum). That's why we prefer more advanced optimization algorithms like fminunc (conjugate gradient/BFGS/L-BFGS/etc)
B. The value of the sigmoid function g(z)=11+e−z will never be greater than 1
C. The cost function J(θ) of logistic regression trained with m≥1 examples is always greater than or equal to zero
D. Using linear regression + threshold method for classification prediction is always very effective
Question 30
Suppose you train a logistic regression classifier hθ(x)=g(θ0+θ1x1+θ2x2). Assuming θ0=6, θ1=−1, θ2=0, which of the following graphs represents the decision boundary found by the classifier?
A.
B.
C.
D.
Week 3 | 2 Regularization
Question 31
You are training a categorical logistic regression model. Which of the following statements is true? Check all correct items
A. Introducing regularization into the model always achieves the same or better performance on the training set
B. Adding many new features to the model helps prevent overfitting on the training set
C. Introducing regularization into the model will always achieve the same or better performance for examples not in the training set
D. Adding new features to the model will always result in equal or better performance on the training set
Question 32
Suppose you run two logistic regressions, one with λ=0 and one with λ=1. One time, get the parameter θ=[81.4712.69], and another time, get θ=[13.010.91].
However, you forget which value of λ corresponds to which value of θ. Which do you think corresponds to λ=1?
A. θ=[13.010.91]
B. θ=[81.4712.69]
Question 33
Which of the following statements about regularization is true? Check all correct items
A. Using too large a value for λ may cause your hypothesis to overfit the data; this can be avoided by reducing λ
B. Using a very large value for λ does not affect the performance of the hypothesis; the only reason we do not set λ too large is to avoid numerical problems
C. Consider a classification problem. Adding regularization may cause the classifier to misclassify some training examples (when no regularization is used, i.e. when λ=0, it correctly classifies these examples)
D. Since the output value of logistic regression is 0≤hθ(x)≤1, the range of its output value can only be "shrunk" by regularization anyway, so regularization usually does not help it
Question 34
Which of the following image hypotheses overfit to the training set?
A.
B.
C.
D.
Question 35
Which of the following image hypotheses underfit the training set?
A.
B.
C.
D.
Question 36
Which of the following statements is true? select all correct
A. The activation value of the hidden unit in the neural network, after applying the sigmoid function, is always in the range (0, 1)
B. Logical functions on binary values (0 or 1) can be (approximately) represented by some neural networks
C. A two-layer (one input layer, one output layer, no hidden layer) neural network can represent an XOR function
D. Suppose there is a multi-class classification problem with three classes, using a three-layer network for training. Let a1(3)=(hΘ(x))1 be the activation of the first output unit, and similarly, have a2(3)=(hΘ(x))2 and a3(3)=(hΘ(x)) 3. Then for any input x, there must be a1(3)+a2(3)+a3(3)=1
Question 37
Consider the following neural network with two binary inputs x1,x2∈{0,1} and output hΘ(x). Which of the following logistic functions does it (approximately) compute?
A. OR
B. AND
C. NAND (and not)
D. XOR (exclusive or)
Question 38
Consider the neural network given below. Which of the following equations correctly computes the activation of a1(3)? Note: g(z) is the sigmoid activation function
A. a1(3)=g(Θ1,0(2)a0(2)+Θ1,1(2)a1(2)+Θ1,2(2)a2(2))
B. a1(3)=g(Θ1,0(1)a0(1)+Θ1,1(1)a1(1)+Θ1,2(1)a2(1))
C. a1(3)=g(Θ1,0(1)a0(2)+Θ1,1(1)a1(2)+Θ1,2(1)a2(2))
D. There is no activation a1(3) in this network
Question 39
You have the following neural network:
You want to compute the activations of the hidden layer a(2) ∈ R3, one way is to use the following Octave code:
You need a vectorized implementation (ie, one that doesn't use loops). Which of the following implementations correctly computes a(2)? Check all correct items
A. z = Theta1 * x; a2 = sigmoid (z)
B. a2 = sigmoid (x * Theta1)
C. a2 = sigmoid (Theta2 * x)
D. z = sigmoid(x); a2 = sigmoid (Theta1 * z)
Question 40
You are using the neural network shown below and have learned the parameters Θ(1)=[112.411.73.2] (for computing a(2)) and Θ(2)=[10.3−1.2] (for acting on function of a(2), computes the value of a(3)).
Suppose you swap the parameters Θ(1)=[11.73.2112.4] of the 2 units of the first hidden layer, and also swap the output layer Θ(2)=[1−1.20.3]. How will this change the value of the output hΘ(x)?
A. No change
B. Bigger
C. Smaller
D. Incomplete information, may become larger or smaller
Question 41
You are training a three-layer neural network and want to use backpropagation to compute the gradient of the cost function.
In the backpropagation algorithm, one of the steps is to update
Δij(2):=Δij(2)+δi(3)∗(a(2))j
for each i,j, which of the following is the correct one for this step vectorization?
A. Δ(2):=Δ(2)+(a(2))T∗δ(3)
B. Δ(2):=Δ(2)+(a(3))T∗δ(2)
C. Δ(2):=Δ(2)+δ(3)∗(a(2))T
D. Δ(2):=Δ(2)+δ(3)∗(a(3))T
Question 42
Assuming Theta1
a 5x3 matrix, Theta2
is a 4x6 matrix. make thetaVec=[Theta1(;);Theta2(:)]
. Which of the following can be correctly restored Theta2
?
A. reshape(thetaVec(16:39),4,6)
B. reshape(thetaVec(15:38),4,6)
C. reshape(thetaVec(16:24),4,6)
D. reshape(thetaVec(15:39),4,6)
E. reshape(thetaVec(16:39),6,4)
Question 43
Let J(θ)=2θ3+2, let θ=1, ϵ=0.01. Use the formula JJ(θ+ϵ)−J(θ−ϵ)2ϵ to numerically compute the approximation at θ=1. What value will you get? (When θ=1, the exact derivative is dJ(θ)dθ=6)
A. 8
B. 6
C. 5.9998
D. 6.0002
Question 44
Which of the following statements is true? select all correct
A. Using a large value of λ does not affect the performance of the neural network; the only reason we do not set λ too large is to avoid numerical problems
B. Gradient checking is useful if we use gradient descent as an optimization algorithm. However, it is not very useful if we use an advanced optimization method (such as in fminunc)
C. Using gradient checking can help verify that the implementation of backpropagation is bug-free
D. If our neural network is overfitting the training set, a reasonable step is to increase the regularization parameter λ
Question 45
Which of the following statements is true? select all correct
A. Assume that the parameter Θ(1) is a square matrix (that is, the number of rows is equal to the number of columns). If we replace Θ(1) with its transpose (Θ(1))T, then we have not changed the function that the network is computing.
B. Suppose we have a correct implementation of backpropagation and are training a neural network using gradient descent. Suppose we plot J(Θ) as a function of the number of iterations and find that it increases rather than decreases. One possible reason is that the learning rate α is too large.
C. Suppose we use gradient descent with a learning rate α. For logistic regression and linear regression, J(Θ) is a convex optimization problem, so we don't want to choose an excessively large learning rate α.
However, for neural networks, J(Θ) may not be convex, so choosing a very large value for α can only speed up convergence.
D. If we are training a neural network using gradient descent, a reasonable debugging step is to plot J(Θ) as a function of the number of iterations and ensure that it is decreasing (or at least not increasing) after each iteration.
Question 46
You train a learning algorithm and find that it has a high error on the test set. Plot the learning curve and get the graph below. Does the algorithm have high bias, high variance, or neither?
A. High bias
B. High variance
C. Neither
Question 47
Let's say you've implemented regularized logistic regression to classify objects in images (i.e., haven't implemented image recognition yet). However, when you test your model on a new set of images, you will find that its predictions for the new images are very wrong. However, your hypothesis fits well on the training set. Which of the following practices could be improved? Check all correct items
A. Try adding multinomial features
B. Get more training examples
C. Try using fewer features
D. Use fewer training examples
Question 48
Suppose you have implemented regularized logic to predict which items a customer will buy on a shopping website. However, when you test your model on a new set of customers, you discover that it has a large error in its predictions. Also, the model does not perform well on the training set. Which of the following practices could be improved? Check all correct items
A. Try to get and use other features
B. Try to add polynomial features
C. Try to use fewer features
D. Try to increase the regularization parameter λ
Question 49
Which of the following statements is true? Check all correct items
A. Suppose you are training a regularized linear regression model. The recommended way to choose a value for the regularization parameter λ is to choose the value of λ that minimizes the cross-validation error.
B. Suppose you are training a regularized linear regression model. The recommended way to choose the value of λ for the regularization parameter is to choose the value of λ that gives the smallest test set error.
C. Assuming you are training a regularized linear regression model, the recommended way to choose the value of the regularization parameter λ is to choose the value of λ that gives the smallest training set error.
D. The performance of the learning algorithm on the training set is usually better than that on the test set.
Question 50
Which of the following statements is true? Check all correct items
A. When debugging a learning algorithm, it is helpful to plot learning curves to see if you have high bias or high variance issues.
B. If a learning algorithm suffers from high variance, adding more training examples may improve test error.
C. We always prefer models with high variance (rather than models with high bias) because they fit the training set better.
D. If a learning algorithm has high bias, simply adding more training examples may not significantly improve test error.
Question 51
You are working on a spam classification system using regularized logistic regression. "Spam" is the positive class (y=1), and "Not Spam" is the negative class (y=0). You have trained a classifier with m=1000 examples in the cross-validation set. The plot of predicted classes vs actual classes is:
| Actual Class: 1 | Actual Class: 0 |
| Predicted Class: 1 | 85 |
| Predicted Class: 0 | 15 |
For reference:
Accuracy = (True Positives + True Negatives) / (Total Examples)
Precision = (True Positives) / (True Positives + False Positives)
Recall = (True Positives) / (True Positives + False Negatives)
F1 Score = (2 precision- recall) / (precision + recall)
What is the recall of the classifier?
Question 52
Suppose a huge dataset can be used to train a learning algorithm. Training on large amounts of data may yield good performance when the following two conditions hold. What are the two conditions?
A. Feature x contains enough information to accurately predict y. (For example, one way to test this is whether human experts can confidently predict y when given only x).
B. We train a learning algorithm with a small number of parameters (so less likely to overfit).
C. We train learning algorithms with a large number of parameters (capable of learning/representing fairly complex functions).
D. We train a model without regularization.
Question 53
Suppose you have trained a logistic regression classifier that outputs hθ(x).
Currently, 1 is predicted if hθ(x) ≥ threshold, and
0 is predicted if hθ(x) ≤ threshold, and the current threshold is set to 0.5.
Say you increase the threshold to 0.9. Which of the following is correct? Check all correct items
A. Now the classifier may be less accurate.
B. The precision and recall of the classifier may be unchanged, but the accuracy is lower.
C. The accuracy and recall of the classifier may not change, but the precision is higher.
D. The classifier may now have a lower recall.
Say you lower the threshold to 0.3. Which of the following is correct? Check all correct items
A. The classifier may now have a higher recall.
B. The accuracy and recall of the classifier may not change, but the precision is higher.
C. The classifier may now have higher accuracy.
D. The precision and recall of the classifier may be unchanged, but the accuracy is lower.
Question 54
Suppose you are working with a spam classifier where spam is a positive example (y=1) and non-spam is a negative example (y=0). You have a training set of emails where 99% of emails are not spam and 1% are spam. Which of the following statements is true? Check all correct items
A. A good classifier should have both high precision and high recall on the cross-validation set.
B. If you always predict non-spam (output y=0), then your classifier will have 99% accuracy on the training set, and it will probably perform similarly on the cross-validation set.
C. If you always predict non-spam (output y=0), then your classifier will have an accuracy of 99%.
D. If you always predict non-spam (output y=0), then your classifier will be 99% accurate on the training set, but worse on the cross-validation set because it Overfitting the training data.
E. If spam is always predicted (output y=1), the recall rate of the classifier is 0% and the precision is 99%.
F. If it always predicts non-spam (output y=0), the recall rate of the classifier is 0%.
G. If you always predict spam (output y=1), then your classifier will have recall 100% and precision 1%.
H. If you always predict non-spam (output y=0), then your classifier will have an accuracy of 99%.
Question 55
Which of the following statements is true? Check all correct items
A. Before building the first version of a learning algorithm, it is a good idea to spend a lot of time collecting a lot of data.
B. On skewed datasets (e.g. when there are more positive examples than negative examples), accuracy is not a good performance measure, you should use F1 score based on precision and recall.
C. After training the logistic regression classifier, you must use 0.5 as the threshold for predicting whether the example is positive or negative.
D. Using a very large training set makes the model less likely to overfit the training data.
E. If your model doesn't fit the training set, getting more data might help.
Question 56
Suppose you use an SVM trained with a Gaussian kernel that learns the following decision boundary on the training set:
Do you think the SVM is underfitting, should you try increasing or decreasing C? Or increase or decrease σ2?
A. Decrease C, increase σ2
B. Decrease C, decrease σ2
C. Increase C, increase σ2
D. Increase C, decrease σ2
Question 57
The formula for the Gaussian kernel is given by similarity(x,l(1))=exp(−||x−l(1)||22σ2).
The figure below shows the plot of f1=similarity(x,l(1)) when σ2=1.
When σ2=0.25, which of the following is the graph of f1?
A.
B.
C.
D.
Question 58
Support vector machine to solve minθ C∑i=1my(i)cost1(θTx(i))+(1−y(i))cost0(θTx(i))+∑j=1nθj2, where the functions cost0(z) and cost1 (z) The image is as follows:
The first item in the target is: C∑i=1my(i)cost1(θTx(i))+(1−y(i))cost0(θTx(i)). If two of the following four conditions
are True, the first term is zero. What are the two conditions that make this term equal to zero?
A. For each example where y(i)=1, there are θTx(i)≥1
B. For each example where y(i)=0, θTx(i)≤−1
C. For each example where y(i)=1, there are θTx(i)≥0
D. For each example where y(i)=0, there is θTx(i)≤0
Question 59
Suppose you have a dataset with n=10 features and m=5000 examples. After training a logistic regression classifier with gradient descent, you find that it underfits the training set and does not achieve the desired performance on the training set or the cross-validation set. Which of the following steps is expected to improve? Check all correct items
A. Try using a neural network with a large number of hidden units.
B. Reduce the number of examples in the training set.
C. Use a different optimization method, since training the logic with gradient descent may lead to local minima.
D. Create/add new polynomial features.
Question 60
Which of the following statements is true? Check all correct items
A. Suppose you are using support vector machines for multiclass classification and wish to use a one-vs-all approach. If you have K different classes, you will train K−1 different SVMs.
B. If the data is linearly separable, then an SVM with a linear kernel will return the same parameter θ regardless of the value of C (i.e., the resulting value of θ does not depend on C).
C. The maximum value of the Gaussian kernel (ie sim(x,l(1))) is 1.
D. It is important to perform feature normalization before using the Gaussian kernel.
Question 61
For which of the following tasks, K-means clustering might be an appropriate algorithm? Check all correct items
A. Given a database of user information, automatically group users into different market segments.
B. Based on the sales data of a large number of products in the supermarket, find out which products can be combined (eg often bought together) and therefore should be placed on the same shelf.
C. Predict tomorrow's rainfall based on historical weather records
D. Given sales data for a large number of products in a supermarket, estimate future sales of those products.
E. Given a set of news articles from many different news sites, find the main topics covered.
F. Based on many emails, determine whether they are spam or not spam.
G. From user usage patterns on the site, find out which different user groups exist.
H. Based on historical weather records, predict whether tomorrow's weather will be sunny or rainy.
Question 62
Suppose we have three cluster centers μ1=[12], μ2=[−30], μ3=[42]. Also, we have a training example x(i)=[−21]. What will c(i) be after a cluster allocation step?
A. c(i)=2
B. c(i) is not assigned
C. c(i)=1
D. c(i)=3
Question 63
K-means is an iterative algorithm that repeats the following two steps in its inner loop. which two?
A. Move the cluster center and update the cluster center μk.
B. Allocation of clusters where parameter c(i) is updated.
C. Move the cluster center μk, setting it equal to the nearest training example c(i)
D. Cluster center assignment step, where each cluster centroid μi is assigned (by setting c(i)) to the nearest training example x(i).
Question 64
Suppose you have an unlabeled dataset {x(1),...,x(m)}. You run K-means initialization with 50 different random numbers and get 50 different clusters. What is the method for choosing which of these 50 combinations?
A. The only way is that we need data labels y(i).
B. For each category, calculate 1m∑i=1m||x(i)−μc(i)||2, and choose the one with the smallest value.
C. The answers are ambiguous and there is no good way to choose them.
D. Always choose the last (50th) cluster found, as it is more likely to converge to a good solution.
Question 65
Which of the following statements is true? Check all correct items
A. If we are worried about K-means getting stuck in local optima, one way to improve (reduce) this problem is to try to use multiple random initializations.
B. The standard way to initialize K-means is to set μ1=…=μk to a vector equal to zero.
C. Since K-Means is an unsupervised learning algorithm, it cannot overfit the data, so it is better to cluster as many as possible computationally.
D. For some datasets, the "correct" value of K (the number of clusters) may be ambiguous and difficult to decide, even for a human expert looking carefully at the data.
E. K-means gives the same result regardless of the initialization of the cluster centers.
F. A good way to initialize K-means is to select K (distinct) examples from the training set and set cluster centroids equal to these selected examples.
G. In each iteration of K-means, the cost function J(c(1),…,c(m), μ1,…,μk) (distortion function) either remains constant or decreases, especially not should be increased.
H. Once an example is assigned to a particular cluster center, it will never be reassigned to a different cluster center.
Question 66
Consider the following two-dimensional dataset:
Which of the following pictures corresponds to the possible value of u(1) (first eigenvector/first principal component) returned by PCA? Check all correct items
A.
B.
C.
D.
Question 67
Which of the following is a reasonable way to choose the number of principal components k? (n is the dimension of the input data mm is the number of input examples)
A. Choose the smallest value of k that retains at least 99% of the variance
B. Choose k so that the approximation error 1m∑i=1m||x(i)−xapprox(i)||2.
C. Choose the smallest value of k that preserves at least 1% of the variance
D. Choose n where k is 99% (that is, k=0.99∗n rounded to the nearest integer).
Question 68
Suppose someone tells you that the way they run PCA is that "95% of the variance is preserved", what is the equivalent of that?
A. 1m∑i=1m||x(i)||21m∑i=1m||x(i)−xapprox(i)||2≥0.05
B. 1m∑i=1m||x(i)||21m∑i=1m||x(i)−xapprox(i)||2≤0.05
C. 1m∑i=1m||x(i)−xapprox(i)||21m∑i=1m||x(i)||2≤0.05
D. 1m∑i=1m||x(i)||21m∑i=1m||x(i)−xapprox(i)||2≤0.95
Question 69
Which of the following statements is true? select all correct
A. Given only z(i) and Ureduce, there is no way to reconstruct any reasonable approximation of x(i).
B. Even if all input features are on very similar scales, we should still perform mean normalization (so that each feature has a mean of zero) before running PCA.
C. PCA is susceptible to local optima; trying multiple random initializations may help.
D. Given input data x ∈ Rn, it makes sense to run PCA only with values of k satisfying k≤n (in particular, running PCA with k=n is possible but not helpful, and k>n does not make sense)
Question 70
Which of the following is a recommended application of PCA? select all correct
A. As an Alternative to Linear Regression: For most model applications, PCA and linear regression give essentially similar results.
B. Data Compression: Reduce the dimensionality of the data, thereby reducing the memory/disk space occupied.
C. Data Visualization: Take 2D data and find different ways to plot it in 2D (using k=2).
D. Data Compression: Reduce the dimensionality of the input data x(i), which will be used in the supervised learning algorithm (i.e., use PCA to make the supervised learning algorithm run faster).
Week 9 | 1 Anomaly Detection
Question 71
For which of the following problems is anomaly detection an appropriate algorithm?
A. Given an image of a face, determine whether it is the face of a particular celebrity.
B. Given a dataset of credit card transactions, identify unusual transactions and flag them as potentially fraudulent.
C. Given data on credit card transactions, classify each transaction by type of purchase (eg: food, transportation, clothing).
D. Identify individuals who may have abnormal health conditions from a large number of primary care patient records.
Question 72
Suppose you have trained an anomaly detection system to flag anomalies when p(x) < ϵ, and you find that it has too many false positives (flag too many things as anomalies) in the cross-validation set. what should you do?
A. Increase ϵ
B. Decrease ϵ
Question 73
Suppose you are developing an anomaly detection system to catch manufacturing defects in aircraft engines. Your model uses p(x)=∏j=1np(xj;μj,σj2).
There are two properties x1=vibration strength, x2=heat generated, the values of x1, x2 are both between 0 and 1 (and strictly greater than 0).
For most "normal" engines, you would expect x1 ≈ x2. One of the suspicious anomalies is that the engine vibrates violently (big x1, small x2) even without generating much heat, even though the specific values of x1 and x2 may not be outside their typical values.
Which traits x3 should you construct to catch these types of exceptions:
A. x3=x12×x2
B. x3=x1x2
C. x3=x1+x2
D. x3=x1×x2
Question 74
Which of the following is correct? select all correct
A. If there is no labeled data (or if all data has labeled y=0), p(x) can still be learned, but it may be harder to evaluate the system or choose a good value.
B. If you have a training set with many positive examples and many negative examples, then anomaly detection algorithms may perform as well as supervised learning algorithms such as support vector machines.
C. If you are developing an anomaly detection system, you cannot use labeled data to improve your system.
D. When selecting features for an anomaly detection system, it is best to look for features with unusually large or small values for anomalous examples.
Question 75
You have a 1D dataset {x(1),…,x(m)} and you want to detect outliers in the dataset. First plot the dataset, it looks like this:
Suppose a Gaussian distribution with parameters μ1μ1 and σ21σ12 is fitted to this dataset. For μ1,σ12, which of the following values can be obtained?
A. μ1=−3,σ12=4
B. μ1=−6,σ12=4
C. μ1=−3,σ12=2
D. μ1=−6,σ12=4
1. A prison face recognition access system is used to identify the identity of the person to enter. This system includes the identification of 4 different types of personnel: prison guards, thieves, food delivery staff, and others. Which of the following learning methods is most suitable for this application need:
A. Regression problem
B. Binary classification problem
C. Multiple Classification Problems
DK-means clustering problem
2. Which of the following techniques would be better for reducing the dimensionality of a dataset
A. Drop columns with too many missing values
B. Delete columns with large data differences
C. Delete columns with different data trends
D. neither
3. Which of the following steps is the task of integrating, transforming, dimensionally reducing, and numerically reducing the original data?
A. Frequent Pattern Mining
B. Classification and Prediction
C. Data preprocessing
D. Data Stream Mining
4. Which of the following is not an SVM kernel function is ( )
A. Polynomial kernel function
B. Logical kernel function
C. Radial Basis Kernel Function
D. Linear kernel function
5. Data scientists may use multiple algorithms (models) to make predictions at the same time, and finally integrate the results of these algorithms to make the final prediction (integrated learning). The following statement about integrated learning is correct
A. High correlation between individual models
B. There is low correlation between individual models
C. It is better to use "average weight" instead of "voting" in ensemble learning
D. A single model uses an algorithm
6. In the following different scenarios, the analysis method used is incorrect ()
A. According to the business and service data of the business in the last year, use the clustering algorithm to determine the business level of the Tmall business under their respective main categories
B. According to the transaction data of the merchant in recent years, use the clustering algorithm to fit the formula of the possible consumption amount of the user in the next month
C. Use the association rule algorithm to analyze whether the buyer who bought the car seat is suitable for recommending the car mat
D. According to the product information recently purchased by the user, use the decision tree algorithm to identify whether the Taobao buyer may be male or female
7. The meaning of bootstrap data is
A. Sampling m features from the whole M with replacement
B. Sampling m features from the population M without replacement
C. Sampling n samples from the whole N with replacement
D. Sampling n samples from a population of N without replacement
8. In logistic regression, if the L1 and L2 norms are added at the same time, there will be no effect.
A. To do feature selection and prevent overfitting to a certain extent
B. Can solve the dimension disaster problem
C. Can speed up calculation
D. Can get more accurate results
9. For linearly inseparable problems in the original space, support vector machines ().
A. Find the partition data of the nonlinear function in the original space
B. can't handle
C. Find a linear function to divide the data in the original space
D. Map data into kernel space
10. What is the difference between a regression problem and a classification problem?
A. Regression problems have labels, classification problems do not
B. The output value of the regression problem is discrete, and the output value of the classification problem is continuous
C. The output value of the regression problem is continuous, and the output value of the classification problem is discrete
D. Regression problems and classification problems require different input attribute values
11. Which of the following statements about dimensionality reduction is incorrect?
A. Dimensionality reduction is to convert training samples from high-dimensional space to low-dimensional space
B. Dimensionality reduction will not damage the data
C. Through dimensionality reduction, meaningful data structures can be discovered more effectively
D. Dimensionality reduction will help in data visualization
12. What is the L1 norm of the vector x=[1,2,3,4,-9,0]?
A.1
B.19
C.6
D.
13.Assuming that X and Y both obey the normal distribution, then P(X<5,Y<0) is a ( ), indicating the probability that the two conditions of X<5,Y<0 are true at the same time, that is, the two events are common probability of occurrence.
A. Prior probability
B. Posterior probability
C. Joint probability
D. None of the above statements are correct
14. Suppose that the proportion of undergraduate students who can drive is 15%, and the proportion of graduate students who can drive is 23%. If the proportion of graduate students in a university is 20%, what is the probability that the students who can drive are graduate students?
A.80%
B.16.6%
C.23%
D.27.71%
15. Assume that there are 100 photos, of which, there are 60 photos of cats and 40 photos of dogs.
Recognition results: TP=40, FN=20, FP=10, TN=30, then you can get: ( ).
A.Accuracy=0.8
B.Precision=0.8
C.Recall=0.8
D. None of the above is correct
16. The following statements about the training set, verification set and test set are incorrect ( ).
A. The test set is purely used to test the generalization ability of the model
B. The training set is used to train and evaluate model performance
C. The validation set is used to tune the model parameters
D. None of the above statements are correct
17. Which of the following methods can be used to alleviate the occurrence of overfitting: ( ).
A. Add more features
B. Regularization
C. Increase the complexity of the model
D. All of the above
18. Suppose there are 6 two-dimensional data points: D={(2,3),(5,7),(9,6),(4,5),(6,4),(7,2) }, when splitting for the first time, the splitting line is ( ).
A.x=5
B.x=6
C.y=5
D.y=6
19. The lengths of the two vectors are 1 and 2 respectively, and the angle between them is 60 degrees, then the following option is wrong ( ).
A. The cosine similarity is 0.5
B. Cosine similarity is positive
C. The cosine similarity cannot be calculated because no specific coordinate values are given
D. The value of cosine similarity has nothing to do with the length of the vector, but only with the angle between the vectors
20. Compared with XGBoost, the main advantages of LightGBM do not include ( )
A. Faster training speed
B. Lower memory consumption
C. Better accuracy
D. Use second-order Taylor expansion to speed up convergence
21. The statement about the advantages and disadvantages of BP algorithm is wrong ( ).
A.BP algorithm cannot be used to deal with nonlinear classification problems
B.BP algorithm takes a long time to train
C.BP algorithm is easy to fall into local minimum
D. During the training of BP algorithm, the activation function may be saturated due to excessive weight adjustment
22.The neural network algorithm sometimes overfits, so which of the following methods is more feasible to solve the overfitting ().
A. Select multiple sets of initial values for the parameters, train them separately, and then select one set as the optimal value
B. Increase the step size of learning
C. Reduce the amount of data in the training dataset
D. Set a regular term to reduce the complexity of the model
23.The minimum time complexity of the SVM algorithm is O(n^2). Based on this, which of the following data sets is not suitable for this algorithm? ( )
A. Large data sets
B. Small dataset
C. Medium dataset
D. Not affected by the size of the dataset
24.A positive example (2,3), a negative example (0,-1), which of the following is the SVM hyperplane? ( )
A.2x+y-4=0
B.2y+x-5=0
C.x+2y-3=0
D. cannot be calculated
25. Which of the following statements about the Kmeans clustering algorithm is wrong ( ).
A. High efficiency and scalability for large data sets
B. is an unsupervised learning method
The CK value cannot be obtained automatically, and the initial cluster center is randomly selected
D. The selection of the initial cluster center has little effect on the clustering results
26. Simply divide the data object set into non-overlapping subsets so that each data object is in exactly one subset. This type of clustering is called ( ).
A. Hierarchical clustering
B. Partitioning clusters
C. Non-mutually exclusive clustering
D. Density clustering
27. The following statement about PCA is correct ( ).
A. PCA is a supervised learning algorithm
B. PCA selects the direction with the smallest variance in the original data for the first new coordinate axis after transformation
C. The first direction selected after PCA transformation is the most dominant feature
D. PCA does not need to normalize the data
28. The statement about Apriori and FP-growth algorithm is correct ( ).
A. Apriori is more troublesome than FP-growth operation
B. The FP-growth algorithm needs to pair items, so the processing speed is slow
C.FP-growth only needs to traverse the data once, and the scanning efficiency is high
D. The FP-growth algorithm is not suitable for shared memory when the database is large
29. A supermarket researched sales record data and found that people who bought beer had a high probability of buying diapers. What kind of data mining problem does this belong to? ( )
A. Association rule discovery
B. Clustering
c. Classification
D. Natural Language Processing
30.Confidence (confidence) is a measure of the measure of interest ( ).
A. Simplicity
B. Certainty
C. Practicality
D. Novelty
2. Multiple choice (2 points for each question)
31. Which of the following is a classification algorithm?
A. Judging benign or malignant based on the volume of the tumor and the age of the patient?
B. According to the user's age, occupation, and deposit amount, determine whether the credit card will default?
C. What size T-shirt does a man with a height of 1.85m and a weight of 100kg wear?
D. Estimate the house price based on the size of the house, the number of bathrooms and other characteristics
32. Which of the following are reasons for using data normalization (feature scaling)?
A. It speeds up gradient descent by reducing the computational cost of each iteration of gradient descent
B. It speeds up gradient descent by reducing the number of iterations to get a good solution
C. It does not prevent gradient descent from falling into a local optimum
D. It prevents matrix irreversibility (singular/degenerate)
33. The main factors affecting the effect of KNN algorithm include ( ).
AK Values
B. Distance Metrics
C. Decision Rules
D. The distance of the nearest neighbor data
34. What are the commonly used kernel functions of support vector machines ( ).
A. Gaussian kernel
B. Laplace kernel
C. Linear Kernel
D. Polynomial Kernel
35. Which of the following statements about support vector machines is correct ( ).
A. SVM is suitable for large-scale data sets
B. The idea of SVM classification is to minimize the interval between classification surfaces
C. The SVM method is simple and robust
D.SVM classification surface depends on support vectors
36. The correct statement about the advantages of the BP algorithm is ( ).
A. BP algorithm can learn adaptively
B.BP algorithm has a strong nonlinear mapping ability
C.BP algorithm backpropagation adopts the chain rule, and the derivation process is rigorous
D. The generalization ability of the BP algorithm is not strong
37. Which of the following descriptions about support vector machines is correct ( ).
A. is a supervised learning method
B. Can be used for multi-classification problems
C. Support non-linear kernel function
D. is a generative model
38. The following are commonly used techniques for dimensionality reduction: ( ).
A. Principal Component Analysis
B. Feature extraction
C. Singular value decomposition
D. Discretization
39. What properties should the hyperplane obtained by the PCA algorithm have ( ).
A. Recently reconfigurable
B. Information Gain Maximization
C. Maximum Separability
D. Local minima
40. Regarding association rules, the correct one is: ( ).
A. The main algorithms for mining association rules are: Apriori and FP-Growth
B. An itemset that satisfies the minimum support, we call it a frequent itemset
C. The story of beer and diapers is a typical example of cluster analysis
D. Support is an indicator to measure the importance of association rules
3. Judgment (1 point for each question)
41. Support vectors are those data points closest to the decision plane
A.Correct
b. wrong
42.The correlation coefficient of correlated variables can be zero, right?
A.Correct
b. wrong
43. PCA chooses the direction with the least amount of information for projection.
A.Correct
b. wrong
44.In most machine learning projects, the three steps of data collection, data cleaning, and feature engineering spend most of the time, while data modeling takes up less of the total time.
A. correct
b. wrong
45. Stochastic gradient descent, each iteration, using a sample.
A.Correct
b. wrong
46. The basic assumption of Naive Bayes is conditional independence.
A.Correct
b. wrong
47. The SMOTE algorithm uses an upsampling method.
A.Correct
b. wrong
48. The solution obtained by L2 regularization is more sparse.
A. correct
b. wrong
49.ID3 algorithm can only be used to deal with discrete distribution features.
A. correct
b. wrong
50. The data for ensemble learning does not need to be normalized or standardized.
A.Correct
b. wrong
51.BP algorithm "likes the new and dislikes the old". After learning new samples, the old samples will be gradually forgotten.
A.Correct
b. wrong
52. The accuracy of logistic regression classification is not high enough, so this algorithm is rarely used in the industry
A.Correct
b. wrong
53. The SMOTE algorithm uses upsampling.
A.Correct
b. wrong
54. 1 million pieces of data are divided into training set, verification set, and test set. The data can be divided like this: 98%, 1%, and 1%.
A. correct
b. wrong
55.K-means is a density-based clustering algorithm that produces partitioned clusters, the number of clusters is determined automatically by the algorithm.
A. correct
b. wrong
56. The basic assumption of Naive Bayesian method is conditional independence.
A.Correct
b. wrong
57. The larger the feature space, the greater the possibility of overfitting.
A.Correct
b. wrong
58.The closer the cosine similarity of two vectors is to 1, the more similar they are.
A.Correct
b. wrong
59.K-means is a density-based clustering algorithm that generates partitioned clusters, and the number of clusters is automatically determined by the algorithm.
A. correct
b. wrong
60.The core idea of the ID3 algorithm is to use information gain to measure feature selection, and select the feature with the largest information gain for splitting.
A. correct
b. wrong