Machine Learning Final Question Bank

Machine Learning Final Question Bank

1. The machine learning algorithm that belongs to supervised learning is: Bayesian classifier

2. The machine learning algorithm that belongs to unsupervised learning is: hierarchical clustering

3. The conjugate distribution of the binomial distribution is: Beta distribution

4. The conjugate distribution of the multinomial distribution is: Dirichlet distribution

5. The characteristics of the Naive Bayesian classifier are: it is assumed that the attributes of each dimension of the sample are independent

6. Which of the following methods does not consider the prior distribution: maximum likelihood estimation

7. For the Bayesian classifier of normal density, when all kinds of covariance matrices are the same, the decision function is: linear decision function

8. The following are linear classification methods: Perceptron

9. The following methods are not affected by data normalization: decision tree

10. The gradient descent method will not be used in the following classification methods: minimum distance classifier

11. The following method uses maximum likelihood estimation: Logistic regression

12. The most accurate description of linear discriminant analysis is to find a projection direction such that: the intra-class distance is the smallest and the inter-class distance is the largest

13. A brief description of the principle of SVM can be summarized as: maximum interval classification

14. The algorithm performance of SVM depends on: all of the above (kernel function selection, kernel function parameters, soft interval parameter C)

15. The dual problem of support vector machine is: convex quadratic optimization

16. The following description of the support vector in the support vector machine is correct: the vector on the maximum interval support surface

17. Suppose you use a linear kernel SVM with an order of 2. After applying the model to the actual data set, the training accuracy and test accuracy are both 100%. Now increasing the complexity of the model (increasing the order of the kernel function), which of the following will happen: Overfitting

18. To avoid direct complex nonlinear transformation, the method of using linear means to realize nonlinear learning is: Kernel function method

19. The correct description of the decision tree node division index is: the greater the information gain, the better

20. In the following description, which belongs to the decision tree strategy is: maximum information gain

21. How to choose the base classifier in the ensemble learning, the learning efficiency is usually better: the classifiers are diverse and the differences are large

22. In ensemble learning, the minimum requirement for the correct rate of each base classifier: more than 50%

23. The following characteristics of the Bagging method are: Bootstraping is used when constructing the training set

24. The following characteristics of the Bagging method are: Bootstraping is used when constructing the training set

25. The random forest method belongs to: Bagging method

26. Suppose there is a data set S, but the data set has a lot of errors, using soft interval SVM training, the threshold is C, if the value of C is very small, which of the following statements is correct: Misclassification will occur

27. The threshold of soft-margin SVM tends to infinity. Which of the following statements is correct: As long as the optimal classification hyperplane exists, it can correctly classify all data

28. In general, under what circumstances does the K-NN nearest neighbor method work well: fewer samples but better typicality

29. The difference between the regression problem and the classification problem: the former predicts the function value as a continuous value, and the latter as a discrete value

30. Equivalent Regression Method of Least Squares Regression Method: Maximum Likelihood Regression with Linear Mean and Normal Error

31. Regularized regression analysis can avoid: overfitting

32. The problem of "beer-diapers" tells about the fact that when shopping in a supermarket, by analyzing the shopping list, it is found that men who buy diapers often buy beer. What is the question: association analysis

33. KL divergence is based on what constructs the separability criterion: class probability density

34. The density clustering method fully considers the relationship between samples: the density can reach

35. In mixed Gaussian clustering, which of the following processes is used: EM algorithm

36. What method is principal component analysis: dimensionality reduction method

37. When PCA performs dimensionality reduction processing, which features are preferred: the largest eigenvalue of the covariance matrix of the centered sample corresponds to the eigenvector

38. In the phenomenon of overfitting: the test error of the training sample is the smallest, but the correct recognition rate of the test sample is very low

39. As shown in the directed graph on the right, the Markov blanket of node G is: {D, E, F, H, I, J}

40. As shown in the undirected graph on the right, the Markov blanket of node G is: {D, E, I, J}

41. In the multi-layer perceptron method, it can be used as a nonlinear activation function of neurons: Logistic function

42. On the finite support set, the entropy of the following distribution is the largest: uniform distribution

43. Knowing the mean and variance, which of the following distributions has the largest entropy: Gaussian distribution

44. Among the following models, the probabilistic graphical model is: Restricted Boltzmann Machine

45. As shown in the directed graph on the right, the following statements are correct: B and G are independent of the condition of {C, F}

46. ​​In standardized formulas, the purpose of use is to prevent the denominator from being zero

47. What are the correct steps for the gradient descent algorithm: 4,3,1,5,2 (initialization-input-calculation error-change weights to reduce error-iterative update) (1
) Calculate the difference between the predicted value and the real value (2) update
iteratively until the optimal weight is found
(3) pass the input into the network to get the output value
(4) initialize random weights and biases
(5) for each neuron that generates an error, Change the corresponding (weight) value to reduce the error
48. If you use a more complex regression model to fit the sample data, use ridge regression and adjust the regularization parameters to reduce the complexity of the model. If λ is large, the following statements about bias and variance are correct: if λ is large, the bias decreases and the variance decreases

49. Which of the following methods will increase the risk of poor fit of the model: data augmentation

50. The following statement is correct: In addition to the EM algorithm, gradient descent can also find the parameters of the mixed Gaussian model

51. When training a neural network, if the training error is too high, which of the following methods cannot greatly reduce the training error: increase training data

52. Which of the following activation functions can cause the gradient to disappear: Tanh

53. Increasing which of the following hyperparameters may cause the random forest model to overfit the data: (2) Depth of the decision tree

54. Which of the following statements about deep network training is correct: D
A. The training process requires the use of gradients, which measure the rate of change of the loss function relative to the model parameters
B. The loss function measures the difference between the model prediction results and the real values
C. The training process is based on a technique called backpropagation
D. All other options are correct 55.
Which of the following introduces nonlinearity in a neural network: ReLU

56. Using regularization items in linear regression, you find that many coefficients of the solution are 0, then this regularization item may be:
L0-norm, L1-norm
57. Regarding CNN, the following conclusion is correct: Pooling layer uses for reducing the spatial resolution of images

58. Regarding the k-means algorithm, the correct description is: the initial value is different, and the final result may be different

59. Which of the following descriptions about the phenomenon of overfitting is correct: the training error is small, and the test error is large

60. The following statement about the convolutional neural network is correct: the convolutional neural network can have multiple convolution kernels, which can be of different sizes

61. The loss function of the LR model is: cross entropy

62. The statement of GRU and LSTM is correct: GRU has fewer parameters than LSTM

63. The following methods cannot be used for feature dimensionality reduction: Monte Carlo method

64. Which of the following functions cannot be used as an activation function: y=2x

65. There are two sample points, the first point is a positive sample, its feature vector is (0,-1); the second point is a negative sample, its feature vector is (2,3), from these two A training set composed of sample points constructs a classification surface equation of a linear SVM classifier: x+2y=3

66. Under the premise that other conditions remain unchanged, which of the following practices is likely to cause overfitting problems in machine learning: Gaussian kernels are used instead of linear kernels in the SVM algorithm

67. The following method belongs to the unsupervised learning algorithm: K-Means clustering

68. What does Bootstrap data mean: sample n samples from a total of N samples with replacement

69. The following description of the Bayesian classifier is wrong: the prior probability is derived based on the posterior probability

70. In the following description of the Adaboost algorithm, the error is: learn multiple weak classifiers independently at the same time

71. In the following machine learning, when data preprocessing does not need to consider normalization processing: tree model

72. In the binary classification task, there are three classifiers h1, h2, h3, and three test samples x1, x2, x3. Suppose 1 means the classification result is correct, 0 means wrong, the results of h1 in x1, x2, x3 are (1,1,0), h2, h3 are (0,1,1), (1,0,1 ), integrate the three classifiers according to the voting method, the following statement is correct: the integration improves the performance

73. Regarding the Precision and Recall of the machine learning classification algorithm, which of the following definitions is correct (assuming tp = true positive, tn = true negative, fp = false positive, fn = false negative):

Precision= tp / (tp + fp), Recall = tp / (tp + fn)

74. Which of the following is not a commonly used feature selection algorithm for text classification: principal component analysis

75. In HMM, if the observation sequence and the state sequence that produces the observation sequence are known, which of the following methods can be used to directly estimate the parameters: maximum likelihood estimation

76. Which of the following distances will focus on the direction of the vector: cosine distance

77. The algorithm to solve the prediction problem in the hidden horse model is: Viterbi algorithm

78. In Logistic Regression, if L1 and L2 norms are added at the same time, what effect will it have: feature selection can be done, and overfitting can be prevented to a certain extent

79. What is the technical difference between the ordinary backpropagation algorithm and the backpropagation over time algorithm (BPTT): Unlike ordinary backpropagation, BPTT will superimpose the gradient of all corresponding weights in each time step

80. The gradient explosion problem means that when training a deep neural network, the gradient becomes too large and the loss function becomes infinite. In RNN, which of the following methods can better deal with the gradient explosion
problem: Gradient clipping

81. When training a neural network for image recognition tasks, it is common to draw a graph of training set error and validation set error for debugging. In the figure below, when is the best time to stop training
: C

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-ctS8tH71-1655199702381) (C:\Users\Crescent_P\AppData\Roaming\Typora\typora-user-images\ image-20220612170843797.png)]

Question 1

A computer program learns a task T from experience E and uses P to measure performance. Also, the performance P of T increases with the increase of experience E.
Suppose we feed a learning algorithm a lot of historical weather data and let it learn to predict the weather. What is a reasonable choice for P?

A. The process of calculating a large amount of historical weather data
B. None of the above
C. The probability of correctly predicting the weather for a future date
D. The task of weather forecasting

Question 2

Suppose you are doing weather forecasting and using an algorithm to predict the temperature (Celsius/Fahrenheit) tomorrow, would you treat this as a classification problem or a regression problem?

A. Classification
B. Regression

Question 3

Let's say you're doing stock market predictions. You want to predict whether a company will declare bankruptcy within the next 7 days (by training on data from similar companies that were previously at risk of bankruptcy). Would you treat this as a classification problem or a regression problem?

A. Classification
B. Regression

Question 4

Some of the problems below are best solved using supervised learning algorithms, while others should be solved using unsupervised learning algorithms. Which of the following would you use supervised learning? (Select all that apply.) In each case, assume that an appropriate data set is available for the algorithm to learn from.

A. Based on a person's genetic (DNA) data, predict his/her chance of developing diabetes in the next 10 years

B. Based on a large dataset of medical records of cardiac patients, try to understand whether there are different patient groups for whom we can tailor different treatment options

C. Have a computer examine a piece of audio and classify whether the audio has a human voice (i.e. a vocal singing) or only instruments (no vocals)

D. Given data on the responses of 1000 medical patients to an experimental drug (such as treatment effects, side effects, etc.), discover whether there are different categories or "types" of patient response to the drug, and if so, what are those categories

Question 5

Which is a reasonable definition of machine learning?

A. Machine learning learns from labeled data

B. Machine learning enables computers to learn without being explicitly programmed

C. Machine learning is the science of computer programming

D. Machine learning is the field that allows robots to act intelligently

Question 6

Based on a student's performance in his freshman year, predict how he will perform in his sophomore year.
Let x be equal to the number of "A's" (including A-, A, and A+ grades) the student gets in his first year of college. Predict the value of y: the number of "A" grades obtained in the second year
Here each row is a training data. In linear regression, our hypothesis hθ(x) = θ0 + θ1x, and we use m to denote the number of training examples.

| x    | y    |  
| 3    | 2    |  
| 1    | 2    |  
| 0    | 1    |  
| 4    | 3    |  

For the training set given above (note that this training set can also be referenced in other questions on this quiz), what is the value of m?

Question 7

For this problem, assume we use the training set from the first problem. And, our definition of the cost function is J(θ0,θ1)=12m∑i=1m(hθ(x(i))−y(i))2 to
find J(0,1)

Question 8

In question 1, the linear regression assumes θ0=−1, θ1=2, how to find hθ(6)?

Question 9

The relationship between the cost function J(θ0,θ1) and θ0,θ1 is shown in Figure 2. A contour plot of the same cost function is given in "Fig. 1". According to the illustration, choose the correct option (check all the correct items)

Image Name

A. Starting from point B, a gradient descent algorithm with an appropriate learning rate will eventually help us reach or approach point A, that is, the cost function J(θ0,θ1) has a minimum value at point A

B. Point P (global minimum in Figure 2) corresponds to point C in Figure 1

C. Starting from point B, a gradient descent algorithm with an appropriate learning rate will eventually help us reach or approach point C, that is, the cost function J(θ0,θ1) has a minimum value at point C

D. Starting from point B, a gradient descent algorithm with an appropriate learning rate will eventually help us reach or approach point A, that is, the cost function J(θ0,θ1) has a maximum value at point A

E. Point P (global minimum in Figure 2) corresponds to point A in Figure 1

Question 10

Suppose for some linear regression problem (such as predicting housing prices), we have some training sets. For our training set, we can find some θ0, θ1 such that J(θ0, θ1)=0.
Which of the following statements is true? (check all correct items)

A. In order to achieve this, we must have θ0=0,θ1=0, so that J(θ0,θ1)=0

B. For the value of θ0 and θ1 satisfying J(θ0,θ1)=0, for each training example (x(i), y(i)), hθ(x(i))=y(i )

C. This is impossible: by the definition of J(θ0,θ1)=0, there cannot be θ0,θ1 such that J(θ0,θ1)=0

D. We can perfectly predict the value of y even for new examples we haven't seen yet (e.g. we can perfectly predict the price of a new house we haven't seen yet)

Question 11

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-eIxCOuA6-1655199702382) (C:\Users\Crescent_P\AppData\Roaming\Typora\typora-user-images\ image-20220612171415306.png)]

Question 12

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-HeGb32rU-1655199702382) (C:\Users\Crescent_P\AppData\Roaming\Typora\typora-user-images\ image-20220612171402929.png)]

Question 13

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-6ZkiMIL2-1655199702383) (C:\Users\Crescent_P\AppData\Roaming\Typora\typora-user-images\ image-20220612171356197.png)]

Question 14

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-rHrPn3OV-1655199702383) (C:\Users\Crescent_P\AppData\Roaming\Typora\typora-user-images\ image-20220612171343679.png)]

Question 15

Let A and B be 3x3 matrices, which of the following must be correct (choose all correct items)

A. A+B=B+A
B. If v is a 3D vector, then A∗B∗v is a 3D vector
C. A∗B∗A=B∗A∗B
D. If C=A∗B, Then C is a 6x6 matrix

Question 16¶

Assume m=4 students take a class with midterm and final exams. You have collected a dataset of their scores on two exams as follows:

midterm score (mid-term score)^2 final score
89 7921 96
72 5184 74
94 8836 87
69 4761 78

You want to use polynomial regression to predict a student's midterm grade. Specifically, suppose you want to fit a model with hθ(x)=θ0+θ1x1+θ2x2, where x1 is the interim score and x2 is (interim score)^2. Also, you plan to use both feature scaling (dividing by the "max-min" or range of the feature) and mean normalization.

What is the normalized x2(4) eigenvalue? (Hint: midterm=89, final=96 is training example 1)

Question 17

15 iterations of gradient descent were performed with α = 0.3, and J(θ) was calculated after each iteration. You will find that the value of J(θ) decreases slowly and is still decreasing after 15 iterations. Based on this, which of the following conclusions seems most plausible?

A. α=0.3 is a valid choice for the learning rate.

B. Instead of using the current value of α, it is better to try a smaller α value (such as α=0.1)

C. Instead of using the current value of α, it is better to try a larger α value (such as α=1.0)

Question 18

Assuming you have m=14 training examples, with n=3 features (not including the intercept term that needs to be added to be constant at 1), the normal equation is θ=(XTX)−1XTy. For given values ​​of m and n, what are the dimensions of θ, X, and y in this equation?

A. X 14×3, y 14×1, θ 3×3
B. X 14×4, y 14×1, θ 4×1
C. X 14×3, y 14×1, θ 3×1
D. X 14×4, y 14×4, θ 4×4

Question 19

Suppose you have a dataset with m=1000000 examples and n=200000 features per example. You want to fit the parameters θ to our data using multiple linear regression. Should you use gradient descent or normal equations?

A. Gradient descent, because the calculation in the normal equation θ=(XTX)−1 is very slow

B. The normal equation because it provides an efficient way to solve directly

C. Gradient descent because it always converges to the optimal θ

D. Normal equation, because gradient descent may not find the optimal θ

Question 20

Which of the following are reasons to use feature scaling?

A. It prevents gradient descent from falling into a local optimum

B. It speeds up gradient descent by reducing the computational cost of each iteration of gradient descent

C. It speeds up gradient descent by reducing the number of iterations to get a good solution

D. It prevents the matrix XTX (for normal equations) from being irreversible (singular/degenerate)

Question 26

Suppose you have trained a logistic classifier that outputs a prediction hθ(x) = 0.4 on a new example x. This means (choose all correct items):

A. Our estimate of P(y=0∣x;θ) is 0.4

B. Our estimate of P(y=1∣x;θ) is 0.6

C. Our estimate of P(y=0∣x;θ) is 0.6

D. Our estimate of P(y=1∣x;θ) is 0.4

Question 27

Suppose you have the following training set, and fit a logistic regression classifier hθ(x)=g(θ0+θ1x1+θ2x2)

Image Name

Image Name

Which of the following is correct? Check all correct items

A. Adding polynomial features (for example, using hθ(x)=g(θ0+θ1x1+θ2x2+θ3x12+θ4x1x2+θ5x22)) can increase how well we fit the training data

B. At the optimal value of θ (e.g. found by fminunc), J(θ)≥0

C. Adding polynomial features (e.g. using hθ(x)=g(θ0+θ1x1+θ2x2+θ3x12+θ4x1x2+θ5x22) will increase J(θ) because we are now summing more terms

D. If we train gradient descent iterations enough times, for some examples x(i) in the training set, it is possible to get hθ(x(i))>1

Question 28

For logistic regression, the gradient is given by ∂∂θjJ(θ)=1m∑i=1m(hθ(x(i))−y(i))xj(i). Which of the following is the correct gradient descent update for logistic regression with learning rate α? Check all correct items

A. θ:=θ−α1m∑i=1m(θTx−y(i))x(i)

B. θj:=θj−α1m∑i=1m(11+e−θTx(i)−y(i))xj(i) (update all j at the same time)

C. θj:=θj−α1m∑i=1m(hθ(x(i))−y(i))x(i) (update all j at the same time)

D. θj:=θj−α1m∑i=1m(hθ(x(i))−y(i))xj(i) (update all j at the same time)

Question 29

Which of the following statements is true? Check all correct items

A. For logistic regression, gradient descent sometimes converges to a local minimum (and fails to find a global minimum). That's why we prefer more advanced optimization algorithms like fminunc (conjugate gradient/BFGS/L-BFGS/etc)

B. The value of the sigmoid function g(z)=11+e−z will never be greater than 1

C. The cost function J(θ) of logistic regression trained with m≥1 examples is always greater than or equal to zero

D. Using linear regression + threshold method for classification prediction is always very effective

Question 30

Suppose you train a logistic regression classifier hθ(x)=g(θ0+θ1x1+θ2x2). Assuming θ0=6, θ1=−1, θ2=0, which of the following graphs represents the decision boundary found by the classifier?

A.
Image Name

B.

Image Name

C.

Image Name

D.

Image Name

Week 3 | 2 Regularization

Question 31

You are training a categorical logistic regression model. Which of the following statements is true? Check all correct items

A. Introducing regularization into the model always achieves the same or better performance on the training set

B. Adding many new features to the model helps prevent overfitting on the training set

C. Introducing regularization into the model will always achieve the same or better performance for examples not in the training set

D. Adding new features to the model will always result in equal or better performance on the training set

Question 32

Suppose you run two logistic regressions, one with λ=0 and one with λ=1. One time, get the parameter θ=[81.4712.69], and another time, get θ=[13.010.91].
However, you forget which value of λ corresponds to which value of θ. Which do you think corresponds to λ=1?

A. θ=[13.010.91]

B. θ=[81.4712.69]

Question 33

Which of the following statements about regularization is true? Check all correct items

A. Using too large a value for λ may cause your hypothesis to overfit the data; this can be avoided by reducing λ

B. Using a very large value for λ does not affect the performance of the hypothesis; the only reason we do not set λ too large is to avoid numerical problems

C. Consider a classification problem. Adding regularization may cause the classifier to misclassify some training examples (when no regularization is used, i.e. when λ=0, it correctly classifies these examples)

D. Since the output value of logistic regression is 0≤hθ(x)≤1, the range of its output value can only be "shrunk" by regularization anyway, so regularization usually does not help it

Question 34

Which of the following image hypotheses overfit to the training set?

A.

Image Name

B.

Image Name

C.

Image Name

D.

Image Name

Question 35

Which of the following image hypotheses underfit the training set?

A.

Image Name

B.

Image Name

C.

Image Name

D.

Image Name

Question 36

Which of the following statements is true? select all correct

A. The activation value of the hidden unit in the neural network, after applying the sigmoid function, is always in the range (0, 1)

B. Logical functions on binary values ​​(0 or 1) can be (approximately) represented by some neural networks

C. A two-layer (one input layer, one output layer, no hidden layer) neural network can represent an XOR function

D. Suppose there is a multi-class classification problem with three classes, using a three-layer network for training. Let a1(3)=(hΘ(x))1 be the activation of the first output unit, and similarly, have a2(3)=(hΘ(x))2 and a3(3)=(hΘ(x)) 3. Then for any input x, there must be a1(3)+a2(3)+a3(3)=1

Question 37

Consider the following neural network with two binary inputs x1,x2∈{0,1} and output hΘ(x). Which of the following logistic functions does it (approximately) compute?

Image Name

A. OR
B. AND
C. NAND (and not)
D. XOR (exclusive or)

Question 38

Consider the neural network given below. Which of the following equations correctly computes the activation of a1(3)? Note: g(z) is the sigmoid activation function

Image Name

A. a1(3)=g(Θ1,0(2)a0(2)+Θ1,1(2)a1(2)+Θ1,2(2)a2(2))

B. a1(3)=g(Θ1,0(1)a0(1)+Θ1,1(1)a1(1)+Θ1,2(1)a2(1))

C. a1(3)=g(Θ1,0(1)a0(2)+Θ1,1(1)a1(2)+Θ1,2(1)a2(2))

D. There is no activation a1(3) in this network

Question 39

You have the following neural network:

Image Name

You want to compute the activations of the hidden layer a(2) ∈ R3, one way is to use the following Octave code:

Image Name

You need a vectorized implementation (ie, one that doesn't use loops). Which of the following implementations correctly computes a(2)? Check all correct items

A. z = Theta1 * x; a2 = sigmoid (z)
B. a2 = sigmoid (x * Theta1)
C. a2 = sigmoid (Theta2 * x)
D. z = sigmoid(x); a2 = sigmoid (Theta1 * z)

Question 40

You are using the neural network shown below and have learned the parameters Θ(1)=[112.411.73.2] (for computing a(2)) and Θ(2)=[10.3−1.2] (for acting on function of a(2), computes the value of a(3)).

Suppose you swap the parameters Θ(1)=[11.73.2112.4] of the 2 units of the first hidden layer, and also swap the output layer Θ(2)=[1−1.20.3]. How will this change the value of the output hΘ(x)?

Image Name

A. No change
B. Bigger
C. Smaller
D. Incomplete information, may become larger or smaller

Question 41

You are training a three-layer neural network and want to use backpropagation to compute the gradient of the cost function.
In the backpropagation algorithm, one of the steps is to update
Δij(2):=Δij(2)+δi(3)∗(a(2))j
for each i,j, which of the following is the correct one for this step vectorization?

A. Δ(2):=Δ(2)+(a(2))T∗δ(3)
B. Δ(2):=Δ(2)+(a(3))T∗δ(2)
C. Δ(2):=Δ(2)+δ(3)∗(a(2))T
D. Δ(2):=Δ(2)+δ(3)∗(a(3))T

Question 42

Assuming Theta1a 5x3 matrix, Theta2is a 4x6 matrix. make thetaVec=[Theta1(;);Theta2(:)]. Which of the following can be correctly restored Theta2?

A. reshape(thetaVec(16:39),4,6)
B. reshape(thetaVec(15:38),4,6)
C. reshape(thetaVec(16:24),4,6)
D. reshape(thetaVec(15:39),4,6)
E. reshape(thetaVec(16:39),6,4)

Question 43

Let J(θ)=2θ3+2, let θ=1, ϵ=0.01. Use the formula JJ(θ+ϵ)−J(θ−ϵ)2ϵ to numerically compute the approximation at θ=1. What value will you get? (When θ=1, the exact derivative is dJ(θ)dθ=6)

A. 8
B. 6
C. 5.9998
D. 6.0002

Question 44

Which of the following statements is true? select all correct

A. Using a large value of λ does not affect the performance of the neural network; the only reason we do not set λ too large is to avoid numerical problems

B. Gradient checking is useful if we use gradient descent as an optimization algorithm. However, it is not very useful if we use an advanced optimization method (such as in fminunc)

C. Using gradient checking can help verify that the implementation of backpropagation is bug-free

D. If our neural network is overfitting the training set, a reasonable step is to increase the regularization parameter λ

Question 45

Which of the following statements is true? select all correct

A. Assume that the parameter Θ(1) is a square matrix (that is, the number of rows is equal to the number of columns). If we replace Θ(1) with its transpose (Θ(1))T, then we have not changed the function that the network is computing.

B. Suppose we have a correct implementation of backpropagation and are training a neural network using gradient descent. Suppose we plot J(Θ) as a function of the number of iterations and find that it increases rather than decreases. One possible reason is that the learning rate α is too large.

C. Suppose we use gradient descent with a learning rate α. For logistic regression and linear regression, J(Θ) is a convex optimization problem, so we don't want to choose an excessively large learning rate α.
However, for neural networks, J(Θ) may not be convex, so choosing a very large value for α can only speed up convergence.

D. If we are training a neural network using gradient descent, a reasonable debugging step is to plot J(Θ) as a function of the number of iterations and ensure that it is decreasing (or at least not increasing) after each iteration.

Question 46

You train a learning algorithm and find that it has a high error on the test set. Plot the learning curve and get the graph below. Does the algorithm have high bias, high variance, or neither?

Image Name

A. High bias
B. High variance
C. Neither

Question 47

Let's say you've implemented regularized logistic regression to classify objects in images (i.e., haven't implemented image recognition yet). However, when you test your model on a new set of images, you will find that its predictions for the new images are very wrong. However, your hypothesis fits well on the training set. Which of the following practices could be improved? Check all correct items

A. Try adding multinomial features
B. Get more training examples
C. Try using fewer features
D. Use fewer training examples

Question 48

Suppose you have implemented regularized logic to predict which items a customer will buy on a shopping website. However, when you test your model on a new set of customers, you discover that it has a large error in its predictions. Also, the model does not perform well on the training set. Which of the following practices could be improved? Check all correct items

A. Try to get and use other features
B. Try to add polynomial features
C. Try to use fewer features
D. Try to increase the regularization parameter λ

Question 49

Which of the following statements is true? Check all correct items

A. Suppose you are training a regularized linear regression model. The recommended way to choose a value for the regularization parameter λ is to choose the value of λ that minimizes the cross-validation error.

B. Suppose you are training a regularized linear regression model. The recommended way to choose the value of λ for the regularization parameter is to choose the value of λ that gives the smallest test set error.

C. Assuming you are training a regularized linear regression model, the recommended way to choose the value of the regularization parameter λ is to choose the value of λ that gives the smallest training set error.

D. The performance of the learning algorithm on the training set is usually better than that on the test set.

Question 50

Which of the following statements is true? Check all correct items

A. When debugging a learning algorithm, it is helpful to plot learning curves to see if you have high bias or high variance issues.

B. If a learning algorithm suffers from high variance, adding more training examples may improve test error.

C. We always prefer models with high variance (rather than models with high bias) because they fit the training set better.

D. If a learning algorithm has high bias, simply adding more training examples may not significantly improve test error.

Question 51

You are working on a spam classification system using regularized logistic regression. "Spam" is the positive class (y=1), and "Not Spam" is the negative class (y=0). You have trained a classifier with m=1000 examples in the cross-validation set. The plot of predicted classes vs actual classes is:

| Actual Class: 1    | Actual Class: 0 |  
| Predicted Class: 1 | 85              |  
| Predicted Class: 0 | 15              |  

For reference:
Accuracy = (True Positives + True Negatives) / (Total Examples)
Precision = (True Positives) / (True Positives + False Positives)
Recall = (True Positives) / (True Positives + False Negatives)
F1 Score = (2 precision- recall) / (precision + recall)

What is the recall of the classifier?

Question 52

Suppose a huge dataset can be used to train a learning algorithm. Training on large amounts of data may yield good performance when the following two conditions hold. What are the two conditions?

A. Feature x contains enough information to accurately predict y. (For example, one way to test this is whether human experts can confidently predict y when given only x).

B. We train a learning algorithm with a small number of parameters (so less likely to overfit).

C. We train learning algorithms with a large number of parameters (capable of learning/representing fairly complex functions).

D. We train a model without regularization.

Question 53

Suppose you have trained a logistic regression classifier that outputs hθ(x).
Currently, 1 is predicted if hθ(x) ≥ threshold, and
0 is predicted if hθ(x) ≤ threshold, and the current threshold is set to 0.5.

Say you increase the threshold to 0.9. Which of the following is correct? Check all correct items

A. Now the classifier may be less accurate.

B. The precision and recall of the classifier may be unchanged, but the accuracy is lower.

C. The accuracy and recall of the classifier may not change, but the precision is higher.

D. The classifier may now have a lower recall.

Say you lower the threshold to 0.3. Which of the following is correct? Check all correct items

A. The classifier may now have a higher recall.

B. The accuracy and recall of the classifier may not change, but the precision is higher.

C. The classifier may now have higher accuracy.

D. The precision and recall of the classifier may be unchanged, but the accuracy is lower.

Question 54

Suppose you are working with a spam classifier where spam is a positive example (y=1) and non-spam is a negative example (y=0). You have a training set of emails where 99% of emails are not spam and 1% are spam. Which of the following statements is true? Check all correct items

A. A good classifier should have both high precision and high recall on the cross-validation set.

B. If you always predict non-spam (output y=0), then your classifier will have 99% accuracy on the training set, and it will probably perform similarly on the cross-validation set.

C. If you always predict non-spam (output y=0), then your classifier will have an accuracy of 99%.

D. If you always predict non-spam (output y=0), then your classifier will be 99% accurate on the training set, but worse on the cross-validation set because it Overfitting the training data.

E. If spam is always predicted (output y=1), the recall rate of the classifier is 0% and the precision is 99%.

F. If it always predicts non-spam (output y=0), the recall rate of the classifier is 0%.

G. If you always predict spam (output y=1), then your classifier will have recall 100% and precision 1%.

H. If you always predict non-spam (output y=0), then your classifier will have an accuracy of 99%.

Question 55

Which of the following statements is true? Check all correct items

A. Before building the first version of a learning algorithm, it is a good idea to spend a lot of time collecting a lot of data.

B. On skewed datasets (e.g. when there are more positive examples than negative examples), accuracy is not a good performance measure, you should use F1 score based on precision and recall.

C. After training the logistic regression classifier, you must use 0.5 as the threshold for predicting whether the example is positive or negative.

D. Using a very large training set makes the model less likely to overfit the training data.

E. If your model doesn't fit the training set, getting more data might help.

Question 56

Suppose you use an SVM trained with a Gaussian kernel that learns the following decision boundary on the training set:

Image Name

Do you think the SVM is underfitting, should you try increasing or decreasing C? Or increase or decrease σ2?

A. Decrease C, increase σ2
B. Decrease C, decrease σ2
C. Increase C, increase σ2
D. Increase C, decrease σ2

Question 57

The formula for the Gaussian kernel is given by similarity(x,l(1))=exp⁡(−||x−l(1)||22σ2).

The figure below shows the plot of f1=similarity(x,l(1)) when σ2=1.

Image Name

When σ2=0.25, which of the following is the graph of f1?

A.

Image Name

B.

Image Name

C.

Image Name

D.

Image Name

Question 58

Support vector machine to solve minθ C∑i=1my(i)cost1(θTx(i))+(1−y(i))cost0(θTx(i))+∑j=1nθj2, where the functions cost0(z) and cost1 (z) The image is as follows:

Image Name

The first item in the target is: C∑i=1my(i)cost1(θTx(i))+(1−y(i))cost0(θTx(i)). If two of the following four conditions
are True, the first term is zero. What are the two conditions that make this term equal to zero?

A. For each example where y(i)=1, there are θTx(i)≥1

B. For each example where y(i)=0, θTx(i)≤−1

C. For each example where y(i)=1, there are θTx(i)≥0

D. For each example where y(i)=0, there is θTx(i)≤0

Question 59

Suppose you have a dataset with n=10 features and m=5000 examples. After training a logistic regression classifier with gradient descent, you find that it underfits the training set and does not achieve the desired performance on the training set or the cross-validation set. Which of the following steps is expected to improve? Check all correct items

A. Try using a neural network with a large number of hidden units.

B. Reduce the number of examples in the training set.

C. Use a different optimization method, since training the logic with gradient descent may lead to local minima.

D. Create/add new polynomial features.

Question 60

Which of the following statements is true? Check all correct items

A. Suppose you are using support vector machines for multiclass classification and wish to use a one-vs-all approach. If you have K different classes, you will train K−1 different SVMs.

B. If the data is linearly separable, then an SVM with a linear kernel will return the same parameter θ regardless of the value of C (i.e., the resulting value of θ does not depend on C).

C. The maximum value of the Gaussian kernel (ie sim(x,l(1))) is 1.

D. It is important to perform feature normalization before using the Gaussian kernel.

Question 61

For which of the following tasks, K-means clustering might be an appropriate algorithm? Check all correct items

A. Given a database of user information, automatically group users into different market segments.

B. Based on the sales data of a large number of products in the supermarket, find out which products can be combined (eg often bought together) and therefore should be placed on the same shelf.

C. Predict tomorrow's rainfall based on historical weather records

D. Given sales data for a large number of products in a supermarket, estimate future sales of those products.

E. Given a set of news articles from many different news sites, find the main topics covered.

F. Based on many emails, determine whether they are spam or not spam.

G. From user usage patterns on the site, find out which different user groups exist.

H. Based on historical weather records, predict whether tomorrow's weather will be sunny or rainy.

Question 62

Suppose we have three cluster centers μ1=[12], μ2=[−30], μ3=[42]. Also, we have a training example x(i)=[−21]. What will c(i) be after a cluster allocation step?

A. c(i)=2
B. c(i) is not assigned
C. c(i)=1
D. c(i)=3

Question 63

K-means is an iterative algorithm that repeats the following two steps in its inner loop. which two?

A. Move the cluster center and update the cluster center μk.

B. Allocation of clusters where parameter c(i) is updated.

C. Move the cluster center μk, setting it equal to the nearest training example c(i)

D. Cluster center assignment step, where each cluster centroid μi is assigned (by setting c(i)) to the nearest training example x(i).

Question 64

Suppose you have an unlabeled dataset {x(1),...,x(m)}. You run K-means initialization with 50 different random numbers and get 50 different clusters. What is the method for choosing which of these 50 combinations?

A. The only way is that we need data labels y(i).

B. For each category, calculate 1m∑i=1m||x(i)−μc(i)||2, and choose the one with the smallest value.

C. The answers are ambiguous and there is no good way to choose them.

D. Always choose the last (50th) cluster found, as it is more likely to converge to a good solution.

Question 65

Which of the following statements is true? Check all correct items

A. If we are worried about K-means getting stuck in local optima, one way to improve (reduce) this problem is to try to use multiple random initializations.

B. The standard way to initialize K-means is to set μ1=…=μk to a vector equal to zero.

C. Since K-Means is an unsupervised learning algorithm, it cannot overfit the data, so it is better to cluster as many as possible computationally.

D. For some datasets, the "correct" value of K (the number of clusters) may be ambiguous and difficult to decide, even for a human expert looking carefully at the data.

E. K-means gives the same result regardless of the initialization of the cluster centers.

F. A good way to initialize K-means is to select K (distinct) examples from the training set and set cluster centroids equal to these selected examples.

G. In each iteration of K-means, the cost function J(c(1),…,c(m), μ1,…,μk) (distortion function) either remains constant or decreases, especially not should be increased.

H. Once an example is assigned to a particular cluster center, it will never be reassigned to a different cluster center.

Question 66

Consider the following two-dimensional dataset:

Image Name

Which of the following pictures corresponds to the possible value of u(1) (first eigenvector/first principal component) returned by PCA? Check all correct items

A.

Image Name

B.

Image Name

C.

Image Name

D.

Image Name

Question 67

Which of the following is a reasonable way to choose the number of principal components k? (n is the dimension of the input data mm is the number of input examples)

A. Choose the smallest value of k that retains at least 99% of the variance

B. Choose k so that the approximation error 1m∑i=1m||x(i)−xapprox(i)||2.

C. Choose the smallest value of k that preserves at least 1% of the variance

D. Choose n where k is 99% (that is, k=0.99∗n rounded to the nearest integer).

Question 68

Suppose someone tells you that the way they run PCA is that "95% of the variance is preserved", what is the equivalent of that?

A. 1m∑i=1m||x(i)||21m∑i=1m||x(i)−xapprox(i)||2≥0.05
B. 1m∑i=1m||x(i)||21m∑i=1m||x(i)−xapprox(i)||2≤0.05
C. 1m∑i=1m||x(i)−xapprox(i)||21m∑i=1m||x(i)||2≤0.05
D. 1m∑i=1m||x(i)||21m∑i=1m||x(i)−xapprox(i)||2≤0.95

Question 69

Which of the following statements is true? select all correct

A. Given only z(i) and Ureduce, there is no way to reconstruct any reasonable approximation of x(i).

B. Even if all input features are on very similar scales, we should still perform mean normalization (so that each feature has a mean of zero) before running PCA.

C. PCA is susceptible to local optima; trying multiple random initializations may help.

D. Given input data x ∈ Rn, it makes sense to run PCA only with values ​​of k satisfying k≤n (in particular, running PCA with k=n is possible but not helpful, and k>n does not make sense)

Question 70

Which of the following is a recommended application of PCA? select all correct

A. As an Alternative to Linear Regression: For most model applications, PCA and linear regression give essentially similar results.

B. Data Compression: Reduce the dimensionality of the data, thereby reducing the memory/disk space occupied.

C. Data Visualization: Take 2D data and find different ways to plot it in 2D (using k=2).

D. Data Compression: Reduce the dimensionality of the input data x(i), which will be used in the supervised learning algorithm (i.e., use PCA to make the supervised learning algorithm run faster).

Week 9 | 1 Anomaly Detection

Question 71

For which of the following problems is anomaly detection an appropriate algorithm?

A. Given an image of a face, determine whether it is the face of a particular celebrity.

B. Given a dataset of credit card transactions, identify unusual transactions and flag them as potentially fraudulent.

C. Given data on credit card transactions, classify each transaction by type of purchase (eg: food, transportation, clothing).

D. Identify individuals who may have abnormal health conditions from a large number of primary care patient records.

Question 72

Suppose you have trained an anomaly detection system to flag anomalies when p(x) < ϵ, and you find that it has too many false positives (flag too many things as anomalies) in the cross-validation set. what should you do?

A. Increase ϵ
B. Decrease ϵ

Question 73

Suppose you are developing an anomaly detection system to catch manufacturing defects in aircraft engines. Your model uses p(x)=∏j=1np(xj;μj,σj2).
There are two properties x1=vibration strength, x2=heat generated, the values ​​of x1, x2 are both between 0 and 1 (and strictly greater than 0).
For most "normal" engines, you would expect x1 ≈ x2. One of the suspicious anomalies is that the engine vibrates violently (big x1, small x2) even without generating much heat, even though the specific values ​​of x1 and x2 may not be outside their typical values.
Which traits x3 should you construct to catch these types of exceptions:

A. x3=x12×x2
B. x3=x1x2
C. x3=x1+x2
D. x3=x1×x2

Question 74

Which of the following is correct? select all correct

A. If there is no labeled data (or if all data has labeled y=0), p(x) can still be learned, but it may be harder to evaluate the system or choose a good value.

B. If you have a training set with many positive examples and many negative examples, then anomaly detection algorithms may perform as well as supervised learning algorithms such as support vector machines.

C. If you are developing an anomaly detection system, you cannot use labeled data to improve your system.

D. When selecting features for an anomaly detection system, it is best to look for features with unusually large or small values ​​for anomalous examples.

Question 75

You have a 1D dataset {x(1),…,x(m)} and you want to detect outliers in the dataset. First plot the dataset, it looks like this:

Image Name

Suppose a Gaussian distribution with parameters μ1μ1 and σ21σ12 is fitted to this dataset. For μ1,σ12, which of the following values ​​can be obtained?

A. μ1=−3,σ12=4
B. μ1=−6,σ12=4
C. μ1=−3,σ12=2
D. μ1=−6,σ12=4

1. A prison face recognition access system is used to identify the identity of the person to enter. This system includes the identification of 4 different types of personnel: prison guards, thieves, food delivery staff, and others. Which of the following learning methods is most suitable for this application need:

A. Regression problem

B. Binary classification problem

C. Multiple Classification Problems

DK-means clustering problem

2. Which of the following techniques would be better for reducing the dimensionality of a dataset

A. Drop columns with too many missing values

B. Delete columns with large data differences

C. Delete columns with different data trends

D. neither

3. Which of the following steps is the task of integrating, transforming, dimensionally reducing, and numerically reducing the original data?

A. Frequent Pattern Mining

B. Classification and Prediction

C. Data preprocessing

D. Data Stream Mining

4. Which of the following is not an SVM kernel function is ( )

A. Polynomial kernel function

B. Logical kernel function

C. Radial Basis Kernel Function

D. Linear kernel function

5. Data scientists may use multiple algorithms (models) to make predictions at the same time, and finally integrate the results of these algorithms to make the final prediction (integrated learning). The following statement about integrated learning is correct

A. High correlation between individual models

B. There is low correlation between individual models

C. It is better to use "average weight" instead of "voting" in ensemble learning

D. A single model uses an algorithm

6. In the following different scenarios, the analysis method used is incorrect ()‎

A. According to the business and service data of the business in the last year, use the clustering algorithm to determine the business level of the Tmall business under their respective main categories

B. According to the transaction data of the merchant in recent years, use the clustering algorithm to fit the formula of the possible consumption amount of the user in the next month

C. Use the association rule algorithm to analyze whether the buyer who bought the car seat is suitable for recommending the car mat

D. According to the product information recently purchased by the user, use the decision tree algorithm to identify whether the Taobao buyer may be male or female

7. The meaning of bootstrap data is‏

A. Sampling m features from the whole M with replacement

B. Sampling m features from the population M without replacement

C. Sampling n samples from the whole N with replacement

D. Sampling n samples from a population of N without replacement

8. In logistic regression, if the L1 and L2 norms are added at the same time, there will be no effect.‌

A. To do feature selection and prevent overfitting to a certain extent

B. Can solve the dimension disaster problem

C. Can speed up calculation

D. Can get more accurate results

9.‌ For linearly inseparable problems in the original space, support vector machines ().

A. Find the partition data of the nonlinear function in the original space

B. can't handle

C. Find a linear function to divide the data in the original space

D. Map data into kernel space

10. What is the difference between a regression problem and a classification problem?

A. Regression problems have labels, classification problems do not

B. The output value of the regression problem is discrete, and the output value of the classification problem is continuous

C. The output value of the regression problem is continuous, and the output value of the classification problem is discrete

D. Regression problems and classification problems require different input attribute values

11. Which of the following statements about dimensionality reduction is incorrect?

A. Dimensionality reduction is to convert training samples from high-dimensional space to low-dimensional space

B. Dimensionality reduction will not damage the data

C. Through dimensionality reduction, meaningful data structures can be discovered more effectively

D. Dimensionality reduction will help in data visualization

12. What is the L1 norm of the vector x=[1,2,3,4,-9,0]?

A.1

B.19

C.6

D.

13.‍Assuming that X and Y both obey the normal distribution, then P(X<5,Y<0) is a ( ), indicating the probability that the two conditions of X<5,Y<0 are true at the same time, that is, the two events are common probability of occurrence.

A. Prior probability

B. Posterior probability

C. Joint probability

D. None of the above statements are correct

14. Suppose that the proportion of undergraduate students who can drive is 15%, and the proportion of graduate students who can drive is 23%. If the proportion of graduate students in a university is 20%, what is the probability that the students who can drive are graduate students?

‎A.80%

B.16.6%

C.23%

D.27.71%

15.‏ Assume that there are 100 photos, of which, there are 60 photos of cats and 40 photos of dogs.

‏Recognition results: TP=40, FN=20, FP=10, TN=30, then you can get: ( ).

A.Accuracy=0.8

B.Precision=0.8

C.Recall=0.8

D. None of the above is correct

16. The following statements about the training set, verification set and test set are incorrect ( ).

‍A. The test set is purely used to test the generalization ability of the model

B. The training set is used to train and evaluate model performance

C. The validation set is used to tune the model parameters

D. None of the above statements are correct

17. Which of the following methods can be used to alleviate the occurrence of overfitting: ( ).

A. Add more features

B. Regularization

C. Increase the complexity of the model

D. All of the above

18.‎ Suppose there are 6 two-dimensional data points: D={(2,3),(5,7),(9,6),(4,5),(6,4),(7,2) }, when splitting for the first time, the splitting line is ( ).

A.x=5

B.x=6

C.y=5

D.y=6

19.‏ The lengths of the two vectors are 1 and 2 respectively, and the angle between them is 60 degrees, then the following option is wrong ( ).

A. The cosine similarity is 0.5

B. Cosine similarity is positive

C. The cosine similarity cannot be calculated because no specific coordinate values ​​are given

D. The value of cosine similarity has nothing to do with the length of the vector, but only with the angle between the vectors

20. Compared with XGBoost, the main advantages of LightGBM do not include ( )‌

A. Faster training speed

B. Lower memory consumption

C. Better accuracy

D. Use second-order Taylor expansion to speed up convergence

21.‏ The statement about the advantages and disadvantages of BP algorithm is wrong ( ).

A.BP algorithm cannot be used to deal with nonlinear classification problems

B.BP algorithm takes a long time to train

C.BP algorithm is easy to fall into local minimum

D. During the training of BP algorithm, the activation function may be saturated due to excessive weight adjustment

22.‍The neural network algorithm sometimes overfits, so which of the following methods is more feasible to solve the overfitting ().

A. Select multiple sets of initial values ​​for the parameters, train them separately, and then select one set as the optimal value

B. Increase the step size of learning

C. Reduce the amount of data in the training dataset

D. Set a regular term to reduce the complexity of the model

23.‎The minimum time complexity of the SVM algorithm is O(n^2). Based on this, which of the following data sets is not suitable for this algorithm? ( )‎

A. Large data sets

B. Small dataset

C. Medium dataset

D. Not affected by the size of the dataset

24.‍A positive example (2,3), a negative example (0,-1), which of the following is the SVM hyperplane? ( )

‎A.2x+y-4=0

B.2y+x-5=0

C.x+2y-3=0

D. cannot be calculated

25. Which of the following statements about the Kmeans clustering algorithm is wrong ( ).

A. High efficiency and scalability for large data sets

B. is an unsupervised learning method

The CK value cannot be obtained automatically, and the initial cluster center is randomly selected

D. The selection of the initial cluster center has little effect on the clustering results

26.‍ Simply divide the data object set into non-overlapping subsets so that each data object is in exactly one subset. This type of clustering is called ( ).

A. Hierarchical clustering

B. Partitioning clusters

C. Non-mutually exclusive clustering

D. Density clustering

27.‎ The following statement about PCA is correct ( ).

A. PCA is a supervised learning algorithm

B. PCA selects the direction with the smallest variance in the original data for the first new coordinate axis after transformation

C. The first direction selected after PCA transformation is the most dominant feature

D. PCA does not need to normalize the data

28. The statement about Apriori and FP-growth algorithm is correct ( ).

A. Apriori is more troublesome than FP-growth operation

B. The FP-growth algorithm needs to pair items, so the processing speed is slow

C.FP-growth only needs to traverse the data once, and the scanning efficiency is high

D. The FP-growth algorithm is not suitable for shared memory when the database is large

29. A supermarket researched sales record data and found that people who bought beer had a high probability of buying diapers. What kind of data mining problem does this belong to? ( )‎

A. Association rule discovery

B. Clustering

c. Classification

D. Natural Language Processing

30.‍Confidence (confidence) is a measure of the measure of interest ( ).

A. Simplicity

B. Certainty

C. Practicality

D. Novelty

2. Multiple choice (2 points for each question)

31.‎ Which of the following is a classification algorithm?

‌A. Judging benign or malignant based on the volume of the tumor and the age of the patient?

B. According to the user's age, occupation, and deposit amount, determine whether the credit card will default?

C. What size T-shirt does a man with a height of 1.85m and a weight of 100kg wear?

D. Estimate the house price based on the size of the house, the number of bathrooms and other characteristics

32.‎ Which of the following are reasons for using data normalization (feature scaling)?

‌A. It speeds up gradient descent by reducing the computational cost of each iteration of gradient descent

B. It speeds up gradient descent by reducing the number of iterations to get a good solution

C. It does not prevent gradient descent from falling into a local optimum

D. It prevents matrix irreversibility (singular/degenerate)

33.‎ The main factors affecting the effect of KNN algorithm include ( ).

‎AK Values

B. Distance Metrics

C. Decision Rules

D. The distance of the nearest neighbor data

34. What are the commonly used kernel functions of support vector machines ( ).

A. Gaussian kernel

B. Laplace kernel

C. Linear Kernel

D. Polynomial Kernel

35.‏ Which of the following statements about support vector machines is correct ( ).

A. SVM is suitable for large-scale data sets

B. The idea of ​​SVM classification is to minimize the interval between classification surfaces

C. The SVM method is simple and robust

D.SVM classification surface depends on support vectors

36. The correct statement about the advantages of the BP algorithm is ( ).

A. BP algorithm can learn adaptively

B.BP algorithm has a strong nonlinear mapping ability

C.BP algorithm backpropagation adopts the chain rule, and the derivation process is rigorous

D. The generalization ability of the BP algorithm is not strong

37.‏ Which of the following descriptions about support vector machines is correct ( ).

‏A. is a supervised learning method

B. Can be used for multi-classification problems

C. Support non-linear kernel function

D. is a generative model

38.‎ The following are commonly used techniques for dimensionality reduction: ( ).

‌A. Principal Component Analysis

B. Feature extraction

C. Singular value decomposition

D. Discretization

39. What properties should the hyperplane obtained by the PCA algorithm have ( ).

A. Recently reconfigurable

B. Information Gain Maximization

C. Maximum Separability

D. Local minima

40.‎ Regarding association rules, the correct one is: ( ).

‌A. The main algorithms for mining association rules are: Apriori and FP-Growth

B. An itemset that satisfies the minimum support, we call it a frequent itemset

C. The story of beer and diapers is a typical example of cluster analysis

D. Support is an indicator to measure the importance of association rules

3. Judgment (1 point for each question)

41.‏ Support vectors are those data points closest to the decision plane

‍A.Correct

b. wrong

42.‍The correlation coefficient of correlated variables can be zero, right?

‏A.Correct

b. wrong

43. ‌PCA chooses the direction with the least amount of information for projection.

A.Correct

b. wrong

44.‍In most machine learning projects, the three steps of data collection, data cleaning, and feature engineering spend most of the time, while data modeling takes up less of the total time.

A. correct

b. wrong

45.‏ Stochastic gradient descent, each iteration, using a sample.

A.Correct

b. wrong

46.‎ The basic assumption of Naive Bayes is conditional independence.

A.Correct

b. wrong

47. The SMOTE algorithm uses an upsampling method.

‍A.Correct

b. wrong

48. The solution obtained by L2 regularization is more sparse.

A. correct

b. wrong

49.‍ID3 algorithm can only be used to deal with discrete distribution features.

A. correct

b. wrong

50.‏ The data for ensemble learning does not need to be normalized or standardized.

‎A.Correct

b. wrong

51.‎BP algorithm "likes the new and dislikes the old". After learning new samples, the old samples will be gradually forgotten.

A.Correct

b. wrong

52. The accuracy of logistic regression classification is not high enough, so this algorithm is rarely used in the industry

‍A.Correct

b. wrong

53. The SMOTE algorithm uses upsampling.

A.Correct

b. wrong

54. ‍1 million pieces of data are divided into training set, verification set, and test set. The data can be divided like this: 98%, 1%, and 1%.

A. correct

b. wrong

55.‎K-means is a density-based clustering algorithm that produces partitioned clusters, the number of clusters is determined automatically by the algorithm.

A. correct

b. wrong

56. The basic assumption of Naive Bayesian method is conditional independence.

‏A.Correct

b. wrong

57. The larger the feature space, the greater the possibility of overfitting.

‎A.Correct

b. wrong

58.‍The closer the cosine similarity of two vectors is to 1, the more similar they are.

A.Correct

b. wrong

59.‍K-means is a density-based clustering algorithm that generates partitioned clusters, and the number of clusters is automatically determined by the algorithm.

A. correct

b. wrong

60.‍The core idea of ​​the ID3 algorithm is to use information gain to measure feature selection, and select the feature with the largest information gain for splitting.

A. correct

b. wrong

Guess you like

Origin blog.csdn.net/Destiny_159/article/details/125282972