Machine Learning Final Exam Multiple Choice Questions

1. There are problems in neural network optimization calculation (ABC).

A. The instability of the solution B. The parameters are difficult to determine  

C. It is difficult to guarantee the optimal solution D. There are a large number of local maxima in the energy function

2. Among the following Python data types, the variable data type is (AC).

A. Dictionary B. Tuple C. List D. String

  Variable data type: value changes, id value remains unchanged; immutable data type: value value changes, id value also changes accordingly. Tuple elements cannot be modified and are therefore immutable. Common immutable types: numbers, strings, booleans, tuples.

3. Which of the following Python data types are ordered sequences (ABD).

A. Tuple B. List C. Dictionary D. String

A list in Python is a variable sequence, usually used to store the same type of data collection, of course, it can also store different types of data. Tuple is an immutable sequence, usually used to store heterogeneous data collections. The ordered sequence is mainly reflected in the order of the elements.

4. The factors that determine the performance of artificial neural networks are (ABC).

A. Properties of neurons

B. The form of interconnection between neurons is topology

C. Learning rules to improve performance for adaptation

D. Data size

5. The application fields of Python language are (ABCD).

A. Web development B. Automated scripts for operating system management and server operation and maintenance

C. Scientific Computing D. Game Development

6. Feedforward neural network is commonly used in (AD).

A. Image recognition B. Text processing C. Question answering system D. Image detection

In a feedforward neural network, each neuron belongs to a different layer. Neurons in each layer can receive signals from neurons in the previous layer and generate signals to output to the next layer. The 0th layer is called the input layer, the last layer is called the output layer, and the other intermediate layers are called the hidden layer. The neurons in the adjacent two layers are fully connected, also known as the fully connected neural network (FNN). Its main features are Unidirectional multi-layer , the network is represented as a directed acyclic graph . It is divided into three types: one is the perceptron network, which is mainly used for pattern classification and multimodal perception; the other is called back propagation network (Back propagation Networks), which can also be referred to as BP network, and is often used for nonlinear mapping; One is called Radial Basis Function Neural Network (RBF Network), which is often used to recognize cells, recognize images, and recognize sounds.

7. The implementation process of machine learning, including data collection, (ABCD) and other links.

A. Data analysis and processing B. Algorithm selection C. Training model D. Model adjustment

The general process of machine learning: clear problems, collect data, analyze data (formatting, deviation detection and noise reduction, data cleaning, removal of dirty data and useless data or noise and missing values, data standardization or visualization, etc., setting data sets is data dismantling points; this part mainly completes the data feature extraction of available data, etc.), selects the model, and sets the loss function loss (0-1 loss function, when the prediction is wrong, the loss function is 1, and when the prediction is correct, the loss function value is 0 ), setting the learning rate (the learning rate is different for different data sets), training, evaluation, parameter adjustment (super parameter adjustment), and predictive model application.

8. Which of the following belongs to the application direction of artificial neural network (ABCD).

A. Automatic control B. Signal processing C. Soft measurement D. Intelligent calculation

Soft measurement is an important variable that is difficult to measure or cannot be measured temporarily, select some other variables that are easy to measure, infer or estimate by forming a certain mathematical relationship, and replace the function of hardware with software.

9. The characteristics of the Python language are (ABD).

A. Easy to learn B. Open source C. Process-oriented D. Portability

10. The application fields of traditional machine learning are (ABD).

A. Credit risk detection B. Sales forecast C. Speech synthesis D. Product recommendation

Speech synthesis, also known as Text to Speech (Text to Speech) technology, can convert any text information into a standard and smooth voice in real time. It involves multiple disciplines such as acoustics, linguistics, digital signal processing, and computer science, and is a cutting-edge technology in the field of Chinese information processing.

11. Which of the following statements is incorrect ( CD ).

A. When dealing with data defects in the Pandas library, dropna is often used to clear the defect data

B. isnull in the Pandas library determines whether the data is empty

C.Pandas cannot read csv text

D. Pandas can read word files

12. A complete artificial neural network includes (AC).

A. One input layer B. Multiple analysis layers C. Multiple hidden layers D. Two output layers

13. According to different learning methods, machine learning can be divided into which of the following categories (ABC).

A. Supervised learning B. Unsupervised learning C. Semi-supervised learning D. Self-directed learning

14. The following are deep learning frameworks: (ABCD).

A.Keras  B.TensorFlow  C.PaddlePaddle  D.PyTorch

15. (B) and (C) are the two most commonly used evaluation metrics in classification tasks .

A. Recall rate B. Error rate C. Accuracy rate (precision) D. Precision rate

16. The core elements of machine learning include (ACD).

A. Data B. Operators C. Algorithms D. Computing power

17. Regarding the sigmoid function , the following description is correct: (ABD).

A. The range of the output value is a real number between 0-1

B. Where the input value is close to 0, the relationship between input and output is approximately linear

C. Where the input value is close to 0, the slope is approximately 0

D. The input value is any real number

18. In multi-category learning , the classic split strategy is (ACD).

A. One vs Rest (One vs Rest) B. Two vs Two (Two vs Two)

C. Many to many (Many vs Many) D. One to one (One vs One)

19.a = numpy.array([[1,2,3],[4,5,6]]) Among the following options, the index of number 5 can be selected (AC).

Aa[1][1] Ba[2][2] Ca[1,1] Da[2,2] A option starts counting from 0

20. Which of the following are classification problems: (BCD).

A. Multi-label single classification B. Single-label multi-classification C. Two-class classification D. Multi-label multi-classification

21. How to judge an ideal training set? (ABC).

A. The ideal training set has a balanced diversity distribution and is not prone to overfitting

B. Compared with the number of samples, the representativeness and quality of the samples themselves are more important

C. The content of the data set is highly consistent with the goals that the model needs to achieve

D. The cross-validation method can make up for the shortcomings of the data set

22. The relationship and difference between machine learning and data mining are (ABC).

A. Data mining can be seen as the intersection of machine learning and databases.

B. Data mining mainly uses the technology provided by the machine learning field to analyze massive data, and uses the technology provided by the database field to manage massive data.

C. Machine learning is partial to theory, and data mining is partial to application.

D. The two are two independent data processing technologies.

Analysis of test questions: D data mining can be regarded as the intersection of machine learning and database, and the two are not independent of each other.

23. Which of the following function statements can set the scale of the coordinate axis : ( AB ).

A.plt.xticks()  B.plt.yticks()  C.plt.xlabel()  D.plt.ylabel()

24. In real world data, missing values ​​are common, and the general processing methods are (ABCD).

A. Ignore B. Delete C. Average fill D. Max fill

25. Which of the following methods can be used to evaluate the performance of classification algorithms : (ABC).

A. F1 Score B. Accuracy rate C. AUC D. Prediction result distribution

Performance indicators for evaluating classifiers: [Evaluation Metric]

ACC [accuracy] accuracy, Precision accuracy, Recall recall, F1-score, AUC

The classifier predicts correctly or incorrectly on the test data set, which can be divided into 4 cases, namely:

TP : Predict the positive class as the number of positive classes

FN: Predict the positive class as the number of negative classes

FP : Predict negative classes as positive class numbers

TN: Predict the negative class as the number of negative classes

1. Accuracy of ACC

For a given test set, the ratio of the number of samples correctly classified by the classifier to the total number of samples.

But in the case of binary classification and unbalanced positive and negative examples, especially when we are more interested in the minority class, the accuracy evaluation has basically no reference value.

2.Precision accuracy

or positive predictive value

3. Recall recall rate

sensitivity or true positive rate (TPR)

4. F1-Score

is the harmonic mean of precision and recall

When both precision and recall are high, the F1 value will also be high.

5,AUC

5.1 ROC curve

The ROC curve describes the relative change between the two quantities of FPR-TPR in the classification confusion matrix.

    

If the binary classifier outputs a classification probability value for positive samples, different confusion matrices will be obtained when different thresholds are taken, corresponding to a point on the ROC curve.

The ROC curve reflects the trade-off between FPR and TPR. In layman's terms, it is the question of who grows faster and how much faster when TPR increases with FPR.

The faster the TPR grows, the more upward the curve is, reflecting the better classification performance of the model. When the positive and negative samples are unbalanced, the benefits of this model evaluation method over the general accuracy evaluation method are particularly significant.

5.2 AUC curve

AUC (Area Under Curve) is the area under the ROC curve. The value range of AUC is between 0.5 and 1.

The AUC value is used as the evaluation standard because in many cases, the ROC curve cannot clearly indicate which classifier is better, and AUC can be used as a numerical value to intuitively evaluate the quality of the classifier. The larger the value, the better.

Method 1: Calculate the area under the ROC curve, that is, the value of AUC. Our test samples are limited, so the obtained AUC curve must be a ladder. The calculated AUC is simply the sum of the areas under these steps. The accuracy of the calculation is related to the accuracy of the threshold.

Method 2: An interesting property about AUC is that it is equivalent to the Wilcoxon-Mann-Witney Test (rank sum test). The Wilcoxon-Mann-Witney Test is to test any positive sample and a negative sample, how likely is the score of the positive sample to be greater than the score of the negative sample.

With this definition, we get another way to calculate AUC: get this probability. Specifically, it is to count all M×N (M is the number of positive samples, N is the number of negative samples) positive and negative sample pairs, how many groups have the positive sample score greater than the negative sample score . When the scores of the positive and negative samples in the binary group are equal, it is calculated as 0.5. Then divide by MN. The complexity of implementing this method is O(n^2). n is the number of samples (ie n=M+N).

Method 3: It is actually the same as the second method above, but the complexity is reduced. It also first sorts the scores from large to small, and then sets the rank of the sample corresponding to the largest score to n, the rank of the sample corresponding to the second largest score to n-1, and so on. Then add the ranks of all positive samples, and then subtract M-1 combinations of two positive samples. What is obtained is how many pairs of positive samples have a score greater than that of negative samples in all samples. Then divide by M×N again.

The ROC (Receiver Operating Characteristic ) curve is a useful visualization tool for comparing two classification models. The ROC curve shows the trade-off between the true positive rate (TPR) and the false positive rate (FPR) of a given model, the vertical axis is the "true positive rate (TPR)", and the horizontal axis is the "false positive rate (FPR) ".

In Figure (a), two lines are given, and the ROC curve shows the change of the false positive rate and the true positive rate when the threshold changes. The point in the lower left corner corresponds to the case where all samples are judged as negative, while the point in the upper right corner corresponds to the case where all samples are judged as positive. The dashed line shows the resulting curve for random guessing.

In real tasks, a limited number of test samples are usually used to draw the ROC graph. At this time, only a limited number of (true positive rate, false positive rate) coordinate pairs can be obtained, and the smooth ROC curve in figure (a) cannot be generated. Only draw The approximate ROC curve shown in Figure (b).

 Drawing process: Given m+ positive examples and m- negative examples, the samples are sorted according to the prediction results of the learner, and then the classification threshold is set to the maximum, that is, all samples are predicted as negative examples. At this time, the true example rate and The false positive rate is 0, and a point is marked at the coordinate (0,0). Then, set the classification threshold as the predicted value of each sample in turn, that is, classify each sample as a positive example in turn. Let the coordinates of the previous marker point be (x, y). If it is a true example, the coordinates of the corresponding marker point are (x, y+1/m+); if it is a false positive example, the coordinates of the corresponding marker point are (x+1/m-,y), and then connect adjacent points with line segments.

If the ROC curve of one learner is completely "wrapped" by the curve of another learner, it can be asserted that the latter has better performance than the former; if the curves cross, they can be compared according to the area under the ROC curve, that is, AUC (Area Under ROC Curve) value.

 AUC can be obtained by summing the areas under the ROC curve. Assuming that the ROC curve is formed by sequentially connecting points with coordinates {(x1, y1), (x2, y2), ..., (xm, ym)} (x1=0, xm=1), the AUC can be estimated for

AUC gives the average performance value of the classifier, it does not replace the observation of the entire curve. A perfect classifier has an AUC of 1.0, while a random guess has an AUC of 0.5

AUC considers the ranking quality of sample predictions, so it is closely related to ranking error. Given m+ positive examples and m- negative examples, let D+ and D- denote positive and negative example sets respectively, then the sorting "loss" is defined as

Lrank corresponds to the area above the ROC curve: if a positive example is marked as (x,y) on the ROC curve, then x is exactly the proportion of all negative examples sorted before the period, that is, false positive examples, so:

The AUC value is a probability value. When you randomly select a positive sample and a negative sample, the probability that the current classification algorithm ranks the positive sample ahead of the negative sample according to the calculated Score value is the AUC value. The larger the AUC value, the greater the current classification. The more likely the algorithm is to rank positive samples ahead of negative samples, the better it can classify.

26. In class-imbalanced datasets, (A) and (B) are often used as more appropriate performance measures.

A. Recall B. Precision C. Error D. Accuracy

1. Empirical error and overfitting

Error rate: the ratio of the number of misclassified samples to the total number of samples, if a sample is misclassified among m samples, then the error rate E = a/m

Accuracy: 1- a/m

Training error (empirical error): The error of the learner on the training set

Generalization error: error on new samples

Overfitting: learning too much about certain characteristics of the training samples, resulting in a decline in generalization performance

Underfitting: The general properties of the training samples have not been learned well

2. Data set division

1. Set-out method

The hold-out method directly divides the data set D into two mutually exclusive sets, one of which is used as the training set S and the other as the test set T.

A common practice is: 1/5 ~ 1/3 samples as the test set

The division of the training/test set should maintain the consistency of the data distribution as much as possible, and avoid the additional bias introduced by the data division.

Therefore, the estimation results obtained by using the leave-out method alone are often not stable and reliable. When using the leave-out method, it is generally necessary to use several random divisions, repeat the experimental evaluation, and take the average value as the evaluation result of the leave-out method.

2. Cross-validation method

The cross-validation method first divides the data set D into K mutually exclusive subsets of similar size, and each subset maintains the consistency of the data distribution as much as possible, that is, it is obtained from D through stratified sampling. Each time, the union of K-1 subsets is used as the training set, and the remaining one is used as the test set. In this way, K times of training and testing are carried out, and finally the mean value of K test results is returned. The cross-validation method is usually called: K-fold cross-validation.

Similar to the hold-out method, there are many ways to divide the data set D into K subsets. In order to reduce the difference caused by different sample divisions, K-fold cross-validation usually randomly uses different divisions to repeat P times, and the final evaluation The result is the mean of the P times of K-fold cross-validation, for example, 10 times of 10-fold cross-validation is common

Assuming that D contains m samples, if K = m, a special case of the cross-validation method is obtained: the leave-one-out method

The leave-one-out method is not affected by the random sample division method, so the evaluation results of the leave-one-out method are often more accurate. However, there are also disadvantages: the calculation overhead is large when the data set is large.

3. Self-help

Given a data set D with m samples, it is sampled to generate a data set D2: randomly select a sample from D each time, copy it into D2, and then put the sample back into D, so that the sample is still the same in the next sampling. likely to be picked up. After this process is executed m times, a data set D2 containing m samples is obtained.

After the calculation of the mathematical limit, 36.8% of the samples in the initial data set D do not appear in the sampling data set D2. Therefore, D2 can be used as a training set, and D\D2 can be used as a test set ("\ "indicates set subtraction)

The bootstrap method is useful when the data set is small and it is difficult to effectively divide the training and test sets. In addition, the bootstrap method can generate more diverse training sets from the initial data set, which is of great benefit to methods such as ensemble learning. However, changing the distribution of the initial dataset introduces estimation bias. Therefore, when the amount of initial data is sufficient, the leave-out method and cross-validation method are more commonly used.

3. Adjust and participate in the final model

1. When evaluating and selecting models, in addition to selecting the algorithm used for learning, parameters must also be set.

2. After the model selection is completed, the learning algorithm and parameter configuration have been set. At this time, the data set D is used to retrain the model instead of the training set. This is the final model submitted to the user.

4. Performance Metrics

1. Return

The most commonly used performance measure for regression tasks is "mean squared error"

2. Classification

1. Error rate and precision

These are the two most commonly used performance measures in classification tasks, and are applicable to both binary and multi-classification tasks.
The error rate is the proportion of misclassified samples to the total number of samples; the accuracy is the proportion of correctly classified samples to the total number of samples.

2. Precision, recall and F1

Since the error rate and accuracy are equally important for each class, it is not suitable for analyzing class imbalanced datasets.

In class-imbalanced datasets, it makes more sense to correctly classify rare classes than to correctly classify more classes. At this time, precision and recall are more appropriate than accuracy and error. For binary problems, rare class samples are usually labeled as positive examples, while majority class samples are labeled as negative examples.

The combination of statistically ground truth and predicted results results in a confusion matrix as shown below:

  • Precision ( P ) : The proportion of samples classified as positive that are actually positive.
  • Recall ( R )  : The proportion of samples classified as positive among the samples that are actually positive.
  • P = TP / ( TP + FP)
  • R = TP / ( TP + FN)

The PR curve
is plotted on the vertical axis of the precision rate and the horizontal axis of the recall rate, and the PR curve
can be judged as follows when the two curves intersect:

Balance point ( BEP )  : It is the value when recall rate = precision rate.

F1 = ( 2*TP ) / total number of samples + TP – FN

If there are different preferences for precision or recall:

When β > 1, the recall rate has a greater impact, when β < 1, the precision rate has a greater impact, and when β = 1 , it degenerates to the standard F1.

If there are multiple binary confusion matrices (global):

First calculate the precision rate and recall rate on the mixed matrix, and record them as (P1, R1), (P2, R2), (P3, R3), (P4, R4)... and then calculate the average value, In this way, the macro search rate (macro-P), the macro recall rate (macro-R), and the corresponding macro F1 are obtained:

It is also possible to average the corresponding elements of each confusion matrix to obtain the average values ​​of TP, FP, TN, and FN, and then calculate the micro-precision rate, micro-recall rate, and micro-F1 based on these average values.

3. ROC and AUC

True Positive Rate ( TPR ): TPR = TP / (TP + FN)

False positive rate ( FPR ): FPR = FP / (TN + FP)

AUC: area under the ROC curve

 

4. Cost-sensitive error rate and cost curve

In order to balance the different losses caused by different types of errors and assign unequal costs to errors  , taking binary classification as an example, a cost matrix is ​​introduced:

Under unequal costs, what we want is no longer to simply minimize the number of errors, but to minimize the overall cost.

Draw the cost curve 

 Analysis of the test paper: The examination of multiple-choice questions is relatively basic, mainly involving the examination of basic concepts and the application of basic knowledge. The knowledge points cover Python foundation, Python third-party library scientific computing basic software package NumPy, structured data analysis tool Pandas, drawing library Matplotlib, The specific application of the scientific computing toolkit SciPy, etc., the basic process of machine learning, data set division, basic knowledge of machine learning, English terminology, AUC, etc. The difficulties in the test questions mainly include the understanding of the general performance standards of machine learning algorithms. In general, the questions are somewhat difficult.

Guess you like

Origin blog.csdn.net/anmin8888/article/details/121307057