Practice exercises

  1. Which of the following statements about the Kmeans clustering algorithm is incorrect (D)

A. Highly efficient and scalable for large data sets

B. It is an unsupervised learning method

C. The K value cannot be obtained automatically, and the initial clustering center is randomly selected.

D. The selection of the initial clustering center has little impact on the clustering results.

  1. Problems studied by clustering algorithms include: (multiple choice) (ABC)

A. Make the final category distribution more reasonable

B. Rapid clustering

C. High accuracy

D. Can automatically identify the number of cluster centers

3. Briefly describe the clustering implementation process of the K-means algorithm.

① Randomly initialize K center points;

② Calculate the distance D from the unknown sample point to the K center points;

③ Classify the unknown sample point into the same category as the center point when the D value is the smallest;

④ Calculate the mean values ​​of these K classification clusters as the new center points of these K clusters;

⑤ Repeat steps ①-④ until the new center point is consistent with the old center point, then the iteration stops and the last clustering is regarded as the optimal clustering result.

4. Please briefly describe the advantages and disadvantages of the K-means algorithm.

answer:

Advantages:
① The principle is relatively simple, the implementation is easy, and the convergence speed is fast;

② The clustering effect is better;

③ The interpretability of the algorithm is relatively strong;

④ The main parameter that needs to be adjusted is only the number of clustering centers k.

Disadvantages:
① The selection of K value is difficult to grasp;

② It is difficult to converge for data sets that are not convex;

③ If the data of each hidden category is unbalanced, for example, the amount of data of each hidden category is seriously imbalanced, or the variance of each hidden category is different, the clustering effect will be poor;

④ Using iterative methods, the result obtained may be a local optimum;

⑤ Sensitive to noise and abnormal points.

5. Please briefly describe what evaluation indicators and methods can be used to evaluate clustering algorithms. **

① SSE: Sum of squares of errors. Calculate the sum of squares of the distance between sample points in a class and the center point of the cluster after each clustering. The
smaller the value, the better the clustering effect. This method is simple and crude, and if the initial center point is improperly selected, it will fall into a local optimal solution.

② Elbow method: Determine the best K value by drawing a line graph of the sum of the squares of the distance between the sample point in the class and the center point and the K value. Finally, it is
determined that the current sample set is divided into K cluster centers which is the best clustering center. class effect.

6. In the K-means algorithm, which of the following can be used to obtain the global optimal solution: (D)

① Try running the initialization algorithm for different centroids

② Adjust the number of iterations

③ Find the optimal number of clusters

A. ② and ③

B. ① and ③

C. ① and ②

D. All of the above

Answer analysis:

The traditional K-means algorithm randomly selects the initial cluster center, which often causes the clustering result to fall into the local optimal solution.
Improving the selection method of the initial cluster center can improve the clustering effect of the K-means algorithm and obtain the global optimal solution.
So trying to initialize different centroids is actually looking for the best initial class center in order to achieve the global optimal;
and if the number of iterations is too small, the global optimal solution may not be obtained, so it is necessary to adjust the number of iterations to obtain the global optimal solution. Solution;
Finally, the optimal number of clusters, that is, the K value, is artificially defined. We do not know in advance what K value can achieve the global optimum, so we need to
debug the K value to achieve the global optimum.

To sum up, option D is the correct answer.

7. Which of the following statements about SVM application scenarios is correct (multiple choices): (ABC)

A. SVM performs outstandingly on binary classification problems

B. SVM can solve multi-classification problems

C. SVM can solve regression problems

8. Which of the following statements about the hard margin and soft margin of SVM is incorrect: (B)

A. Hard intervals will perform better in linearly separable samples.

B. Hard intervals are not very sensitive to outliers

C. Soft margin performs better in linearly inseparable samples

D. Soft margins require a trade-off between constraint margin violations and model complexity

9. Please briefly describe the purpose of introducing kernel functions in SVM in linearly inseparable samples, common kernel functions, and their usage scenarios and effects.

Answer:
① When we encounter linearly inseparable samples, a common approach is to map the sample features into a high-dimensional space.
However, when encountering linearly inseparable samples, they are always mapped to a high-dimensional space, so the size of this dimension will be terrifyingly high.
At this time, the kernel function reflects its value. The value of the kernel function is that although it also converts features from low dimensions to high dimensions, the good thing about the kernel function is that it calculates
in low dimensions and will essentially The classification effect (using inner product) is expressed in high dimensions,
which avoids complex calculations directly in high-dimensional space and truly solves the problem of linear inseparability of SVM.

1. Among the following options, the incorrect statement about the KNN algorithm is: (D)

A. Can find K samples that are similar to the sample to be tested

B. KNeighborsClassifier in sklearn uses the Euclidean distance metric by default

C. The implementation process is relatively simple, but the interpretability is not strong

D. Very efficient

2. Among the following options, the description is incorrect. (B)

A. The acquired data sets built into sklearn are generally in dictionary format.

B. The corresponding small-scale data set can be obtained through a method similar to sklearn.datasets.fetch_*

C. The corresponding feature values ​​and target values ​​can be obtained through the data and target attributes of the obtained data set.

D. You can obtain the corresponding feature names and target names through the feature_names and target_names attributes of the obtained data set.

3. Regarding the data split by train_test_split(data), the correct receiving method is (B)

A. x_train, y_train, x_test, y_test

B. x_train, x_test, y_train, y_test

C. x_test, y_test, x_train, y_train

D. It doesn’t matter, you can accept whatever you want

4. Briefly describe the advantages and disadvantages of the K nearest neighbor algorithm.

answer:

advantage:

① The algorithm theory is simple, easy to understand and easy to implement;

② Supports multiple classifications and has high accuracy;

③ High calculation accuracy and not affected by outliers;

shortcoming:

① The calculation complexity is high and the amount of calculation is large;

② Sensitive to the value of K;

③ The interpretability of the prediction results is not strong.

5. Briefly describe the commonly used methods of feature preprocessing in feature engineering and the differences between them.

answer:

Commonly used feature preprocessing methods include normalization and standardization.

The same thing: they all process the data to be trained by the machine learning algorithm into data that is more suitable for the algorithm model, and at the same time eliminate the impact of the difference in data dimensions between different dimensions.

Differences: The implementation methods are different. Normalization uses the maximum and minimum values ​​in each dimension of the sample data, and we need to determine the range of data scaling; while standardization uses the average value in each dimension of the sample data. and standard deviation to calculate the scale range size of the final data.

6. Briefly describe the application scenarios of K nearest neighbor algorithm.

answer:

① The K nearest neighbor algorithm can be used for both classification tasks and regression tasks;

② K nearest neighbor algorithm is suitable for application scenarios of small and medium-sized data with numerical characteristics;

③ The K nearest neighbor algorithm can be used in both two-classification scenarios and multi-classification scenarios.

7. Briefly describe the conveniences that cross-validation and grid search can bring to us in machine learning.

Answer: Cross-validation allows us to make full use of existing data to find models with higher credibility when training data is relatively small and scarce. Grid search allows us to find more reliable optional hyperparameters in a shorter period of time. Long-term parameter optimization results, both of which can improve our development efficiency in the machine learning development process.

8. Use the iris flower data set to train the KNN classification model.

Require:

  1. Use sklearn’s built-in iris data set;
  2. Divide the data set, and the proportion of the verification set can be customized to ensure that the data set used by the program is the same every time;
  3. Use appropriate feature preprocessing methods to process raw data;
  4. Use cross-validation and grid search to tune hyperparameters (including but not limited to K value);
  5. Evaluate the trained model;
  6. Get the accuracy of the best performing model on the test set.
  7. Get the best performing model in cross-validation and its parameters.
  8. If conditions permit, compare the models trained by classmates to see whose model has the highest accuracy, think and discuss.

Code example:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

# 1.获取鸢尾花数据
iris = load_iris()

# 2.数据集划分
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=666)

# 3.特征预处理
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)


# 4. 实例化一个估计器
estimator = KNeighborsClassifier()

# 5 交叉验证,网格搜索
param_grid = {"n_neighbors": [1, 3, 5, 7, 9], "p": [1, 2, 10, 20]}
estimator = GridSearchCV(estimator, param_grid=param_grid, cv=5)

# 6. 模型的训练及调优
estimator.fit(x_train, y_train)


# 7. 得出预测值
y_pre = estimator.predict(x_test)
print("预测值是:\n", y_pre)

# 8. 计算模型的准确率
score = estimator.score(x_test, y_test)
print("准确率为:\n", score)

# 9. 得到交叉验证中表现最好的模型的准确率及其参数
print("在交叉验证中,得到的最好结果是:\n", estimator.best_score_)
print("在交叉验证中,得到的最好的模型是:\n", estimator.best_estimator_)

9. Among the following options, the incorrect statement about generalization error is: (C)

A. Generalization error refers to the error of the model on new samples.

B. Error refers to the difference between the actual output of the model and the true label of the sample

C. The purpose of machine learning is to obtain a model with larger generalization error

10. Which of the following methods can be used to alleviate the occurrence of overfitting: (B)

A. Add more features

B. Regularization

C. Increase model complexity

D. All of the above

11. Regarding regularization, which of the following statements is correct (A)

A. The solution obtained by L1 regularization is more sparse

B. L2 regularization technology is also called Lasso Regularization

C. The solution obtained by L2 regularization is more sparse

D. L2 regularization can prevent overfitting and improve the generalization ability of the model, but L1 regularization cannot do this.

12. Regarding feature selection, which of the following statements about Ridge regression and Lasso regression is correct: (B)

A. Ridge regression is suitable for feature selection

B. Lasso regression is suitable for feature selection

C. Both are suitable for feature selection

D. None of the above statements are correct

13. In linear regression, we can use the normal equation (Normal Equation) to solve for the coefficients. Which of the following statements about the normal equation is correct (multiple choices): (ABC)

A. No need to choose learning rate

B. When the number of features is large, the amount of calculation will increase

C. No need for iterative training

14. Regarding the description of gradient descent method, the correct one is: (ABC)

A. The stochastic gradient descent method uses one sample of data each time to iterate the weights

B. The calculation amount of the full gradient descent method increases as the number of samples increases.

C. The mini-batch gradient descent method combines the advantages of the stochastic gradient descent method and the full gradient descent method.

15. Regarding the saving and loading of models, which of the following statements is correct: (B)

A. Model loading allows us to seamlessly connect the already trained model with more data.

B. Saving the model can be done using the externals.joblib.dump() method in sklearn

C. Model loading can be completed using the externals.joblib.dump() method in sklearn

16. Briefly describe the commonly used methods for model selection, their characteristics and application scenarios.

answer:

① The hold-out method, cross-validation method and bootstrap method are commonly used in model selection.

② The hold-out method directly divides the data set D into two mutually exclusive sets, which are used as training sets and validation sets respectively; the cross-validation method divides the data set D into k mutually exclusive subsets of similar size, and selects each time One or more of them are validation sets, and the rest are training sets; the bootstrapping method randomly selects a sample from D each time, puts its copy into the training set, and then puts the sample back into the initial data set D. , so that the sample may still be selected in the next sampling. The training set obtained by repeating this sampling process m times is then given to the model for training, and the samples that have not been sampled can be used as the test set.

③ The hold-out method is generally suitable for scenarios with a large amount of data. It is simple and time-saving, but will sacrifice a small part of the accuracy; the cross-validation method is also applicable in scenarios with a small amount of data, allowing us to make full use of the only available data. The data is filtered to find a more reliable model.

17. Briefly describe what linear regression is and what kind of problems we can use linear regression to solve.

answer:

① Linear regression is a statistical analysis method that uses regression analysis in mathematical statistics and is widely used to determine the interdependent quantitative relationship between two or more variables.

② Expression form: The situation with only one independent variable is called simple regression (form: y = wx + by = wx + by=wx+b ), the situation with more than one independent variable is calledmultiple regression(form:y = w 1 x 1 + w 2 x 2 + ⋯ + by = w_{1} x_{1} + w_{2} x_{2} + \cdots + by=w1x1+w2x2++b)。

③ Such a statistical model is a linear combination of one or more model parameters called regression coefficients. The linearity here refers to the independent variable xxx and dependent variableyyWhen y is known, solve for a set of regression coefficients w 1 , w 2 , ⋯ , wn w_{1}, w_{2}, \cdots, w_{n}w1,w2,,wnto fit xxxyyThe most primitive relationship of y is such that yyw 1 x 1 + w 2 x2 + ⋯ + b w_{1} x_{1} + w_{2} x_{2} + \cdots + bw1x1+w2x2++b method to approximate.

④ In machine learning, all regression problems that require determining quantitative relationships can be solved using linear regression, including but not limited to scenarios such as housing price prediction and stock return prediction.

18. Briefly describe the loss measurement method of linear regression model and how to optimize the loss.

answer:

① 若使用 X ( x 0 , x 1 , x 2 , ⋯   . , x n ) X(x_{0}, x_{1}, x_{2},\cdots., x_{n}) X(x0,x1,x2,.,xn) display special attack, useW ( w 0 , w 1 , w 2 , ⋯ . , wn ) W(w_{0}, w_{1}, w_{2},\cdots., w_{n})W(w0,w1,w2,.,wn) represents the weight coefficient, useyyy represents the true target value of the model, then we can use the least squares method to measure the error (loss) of the model as:
J ( W ) = 1 2 ∑ i = 1 n ( wixi − y ) 2 J(W) = \frac {1}{2}\sum_{i=1}^n (w_ix_i - y)^2J(W)=21i=1n(wixiy)2

② For the optimization of the loss function, we can directly use the normal equation to solve the optimal parameters, or we can solve it through the gradient descent method.

19. Briefly describe the causes and solutions of underfitting and overfitting.

answer:

Reasons for underfitting:

① Few training times;

② The model is too simple.

Solution:

① Increase the number of training sessions;

② Increase the complexity of the model, such as adding polynomial features;

Causes of overfitting:

① There are too many training sample features;

② The model is too complex.

Solution:

① Increase data cleaning efforts on original data;

② Increase the number of training samples until the number of samples is much greater than the number of features;

③ Use regularization;

④ Filter features and reduce feature dimensions;

20. Briefly describe what methods can be used to alleviate overfitting at the code level in actual development.

answer:

① Use Ridge Ridge regression, which uses a linear regression model with L2 regularization, which can produce smooth weight coefficients for the model, make the weight coefficients of certain features smaller, and reduce the impact of some features on the model.

② Use Lasso regression, which uses a linear regression model with L1 regularization, which can generate sparse weight coefficients for the model, so that the weight coefficients of some features are directly zero, eliminating the impact of some features on the model, and achieving Feature selection.

③ Use elastic network, which uses a linear combination of L1 regularization and L2 regularization. It inherits the advantages of L1 regularization and L2 regularization. By adjusting the coefficients of the linear combination, we can obtain models with different effects.

④ Use Early Stopping to specify a threshold. When the model is being trained, if the verification error is less than this threshold, the model will be stopped in time to continue training.

21. Use the normal equation, the linear regression model and the ridge regression model of the stochastic gradient descent optimization method to complete the prediction of Boston housing prices.

Require:

  1. Use sklearn’s built-in Boston housing price data set;
  2. Divide the data set, and the proportion of the verification set can be customized to ensure that the data set used by the program is the same every time;
  3. Use appropriate feature preprocessing methods to process raw data;
  4. Use ridge regression model with model selection;
  5. Evaluate each trained model, compare the effects, think and discuss.
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, SGDRegressor, RidgeCV, Ridge
from sklearn.metrics import mean_squared_error


def linear_reg_model():
    # 1.获取数据
    boston = load_boston()

    # 2.数据基本处理
    # 2.1 分割数据
    x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.25, random_state=8)

    # 3.特征工程-标准化
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.fit_transform(x_test)

    # 机器学习-线性回归-正规方程求解
    estimator1 = LinearRegression()
    estimator1.fit(x_train, y_train)

    print("正规方程求解出的模型偏置是:\n", estimator1.intercept_)
    print("正规方程求解出的模型系数是:\n", estimator1.coef_)

    # 模型评估-正规方程求解
    y_pre = estimator1.predict(x_test)
    ret = mean_squared_error(y_test, y_pre)
    print("正规方程求解最优解模型的均方误差为:\n", ret)
    
    # 机器学习-线性回归-随机梯度下降求解
    estimator2 = SGDRegressor(max_iter=1000)
    estimator2.fit(x_train, y_train)

    print("随机梯度下降求解出的模型偏置是:\n", estimator2.intercept_)
    print("随机梯度下降求解出的模型系数是:\n", estimator2.coef_)

    # 模型评估-随机梯度下降求解
    y_pre = estimator2.predict(x_test)
    ret = mean_squared_error(y_test, y_pre)
    print("随机梯度下降求解出的模型的均方误差:\n", ret)



    # 机器学习-线性回归-岭回归
    # estimator = Ridge(alpha=1.0)
    estimator3 = RidgeCV(alphas=(0.001, 0.01, 0.1, 1, 10, 100))
    estimator3.fit(x_train, y_train)

    print("带有交叉验证的岭回归模型的偏置是:\n", estimator3.intercept_)
    print("带有交叉验证的岭回归模型的系数是:\n", estimator3.coef_)

    # 模型评估-岭回归
    y_pre = estimator3.predict(x_test)
    ret = mean_squared_error(y_test, y_pre)
    print("带有交叉验证的岭回归模型的均方误差:\n", ret)
    print("交叉验证中表现最好的模型的参数alpha为:\n", estimator3.alpha_)


22. Suppose there are N samples, half of which are used for training and half of which are used for testing. If the value of N is increased, how will the gap between training error and test error change? (B)

A. increase

B. reduce

Answer analysis:
If you add more data, you can effectively alleviate overfitting and reduce the gap between the training sample error and the test sample error.

23. Among the following options, the incorrect statement about logistic regression is: (B)

A. Logistic regression is a classification algorithm

B. Logistic regression uses the idea of ​​regression

C. Logistic regression is a classification model

D. Logistic regression uses the sigmoid function as the activation function to map the regression results.

24. Which of the following descriptions of the evaluation methods of classification models is incorrect: (B)

A. We often comprehensively evaluate classification models through multiple evaluation indicators.

B. Accuracy is precision

C. The sample populations referenced by precision rate and recall rate are different

D. AUC is only suitable for evaluating classification models in binary classification scenarios.

25. Which of the following descriptions of the imbalanced sample category scenario is correct: (A)

A. Imbalanced sample categories will affect the final results of the classification model

B. We do not have a better solution in the scenario of unbalanced sample categories

C. Undersampling is to copy samples with a smaller number of categories to expand the sample set.

D. Oversampling will cause the loss of some information in the data set.

26. Regarding information gain and decision tree splitting nodes, which of the following statements is correct (multiple choice) (BC)

A. Nodes with high purity require more information to distinguish

B. Information gain can be obtained by " entroy(front) - entroy(back) "

26. We want to train a decision tree model on a large data set. In order to use less time, we can: (C)

A. Increase the depth of the tree

B. Increase the learning rate

C. Reduce tree depth

D. Reduce the number of trees

27. Assume that the sample categories used in model training are very unbalanced, and the main categories account for 99% of the training data. Now your model has an accuracy of 99% on the training set, then which of the following statements is correct (multiple choices) ? (AC)

A. Accuracy is not suitable for measuring imbalanced class problems

B. Accuracy is suitable for measuring imbalanced category problems

C. Precision and recall are suitable for measuring imbalanced class problems

D. Precision and recall are not suitable for measuring imbalanced class problems

28. In which of the following situations would information gain rate be preferable to information gain? (A)

A. When the number of attribute categories is particularly large

B. When the number of attribute categories is particularly small

C. It has nothing to do with the number of attribute categories

29. Briefly describe the characteristics of logistic regression.

answer:

① 它是分类算法;
② 它是广义的线性模型;
③ 它使用sigmoid函数去映射回归运算的结果,使最终的值落在范围[-1, 1]之中;
④ 经常被用在二分类场景中,而且效果突出。
30. Briefly describe the loss function and optimization method of logistic regression.

answer:

① 逻辑回归的损失函数是对数似然损失,通过提升样本所属类别对应的输出概率值,来降低损失。

② 逻辑回归的优化方法与线性回归类似,使用梯度下降法可以快速定位到最优解。
31. Briefly describe the model evaluation method and sampling method in the scenario of imbalanced sample categories.

answer:

① 样本类别不均衡场景下,我们使用准确率无法衡量模型的好坏,这时我们可以通过使用混淆矩阵计算精确率、召回率、R1-score等指标来综合评估模型的好坏,在二分类场景下,我们还可以绘制ROC曲线得到模型的AUC值来评估模型的好坏。

② 样本类别不均衡场景下会对分类模型的结果和评估产生较大影响,这时我们可以通过欠采样和过采样的方法来重组数据集。一般情况下我们会优先采用过采样;而欠采样因为会丢失部分数据,可能会导致模型欠拟合,所以一般很少使用。
32. Briefly describe the principle and construction process of decision trees.

answer:

决策树通过ID3算法(信息增益)、C4.5算法(信息增益率)和CART算法(基尼指数)来计算每个特征在不同条件下的重要性程度,使用树形结构,每个节点代表某个判断条件,每个分支代表一个判断结果的输出,每个叶节点代表一种分类结果,是一颗由多个判断节点组成的树形决策逻辑。
33. Briefly describe the implementation, advantages and disadvantages of the ID3 algorithm, C4.5 algorithm, and CART algorithm respectively.

answer:

① The ID3 algorithm uses information gain as the basis for judging the importance of features. The greater the information gain, the greater the amount of information it brings. The greater the degree of importance. However, when calculating the information gain of features with a large number of categories, the results are often inaccurate.
The calculation formula of information gain is: G(D, a) = H(D) - H(D|a)

② The C4.5 algorithm inherits the advantages of the ID3 algorithm and alleviates its disadvantages. When measuring the importance of a feature, the original information gain is divided by the intrinsic value of the feature (Intrinsic Value). The intrinsic value of is related to the number of categories of the feature. The larger the number of categories, the greater the intrinsic value, which is equivalent to a "penalty" on the basis of information gain, so that the "information gain" of features with too many categories will not exceed Large and resulting inaccuracy of results.
The calculation formula of the information gain rate is: Gain_ratio(D, a) = Gain(D, a) / IV(a)
where IV(a) represents the category entropy of feature a.

The C4.5 algorithm implements post-pruning internally, which also brings certain usage restrictions to it. The post-pruning operation needs to wait until the decision number is constructed before traversing each node and calculating each node through the cost complexity algorithm. level of importance, and then deleting unimportant nodes will consume a lot of memory.

③ The CART algorithm uses a more concise calculation method, the Gini index, to measure the importance of features. The smaller the Gini index, the greater the amount of information and the higher the importance. It can be used to solve both classification and regression problems. No logarithmic operation is introduced in the calculation, making the calculation more convenient.

The Gini value represents the probability that two samples are randomly selected from the data set D and their categories are different (the probability that the sample is wrongly classified). The calculation formula is:

G ini ( D ) = 1 − ∑ k = 1 npk 2 Gini(D) = 1 − \sum_{k=1}^n p_k^2G i n i ( D )=1k=1npk2
Among them, D represents the data set, n represents the number of categories of samples in the data set D, and pk represents the proportion of the kth category samples in the samples used.

Gini index, also known as Gini impurity, represents the probability of a sample being selected * the probability of a sample being misclassified. The calculation formula is:
G iniindex ( D , a ) = ∑ v = 1 VD v DG ini ( D v ) Gini_index( D, a) = \sum_{v=1}^V \frac{D^v}{D} Gini(D^v)Giniindex(D,a)=v=1VDDvG i n i ( Dv )
where V represents the number of categories of a certain feature, and v represents the type of data in the feature.

Another reason why the CART algorithm is relatively efficient is that the trees it constructs are all binary trees, which simplifies the tree structure.

34. Briefly describe the purpose and common methods of feature extraction in feature engineering.

answer:

① 特征提取是为了将一些符合规律的特征数据转换成算法模型能够更加容易识别的数字类型,特别是在一些文本处理中,tfidf技术会被经常使用到。

② 常用的特征提取方法有字典特征提取和文本特征提取。字典特征提取的前提是你的特征数据都是字典的格式,否则无法转化特征;文本特征提取则需要先对文本进行分词处理,使用较为广泛的提取技术是TFIDF,通过计算词的重要性程度来表征文本特征。
35. In the picture below, there are 8 watermelons and their respective feature values. Please use the ID3 algorithm in the decision tree to calculate which feature of the existing watermelon can be used first to judge whether it is a good melon.

[The external link image transfer failed. The source site may have an anti-leeching mechanism. It is recommended to save the image and upload it directly (img-gmLaANmQ-1668820611429) (H:/sias information/3-supporting information/stage 3-artificial intelligence machine learning/ 02_Machine learning algorithm day07/02_Machine learning algorithm day07/03-Other materials/Machine learning day07 exercises/Machine learning exercises for the first 8 days with answers/images/watermelon.jpg)]

answer:

Use ID3 algorithm:

The overall entropy is:
H ("Good Melon") = − 8 17 log 2 ( 8 17 ) − 9 17 log 2 ( 9 17 ) = 0.998 H(\text{"Good Melon"}) = -\frac{8} {17}log_2(\frac{8}{17}) -\frac{9}{17}log_2(\frac{9}{17}) = 0.998H ( " Good Melon " )=178log2(178)179log2(179)=0 . 9 9 8
If the characteristic "color" is now known:

The proportion of "turquoise" category is: 6/17, "light white" is: 5/17, and "black" category is: 6/17

The information entropy for judging the quality of melons based on "green" is:
H ("green") = − 3 6 log 2 ( 3 6 ) − 3 6 log 2 ( 3 6 ) = 1 H(\text{"Green" }) = -\frac{3}{6}log_2(\frac{3}{6}) -\frac{3}{6}log_2(\frac{3}{6}) = 1H ( " turquoise " )=63log2(63)63log2(63)=1Similarly
, the information entropy for judging the quality of melons based on "light white" and "black" is:
H ("light white") = − 1 5 log 2 ( 1 5 ) − 4 5 log 2 ( 4 5 ) = 0.722 H(\text{"Qianbai"}) = -\frac{1}{5}log_2(\frac{1}{5}) -\frac{4}{5}log_2(\frac{4 }{5}) = 0.722H ( " light white " )=51log2(51)54log2(54)=0.722

H ("黑") = − 4 6 log 2 ( 4 6 ) − 2 6 log 2 ( 2 6 ) = 0.918 H(\text{"黑"}) = -\frac{4}{6}log_2(\ frac{4}{6}) -\frac{2}{6}log_2(\frac{2}{6}) = 0.918H ( " jet black " )=64log2(64)62log2(62)=0.918

Then the information gain of the feature "color" is:
G ("Good Melon" ∣ "Color" ) = 0.998 − 6 17 × 1 − 5 17 × 0.722 − 6 17 × 0.918 = 0.109 G(\text{"Good Melon "}|\text{"Color"}) = 0.998 - \frac{6}{17} \times 1 - \frac{5}{17} \times 0.722 - \frac{6}{17} \times 0.918 = 0.109G ( " good melon "" color " )=0.998176×1175×0.722176×0.918=0 . 1 0 9
In the same way, we can also calculate the information gain when other features are known:
G ("好瓜" ∣ "base" ) = 0.143 G(\text{"好瓜"}| \text{"Basic"}) = 0.143G ( " Good Melon "" Basic " )=0.143

G ("good melon" ∣ "knocking sound" ) = 0.141 G(\text{"good melon"}|\text{"knocking sound"}) = 0.141G ( " good melon "" knocking sound " )=0.141

G ("Good Melon" ∣ "Texture" ) ​​= 0.381 G(\text{"Good Melon"}|\text{"Texture"}) = 0.381G ( " Good Melon "" Texture " )=0.381

G ("Good Melon" ∣ "Navel" ) = 0.289 G(\text{"Good Melon"}|\text{"Navel"}) = 0.289G ( " Good Melon "" Navel " )=0.289

G ("Good Melon" ∣ "Touch" ) = 0.006 G(\text{"Good Melon"}|\text{"Touch"}) = 0.006G ( " good melon "" touch " )=0.006

We can easily conclude that the feature "texture" can be prioritized to identify good melons.

36. The symbol sets a, b, c, and d are independent of each other, and the corresponding probabilities are 1/2, 1/4, 1/8, and 1/16 respectively. The symbol with the smallest amount of information is: (A)

A. a

B. b

C. c

D. d

Answer analysis:

The greater the probability, the more certain the matter is, and the less information it brings to us; conversely, the smaller the probability, the more uncertain the matter is, and the greater the amount of information it brings to us.

According to the information content formula:
H (xi) = − log 2 pi H(x_i) = -log_2 p_iH(xi)=log2pi
可得:
H ( a ) = − l o g 2 ( 1 2 ) = 1   b i t H(a) = -log_2(\frac{1}{2}) = 1\ bit H(a)=log2(21)=1 bit

H ( b ) = − l o g 2 ( 1 4 ) = 2   b i t H(b) = -log_2(\frac{1}{4}) = 2\ bit H(b)=log2(41)=2 bit

H ( c ) = − l o g 2 ( 1 8 ) = 3   b i t H(c) = -log_2(\frac{1}{8}) = 3\ bit H(c)=log2(81)=3 b i t 

H ( d ) = − l o g 2 ( 1 16 ) = 4   b i t H(d) = -log_2(\frac{1}{16}) = 4\ bit H(d)=log2(161)=4 b i t 

The least informative symbol is a.

37. Please explain the similarities and differences between logistic regression and linear regression.

answer:

Logistic regression, from its name, seems to have the same origin as the linear regression problem in mathematics, but its essence is quite different.
First of all, logistic regression deals with classification problems, while linear regression deals with regression problems. This is the most essential difference between the two. In logistic regression, the value of the dependent variable is a binary distribution, and the model learning result is E [ y ∣ x ; θ ] E[y|x;\theta]E[yx;θ ] , that is, given the independent variables and hyperparameters, the expectation of the dependent variable is obtained, and the prediction classification problem is handled based on this expectation. What is actually solved in linear regression isy ′ = θ T x y' = \theta^T xy=iT x, is the true relation for our hypothesisy = θ T x + ϵ y = \theta^T x + \epsilony=iTx+An approximation of ϵ , where ϵ \epsilonϵ represents the error term, and we use this approximation term to deal with regression problems.
Classification and regression are two different tasks in today's machine learning. Logistic regression, which is a classification algorithm, has certain historical reasons for its naming. This method was first developed by statisticianDavid Cox David CoxD a v i d C o x proposed in his 1958 paper "The regression analysis of binary sequences" that people's definitions of regression and classification at that time were somewhat different from today's. The name "Return" stuck. In fact, by sorting out the formulas of logistic regression, we can getlogp 1 − p = θ T x log\frac{p}{1-p} = \theta^T xlog1pp=iT x, wherep = P ( y = 1 ∣ x ) p = P(y=1|x)p=P ( and=1 x ) , that is, the probability of predicting a given input x as a positive sample. If the odds (odds) of an event are defined as the ratio of the probability of the event occurring to the probability of the event not occurringp 1 − p \frac{p}{1-p}1pp, then logistic regression can be seen as “ y = 1 ∣ xy=1|xy=1 _ _
__ p \frac{p}{1-p}1pp, which leads to the biggest difference between logistic regression and linear regression, that is, the dependent variable in logistic regression is discrete, while the dependent variable in linear regression is continuous. And between the independent variable x and the hyperparameter θ \thetaWhen θ is determined, logistic regression can be regarded as a special case of Generalized Linear Models when the dependent variable y obeys a binary distribution; when using the least squares method to solve linear regression, we believe that the dependent variable y obeys a positive distribution. state distribution.
Of course, logistic regression and linear regression also have similarities. First of all, we can think that both use maximum likelihood estimation to model training samples. Linear regression uses the least squares method, which actually combines the independent variable x and the hyperparameterθ \thetaθ is determined, and under the assumption that the dependent variable y obeys the normal distribution, a simplification of the maximum likelihood estimation is used; in logistic regression, the log likelihood functionL ( θ ) = ∏ i = 1 NP ( yi ∣ xi ; θ ) = ∏ i = 1 N ( π ( xi ) ) yi ( 1 − π ( xi ) ) 1 − yi L(\theta) = \prod_{i=1}^NP(y_i|x_i;\theta) = \prod_{i=1}^N (\pi (x_i))^{y_i} (1-\pi (x_i))^{1-y_i}L ( i )=i=1NP ( andixi;i )=i=1N( p ( xi))yi(1π ( xi))1yiLearning to obtain the optimal parameter θ \thetaθ . In addition, both can use the gradient descent method in the process of solving hyperparameters, which is also a common similarity in supervised learning.

38. You used random forest to generate hundreds of trees (T1, T2, …, Tn), and then synthesized the prediction results of these trees. Which of the following statements is correct: (D)

1. Each tree is constructed from a subset of all data

2. The sample data learned by each tree is obtained through random sampling with replacement.

3. Each tree is constructed from a subset of the data set and a subset of features.

4. Each tree is constructed from all data

A. 1 and 2

B. 2 and 4

C. 1, 2 and 3

D. 2 and 3

39. Briefly describe what ensemble learning is and what kind of problems it solves.

answer:

① 集成学习顾名思义,就是利用多个弱学习器去构建强学习器,使之能够取得更好的泛化能力的表现;
② 集成学习现有两种比较成熟的思想:bagging和boosting。bagging侧重于解决模型过拟合的问题,提升模型的泛化能力;boosting侧重于解决模型的欠拟合问题,提升模型泛化能力的同时优化了数据集。
40. Briefly describe the idea of ​​bagging algorithm.

answer:

bagging是Boostrap Aggregating的缩写,意为自助法。它使用到了Boostrap sampling(随机有放回采样法)的思想,每个基学习器使用的训练数据是差异化的,但是这些训练数据又都来自于同一个整体,最终综合所有基学习器的结果来决定最后集成算法的结果输出。例如随机森林(RandomForest)就是采用了bagging思想的集成学习算法,它使用的基学习器是决策树。
41. Briefly describe why the training samples in random forests need to be randomly sampled.

answer:

随机抽样是为了让各棵决策树学习到的数据特征是有差距的,这是为了避免弱分类器之间过强的相关性。如果各棵决策树学习到的特征数据都是一样的,那么构建出的每棵树都是相同的,这样的话就与我们bagging的初衷背道而驰了。
42. Briefly describe the advantages and disadvantages of the random forest algorithm (RF).

answer:

优点:
① 训练可以高度并行化,对于大数据时代的大样本训练速度有优势。这是的最主要的优点。

② 由于可以随机选择决策树节点划分特征,这样在样本特征维度很高的时候,仍然能高效的训练模型。

③ 在训练后,可以给出各个特征对于输出的重要性

④ 由于采用了随机采样,训练出的模型的方差小,泛化能力强。

⑤ 对部分特征缺失不敏感。

缺点有:

① 在某些噪音比较大的样本集上,RF模型容易陷入过拟合。

② 取值划分比较多的特征容易对RF的决策产生更大的影响,从而影响拟合的模型的效果。

43. Suppose you are dealing with class attribute features and are not looking at the distribution of categorical variables in the test set. Now you want to apply One Hot Encoding (OHE) to the class attribute trait. So what are the possible difficulties in applying OHE to categorical variables in the training set? :(D)

A. Not all categories of the categorical variable appear in the test set

B. The frequency distribution of categories is different in the training set and the test set

C. The training set and the test set usually have the same distribution

D. Both A and B are correct

Answer analysis:

A、B 项都正确,如果类别在测试集中出现,但没有在训练集中出现,OHE 将会不能进行类别编码,这将是应用 OHE-hot 的主要困难。选项 B 同样也是正确的,在应用 OHE 时,如果训练集和测试集的频率分布不相同,我们需要多加小心,这样可能造成最后的结果是有偏差的。
44. Please explain what bias and variance are in machine learning.

answer:

In supervised learning, the generalization error of the model comes from two aspects - bias and variance. Specifically, the definitions of bias and variance are as follows: ① The bias refers to the
training data set of size m obtained from all samples. The deviation between the average of all model outputs and the true model output.

Bias is usually caused by us making wrong assumptions about the learning algorithm. For example, the real model is a quadratic function, but we assume that the model is a linear function.
The error caused by bias is usually reflected in the training error .

② Variance refers to the variance of the output of all models trained from all sampled training data sets of size m.

The variance is usually caused by the complexity of the model being too high relative to the number of training samples m. For example, there are 100 training samples in total, and we assume that the model is a polynomial function with order no greater than 200.

The error caused by variance is usually reflected in the increment of test error relative to training error .

The above definition is accurate, but not intuitive enough. In order to understand bias and variance more clearly, we use a shooting example to further describe the difference and connection between the two.

Suppose a shot is a machine learning model making a prediction on a sample. Hitting the bull's-eye position means the prediction is accurate, and the further it deviates from the bull's-eye, the greater the prediction error. We obtain n training sample sets of size m through n sampling, train n models
, and make predictions on the same sample, which is equivalent to us shooting n times. The shooting results are shown in the figure below:

[The external link image transfer failed. The source site may have an anti-leeching mechanism. It is recommended to save the image and upload it directly (img-3aqN1et5-1668820611430) (H:/sias information/3-supporting information/stage 3-artificial intelligence machine learning/ 02_Machine learning algorithm day07/02_Machine learning algorithm day07/03-Other materials/Machine learning day07 exercises/Machine learning exercises with answers for the first 8 days/images/variance_bias.jpg)]

set and test set are different

C. The training set and the test set usually have the same distribution

D. Both A and B are correct

Answer analysis:

A、B 项都正确,如果类别在测试集中出现,但没有在训练集中出现,OHE 将会不能进行类别编码,这将是应用 OHE-hot 的主要困难。选项 B 同样也是正确的,在应用 OHE 时,如果训练集和测试集的频率分布不相同,我们需要多加小心,这样可能造成最后的结果是有偏差的。
44. Please explain what bias and variance are in machine learning.

answer:

In supervised learning, the generalization error of the model comes from two aspects - bias and variance. Specifically, the definitions of bias and variance are as follows: ① The bias refers to the
training data set of size m obtained from all samples. The deviation between the average of all model outputs and the true model output.

Bias is usually caused by us making wrong assumptions about the learning algorithm. For example, the real model is a quadratic function, but we assume that the model is a linear function.
The error caused by bias is usually reflected in the training error .

② Variance refers to the variance of the output of all models trained from all sampled training data sets of size m.

The variance is usually caused by the complexity of the model being too high relative to the number of training samples m. For example, there are 100 training samples in total, and we assume that the model is a polynomial function with order no greater than 200.

The error caused by variance is usually reflected in the increment of test error relative to training error .

The above definition is accurate, but not intuitive enough. In order to understand bias and variance more clearly, we use a shooting example to further describe the difference and connection between the two.

Suppose a shot is a machine learning model making a prediction on a sample. Hitting the bull's-eye position means the prediction is accurate, and the further it deviates from the bull's-eye, the greater the prediction error. We obtain n training sample sets of size m through n sampling, train n models
, and make predictions on the same sample, which is equivalent to us shooting n times. The shooting results are shown in the figure below:

[External link pictures are being transferred...(img-3aqN1et5-1668820611430)]

The result we most expect is the result in the upper left corner. The shooting results are accurate and concentrated, indicating that the deviation and variance of the model are very small. Although the center of the shooting results in the upper right picture is around the bullseye, the distribution is relatively scattered, indicating that the deviation of the model is small. But the variance is large; similarly, the lower left picture shows that the model has a smaller variance and a larger deviation; the lower right picture shows that the model has a larger variance and a larger deviation.

Guess you like

Origin blog.csdn.net/weixin_52733693/article/details/127932571