Alibaba Cloud Tianchi Competition - Analysis of Machine Learning Questions (Question 1)

5. Model validation
5.1 Concepts and methods of model evaluation
(1) Underfitting and overfitting
When a model properly expresses the data relationship, we think that the model fits well.
Underfitting, also known as high bias, means that the model trained by the algorithm cannot fully express the data relationship.
Overfitting is also called high variance, which means that the model trained by the algorithm expresses too much data relationship, and at this time it is likely to express the noise relationship between data.
(2) Generalization and regularization of the model Generalization
refers to the performance of the concepts learned by the machine learning model when processing samples that have not been encountered in training, that is, the ability of the model to process new samples .
Regularization is to add some rules (restrictions, penalties) to the objective function that needs to be trained, in order to prevent overfitting .
L1, L2 regularization (L1, L2Regularization) uses regularization items that are L1 norm and L2 norm.
Among them, the L1 norm is the sum of the absolute values ​​of the vector elements, the L2 norm is the sum of the squares of the absolute values ​​of the vector elements and then the square root, and the Lq norm is the cumulative sum of the qth power of the absolute value of the vector elements and then 1/q times power.
Adding a regularization term after the original loss function can make the model more generalizable.
The linear model that regularizes the L2 norm on the parameter space is called Ridge Regression (Ridge Regression), and the linear model that regularizes the L1 norm on the parameter space is called LASSO regression (LASSO Regression).
The difference between ridge regression and LASSO regression: using ridge regression, it is difficult to get a straight line, and LASSO regression is more inclined to a straight line.
(3) The evaluation index and calling method of the regression model
There are four methods for evaluating regression models: mean absolute value error, mean square error, root mean square error, and R-squared value.
①Mean absolute error
The mean absolute error (MAE) is the absolute value of the difference between the predicted value and the true value.
The code for calculating MAE is as follows:

from sklearn.metrics import mean_absolute_error
score=mean_absolute_error(y_test,y_pred)

②Mean square error
Mean square error (MSE) refers to the expected value of the square of the difference between the estimated value of the parameter and the true value of the parameter. The smaller the value of MSE, the better the accuracy of the prediction model describing the experimental data.

from sklearn.metrics import mean_squared_error
score=mean_squared_error(y_test,y_pred)

③Root mean square error
RMSE is the square root of MSE

from sklearn.metrics import mean_squared_error
Pred_Error=mean_squared_error(y_test,y_pred)
score=Sqrt(Pred_Error)

④ R square value
R square value (R-Squared) reflects the extent to which the regression model explains the changes in the dependent variable, or how well the model fits the observed values.

from sklearn.metrics import r2_score
score=r2_score(y_test,y_pred)

(4) Cross-validation
Cross-validation is a statistical analysis method to verify the performance of classifiers. The basic idea is to group the original data in a certain sense, one part is used as a training set, and the other part is used as a verification set.
Commonly used cross-validation methods include simple cross-validation, K-fold cross-validation, leave-one-out cross-validation, and leave-one-out cross-validation.
① Simple cross-validation
Simple cross-validation is to randomly divide the original data into two groups, one group is used as a training set, and the other group is used as a verification set, usually 30% of the data is divided into test data.

#切分数据集
from sklearn.model_selection import train_test_split #切分数据
#切分数据,训练数据为70%,验证数据为30%
train_data,test_data,train_target,test_target=train_test_split(train,target,test_size=0.3,random_state=0)

②K-fold cross-validation
K-fold cross-validation is to divide the original data into K groups (usually equally divided), and then make a verification set for each subset data , and use the remaining K-1 subset data as a training set . The classification accuracy of the final validation set of each model is averaged as the performance index of the K-fold cross-validation classifier. Usually, K is set to be greater than or equal to 3 (K subsets, each subset is used as a test set, and the remaining K-1 subsets are used as a training set, and finally the average accuracy rate of the K test sets is obtained).

from sklearn.model_selection import KFold
#KFold(n_split, shuffle, random_state)
#n_split:要划分的折数,shuffle: 每次都进行shuffle,测试集中折数的总和就是训练集的个数,random_state:随机状态,更多详情查看http://t.zoukankan.com/demo-deng-p-10558186.html
Kf=KFold(n_splits=10)

③Leave-one-out cross-validation
Leave-one-out cross-validation means that each training set is composed of other samples except one sample , and the remaining one sample forms the test set . In this way, for a data set of N samples, N There are different training sets and N different test sets, so LOO-CV will get N models , and use the average of the classification accuracy of the final verification set of N models as the performance index of the classifier (N sample data sets , N models, each training N-1 data as the training set, leaving one as the test set, and finally get the average of the accuracy of the N test sets of the N models).

from sklearn.model_selection import LeaveOneOut
loo=LeaveOneOut()

④Leave P method cross-validation
Leaving P method cross-validation is similar to leave-one-out cross-validation. It deletes P samples from the complete data set to generate all possible training sets and test sets. For N samples, it can generate (N, p) training-test pairs (Np data are used as training sets for each training, and p are used as test sets).

from sklearn.model_selection import LeavePOut
lpo=LeavePOut(p=5)

5.2 Model parameter adjustment
(1) Parameter adjustment Parameter
adjustment is to adjust the parameters of the model to find the parameters that optimize the performance of the model. The parameters can be divided into two categories: process-influenced parameters and sub-model-influenced parameters . The parameters of the process influence class are to adjust the parameters such as " number of sub-models (n_estimators)" and " learning_rate
" under the premise that the sub-model is unchanged . The parameters of the sub-model influence class are to adjust the " maximum tree depth (max_depth)" conditions (criterion)” and other parameters to change the performance of the sub-model. The training process of bagging aims to reduce variance, while the training process of boosting aims to reduce deviation, and the parameters of the process affecting the class can cause large changes in the overall model performance. (2) Grid search In all candidate parameter selections, through loop traversal, try every possibility, the parameter with the best performance is the final result. example



from sklearn.datasets import load_iris
from sklearn.svm import svc
from sklearn.model_selection import train_test_split
iris=load_iris()
X_train,X_test,y_train,y_test=train_test_split(iris.data,
iris.target,
random_state=0)
print("Size of training set:{} size of testing  set:{}".format(
X_train.shape[0],X_test.shape[0])
#### grid search start
best_score=0
for gamma in [0.001,0.01,0.1,1,10,100]:   #定义一个一个gamma参数可选范围
	for C in [0.001,0.01,0.1,1,10,100]:   #定义一个一个C参数可选范围
		#定义支持向量机模型
		svm=SVC(gamma=gamma,C=C) #对于每种参数可能的组合,都进行一次训练
		svm.fit(x_train,y_train)
		score=svm.score(x_test,y_test)
		if score>best_score: #找到表现最好的参数
			best_score=score
			best_parameters={
    
    'gamma':gamma,'c':C}
#### grid search end
print("Best score:{:.2f}".format(best_score))   #打印最高准确率
print("Best parameters{}".format(best_parameters))  #打印表现最好的参数

(3) Learning curve The learning curve is to observe the performance of the model on new data by drawing the accuracy rate
on the model training set and the cross-validation set when the size of the training set is different , and then judge whether the variance or deviation of the model is too high , and Whether increasing the training set can reduce overfitting. As shown in the figure, the left picture is high deviation, and the right picture is low deviation (picture source: network) (4) The verification curve is different from the learning curve. The horizontal axis of the verification curve (or complexity curve) is a certain hyperparameter A range of values, whereby the accuracy of the model is compared for different parameter settings (rather than different training set sizes, which is the horizontal axis of the learning curve). With the change of hyperparameter settings (hyperparameters are the parameters that need to be manually set), the model may have a process from underfitting to appropriateness to overfitting, and then an appropriate setting can be selected to improve the performance of the model. Learning curve (image source: https://editor.csdn.net/md?articleId=123589781) verification curve (image source: https://editor.csdn.net/md?articleId=123589781) More about learning curve and The explanation of the verification curve can be seen: https://editor.csdn.net/md?articleId=123589781 6. Feature optimization can be used to synthesize features, make simple transformations on features, use decision trees to create new features, feature combinations, etc. optimize. (1) Synthetic features Synthetic features refer to features that are not included in the input features, but are derived from one or more input features.

insert image description here



insert image description here
insert image description here





(Features created separately by standardization or scaling are not synthetic features, but feature transformations. For details, see the previous article)
Synthetic features include the following types:
multiplying a feature by itself or other features (called feature combination), two features Phase division, bucketing (binning) of continuous features into multiple intervals and binning
(2) Simple transformation of features
① Transformation and combination of numerical features:
polynomial features, proportional features, absolute value, maximum value, minimum value
② Category Combination of features and numerical features
Use N1 and N2 to represent numerical features, use C1 and C2 to represent category features,
median: median(N1)_by (C1)
arithmetic mean: mean(N1)_by (C1)
mode: mode (N1)_by (C1)
min: min(N1)_by (C1)
max: max(N1)_by (C1)
standard deviation: std(N1)_by (C1)
variance: var(N1)_by (C1)
Frequency: freq (C2)_by (C1)
(3) Create new features with decision tree
In the algorithm of decision tree series (single decision tree, GBDT, random forest), since each sample will be mapped to a piece of decision tree Therefore, we can add the sparse vector obtained by the natural numbers or dummy codes mapped by each decision tree into the model as a new feature.
(4) Feature combination
Feature combination refers to a composite feature formed by combining (multiplying or Cartesian product) individual features, which helps to represent nonlinear relationships.
① Coding the nonlinear law
[A x B]: a combination of features formed by multiplying the values ​​of two features
[A x B x C x D x E]: A combination of features formed by multiplying the values ​​of five features
[A x A]: A combination of features formed by squaring the values ​​of a single feature
② Combining one-hot vectors with one
-hot features The feature combination of the vector is regarded as a logical connection
③ Use the bucket feature column to train the model
Bucket feature: it is to divide the continuous numerical feature into different buckets (boxes) in a certain way, which can be understood as a discrete type of continuous type processing method . (Grouping)
7. Model fusion
Model optimization method:
① Study the model learning curve , judge whether the model is overfitting or underfitting and make corresponding adjustments ② Analyze the model weight parameters . For features with high or low absolute weight values, You can do more detailed work (parameter adjustment), or feature combination (build new features) ③Bad-Case analysis, determine whether there is a place to modify the mining for wrong examples ④Carry out model fusion ( or integration Learning ensemble learning ) model fusion: first generate a set of individual learners, and then combine them with a certain strategy to strengthen the model effect. Model fusion and improvement techniques can be divided into two categories: ① Parallel methods that do not have strong dependencies between individual learners and can be generated simultaneously , representing





Bagging methods and random forests . ② There is a strong dependency
between individual learners, which must be traversed to generate a serialization method , which is represented by the Boosting method . 7.1 Model fusion and improvement technology (1) Bagging method and random forest Bagging method is to obtain the sub-training set required by each base model by sampling from the training set, and then synthesize the prediction results of all base models to generate the final prediction result. The Bagging method is a self-service sampling method. The characteristics of Bagging are: the sample data is uniform and random, and there is a selection of replacement, and the classifier takes the mean value. The improvement of the random forest method to the Bagging method has two improvements: the basic learner is limited to decision trees (n_estimators); in addition to adding disturbances to the samples of Bagging, disturbances are also added to the attributes (random_state), which is equivalent to Random attribute selection is introduced in the process of decision tree learning. Random forest sample code:






from sklearn.metrics import mean_squared_error #评价指标
from sklearn import ensemble  #引入sklearn集成学习模型
#定义随机森林模型
clf=ensemble.RandomForestRegressor(n_estimators=200, random_state=1234)   #200棵树,扰动属性1234
#将训练集的自变量和因变量代入到随机森林模型中训练
clf.fit(train_data,train_target)
#将测试集的因变量代入随机森林模型中得到测试集的预测值
test_pred=clf.predict(test_data) 
#得到本次模型准确率得分
score=mean_squared_error(test_target,test_pred)
print("RandomForestRegressor: ",score)

(2) Boosting method
The training process of the Boosting method is stepped, that is, the base models are trained one by one in order (in parallel, the implementation can be done), the training set of the base model is converted according to a certain strategy each time, and then the The results predicted by all base models are linearly integrated to produce the final prediction result.
The well-known algorithms in the Boosting method include the AdaBoost algorithm and the Boosting Tree (Boosting Tree) series of algorithms, and the most widely used is the gradient boosting tree.
Gradient boosted tree example code:

from sklearn.metrics import mean_squared_error #评价指标
from sklearn import ensemble  #引入sklearn集成学习模型
#定义GBDT模型
clf=ensemble.GradientBoostingRegressor(n_estimators=200, random_state=1234)  #200棵树,扰动属性1234
#将训练集的自变量和因变量代入到GBDT模型中训练
clf.fit(train_data,train_target)
#将测试集的因变量代入GBDT模型中得到测试集的预测值
test_pred=clf.predict(test_data) 
#得到本次模型准确率得分
score=mean_squared_error(test_target,test_pred)
print("GradientBoostingRegressor: ",score)

More specific integrated learning models can be found at: https://blog.csdn.net/u013166817/article/details/84913372
7.2 Prediction result fusion strategy
(1) Voting
Voting (voting mechanism) is divided into soft voting and hard voting One, its principle adopts the idea of ​​minority obeying the majority , and this method can be used to solve classification problems.
Hard voting: vote directly for multiple models, and the class with the most votes is the final predicted class.
Soft voting: The same principle as hard voting, it adds the function of setting weights, which can set different weights for different models, and then distinguish the importance of different models.
(2) Averaging and Ranking Averaging: The average
of the model results is used as the final predicted value, and the weighted average method can also be used. The idea of ​​Ranking is consistent with that of Averaging. Here, the method of averaging the rankings is used. If there are weights, the sum of the weight ratio rankings of n models is calculated, which is the final result. (3) Blending Blending is to divide the original training set into two parts , such as 70% of the data as a new training set, and the remaining 30% of the data as a test set. (4) Stacking The basic principle of Stacking is to use all the trained base models to predict the training set , and use the j-th base model’s predicted value of the i-th training sample as the jth of the i-th sample in the new training set eigenvalues, and finally train based on the new training set. 7.3 Other lifting methods






Through the analysis of weight or feature importance, you can accurately find important data and fields and related feature directions, and you can continue to refine in this direction. At the same time, you can find more data in this direction, and you can also do related feature combinations , which can improve the performance of the model.
Through Bad-Case analysis, you can effectively find the sample points with inaccurate predictions, and then analyze the data retrospectively to find the relevant reasons, so as to find ways to improve the accuracy of the model.

Guess you like

Origin blog.csdn.net/weixin_47970003/article/details/123589781