Alibaba Cloud Tianchi Competition - Analysis of Machine Learning Questions (Question 1) Part 1

According to "Analysis of Competition Questions in Aliyun Tianchi Competition", it is recommended to cooperate with reading for better results.
1. Competition question understanding
(1)
The basic principle of thermal power generation is that fuel is burned to produce steam, and the steam drives the rotation of the steam turbine to drive the rotation of the generator to generate electricity , the core that affects the efficiency of thermal power generation is the combustion efficiency of the boiler (the amount of steam produced per unit time). , return air, feed water volume; and boiler operating conditions, such as boiler bed temperature, bed pressure, furnace temperature, pressure, superheater temperature, etc.
The goal of this competition is to predict the amount of steam generated based on the data collected by a given boiler sensor (combustion rate, boiler operating condition, etc.).
(2) Data overview
According to the training data provided by the competition questions on the Alibaba Cloud Tianchi official website, there are a total of 38 feature variables (field names V0~V37) and 1 target variable (field names target).
(3) Evaluation index
The prediction error is judged by mean square error MSE (Mean Squared Error). The smaller the MSE value, the higher the accuracy of the prediction model in describing the experimental data.
(4) Question model
Commonly used models include regression prediction models and classification prediction models. Regression prediction models include linear regression, ridge regression, decision tree regression, and gradient boosting tree regression. Classification prediction models include binary classification and multi-class classification.
In this question, the predicted steam volume is a continuous numerical variable, so a regression prediction model is used.
2. Data exploration
(1) Univariate analysis
For continuous variables, it is necessary to perform descriptive statistics on them, and count their central distribution trends and variable distributions, including mean, median, maximum, minimum, variance, standard Poor wait.
For categorical variables, frequency or proportion is generally used to represent the distribution of each category. Histograms and boxplots can be used to represent the visual distribution.
(2) Bivariate analysis
Including continuous and continuous, categorical and categorical, categorical and continuous, three bivariate analysis combinations, using different statistical algorithms and image expressions to describe the relationship between bivariates.
①Continuous and continuous
Statistical analysis algorithm: Calculation of correlation
Graphic expression: scatter diagram
②Category and category
Statistical analysis algorithm:
Two-way table—analyze by establishing a two-way table of frequency (number) and frequency (proportion) relationship between variables.
Chi-square test—mainly used for correlation analysis of two or more sample rates (constituent ratios) and two binary discrete variables Graphical expression
: stacked histogram
③Category and continuous
graphical expression: Violin plot, Analyze the distribution of another continuous variable when the categorical variable is in different categories
3. Feature engineering
finds and constructs features (that is, the processing of variables) from the original data, which can describe the data well and achieve the best predictive performance excellent process.
Processing flow: remove useless features, remove redundant features, generate new features, feature conversion, feature processing
3.1 feature conversion variable
form processing, convert the value range of variables, etc., so that they are distributed within a reasonable range or better Describe the feature shape and characteristics or make it easier to substitute into the model calculation.
Including standardization, normalization, binarization of quantitative features, dummy variables for qualitative features, missing value handling, and data transformation, etc.
(1) Standardization
Convert the features to a standard normal distribution by calculating the standard score.
(2) Normalization
Convert the eigenvalues ​​of the samples to the same dimension, and map the data to the [0,1] or [a,b] interval.
Normalization and standardization usage scenarios
If the range of output resultsIf required, use normalization.
If the data is relatively stable and there are no extreme maximum or minimum values, use normalization.
If there are outliers and more noise in the data, use standardization, which can indirectly avoid the influence of outliers and extreme values ​​through centralization.
Models such as support vector machines , K-nearest neighbors , and principal component analysis must be normalized or marked with Noah operations.
(3) Quantitative feature binarization
Set a threshold, the value greater than the threshold value is 1, and the value less than or equal to the threshold value is 0.
(4) Qualitative feature dummy variables
are also called dummy variables, which are usually artificial dummy variables. The value is 0 or 1, used to reflect different properties of a variable. The process of converting a categorical variable into a dummy variable is dummy coding. For variables with n categorical attributes, usually 1 categorical feature is used as a reference to generate n-1 dummy variables.
The purpose of introducing dummy variables is to quantify the variables that cannot be quantitatively dealt with originally, so as to evaluate the impact of qualitative factors on the dependent variable.
Usually, the original multivariate variables are converted into dummy variables. When constructing the regression model, each dummy variable can obtain an estimated regression coefficient, which makes the regression results easier to interpret.
(5) Missing value and outlier processing
①Missing value
processing method: delete, mean, mode, median filling, prediction model filling
②Outlier processing
detection: box plot, histogram, scatter plot detection abnormal Value
Processing method: delete, convert, fill, treat differently, etc.
(6) Data conversion
In the process of analyzing the feature distribution using tools such as histograms and kernel density estimation, there may be uneven value distribution of some variables, which will greatly affect the estimation. Transform so that it is distributed within a reasonable interval.
Commonly used transformation methods:
① Logarithmic transformation: Take the logarithm of the variable, which can change the distribution shape of the variable.
② Take the square root or cube root: The square root and cube root of a variable have a waveform effect on its distribution.
③Variable grouping: Variables can be classified based on raw value, percentage, or frequency.
3.2 Feature Dimensionality Reduction
In variable dimension processing, dimensionality reduction refers to using a certain mapping method to map data points in a high-dimensional vector space to a low-dimensional space.
In the original high-dimensional space, vector data contains redundant information and noise information, which will cause errors in model recognition in practical applications. Therefore, it is necessary to reduce useless or redundant information, reduce errors, perform feature selection or perform linear dimensionality reduction.
(1) Feature selection
Directly delete unimportant features.
Feature selection methods: filtering method, packaging method, embedding method
①Filtering method: perform feature selection according to the relationship between feature variables and target variables, including variance selection method, correlation coefficient method, Chi-square test, maximum information coefficient method, etc.
②Packing method: use genetic algorithm, annealing algorithm and other algorithms to select several features each time
③Embedding method: use machine learning decision tree, deep learning and other algorithms and models for training to obtain the weight coefficient of each feature, and according to the coefficient from Select features from largest to smallest.
(2) Linear dimensionality reduction
Commonly used methods are principal component analysis and linear discriminant analysis
① Principal component analysis
maps high-dimensional data to a low-dimensional space, and expects that the variance of the data in the projected dimension is the largest . Fewer dimensions retain more characteristics of the original data points .
②Linear discriminant analysis
Unlike principal component analysis, which retains as much data information as possible, the goal of linear discriminant analysis is to make the data points after dimensionality reduction as easy to distinguish as possible.
4. Model training
Regression is a technique from statistics used to predict the value of a desired target quantity when the target quantity is continuous .
Steps: Import the required tool library - data preprocessing - training model - prediction results
(1) Linear regression model
assumes that the dependent variable Y is linearly related to the independent variable X, then the linear model can be used to find out the independent variable X and the dependent variable X variable Y in order to predict the value of a new independent variable X.
First, the data needs to be imported

import os
#读取数据
os.chdir(r"E:\\")
data = pd.read_csv(r'./data.csv')
#选取自变量
train=data.columns[0:38]
#选取因变量
target=data['target']

Before using any machine learning model, it is necessary to split the data set and divide it into training data (training set) and verification data (test set).
The code for splitting data is as follows:

#切分数据集
from sklearn.model_selection import train_test_split #切分数据
#切分数据,训练数据为80%,验证数据为20%
train_data,test_data,train_target,test_target=train_test_split(train,target,test_size=0.2,random_state=0)

Use sklearn to call the linear regression model for prediction, the code is as follows:

from sklearn.metrics import mean_squared_error #评价指标
#从sklearn算法库中导入线性回归模型
from sklearn.linear_model import LinearRegression
#定义线性回归模型
clf=LinearRegression()
#将训练集的自变量和因变量代入到线性回归模型中训练
clf.fit(train_data,train_target)  
#将测试集的因变量代入线性回归模型中得到测试集的预测值
test_pred=clf.predict(test_data) 
#得到本次模型准确率得分
score=mean_squared_error(test_target,test_pred)
print("LinearRegression: ",score)

(2) K nearest neighbor regression model
K nearest neighbor algorithm can be used for classification and regression. By finding the k nearest neighbors of a sample, and assigning the average value of a certain (some) attributes of these neighbors to the sample, the value of the corresponding attribute of the sample can be obtained.
K nearest neighbor regression calling method:

from sklearn.metrics import mean_squared_error #评价指标
#从sklearn算法库中导入k近邻回归模型算法
from sklearn.neighbors import KNeighborsRegressor
#定义K近邻回归模型
clf=KNeighborsRegressor(n_neighbors=3) #选取最近的3个“邻居”的因变量的值的平均值赋值给因变量
#将训练集的自变量和因变量代入到K近邻回归模型中训练
clf.fit(train_data,train_target)
#将测试集的因变量代入线性回归模型中得到测试集的预测值
test_pred=clf.predict(test_data) 
#得到本次模型准确率得分
score=mean_squared_error(test_target,test_pred)
print("KNeighborsRegressor: ",score)

(3) Decision tree regression model
Decision tree regression can be understood as dividing a space into several subspaces according to certain criteria, and then using the information of all points in the subspace to represent the value of this subspace. To set the number of divisions, you can use the least squares method to select the division point to obtain the corresponding subspace, and then use the mean value in the subspace as the output value.
Decision tree regression calling method:

from sklearn.metrics import mean_squared_error #评价指标
#从sklearn算法库中导入决策树回归算法
from sklearn.tree import DecisionTreeRegressor
#定义决策树回归回归模型
clf=DecisionTreeRegressor(max_depth=3, min_samples_leaf = 4, min_samples_split=2) 
#决策树最大深度取3,叶子节点最少样本数取4,内部节点再划分所需最小样本数取2
#更多决策树参数可看https://blog.csdn.net/qq_41577045/article/details/79844709
#将训练集的自变量和因变量代入到决策树回归回归模型中训练
clf.fit(train_data,train_target)
#将测试集的因变量代入线性回归模型中得到测试集的预测值
test_pred=clf.predict(test_data) 
#得到本次模型准确率得分
score=mean_squared_error(test_target,test_pred)
print("DecisionTreeRegressor: ",score)

(4) Random forest regression model
Random forest is an algorithm that integrates multiple trees through the idea of ​​ensemble learning. The basic unit is a decision tree, and its essence belongs to a branch of machine learning - ensemble learning.
In a regression problem, Random Forest outputs the average of all decision tree outputs
Random Forest Regression Model calls the method:

from sklearn.metrics import mean_squared_error #评价指标
#从sklearn算法库中导入随机森林回归算法
from sklearn.ensemble import RandomForestRegressor
#定义随机森林回归模型
clf=RandomForestRegressor(n_estimators=200) #选择200棵决策树
#将训练集的自变量和因变量代入到随机森林回归模型中
clf.fit(train_data,train_target)
#预测测试集的因变量预测值
test_pred=clf.predict(test_data)
#得到本次模型准确率得分
score=mean_squared_error(test_target,test_pred)
print("RandomForestRegressor: ",score)

(5) LightGBM regression model
LightGBM is a GBDT algorithm framework developed by Microsoft, which supports efficient parallel training, has faster training speed, lower memory consumption, better accuracy, distributed support, and can quickly process massive characteristics of the data.
LightGBM regression model calling method:

from sklearn.metrics import mean_squared_error #评价指标
#从sklearn算法库中导入LightGBM回归模型
import lightgbm as lgb
#定义LightGBM回归模型
clf=lgb.LGBMRegressor(
learning_rate=0.01,
max_depth=-1,
n_estimators=5000,
boosting_type='gbdt',
random_state=2019,
objective='regression')
#更多LGB模型参数可看https://www.cnblogs.com/jiangxinyang/p/9337094.html或https://blog.csdn.net/qq_24591139/article/details/100085359
#将训练集的自变量和因变量代入到LightGBM回归模型中
clf.fit(train_data,train_target,eval_metric='MSE',verbose=50)
#预测测试集的因变量预测值
test_pred=clf.predict(test_data)
#得到本次模型准确率得分
score=mean_squared_error(test_target,test_pred)
print("lightgbm: ",score)

Guess you like

Origin blog.csdn.net/weixin_47970003/article/details/123570650