[Mathematical Modeling] Random Forest Prediction (Python Code Implementation)

Table of contents

1 parameter

2 Calculation implementation 

2.1 Calculation example

2.2 Single target prediction - DecisionTreeRegressor

2.3 Multi-objective prediction MultiOutputRegressor


1 parameter

n_estimators : Number of decision trees in the forest. The default of 100
means this is the number of trees in the forest, i.e. the number of base evaluators. The impact of this parameter on the accuracy of the random forest model is monotonous. The larger the n_estimators, the better the effect of the model. But correspondingly, any model has a decision boundary. After n_estimators reach a certain level, the accuracy of the random forest often does not rise or starts to fluctuate. Moreover, the larger the n_estimators, the greater the amount of calculation and memory required, and the training time is also shorter. will get longer and longer. For this parameter, we are eager to strike a balance between training difficulty and model performance.

criterion: The criterion used to split the node, optional "gini", "entropy", default "gini".

max_depth : The maximum depth of the tree. If None, nodes are expanded until all leaves are pure (only one class), or until all leaves contain fewer than min_samples_split samples. The default is None.

min_samples_split : The minimum number of samples needed to split an internal node: if int, min_samples_split is considered the minimum value. If float, min_samples_split is a fraction and ceil(min_samples_split * n_samples) is the minimum number of samples per split. The default is 2.

min_samples_leaf : The minimum number of samples required at a leaf node. Only considered if a split point of any depth leaves at least min_samples_leaf training samples on each of the left and right branches. This may have the effect of smoothing the model, especially in regression. If int, consider min_samples_leaf to be the minimum value. If float, min_samples_leaf is the fraction and ceil(min_samples_leaf * n_samples) is the minimum number of samples per node. The default is 1.

min_weight_fraction_leaf : The minimum weighted fraction in the sum of weights at all leaf nodes (all input samples). If sample_weight is not provided, samples are weighted equally.

max_features : Number of features to consider when finding the best split: if int, max_features features are considered in each split. If float, max_features is a fraction and int(max_features * n_features) features are considered at each split. If "auto", then max_features = sqrt(n_features). If "sqrt", then max_features = sqrt(n_features). If "log2", then max_features = log2(n_features). If None, then max_features = n_features. Note: The search for splits does not stop until at least one valid partition of node samples is found, even if it requires efficiently checking multiple max_features features.

max_leaf_nodes : maximum number of leaf nodes, integer, default is None

min_impurity_decrease : Split if the decrease in the split metric is greater than this value.

min_impurity_split : Minimum impurity for decision tree growth. The default is 0. Deprecated since version 0.19: min_impurity_split is deprecated in favor of min_impurity_decrease from 0.19. The default value of min_impurity_split was changed from 1e-7 to 0 in 0.23 and will be removed in 0.25.

bootstrap : Whether to perform bootstrap operation, bool. The default is True. If bootstrap==True, samples will be randomly selected each time with replacement, only in extra-trees, bootstrap=False

oob_score : Whether to use out-of-bag samples to estimate generalization accuracy. The default is False.

n_jobs : Number of parallel computations. The default is None.

random_state : Controls the randomness of bootstrap and the randomness of the selected samples.
verbose : controls verbosity when fitting and predicting. The default is 0.

class_weight : The weight of each class, which can be passed in {class_label: weight} in the form of a dictionary. If "balanced" is selected, the weights of the input are n_samples / (n_classes * np.bincount(y)).

ccp_alpha : The subtree with the largest cost complexity and less than ccp_alpha will be selected. By default, no pruning is performed.

max_samples: If bootstrap is True, the number of samples to draw from X to train each base classifier. If None (default), X.shape[0] samples are drawn. If int, draw max_samples samples. If float, draw max_samples * X.shape[0] samples. Therefore, max_samples should be in (0, 1). is new in version 0.22.

 

2 Calculation implementation 

2.1 Calculation example

Melt-blown non-woven materials are important raw materials for mask production. They have good filtration performance, simple production process, low cost, and light weight. They have attracted widespread attention from domestic and foreign companies. However, since the fibers of melt-blown nonwovens are very thin, their performance cannot be guaranteed due to poor compression resilience during use. Therefore, scientists have created the intercalation melt-blown method, that is, by inserting fibers such as polyester (PET) staple fibers into the melt-blown fiber stream during the preparation process of polypropylene (PP) melt-blown, a "Z-shaped" structure of intercalated layer of meltblown nonwoven material. There are many process parameters for the preparation of intercalated melt-blown nonwoven materials, and there are interactive effects between the parameters, and it is more complicated after the intercalation airflow is added. Therefore, the structural variables (thickness, porosity, compression resilience), and the study of structural variables determining the final product properties (filtration resistance, filtration efficiency, air permeability) has also become more complicated. If the relationship models between process parameters and structural variables, and between structural variables and product performance can be established, it will help to provide a certain theoretical basis for the establishment of product performance regulation mechanism. Please refer to the relevant literature, understand the professional background, research the topic data
, and answer the following questions:

Question: Please investigate the relationship between process parameters and structural variables. Table 1 gives 8 combinations of process parameters, please fill in the predicted structural variable data in Table 1 and draw the picture in tabular form.

2.2 Single target prediction - DecisionTreeRegressor

Here we take the single-target prediction as an example, use a decision tree to make predictions, and only predict the compression rebound rate :

import pandas as pd
from sklearn.tree import DecisionTreeRegressor#决策树

#第一步正常读取数据:

data=pd.read_excel('C题数据.xlsx',sheet_name=2)
chuli=data.iloc[:,:5]
chuli.columns=['接收距离','热风速度','厚度','孔隙率','压缩回弹性率']

#chuli

#第二步:提取变量
X=chuli.drop(['压缩回弹性率','孔隙率','厚度'],axis=1)
y=chuli['压缩回弹性率']
#X
#第三步:分割数据
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(X, y, test_size= 0.2, random_state=0)

#第四步:建立决策树模型、训练模型

model= DecisionTreeRegressor(max_depth=2) # 设树深度为2

dtr_fit = model.fit(x_train, y_train)

#第五步:读取我们需要预测的数据,整理好了自己下载:test.xlsx
test1=pd.read_excel('test.xlsx')
#test1
#第六步:使用训练好的模型预测数据
ya=dtr_fit.predict(test1)
print(pd.DataFrame(ya,columns=['压缩回弹性率']))

 result:

             

 

2.3 Multi-objective prediction MultiOutputRegressor

     This is the realization of single-target prediction. Next, try multiple input and multiple output, which is to predict all at once. Following the above code, now, let's use MultiOutputRegressor to predict compression resilience','porosity','thickness'
     Sometimes we need to predict multiple targets through the same feature. At this time, we need to use the MultiOutputRegressor package for multiple regression.
import pandas as pd
from sklearn.tree import DecisionTreeRegressor#决策树

#第一步正常读取数据:

data=pd.read_excel('C题数据.xlsx',sheet_name=2)
chuli=data.iloc[:,:5]
chuli.columns=['接收距离','热风速度','厚度','孔隙率','压缩回弹性率']

#chuli

#第二步:提取变量
X=chuli.drop(['压缩回弹性率','孔隙率','厚度'],axis=1)
y=chuli['压缩回弹性率']
#X
#第三步:分割数据
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(X, y, test_size= 0.2, random_state=0)

#第四步:建立决策树模型、训练模型

model= DecisionTreeRegressor(max_depth=2) # 设树深度为2

dtr_fit = model.fit(x_train, y_train)

#第五步:读取我们需要预测的数据,整理好了自己下载:test.xlsx
test1=pd.read_excel('test.xlsx')
#test1
#第六步:使用训练好的模型预测数据
ya=dtr_fit.predict(test1)
print(pd.DataFrame(ya,columns=['压缩回弹性率']))
#display(pd.DataFrame(ya,columns=['压缩回弹性率']))

#第七步:重新读取变量值
X2=chuli.drop(['压缩回弹性率','孔隙率','厚度'],axis=1)
y2=chuli[['压缩回弹性率','孔隙率','厚度']]
#y2

#第八步:对提取到的数据做分割
from sklearn.model_selection import train_test_split
x_train2, x_test2, y_train2, y_test2= train_test_split(X2, y2, test_size= 0.2, random_state=0)

#第九步:采用对输入多输出模型,结合XGBoost算法模型进行训练(如果你觉得效果不够好,自行修改XGBoost参数)
from sklearn.multioutput import MultiOutputRegressor
from xgboost import XGBRegressor

mor = MultiOutputRegressor(XGBRegressor(objective='reg:linear'))

mor.fit(x_train2, y_train2)
#mor

#第十步:预测
pre=mor.predict(test1)
print(pd.DataFrame(pre,columns=['压缩回弹性率','孔隙率','厚度']))

result: 

 

Guess you like

Origin blog.csdn.net/weixin_61181717/article/details/127046671