Basic Learning of Machine Learning Algorithms# Random Forest of Integrated Learning

  • Random Forests is a kind of integrated learning algorithm. Ensemble learning is to complete learning tasks by combining multiple learners. Random forest is a combination of multiple decision trees to train and predict samples. Random Forest decorrelates all trees by randomly perturbing them.
  • Random forests can use a huge number of predictors, even more predictors than observations. The most significant advantage of using the random forest method is that it can obtain more information to reduce the bias of the fitted values ​​and estimated splits. Usually we have some predictors that can dominate the decision tree fitting process, because their average performance is always better than some other competing predictors.
  • There are three main hyperparameter tunings for random forests:
    • Node size: Random forests are not like decision trees, and the number of observation samples contained in each leaf node may be very small. The goal of this hyperparameter is to keep the deviation as small as possible when generating the tree.
    • Number of trees: In practice several hundred trees are generally a good choice.
    • Number of predictor samples: If we have a total of D predictors, then we can use D/3 predictors as samples in regression tasks and D^(1/2) predictors in classification tasks as sampling.
      [I understand this sentence means that after the trees in the random forest are trained, the trees are sampled to reduce the amount of calculation in application?

Code: 5-fold cross-validation grid search to optimize random forest hyperparameters:

from sklearn.model_selection import GridSearchCV

parameters = {
    
    'n_estimators':(100, 500, 1000),'max_depth':(None, 24, 16),'min_samples_split': (2, 4, 8),'min_samples_leaf': (16, 4, 12)}

clf = GridSearchCV(RandomForestClassifier(), parameters, cv=5, n_jobs=8)
clf.fit(x_train, y_train)
clf.best_score_, clf.best_params_
best_rf_model = grid_search.best_estimator_

0.86606676699118579
{
    
    'max_depth': 24,
 'min_samples_leaf': 4,
 'min_samples_split': 4,
 'n_estimators': 1000}
  • Random Forest is a bit:
    • Data does not need to be normalized
  • Random forest limitations:
    • Random forest is not good at inferring independent or dependent variables out of range, MARS algorithm is more suitable;
    • The random forest algorithm is slower in training and prediction;
    • The random forest algorithm does not work well when the number of categories is too large;
  • The direction of improvement in the next step: Generally speaking, the problems that random forests can solve can use the gradient boosting tree algorithm. The random forest accuracy and computational complexity are worse than the gradient boosting tree algorithm.

Relevant code examples (only key codes are kept):

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error

rf = RandomForestRegressor(n_estimators=100, random_state=42)
X, Y = create_dataset(cur_data, global_look_back)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)
rf.fit(X_train, Y_train)
trainPredict = rf.predict(X_train)
testPredict = rf.predict(X_test)
trainMSE = mean_squared_error(Y_train, trainPredict)    # 训练集误差
testMSE = mean_squared_error(Y_test, testPredict)       # 测试集误差                

Model saving and loading:

        with open(os.path.join(save_model_dir, sname), 'wb') as file:
            pickle.dump(rf, file)

        with open(fpath, 'rb') as file:
            model = pickle.load(file)

Reference

1. The Heart of the Machine: From Decision Tree to Random Forest: Principle and Implementation of Tree Algorithm
2. Paddle Community Random Forest Algorithm

Guess you like

Origin blog.csdn.net/qq_33583069/article/details/131544625