Using random forest classifier

There are many classification, classification anything logistic regression, KNN, decision trees, SVM, random forests,

Relatively easy to use and easier to understand or random forests, there are now more common python and realization of R. Principle is not explained, ado, show me the code

import csv
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn import preprocessing
from sklearn.utils import shuffle
from sklearn.metrics import mean_squared_error, explained_variance_score

def load_dataset(filename):
 file_reader = csv.reader(open(filename,'rt'), delimiter=',')
 X, y = [], []
 for row in file_reader:
    X.append(row[0:4])
    y.append(row[-1])
 # Extract feature names
 feature_names = np.array(X[0])
 return np.array(X[1:]).astype(np.float32),np.array(y[1:]).astype(np.float32), feature_names

if __name__=='__main__':
 X, y, feature_names = load_dataset("D:\\yudata.csv")
 X, y = shuffle(X, y, random_state=7)
 # X, y = random.shuffle(X, y)
 num_training = int(0.9 * len(X))
 X_train, y_train = X[:num_training],y[:num_training]
 X_test, y_test = X[num_training:],y[num_training:]

 rf_regressor = RandomForestRegressor(n_estimators=1000, max_depth=10, min_samples_split=2)
 rf_regressor.fit(X_train, y_train)
 y_pred = rf_regressor.predict(X_test)
 mse = mean_squared_error(y_test, y_pred)
 evs = explained_variance_score(y_test,y_pred)
 print(y_test)
 print(y_pred)
 print(rf_regressor.feature_importances_)  

Training data, use csv, for example,

There are several places to note,

1: open (filename, 'rt'), instead of writing 'rb', may be written as 'r'

2: shuffle, random bag in there, numpy there, sklearn.utils also has, as used herein, is the last person, from sklearn.utils import shuffle

Specific principles can refer  https://blog.csdn.net/huanhuan_Coder/article/details/82787923  write better, the first two have to shuffle function is a single list of all parameters,

The package may be a multi sklearn list, as shown in the code, interestingly, for a plurality of parameters, they disrupt the order is the same, rocking.

3: conversion value when it came to astype (np.float32) ValueError: could not convert string to float, and later found csv debugging data which has a null value, it is not often figures being given.

4: min_samples_split = 1 is also being given, on the line 2 into.

5: The model in the rf_regressor.feature_importances_, which are the right parameters of weight, which is important, do not believe, can replace what the column order csv, and see the result is not changed.

6: reference to the above code https://blog.csdn.net/weixin_42039090/article/details/80640890

R languages implemented: specific reference  https://blog.csdn.net/nieson2012/article/details/51279332

Obviously the more streamlined version of the R language, but also easy to understand.

library(”randomForest”)
data(iris)
set.seed(100)
ind=sample(2,nrow(iris),replace=TRUE,prob=c(0.8,0.2))
iris.rf=randomForest(Species~.,iris[ind==1,],ntree=50,nPerm=10,mtry=3,proximity=TRUE,importance=TRUE)
print(iris.rf)
iris.pred=predict( iris.rf,iris[ind==2,] )
table(observed=iris[ind==2,"Species"],predicted=iris.pred )

Finally add this: Reference  https://blog.csdn.net/qq_37423198/article/details/76922207

The main advantage of RF are:
1) training can be highly parallelized, for a large sample training speed big data era advantage.
2) Since the divided feature may be randomly selected tree node, so that when a high sample feature dimensions, still efficient training model.
3) after training, you may be given various features of the importance of the outcome
4) As a result of random sampling, small variance of the trained models, strong generalization ability.
5) with respect to the series of Boosting Adaboost and GBDT, RF is relatively simple.
6) deletion of the partial feature insensitive.
7) the ability to adapt dataset strong: not only deal with discrete data, can handle continuous data, standardized data sets without the need
8) when creating random forests for generlization error is the use of an unbiased estimate of
9) training process capable of detecting the feature on the interaction between
major disadvantage of RF are:
1) in some relatively large noise sample set, RF model vulnerable to overfitting.
2) dividing the value more readily characterized greater impact on RF decisions to influence the effect of the model fit.

Finally, the weight again for a histogram bar

 import matplotlib.pyplot as plt
 import seaborn as sns

 color = sns.color_palette()
 sns.set_style('darkgrid')
 features_list = feature_names
 feature_importance = rf_regressor.feature_importances_
 sorted_idx = np.argsort(feature_importance)

 plt.figure(figsize=(5, 7))
 plt.barh(range(len(sorted_idx)), feature_importance[sorted_idx], align='center')
 plt.yticks(range(len(sorted_idx)), features_list[sorted_idx])
 plt.xlabel('Importance')
 plt.title('Feature importances')
 plt.draw()
 plt.show()

  

 Amassing add one point. .

The article is well written, https://www.cnblogs.com/pinard/p/6156009.html

Distinction between bagging and GBDT that period,

GBDT sub-sampling is without replacement sampling, sub-sampling and Bagging is back samples.

    The probability for a sample in which a sample containing a random training set of m samples, each was to be collected . 1 m 1M. Probability is not collected . 1 - . 1 m 1-1m. If the probability of m samples have not been acquired is ( . 1 - . 1 m ) m (1-1m) m. When m m → ∞ when, ( . 1 - . 1 m ) m . 1 E 0.368 (1-1m) m → 1e≃0.368. In other words, in each round of random sampling bagging, the training focused on approximately 36.8% of the data is not Jicai concentrated sampling.

    About 36.8% for this part of the data is not sampled, we often referred BAG data (Out Of Bag, referred to as OOB). These data did not participate in the training set to fit the model, it can be used to detect the generalization ability of the model.

    bagging no limit to the weak learner, and Adaboost as this. But the most common is the general decision trees and neural networks.

    bagging a collection of relatively simple strategies, for classification, commonly used voting method is simple, get one category or categories the most votes for the final model output. For regression problems, usually using a simple average method, regression results are weak learners get the T to get the final arithmetic mean of the model output.

    Since Bagging algorithm to train every sampling model, so strong generalization ability, to reduce variance model useful role. Of course, for the fit of the training set will be worse, which is biased model will be larger.

Guess you like

Origin www.cnblogs.com/marszhw/p/11330213.html