sklearn python library of

 First, install sklearn

conda install scikit-learn

references

 

[1] Overall description sklearn

 

https://blog.csdn.net/u014248127/article/details/78885180

 

 

Second, the introduction RandomForestRegressor

 

 1     sklearn.ensemble.RandomForestRegressor(  n_estimators=10,
 2                                              criterion='mse',
 3                                              max_depth=None,
 4                                              min_samples_split=2,
 5                                              min_samples_leaf=1,
 6                                              min_weight_fraction_leaf=0.0,
 7                                              max_features='auto',
 8                                              max_leaf_nodes=None,
 9                                              min_impurity_split=1e-07,
10                                              bootstrap=True,
11                                              oob_score=False,
12                                              n_jobs=1,
13                                              random_state=None,
14                                              verbose=0,
15                                              warm_start=False)

 

 

 

Criterion : "Gini" or "Entropy" (default = "Gini") is calculated attribute Gini ( Gini impurity ) or Entropy ( information gain ), to select the most appropriate node.
Splitter : "Best" or "Random" (default = "Best") randomly selected property or choose not to purity greatest attribute, it is recommended to use the default.
max_features : selecting an optimum division attribute character can not exceed this value.
  When an integer, i.e. the maximum number of features;
      . IF "Auto", the then max_features = sqrt (n_features) simply selecting all of the features, every single tree can use them. In this case, every single tree do not have any limitation.
      If "sqrt", then max_features =This option may be utilized every single sub-tree root in the total number of features. For example, if the total number of the variable (feature) is 100, so that every single one of the sub-tree can only take 10. "Log2" is another type of similar options.
      If "log2", then max_features =

      If None, then max_features = n_features. When is a decimal, the training set number of decimal wherein *: for example as follows: 0.2: This option allows each subtree may utilize random forest variable (feature) 20% of the number. If you want to study the characteristics of x% effect, we can use the "0.X" format.
    

Increase max_features generally improve the performance of the model, because in each node, we have more options to consider. However, this may not be entirely correct, because it reduces the diversity of a single tree, which is the random distinct advantage forest. However, to be sure, you will reduce the speed of the algorithm by increasing max_features. Therefore, you need the right balance and choose the best max_features.     


MAX_DEPTH : (default = None) Set the maximum depth of the tree, the default is None, when such achievements will make every leaf node in only one category, or reach min_samples_split.
min_samples_split : The time division node attributes, minimum number of samples per division.
min_samples_leaf : Minimum number of samples of leaf nodes. If you had previously written a tree, you can understand the importance of leaf size of the smallest samples. Ye is an end node of the decision tree. Smaller leaves make the model easier to catch train noise in the data. In general, I prefer to be the minimum number of leaf nodes is set to be greater than 50. In your own case, you should try as many types of leaf size, in order to find the optimal one.
max_leaf_nodes : The maximum number of samples (default = None) leaves of the tree.
min_weight_fraction_leaf : (default = 0) leaf nodes required minimum weight
verbose : (default = 0) is displayed task progress


Random Sen Linte some parameters:
n_estimators = 10 : the number of tree, the better, but the performance would be worse, at least about 100 (particularly the numbers come forget) and can achieve acceptable error performance rate before use to predict the maximum number of votes or mean, you want to establish a number of sub-tree. More sub-tree model allows better performance, but at the same time to make your code slower. You should choose the highest possible value, as long as your processor can afford to live, because it allows you to predict better and more stable.
= True Bootstrap : Is there a back samples.  
= False oob_score : OOB (OUT of Band, band) data, namely: not selected in the bootstrap data in a particular decision tree training. Training multi-parameter single model, we know they can cross validation (cv) to carry out, but especially time-consuming, but also for random forests this case there is no need for big, so we use this data to validate the decision tree model, considered a simple cross-validation. Performance consumption is small, but good results. This is a random forest cross validation method. It left a verification method is very similar, but much faster. This method simply marks observed by each of the data pieces of the subtree. Then each sample was observed to find a score maximum vote is a vote to get those sub-trees are not using the observed sample training.
=. 1 n_jobs : Parallel job number. This is very important in the ensemble algorithm, in particular bagging (rather than boosting, because between the boosting impact of each iteration, it is difficult to parallelize), because in parallel to improve performance. 1 = not parallel; n: n parallel; -1: CPU number of core, how many started the Job
warm_start = False : hot start, decide whether to use the results of the last call and then add a new class. 
= None class_weight : weight weight of each label. 

random_state : This parameter allows the results easily reproducible. Determining a random value will produce the same results, in the case of constant parameters and training data. I have personally tried the optimal parameters model different random state of integration, this method is sometimes better than the individual random state


Prediction can take several forms:
predict_proba (X) : gives the result with probability values. Each point in all probability the label and to 1. 
Predict (X) : prediction results given directly. Internal or call predict_proba (), based on the results of the highest probability to see which type is which type of predictive value. 
predict_log_proba (X) : and predict_proba substantially the same, but the results to make the log () Processing

references:

[2] How to use the GBM / GBDT / GBRT - Introduction gradient enhance various parameters regression trees

https://zwang1986.github.io/2016/04/24/%E5%A6%82%E4%BD%95%E7%94%A8%E5%A5%BDgbdt%EF%BC%88gradient_boosted_regression_trees%EF%BC%89/

[3] describes how to find the optimal parameters of each random forests

https://blog.csdn.net/qq_16633405/article/details/61200502

https://www.analyticsvidhya.com/blog/2015/06/tuning-random-forest-model/

 

Third, the introduction GradientBoostingRegressor

 

 

 

Machine learning algorithms of ridge regression, Lasso regression and regression ElasticNet

https://www.biaodianfu.com/ridge-lasso-elasticnet.html

 

 

 

 

 

 

 

 

 

 

 

 

references:

 

Guess you like

Origin www.cnblogs.com/ccpang/p/11312924.html