Detailed explanation of DecisionTreeRegressor parameters

basic introduction

Untitled

parameter

  • criterion {“squared_error”, “friedman_mse”, “absolute_error”, “poisson”}, default=”squared_error”

    A function to measure the quality of the segmentation. Supported standards are:

    • "squared_error" is the average squared error, which is equivalent to variance reduction as a feature selection criterion, using the average value of each terminal node to minimize the L2 loss;
    • "friedman_mse" to find potential splits using mean squared error and Friedman improved score;
    • "absolute_error" is the mean absolute error, using the median of each terminal node to minimize the L1 loss;
    • "poisson", uses a reduced Poisson bias to find splits.
  • splitter {“best”, “random”}, default=”best”

    The strategy used to choose the split for each node. Supported strategies are "best" to choose the best split, "random" to choose the best random split.

  • max_depth int, default=None

    The maximum depth of the tree. If not, then nodes will be expanded until all leaves are pure, or until all leaves contain fewer samples than min_samples_split.

  • min_samples_split int or float, default=2

    The minimum number of samples required to split an internal node.

    • If int, then considered min_samples_splitas the minimum number.
    • If float, then min_samples_splita fraction, ceil(min_samples_split * n_samples)the minimum number of samples per split.
  • min_samples_leaf int or float, default=1

    The minimum number of samples required for a leaf node. min_samples_leafSplit points of any depth are only considered if at least training samples are left in both left and right branches . This may have the effect of smoothing the model, especially in regression.

    • If int, then consider min_samples_leafas the smallest number.
    • If float, then min_samples_leafa fraction, ceil(min_samples_leaf * n_samples)which is the minimum number of samples per node.
      Changed in version 0.18: Added float values ​​for fractions.
  • min_weight_fraction_leaf float, default=0.0

    The smallest weighted fraction of the sum of weights (all input samples) required at a leaf node. When sample_weight is not provided, samples are weighted equally.

  • max_features int, float or {“auto”, “sqrt”, “log2”}, default=None

    The number of features to consider when finding the best split.

    • If int, then max_featuresfeatures are considered at each split.
    • If float, then max_featuresa score to consider max(1, int(max_features * n_features_in_))features for each split.
    • If "auto", then max_features=n_features.
    • If "sqrt", then max_features=sqrt(n_features).
    • If "log2", then max_features=log2(n_features).
    • If not, well max_features=n_features. Deprecated since version 1.1: The "auto" option was deprecated in version 1.1 and will be removed in version 1.3. Note: the search for splits does not stop until at least one valid partition of the node samples is found, even if efficient checking max_featuresof excess features is required.
  • random_state int, RandomState instance or None, default=None

    Controls the randomness of the estimator. Even splitterif set to "best", features are randomly permuted at each split. When max_features < n_features, the algorithm randomly selects each split max_featuresand finds the best split among them. But even then max_features=n_features, the optimal split found may be different in different runs. This is the case if the standard improvement is the same for several splits and one split must be chosen at random. random_statemust be fixed to an integer in order to obtain deterministic behavior during fitting .

  • max_leaf_nodes int, default=None

    Grow a max_leaf_nodestree with , in a best-first fashion. Best nodes are defined as relatively reduced impurities. If not, the number of leaf nodes is unlimited.

  • min_impurity_decrease float, default=0.0

    If the impurity reduction caused by a node split is greater than or equal to this value, the node will be split.

    The weighted impurity decrease equation is as follows:

    N_t / N * (impurity - N_t_R / N_t * right_impurity
                        - N_t_L / N_t * left_impurity)
    

    Where N is the total number of samples, N_tthe number of samples of the current node, N_t_Lthe number of samples of the left child, N_t_Rand the number of samples of the right child.

    N, N_t, N_t_Rand N_t_Lall refer to weighted sums, if sample_weight is passed.

  • ccp_alpha non-negative float, default=0.0

    Complexity parameter for minimum cost-complexity pruning. ccp_alphaThe subtree with the largest cost complexity and less than will be chosen . By default, no pruning occurs.

code example

import pandas as pd
from sklearn.tree import DecisionTreeRegressor

# Path of the file to read
iowa_file_path = '../input/home-data-for-ml-course/train.csv'

home_data = pd.read_csv(iowa_file_path)
y = home_data.SalePrice
feature_columns = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[feature_columns]

# Specify Model
iowa_model = DecisionTreeRegressor()
# Fit Model
iowa_model.fit(X, y)

print("First in-sample predictions:", iowa_model.predict(X.head()))
print("Actual target values for those homes:", y.head().tolist())

References

sklearn.tree.DecisionTreeRegressor

Guess you like

Origin blog.csdn.net/weixin_46421722/article/details/129520089