basic introduction
parameter
-
criterion {“squared_error”, “friedman_mse”, “absolute_error”, “poisson”}, default=”squared_error”
A function to measure the quality of the segmentation. Supported standards are:
- "squared_error" is the average squared error, which is equivalent to variance reduction as a feature selection criterion, using the average value of each terminal node to minimize the L2 loss;
- "friedman_mse" to find potential splits using mean squared error and Friedman improved score;
- "absolute_error" is the mean absolute error, using the median of each terminal node to minimize the L1 loss;
- "poisson", uses a reduced Poisson bias to find splits.
-
splitter {“best”, “random”}, default=”best”
The strategy used to choose the split for each node. Supported strategies are "best" to choose the best split, "random" to choose the best random split.
-
max_depth int, default=None
The maximum depth of the tree. If not, then nodes will be expanded until all leaves are pure, or until all leaves contain fewer samples than min_samples_split.
-
min_samples_split int or float, default=2
The minimum number of samples required to split an internal node.
- If int, then considered
min_samples_split
as the minimum number. - If float, then
min_samples_split
a fraction,ceil(min_samples_split * n_samples)
the minimum number of samples per split.
- If int, then considered
-
min_samples_leaf int or float, default=1
The minimum number of samples required for a leaf node.
min_samples_leaf
Split points of any depth are only considered if at least training samples are left in both left and right branches . This may have the effect of smoothing the model, especially in regression.- If int, then consider
min_samples_leaf
as the smallest number. - If float, then
min_samples_leaf
a fraction,ceil(min_samples_leaf * n_samples)
which is the minimum number of samples per node.
Changed in version 0.18: Added float values for fractions.
- If int, then consider
-
min_weight_fraction_leaf float, default=0.0
The smallest weighted fraction of the sum of weights (all input samples) required at a leaf node. When sample_weight is not provided, samples are weighted equally.
-
max_features int, float or {“auto”, “sqrt”, “log2”}, default=None
The number of features to consider when finding the best split.
- If int, then
max_features
features are considered at each split. - If float, then
max_features
a score to considermax(1, int(max_features * n_features_in_))
features for each split. - If "auto", then
max_features=n_features
. - If "sqrt", then
max_features=sqrt(n_features)
. - If "log2", then
max_features=log2(n_features)
. - If not, well
max_features=n_features
. Deprecated since version 1.1: The "auto" option was deprecated in version 1.1 and will be removed in version 1.3. Note: the search for splits does not stop until at least one valid partition of the node samples is found, even if efficient checkingmax_features
of excess features is required.
- If int, then
-
random_state int, RandomState instance or None, default=None
Controls the randomness of the estimator. Even
splitter
if set to"best"
, features are randomly permuted at each split. Whenmax_features < n_features
, the algorithm randomly selects each splitmax_features
and finds the best split among them. But even thenmax_features=n_features
, the optimal split found may be different in different runs. This is the case if the standard improvement is the same for several splits and one split must be chosen at random.random_state
must be fixed to an integer in order to obtain deterministic behavior during fitting . -
max_leaf_nodes int, default=None
Grow a
max_leaf_nodes
tree with , in a best-first fashion. Best nodes are defined as relatively reduced impurities. If not, the number of leaf nodes is unlimited. -
min_impurity_decrease float, default=0.0
If the impurity reduction caused by a node split is greater than or equal to this value, the node will be split.
The weighted impurity decrease equation is as follows:
N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity)
Where N is the total number of samples,
N_t
the number of samples of the current node,N_t_L
the number of samples of the left child,N_t_R
and the number of samples of the right child.N
,N_t
,N_t_R
andN_t_L
all refer to weighted sums, if sample_weight is passed. -
ccp_alpha non-negative float, default=0.0
Complexity parameter for minimum cost-complexity pruning.
ccp_alpha
The subtree with the largest cost complexity and less than will be chosen . By default, no pruning occurs.
code example
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
# Path of the file to read
iowa_file_path = '../input/home-data-for-ml-course/train.csv'
home_data = pd.read_csv(iowa_file_path)
y = home_data.SalePrice
feature_columns = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[feature_columns]
# Specify Model
iowa_model = DecisionTreeRegressor()
# Fit Model
iowa_model.fit(X, y)
print("First in-sample predictions:", iowa_model.predict(X.head()))
print("Actual target values for those homes:", y.head().tolist())