Classification tree (tree) is a very commonly used classification method. The core task is to classify the data corresponding to the possible category.
He is a regulatory study, the so-called regulatory learning is given a bunch of samples, each sample has a set of attributes and a category, these categories are determined in advance, by learning to get a classifier, the classifier is capable of emerging objects give the correct classification.
Decision Tree understanding
The concept of entropy is important to understand the decision tree
Decision tree to make a judgment not 100% correct, it's just doing the best judgment based on uncertainty.
Entropy is used to describe the uncertainty.
Case: Find recommender share of bicycle users
Analysis: calculated what kind of people are more likely to be recommended by shared bicycle. In other words it is unusual between the recommender and other variables relationship.
step 1
Entropy measures corresponding to the node population
Two points for whether to recommend such a result, the recommended proportion of 0 or close to 1, the entropy is 0, the recommended proportion close to 50%, an entropy approach.
Analysts users need features distinguish recommender. It can reduce the entropy of node population (through a decision tree constantly bifurcation) by a decision tree as possible.
Step 2
Node bifurcation
Diverged way will be different gain value, the computer will select the maximum gain value, which is the best way fork.
For details, see the text message after a gain related content.
Step 3
Stop bifurcation in a particular case.
Note: Too many branch node will complicate the situation, but not conducive to decision-making, the need to stop the bifurcation at the appropriate time.
Information Gain (IG) concept
He expressed through a decision tree, the entire classification data entropy decrease in size.
IG obtained above entropy parent node and subtracting the weighted, the resulting entropy child node is a bifurcation after the reduction of the entropy values.
Diverged way will be different gain value, the computer will select the maximum gain value, which is the best way fork.
R language
> bike.data <- read.csv(Shared Bike Sample Data - ML.csv)
> library(rpart)
> library(rpart.plot)
> library(rpart.plot)
> bike.data$推荐者 <- bike.data$分数>=9
> rtree_fit <- rpart(推荐者 ~城区+年龄+组别,data=bike.data)
> rpart.plot(rtree_fit)