CART classification tree algorithm
Interpretation of CART classification tree algorithm
Input: data set D, feature set A, stop condition threshold ϵ \epsilonϵ
Output: CART classification decision tree
step:
-
Starting from the root node, perform operations and build an operation binary tree
-
Calculate the Gini index of the data set under the existing features and select the optimal features.
——Under the characteristic Ag, for each value g that it may take, according to the test of Ag=g at the sample point is "yes" or "no", divide D into two parts
D1 and D2, and calculate Ag=g Gini index at that time.
——Select the value with the smallest Gini index as the optimal segmentation point under this feature.
——Calculate the optimal segmentation point under each feature, and compare the Gini index of each feature under the optimal segmentation, and select the feature with the smallest Gini index, that is, the optimal feature. -
According to the optimal feature and the optimal segmentation point, generate two sub-nodes, and assign the data set to the corresponding sub-nodes
按照最优切分点来分成二叉树
-
Continue to recursively call the above steps for the two child nodes respectively until the conditions are met, that is, generate a CART classification decision tree.
The condition here is generally a threshold. When the Gini index is less than this threshold,
the samples basically belong to the same class, or there are no more features, and the CART classification decision tree is generated.
CART classification tree real question explanation**
Training set D, "feature set" is age A 1 A_1A1, is there a job A 2 A_2A2, Do you have your own house A 3 A_3A3, credit situation A 4 A_4A4。
Category is: Y 1 Y_1Y1=Yes, Y 2 = No Y_2=NoY2=no
Use the minimization of the Gini index to select the best features”
For feature A, the Gini index of sample set D is:
Gini ( D , A ) = ∣ D 1 ∣ ∣ D ∣ Gini ( D 1 ) + ∣ D 2 ∣ ∣ D ∣ Gini ( D 2 ) \operatorname{Gini}(D, A)=\frac{\left|D_{1}\right|}{|D|} \operatorname{Gini}\left(D_{1}\right)+\frac{\ left|D_{2}\right|}{|D|} \operatorname{Gini}\left(D_{2}\right)G i n i ( D ,A)=∣D∣∣D1∣Gini(D1)+∣D∣∣D2∣Gini(D2) first feature: age
Three eigenvalues: Youth A 11 A_{11}A11, middle-aged A 12 A_{12}A12and aged A 13 A_{13}A13
The CART
-
Gini ( D 1 ) = 2 × 2 5 × 3 5 = 12 25 \operatorname{Gini}\left(D_{1}\right)=2 \times \frac{2} {5}
\times \frac{3}{5}=\frac{12}{25}Gini(D1)=2×52×53=2512
Weight W 1 = 5 15 W_1=\frac{5}{15}W1=155
Gini ( D 1 ) = 2 × 3 10 × 7 10 = 42 100 \operatorname{Gini}\left(D_{1}\right)=2 \times \frac{3}{10} \times \frac{7}{10}=\frac{42}{100} Gini(D1)=2×103×107=10042
Weight W 2 = 10 15 W_2=\frac{10}{15}W2=1510
Gini ( D , A ) = ∣ D 1 ∣ ∣ D ∣ Gini ( D 1 ) + ∣ D 2 ∣ ∣ D ∣ Gini ( D 2 ) \operatorname{Gini}(D, A)=\frac{\left|D_{1}\right|}{|D|} \operatorname{Gini}\left(D_{1}\right)+\frac{\left|D_{2}\right|}{|D|} \operatorname{Gini}\left(D_{2}\right) G i n i ( D ,A)=∣D∣∣D1∣Gini(D1)+∣D∣∣D2∣Gini(D2)Gini ( D , A 11 ) = 5 15 × 12 25 + 10 15 × 42 100 = 0.44 \operatorname{Gini}\left(D, A_{11}\right)=\frac{5}{15} \times \frac{12}{25}+\frac{10}{15} \times \frac{42}{100}=0.44 Gini(D,A11)=155×2512+1510×10042=0.44
-
Classify Gini by middle-aged and non-middle-aged
Gini ( D 1 ) = 2 × 2 5 × 3 5 = 12 25 \operatorname{Gini}\left(D_{1}\right)=2 \times \frac{2}{ 5} \times \frac{3}{5}=\frac{12}{25}Gini(D1)=2×52×53=2512
Weight D 1 = 5 15 D_1=\frac{5}{15}D1=155
Gini ( D 1 ) = 2 × 4 10 × 6 10 = 48 100 \operatorname{Gini}\left(D_{1}\right)=2 \times \frac{4}{10} \times \frac{6}{10}=\frac{48}{100} Gini(D1)=2×104×106=10048
Weight D 2 = 10 15 D_2=\frac{10}{15}D2=1510
Gini ( D , A ) = ∣ D 1 ∣ ∣ D ∣ Gini ( D 1 ) + ∣ D 2 ∣ ∣ D ∣ Gini ( D 2 ) \operatorname{Gini}(D, A)=\frac{\left|D_{1}\right|}{|D|} \operatorname{Gini}\left(D_{1}\right)+\frac{\left|D_{2}\right|}{|D|} \operatorname{Gini}\left(D_{2}\right) G i n i ( D ,A)=∣D∣∣D1∣Gini(D1)+∣D∣∣D2∣Gini(D2)Gini ( D , A 12 ) = 5 15 × 12 25 + 10 15 × 48 100 = 0.48 \operatorname{Gini}\left(D, A_{12}\right)=\frac{5}{15} \times \frac{12}{25}+\frac{10}{15} \times \frac{48}{100}=0.48 Gini(D,A12)=155×2512+1510×10048=0.48
-
Categorized by old and non-elderly
Gini ( D 1 ) = 2 × 1 5 × 4 5 = 8 25 \operatorname{Gini}\left(D_{1}\right)=2 \times \frac{1}{5} \times \frac{4}{5}=\frac{8}{25} Gini(D1)=2×51×54=258
Weight D 1 = 5 15 D_1=\frac{5}{15}D1=155
Gini ( D 1 ) = 2 × 5 10 × 5 10 = 1 2 \operatorname{Gini}\left(D_{1}\right)=2 \times \frac{5}{10} \times \frac{5}{10}=\frac{1}{2} Gini(D1)=2×105×105=21
Weight D 2 = 10 15 D_2=\frac{10}{15}D2=1510
Gini ( D , A ) = ∣ D 1 ∣ ∣ D ∣ Gini ( D 1 ) + ∣ D 2 ∣ ∣ D ∣ Gini ( D 2 ) \operatorname{Gini}(D, A)=\frac{\left|D_{1}\right|}{|D|} \operatorname{Gini}\left(D_{1}\right)+\frac{\left|D_{2}\right|}{|D|} \operatorname{Gini}\left(D_{2}\right) G i n i ( D ,A)=∣D∣∣D1∣Gini(D1)+∣D∣∣D2∣Gini(D2)
Gini ( D , A 13 ) = 5 15 × 8 25 + 10 15 × 1 2 = 0.44 \operatorname{Gini}\left(D, A_{13}\right)=\frac{5}{15} \times \frac{8}{25}+\frac{10}{15} \times \frac{1}{2}=0.44 Gini(D,A13)=155×258+1510×21=0.44
**由此可以看出青年和老年基尼指数最小0.44,都可以作为最优划分点**
The second characteristic: work
2 eigenvalues: there are jobs A 21 A_{21}A21, no job A 22 A_{22}A22
Gini ( D 1 ) = 2 × 0 5 × 5 5 = 0 \operatorname{Gini}\left(D_{1}\right)=2 \times \frac{0}{5} \times \frac{5}{5}=0 Gini(D1)=2×50×55=0 WeightD 1 = 5 15 D_1=\frac{5}{15}
D1=155
Gini ( D 2 ) = 2 × 6 10 × 4 10 = 48 100 \operatorname{Gini}\left(D_{2}\right)=2 \times \frac{6}{10} \times \frac{4}{10}=\frac{48}{100} Gini(D2)=2×106×104=10048
Weight D 2 = 10 15 D_2=\frac{10}{15}D2=1510
Gini ( D , A ) = ∣ D 1 ∣ ∣ D ∣ Gini ( D 1 ) + ∣ D 2 ∣ ∣ D ∣ Gini ( D 2 ) \operatorname{Gini}(D, A)=\frac{\left|D_{1}\right|}{|D|} \operatorname{Gini}\left(D_{1}\right)+\frac{\left|D_{2}\right|}{|D|} \operatorname{Gini}\left(D_{2}\right) G i n i ( D ,A)=∣D∣∣D1∣Gini(D1)+∣D∣∣D2∣Gini(D2)
Gini ( D , A 2 ) = 5 15 × 0 + 10 15 × 48 100 = 0.32 \operatorname{Gini}\left(D, A_{2}\right)=\frac{5}{15} \times 0+\frac{10}{15} \times \frac{48}{100}=0.32 Gini(D,A2)=155×0+1510×10048=0.32
The third characteristic: the house
2 eigenvalues: There is a house A 31 A_{31}A31, no house A 32 A_{32}A32
Gini ( D 1 ) = 2 × 0 6 × 6 6 = 0 \operatorname{Gini}\left(D_{1}\right)=2 \times \frac{0}{6} \times \frac{6}{6}=0 Gini(D1)=2×60×66=0 WeightD 1 = 5 15 D_1=\frac{5}{15}
D1=155
Gini ( D 2 ) = 2 × 3 9 × 6 9 = 72 81 \operatorname{Gini}\left(D_{2}\right)=2 \times \frac{3}{9} \times \frac{6}{9}=\frac{72}{81} Gini(D2)=2×93×96=8172
Weight D 2 = 10 15 D_2=\frac{10}{15}D2=1510
Gini ( D , A ) = ∣ D 1 ∣ ∣ D ∣ Gini ( D 1 ) + ∣ D 2 ∣ ∣ D ∣ Gini ( D 2 ) \operatorname{Gini}(D, A)=\frac{\left|D_{1}\right|}{|D|} \operatorname{Gini}\left(D_{1}\right)+\frac{\left|D_{2}\right|}{|D|} \operatorname{Gini}\left(D_{2}\right) G i n i ( D ,A)=∣D∣∣D1∣Gini(D1)+∣D∣∣D2∣Gini(D2)
Gini ( D , A 3 ) = 6 15 × 0 + 9 15 × 72 81 = 0.27 \operatorname{Gini}\left(D, A_{3}\right)=\frac{6}{15} \times 0+\frac{9}{15} \times \frac{72}{81}=0.27 Gini(D,A3)=156×0+159×8172=0.27
The fourth characteristic: credit situation
Three eigenvalues: very good A 41 A_{41}A41, Good A 42 A_{42}A42and general A 43 A_{43}A43
-
Classify Gini as very good and not very good
( D 1 ) = 2 × 0 4 × 4 4 = 0 \operatorname{Gini}\left(D_{1}\right)=2 \times \frac{0}{ 4} \times \frac{4}{4}=0Gini(D1)=2×40×44=0
weightW 1 = 4 15 W_1=\frac{4}{15}W1=154
Gini ( D 2 ) = 2 × 6 11 × 5 11 = 60 121 \operatorname{Gini}\left(D_{2}\right)=2 \times \frac{6}{11} \times \frac{5}{11}=\frac{60}{121} Gini(D2)=2×116×115=12160
Weight W 2 = 11 15 W_2=\frac{11}{15}W2=1511
Gini ( D , A ) = ∣ D 1 ∣ ∣ D ∣ Gini ( D 1 ) + ∣ D 2 ∣ ∣ D ∣ Gini ( D 2 ) \operatorname{Gini}(D, A)=\frac{\left|D_{1}\right|}{|D|} \operatorname{Gini}\left(D_{1}\right)+\frac{\left|D_{2}\right|}{|D|} \operatorname{Gini}\left(D_{2}\right) G i n i ( D ,A)=∣D∣∣D1∣Gini(D1)+∣D∣∣D2∣Gini(D2)Gini ( D , A 41 ) = 4 15 × 0 + 11 15 × 60 121 = 0.36 \operatorname{Gini}\left(D, A_{41}\right)=\frac{4}{15} \times 0+\frac{11}{15} \times \frac{60}{121}=0.36 Gini(D,A41)=154×0+1511×12160=0.36
-
Classify
Gini ( D 1 ) = 2 × 2 6 × 4 6 = 16 36 \operatorname{Gini}\left(D_{1}\right)=2 \times \frac{2}{6} \times \frac{4}{6}=\frac{16}{36}Gini(D1)=2×62×64=3616
Weight D 1 = 6 15 D_1=\frac{6}{15}D1=156
Gini ( D 2 ) = 2 × 4 9 × 5 9 = 40 81 \operatorname{Gini}\left(D_{2}\right)=2 \times \frac{4}{9} \times \frac{5}{9}=\frac{40}{81} Gini(D2)=2×94×95=8140
Weight D 2 = 9 15 D_2=\frac{9}{15}D2=159
Gini ( D , A ) = ∣ D 1 ∣ ∣ D ∣ Gini ( D 1 ) + ∣ D 2 ∣ ∣ D ∣ Gini ( D 2 ) \operatorname{Gini}(D, A)=\frac{\left|D_{1}\right|}{|D|} \operatorname{Gini}\left(D_{1}\right)+\frac{\left|D_{2}\right|}{|D|} \operatorname{Gini}\left(D_{2}\right) G i n i ( D ,A)=∣D∣∣D1∣Gini(D1)+∣D∣∣D2∣Gini(D2)Gini ( D , A 42 ) = 6 15 × 16 36 + 9 15 × 40 81 = 0.47 \operatorname{Gini}\left(D, A_{42}\right)=\frac{6}{15} \times \frac{16}{36}+\frac{9}{15} \times \frac{40}{81}=0.47 Gini(D,A42)=156×3616+159×8140=0.47
-
Classified by general and non-general
Gini ( D 1 ) = 2 × 4 5 × 1 5 = 8 25 \operatorname{Gini}\left(D_{1}\right)=2 \times \frac{4}{5} \times \frac{1}{5}=\frac{8}{25} Gini(D1)=2×54×51=258
Weight D 1 = 5 15 D_1=\frac{5}{15}D1=155
Gini ( D 2 ) = 2 × 2 10 × 8 10 = 32 100 \operatorname{Gini}\left(D_{2}\right)=2 \times \frac{2}{10} \times \frac{8}{10}=\frac{32}{100} Gini(D2)=2×102×108=10032
Weight D 2 = 10 15 D_2=\frac{10}{15}D2=1510
Gini ( D , A ) = ∣ D 1 ∣ ∣ D ∣ Gini ( D 1 ) + ∣ D 2 ∣ ∣ D ∣ Gini ( D 2 ) \operatorname{Gini}(D, A)=\frac{\left|D_{1}\right|}{|D|} \operatorname{Gini}\left(D_{1}\right)+\frac{\left|D_{2}\right|}{|D|} \operatorname{Gini}\left(D_{2}\right) G i n i ( D ,A)=∣D∣∣D1∣Gini(D1)+∣D∣∣D2∣Gini(D2)
Gini ( D , A 43 ) = 5 15 × 8 25 + 10 15 × 32 100 = 0.32 \operatorname{Gini}\left(D, A_{43}\right)=\frac{5}{15} \times \frac{8}{25}+\frac{10}{15} \times \frac{32}{100}=0.32 Gini(D,A43)=155×258+1510×10032=0.32
**由此可以看出特征一般基尼指数最小0.32,能作为最优划分点**
Comparison by Gini index of four characteristics
Eigenvalues | Corresponding Gini index |
---|---|
age | 0.44 |
Work | 0.32 |
house | 0.27 |
credit status | 0.32 |
The Gini index of the house is the smallest, so draw the binary tree selection feature as the optimal feature
: Same as above, divided by age, work and credit status
-
In the no-house dataset, classify by age feature
age Number do not agree to the loan agree to loan youth 4 3 1 middle aged 3 2 0 elderly 2 1 2 -
In the no-house dataset, classify by job feature
Work Number do not agree to the loan agree to loan have a job 3 0 3 no job 6 6 0 -
In the no-house dataset, classify by credit status feature
credit status Number do not agree to the loan agree to loan very good 1 0 1 good 4 2 2 generally 4 4 0 In the same way, the features can be selected from the data: Obviously, the Gini index of the feature work here is:
G ( D , A 2 ) = 3 9 × 2 × 0 3 × 3 3 + 6 9 × 2 × 6 6 × 0 0 = 0 G(D,A_2)=\frac{3}{9}\times2\times \frac{0}{3}\times \frac{3}{3}+\frac{6}{9}\times2\ times \frac{6}{6}\times \frac{0}{0}=0G(D,A2)=93×2×30×33+96×2×66×00=0
, so the node is selected to have the feature of work,
and the rest of the nodes can be deduced by analogy, and finally a complete classification tree can be drawn