Decision Tree-CART Algorithm (Part 2)

CART classification tree algorithm


Interpretation of CART classification tree algorithm

Input: data set D, feature set A, stop condition threshold ϵ \epsilonϵ

Output: CART classification decision tree

step:

  1. Starting from the root node, perform operations and build an operation binary tree

  2. Calculate the Gini index of the data set under the existing features and select the optimal features.

    ——Under the characteristic Ag, for each value g that it may take, according to the test of Ag=g at the sample point is "yes" or "no", divide D into two parts
    D1 and D2, and calculate Ag=g Gini index at that time.
    ——Select the value with the smallest Gini index as the optimal segmentation point under this feature.
    ——Calculate the optimal segmentation point under each feature, and compare the Gini index of each feature under the optimal segmentation, and select the feature with the smallest Gini index, that is, the optimal feature.

  3. According to the optimal feature and the optimal segmentation point, generate two sub-nodes, and assign the data set to the corresponding sub-nodes

     按照最优切分点来分成二叉树
    
  4. Continue to recursively call the above steps for the two child nodes respectively until the conditions are met, that is, generate a CART classification decision tree.

    The condition here is generally a threshold. When the Gini index is less than this threshold,
    the samples basically belong to the same class, or there are no more features, and the CART classification decision tree is generated.


CART classification tree real question explanation**

​Training set D, "feature set" is age A 1 A_1A1, is there a job A 2 A_2A2, Do you have your own house A 3 A_3A3, credit situation A 4 A_4A4

Category is: Y 1 Y_1Y1=Yes, Y 2 = No Y_2=NoY2=no

insert image description here
Use the minimization of the Gini index to select the best features”

​For feature A, the Gini index of sample set D is:
Gini ⁡ ( D , A ) = ∣ D 1 ∣ ∣ D ∣ Gini ⁡ ( D 1 ) + ∣ D 2 ∣ ∣ D ∣ Gini ⁡ ( D 2 ) \operatorname{Gini}(D, A)=\frac{\left|D_{1}\right|}{|D|} \operatorname{Gini}\left(D_{1}\right)+\frac{\ left|D_{2}\right|}{|D|} \operatorname{Gini}\left(D_{2}\right)G i n i ( D ,A)=DD1Gini(D1)+DD2Gini(D2) first feature: age

​Three eigenvalues: Youth A 11 A_{11}A11, middle-aged A 12 A_{12}A12and aged A 13 A_{13}A13

​The CART

  1. Gini ⁡ ( D 1 ) = 2 × 2 5 × 3 5 = 12 25 \operatorname{Gini}\left(D_{1}\right)=2 \times \frac{2} {5}
    \times \frac{3}{5}=\frac{12}{25}Gini(D1)=2×52×53=2512
    Weight W 1 = 5 15 W_1=\frac{5}{15}W1=155
    Gini ⁡ ( D 1 ) = 2 × 3 10 × 7 10 = 42 100 \operatorname{Gini}\left(D_{1}\right)=2 \times \frac{3}{10} \times \frac{7}{10}=\frac{42}{100} Gini(D1)=2×103×107=10042
    Weight W 2 = 10 15 W_2=\frac{10}{15}W2=1510
    Gini ⁡ ( D , A ) = ∣ D 1 ∣ ∣ D ∣ Gini ⁡ ( D 1 ) + ∣ D 2 ∣ ∣ D ∣ Gini ⁡ ( D 2 ) \operatorname{Gini}(D, A)=\frac{\left|D_{1}\right|}{|D|} \operatorname{Gini}\left(D_{1}\right)+\frac{\left|D_{2}\right|}{|D|} \operatorname{Gini}\left(D_{2}\right) G i n i ( D ,A)=DD1Gini(D1)+DD2Gini(D2)

    Gini ⁡ ( D , A 11 ) = 5 15 × 12 25 + 10 15 × 42 100 = 0.44 \operatorname{Gini}\left(D, A_{11}\right)=\frac{5}{15} \times \frac{12}{25}+\frac{10}{15} \times \frac{42}{100}=0.44 Gini(D,A11)=155×2512+1510×10042=0.44

  2. Classify Gini by middle-aged and non-middle-aged
    Gini ⁡ ( D 1 ) = 2 × 2 5 × 3 5 = 12 25 \operatorname{Gini}\left(D_{1}\right)=2 \times \frac{2}{ 5} \times \frac{3}{5}=\frac{12}{25}Gini(D1)=2×52×53=2512
    ​Weight D 1 = 5 15 D_1=\frac{5}{15}D1=155
    Gini ⁡ ( D 1 ) = 2 × 4 10 × 6 10 = 48 100 \operatorname{Gini}\left(D_{1}\right)=2 \times \frac{4}{10} \times \frac{6}{10}=\frac{48}{100} Gini(D1)=2×104×106=10048
    ​Weight D 2 = 10 15 D_2=\frac{10}{15}D2=1510
    Gini ⁡ ( D , A ) = ∣ D 1 ∣ ∣ D ∣ Gini ⁡ ( D 1 ) + ∣ D 2 ∣ ∣ D ∣ Gini ⁡ ( D 2 ) \operatorname{Gini}(D, A)=\frac{\left|D_{1}\right|}{|D|} \operatorname{Gini}\left(D_{1}\right)+\frac{\left|D_{2}\right|}{|D|} \operatorname{Gini}\left(D_{2}\right) G i n i ( D ,A)=DD1Gini(D1)+DD2Gini(D2)

    Gini ⁡ ( D , A 12 ) = 5 15 × 12 25 + 10 15 × 48 100 = 0.48 \operatorname{Gini}\left(D, A_{12}\right)=\frac{5}{15} \times \frac{12}{25}+\frac{10}{15} \times \frac{48}{100}=0.48 Gini(D,A12)=155×2512+1510×10048=0.48

  3. Categorized by old and non-elderly

Gini ⁡ ( D 1 ) = 2 × 1 5 × 4 5 = 8 25 \operatorname{Gini}\left(D_{1}\right)=2 \times \frac{1}{5} \times \frac{4}{5}=\frac{8}{25} Gini(D1)=2×51×54=258

​Weight D 1 = 5 15 D_1=\frac{5}{15}D1=155
Gini ⁡ ( D 1 ) = 2 × 5 10 × 5 10 = 1 2 \operatorname{Gini}\left(D_{1}\right)=2 \times \frac{5}{10} \times \frac{5}{10}=\frac{1}{2} Gini(D1)=2×105×105=21
​Weight D 2 = 10 15 D_2=\frac{10}{15}D2=1510
Gini ⁡ ( D , A ) = ∣ D 1 ∣ ∣ D ∣ Gini ⁡ ( D 1 ) + ∣ D 2 ∣ ∣ D ∣ Gini ⁡ ( D 2 ) \operatorname{Gini}(D, A)=\frac{\left|D_{1}\right|}{|D|} \operatorname{Gini}\left(D_{1}\right)+\frac{\left|D_{2}\right|}{|D|} \operatorname{Gini}\left(D_{2}\right) G i n i ( D ,A)=DD1Gini(D1)+DD2Gini(D2)

Gini ⁡ ( D , A 13 ) = 5 15 × 8 25 + 10 15 × 1 2 = 0.44 \operatorname{Gini}\left(D, A_{13}\right)=\frac{5}{15} \times \frac{8}{25}+\frac{10}{15} \times \frac{1}{2}=0.44 Gini(D,A13)=155×258+1510×21=0.44

			**由此可以看出青年和老年基尼指数最小0.44,都可以作为最优划分点**

The second characteristic: work

​2 eigenvalues: there are jobs A 21 A_{21}A21, no job A 22 A_{22}A22


Gini ⁡ ( D 1 ) = 2 × 0 5 × 5 5 = 0 \operatorname{Gini}\left(D_{1}\right)=2 \times \frac{0}{5} \times \frac{5}{5}=0 Gini(D1)=2×50×55=0 ​WeightD 1 = 5 15 D_1=\frac{5}{15}
D1=155
Gini ⁡ ( D 2 ) = 2 × 6 10 × 4 10 = 48 100 \operatorname{Gini}\left(D_{2}\right)=2 \times \frac{6}{10} \times \frac{4}{10}=\frac{48}{100} Gini(D2)=2×106×104=10048
​Weight D 2 = 10 15 D_2=\frac{10}{15}D2=1510
Gini ⁡ ( D , A ) = ∣ D 1 ∣ ∣ D ∣ Gini ⁡ ( D 1 ) + ∣ D 2 ∣ ∣ D ∣ Gini ⁡ ( D 2 ) \operatorname{Gini}(D, A)=\frac{\left|D_{1}\right|}{|D|} \operatorname{Gini}\left(D_{1}\right)+\frac{\left|D_{2}\right|}{|D|} \operatorname{Gini}\left(D_{2}\right) G i n i ( D ,A)=DD1Gini(D1)+DD2Gini(D2)

Gini ⁡ ( D , A 2 ) = 5 15 × 0 + 10 15 × 48 100 = 0.32 \operatorname{Gini}\left(D, A_{2}\right)=\frac{5}{15} \times 0+\frac{10}{15} \times \frac{48}{100}=0.32 Gini(D,A2)=155×0+1510×10048=0.32

The third characteristic: the house

​2 eigenvalues: There is a house A 31 A_{31}A31, no house A 32 A_{32}A32


Gini ⁡ ( D 1 ) = 2 × 0 6 × 6 6 = 0 \operatorname{Gini}\left(D_{1}\right)=2 \times \frac{0}{6} \times \frac{6}{6}=0 Gini(D1)=2×60×66=0 ​WeightD 1 = 5 15 D_1=\frac{5}{15}
D1=155
Gini ⁡ ( D 2 ) = 2 × 3 9 × 6 9 = 72 81 \operatorname{Gini}\left(D_{2}\right)=2 \times \frac{3}{9} \times \frac{6}{9}=\frac{72}{81} Gini(D2)=2×93×96=8172
​Weight D 2 = 10 15 D_2=\frac{10}{15}D2=1510
Gini ⁡ ( D , A ) = ∣ D 1 ∣ ∣ D ∣ Gini ⁡ ( D 1 ) + ∣ D 2 ∣ ∣ D ∣ Gini ⁡ ( D 2 ) \operatorname{Gini}(D, A)=\frac{\left|D_{1}\right|}{|D|} \operatorname{Gini}\left(D_{1}\right)+\frac{\left|D_{2}\right|}{|D|} \operatorname{Gini}\left(D_{2}\right) G i n i ( D ,A)=DD1Gini(D1)+DD2Gini(D2)

Gini ⁡ ( D , A 3 ) = 6 15 × 0 + 9 15 × 72 81 = 0.27 \operatorname{Gini}\left(D, A_{3}\right)=\frac{6}{15} \times 0+\frac{9}{15} \times \frac{72}{81}=0.27 Gini(D,A3)=156×0+159×8172=0.27

The fourth characteristic: credit situation

​Three eigenvalues: very good A 41 A_{41}A41, Good A 42 A_{42}A42and general A 43 A_{43}A43

  1. Classify Gini as very good and not very good
    ⁡ ( D 1 ) = 2 × 0 4 × 4 4 = 0 \operatorname{Gini}\left(D_{1}\right)=2 \times \frac{0}{ 4} \times \frac{4}{4}=0Gini(D1)=2×40×44=0
    weightW 1 = 4 15 W_1=\frac{4}{15}W1=154
    Gini ⁡ ( D 2 ) = 2 × 6 11 × 5 11 = 60 121 \operatorname{Gini}\left(D_{2}\right)=2 \times \frac{6}{11} \times \frac{5}{11}=\frac{60}{121} Gini(D2)=2×116×115=12160
    Weight W 2 = 11 15 W_2=\frac{11}{15}W2=1511
    Gini ⁡ ( D , A ) = ∣ D 1 ∣ ∣ D ∣ Gini ⁡ ( D 1 ) + ∣ D 2 ∣ ∣ D ∣ Gini ⁡ ( D 2 ) \operatorname{Gini}(D, A)=\frac{\left|D_{1}\right|}{|D|} \operatorname{Gini}\left(D_{1}\right)+\frac{\left|D_{2}\right|}{|D|} \operatorname{Gini}\left(D_{2}\right) G i n i ( D ,A)=DD1Gini(D1)+DD2Gini(D2)

    Gini ⁡ ( D , A 41 ) = 4 15 × 0 + 11 15 × 60 121 = 0.36 \operatorname{Gini}\left(D, A_{41}\right)=\frac{4}{15} \times 0+\frac{11}{15} \times \frac{60}{121}=0.36 Gini(D,A41)=154×0+1511×12160=0.36

  2. Classify
    Gini ⁡ ( D 1 ) = 2 × 2 6 × 4 6 = 16 36 \operatorname{Gini}\left(D_{1}\right)=2 \times \frac{2}{6} \times \frac{4}{6}=\frac{16}{36}Gini(D1)=2×62×64=3616
    ​Weight D 1 = 6 15 D_1=\frac{6}{15}D1=156
    Gini ⁡ ( D 2 ) = 2 × 4 9 × 5 9 = 40 81 \operatorname{Gini}\left(D_{2}\right)=2 \times \frac{4}{9} \times \frac{5}{9}=\frac{40}{81} Gini(D2)=2×94×95=8140
    ​Weight D 2 = 9 15 D_2=\frac{9}{15}D2=159
    Gini ⁡ ( D , A ) = ∣ D 1 ∣ ∣ D ∣ Gini ⁡ ( D 1 ) + ∣ D 2 ∣ ∣ D ∣ Gini ⁡ ( D 2 ) \operatorname{Gini}(D, A)=\frac{\left|D_{1}\right|}{|D|} \operatorname{Gini}\left(D_{1}\right)+\frac{\left|D_{2}\right|}{|D|} \operatorname{Gini}\left(D_{2}\right) G i n i ( D ,A)=DD1Gini(D1)+DD2Gini(D2)

    Gini ⁡ ( D , A 42 ) = 6 15 × 16 36 + 9 15 × 40 81 = 0.47 \operatorname{Gini}\left(D, A_{42}\right)=\frac{6}{15} \times \frac{16}{36}+\frac{9}{15} \times \frac{40}{81}=0.47 Gini(D,A42)=156×3616+159×8140=0.47

  3. Classified by general and non-general

Gini ⁡ ( D 1 ) = 2 × 4 5 × 1 5 = 8 25 \operatorname{Gini}\left(D_{1}\right)=2 \times \frac{4}{5} \times \frac{1}{5}=\frac{8}{25} Gini(D1)=2×54×51=258

​Weight D 1 = 5 15 D_1=\frac{5}{15}D1=155
Gini ⁡ ( D 2 ) = 2 × 2 10 × 8 10 = 32 100 \operatorname{Gini}\left(D_{2}\right)=2 \times \frac{2}{10} \times \frac{8}{10}=\frac{32}{100} Gini(D2)=2×102×108=10032
​Weight D 2 = 10 15 D_2=\frac{10}{15}D2=1510
Gini ⁡ ( D , A ) = ∣ D 1 ∣ ∣ D ∣ Gini ⁡ ( D 1 ) + ∣ D 2 ∣ ∣ D ∣ Gini ⁡ ( D 2 ) \operatorname{Gini}(D, A)=\frac{\left|D_{1}\right|}{|D|} \operatorname{Gini}\left(D_{1}\right)+\frac{\left|D_{2}\right|}{|D|} \operatorname{Gini}\left(D_{2}\right) G i n i ( D ,A)=DD1Gini(D1)+DD2Gini(D2)

Gini ⁡ ( D , A 43 ) = 5 15 × 8 25 + 10 15 × 32 100 = 0.32 \operatorname{Gini}\left(D, A_{43}\right)=\frac{5}{15} \times \frac{8}{25}+\frac{10}{15} \times \frac{32}{100}=0.32 Gini(D,A43)=155×258+1510×10032=0.32

			**由此可以看出特征一般基尼指数最小0.32,能作为最优划分点**

Comparison by Gini index of four characteristics

Eigenvalues Corresponding Gini index
age 0.44
Work 0.32
house 0.27
credit status 0.32

The Gini index of the house is the smallest, so draw the binary tree selection feature as the optimal feature

: Same as above, divided by age, work and credit status

  1. In the no-house dataset, classify by age feature

    age Number do not agree to the loan agree to loan
    youth 4 3 1
    middle aged 3 2 0
    elderly 2 1 2
  2. In the no-house dataset, classify by job feature

    Work Number do not agree to the loan agree to loan
    have a job 3 0 3
    no job 6 6 0
  3. In the no-house dataset, classify by credit status feature

    credit status Number do not agree to the loan agree to loan
    very good 1 0 1
    good 4 2 2
    generally 4 4 0

    In the same way, the features can be selected from the data: Obviously, the Gini index of the feature work here is:
    G ( D , A 2 ) = 3 9 × 2 × 0 3 × 3 3 + 6 9 × 2 × 6 6 × 0 0 = 0 G(D,A_2)=\frac{3}{9}\times2\times \frac{0}{3}\times \frac{3}{3}+\frac{6}{9}\times2\ times \frac{6}{6}\times \frac{0}{0}=0G(D,A2)=93×2×30×33+96×2×66×00=0
    , so the node is selected to have the feature of work,

    and the rest of the nodes can be deduced by analogy, and finally a complete classification tree can be drawn


Guess you like

Origin blog.csdn.net/qq_44795788/article/details/124675120