Artificial intelligence algorithm popular explanation series (3): decision tree

Today, the machine learning algorithm we introduce is called decision tree.

As before, let's take a case before introducing the algorithm, and then look at how the algorithm can be used to solve the problem in the case.

The case here is similar to the case in K's lecture on approaching the law , with a few changes. Let me briefly describe the case: a company developed a game and got some user data. As follows:

 

    Each graph on the graph represents a user, the horizontal axis is age, and the vertical axis is gender. Red means the user likes the game, blue means the user doesn't like the game. For example, the blue box in the lower right corner represents a woman in her fifties or sixties. Blue means she doesn't like the game. For another example, the red triangle in the upper left corner represents a teenage boy. Red means he likes the game.

    There is now a new user, shown in green. The company wants to know: Will this new user enjoy the game?

    From the picture alone, it is difficult for us to make an effective judgment at a glance. We might as well sort out the attributes of users first, and then classify users. According to the two dimensions of gender and age, such a table can be made. We put each user into the corresponding grid according to the conditions, as shown in the figure:

    Users younger than 30 and male are in the first grid, there are 3 people in total, and all of them like to play games; users over 30 and female are in the lower right grid, with a total of 4 people, And don't play this game. Others, some like to play games, some don't.

    We can create another tree for judging the preferences of these people. For example, we first use gender as a judgment condition, and divide people into two parts: "male" and "female". Then within each portion, we divide them into two portions by age.

    The oval in the figure represents the judgment conditions, such as what gender is, and whether it is less than 30 years old. Arrows are judgment results, such as males on the left and females on the right. Those less than 30 years old are on the left, and those older than 30 are on the right.

    The bottom is the leaf node of the tree, and each leaf node is a judgment result. For users whose gender is male and less than 30 years old, the judgment result is on the leftmost leaf. This leaf node corresponds to the data of the first grid in the table. 100% of the users there are red, so, this is represented by a red triangle. A value of 1.0 means that the proportion of red users is 100%, or the probability of red users is 100%.

    The second leaf node is a blue square, which corresponds to the data in the second grid. 80% of users who meet these criteria are blue. So, here is blue, and the value is 0.8.

    Others and so on. Red 0.75 means 75% of users are red, blue 1.0 means 100% of people are blue.

    If a new user comes, put his attributes on the tree, find the leaf node where he is, and then judge his preference according to the color and number of the leaf node. For example, a 15-year-old boy, let's see where he should be in this tree. Since it is male, it is on the left half. There is knowing he is 15 years old and less than 30 years old, so he is located in the lower left corner of the leaf. The leaf is red and has a value equal to 1.0. That is, the probability of liking the game is 100%. Therefore, it is judged that he is very likely to like this game. For another example, here comes a 25-year-old lady. Because she's a woman, she's on another half of the tree. The age is 25 years old and less than 30, so it will be positioned on the third leaf. The leaf is red and has a value equal to 0.75. Therefore, we judge that the probability that he likes the game is 75%, or that she is likely to like the game. If a 45-year-old woman puts it on the tree, it will be located on the fourth leaf, the color is blue, and the value is 1.0. Therefore, there is a 100% chance that she doesn't like the game, or a 0 chance that she likes the game.

This tree is a decision tree.

We could also create another tree and reverse the order of the conditions, with age at the first level and gender at the second level.

Both trees have the same leaves, just in a different order. So using the two trees to make judgments, the result is the same. However, the two trees are not equivalent!

The difference between them is that the two attributes "gender" and "age" are not of the same importance!

What does importance mean?

Think of it this way: if you were only asked to choose one attribute, which attribute would you choose. Or when only one attribute is allowed to choose, which attribute will make your decision more accurate? Let's experiment.

1. Suppose we can only use the "gender" attribute.

We judged by gender and found that half of male users are red and half are blue. Therefore, we cannot judge whether male users should be better in red or blue. It was found that among female users, there are 3 red and 5 blue. We can vaguely make a judgment: "Female users don't like playing this game very much . " However, when you make this judgment, you will lack confidence . After all, there are still three women who like to play.

 

2. Suppose we can only use the attribute "age".

We will find that there are a total of 7 users under the age of 30, of which 6 are red, accounting for 85.7%. There are a total of 9 users over 30 years old, of which 8 are blue, accounting for 89%. Therefore, we can make a bold judgment: " Users younger than 30 prefer to play games, and users older than 30 do not like to play." When you make this judgment, you are confident and powerful!

Intuitively, you will feel that the attribute "age" is more important!

Yes, your intuition was right!

    The reason is that with "sex" the data is still confusing; with the attribute "age" the data becomes deterministic.

    The degree of certainty or uncertainty is expressed in information theory by a word called "entropy". The word "entropy" was originally a concept in thermodynamics, used to indicate the degree of disorder in a thermodynamic system. The larger the entropy, the more disorder, the smaller the entropy, the more order. In 1948, Shannon introduced it into information theory and gave the calculation formula of information entropy. The larger the information entropy, the greater the uncertainty; the smaller the entropy, the smaller the uncertainty. This is considered one of the most important contributions of the 20th century!

    In the above figure, if the "gender" attribute is added to the data for classification, the color of the user is still uncertain, so the entropy value is relatively large. If the "age" attribute is added to the data for classification, the color of the user is basically determined, so the entropy value is small.

    Therefore: the importance of an attribute can be judged by the size of the entropy it produces. The attribute that makes the entropy value smaller is more important!

    There is a precise formula for entropy, which will not be written here. When I get a chance, I'll talk about it in a future advanced course.

    Since entropy can be calculated, the importance of attributes can be calculated. We calculate the entropy generated by all attributes, and sort them from small to large. The attribute corresponding to the smallest entropy value is the most important. We put the most important attributes at the top node of the decision tree.

    Then on each branch, the most important attribute among the remaining attributes is calculated and placed on the secondary node. And so on.

    When we have a lot of attributes, the attributes placed at the bottom may have a negligible influence, so we can do without them, so that the structure of the decision tree is very simplified and only contains important attributes. In some cases, this clipping of branches and leaves can effectively avoid overfitting. About what is "overfitting", I will talk about it later, and interested students can also check it out for themselves.

    For the above two trees, we can try to prune a bit. As shown below. The age-rooted tree can still have a good prediction accuracy by clipping, because after the right side is clipped, there is still an 89% prediction accuracy, which is high enough. There are some problems with pruning trees rooted by gender. Because the right leaf has only a 62% predicted quasi-probability. It can be seen that "age" is better as a root.

 

The following is a pseudo-code, the createBranch method is used to create a branch of the decision tree, which is a recursive structure.

createBranch(){

    检测数据集中的每个子项是否属于同一分类

    if  是同一分类{

       return 分类标签

    }else{

        寻找划分数据集的最好特征  

        划分数据集    

        创建分支节点        

        for 每个划分的子集{            

            调用函数createBranch并增加返回结果到分支节点中  

        } 

        return 分支节点

    } 

}

 

related articles:

Artificial intelligence algorithm popular explanation series (1): K-proximity method

Artificial Intelligence Algorithms Popular Explanation Series (2): Logistic Regression

Artificial intelligence algorithm popular explanation series (3): decision tree

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325116431&siteId=291194637