Artificial intelligence algorithm popular explanation series (4): xgboost

Today, the machine learning algorithm we introduce is called xgboost.

Don't be intimidated by the name, its fundamentals aren't complicated. To understand xgboost, you need to first understand decision trees. Students who have not been exposed to decision trees can take a look at the content of " Artificial Intelligence Algorithms Popular Explanation Series (3): Decision Trees  ".

Suppose we now have the following two decision trees, tree1 on the left and tree2 on the right. These trees are used to determine whether the user likes to play the game. Similar to our previous case.

Take a look at the simple tree on the right. The ellipse is the user characteristic: whether the computer is used every day. "Yes" on the left and "No" on the right. Through the data, they found that among the people who play computers every day, the proportion of people who play games is high. Among those who don't play computers much, the proportion of people who play games is low. We set a larger weight, such as 0.9, for the side with a high proportion, and a lower weight for the side with a low proportion. Don't care about the calculation details of 0.9, as long as you know that the proportion (probability) is high and heavy, it's OK.

Suppose we only have the tree on the right. When a new user arrives, we can use it to determine its game-playing preferences. For example, a new user plays the computer every day, we directly judge "he likes to play games". This won't be particularly accurate, but it's better than being blind.

But we still have to strive for better. Because we don't have just one tree, there is another decision tree to use.

Now, let's look at the decision tree on the left. Its first judgment condition is: "whether the age is less than 15". From the existing data, it is found that on the side of "No", that is, people over 15 years old, the proportion of people who play games is relatively low, so a low weight is set, such as -1. Among people under 15 years old , and then divided into male and female. It is found that the proportion of male users playing games is significantly higher than that of females. Therefore, set a higher weight for men, such as 2, and a lower weight for women, such as 0.1. However, the weight of women here is more important than that of the branch over the age of 15.

Then, when a new data comes in, we combine the two trees to make a comprehensive judgment. For example, a new user is younger than 15, male, and plays computer every day. The way to predict whether he likes to play the game is to find his weight in each tree and add it up. His position in the first tree is the lower left leaf with a weight of 2; at the same time, his position in the second tree is also the lower left leaf with a weight of 0.9. Then, we add his weights in both trees to get the final weight, which is 2.9.

In this way, it is equivalent to taking into account the sum of three characteristics: age, gender, and time spent playing computer, which is more accurate than a single decision tree.

In the same way, a person who is older than 15 and who does not play a lot of computers will score very low after the combined calculation. The age factor is reduced by 1 point, and not playing the computer is reduced by 0.9, so the score is particularly low, and finally it is -1.9. Therefore, we predict that he is unlikely to like games.

In actual use, there may be more than three features in the feature library, but dozens or hundreds of features. then what should we do?

For example, we have 100 features and 1 million pieces of data at the same time. We can randomly select 10 features and randomly select 100,000 pieces of data to generate a tree. Then, 10 features are randomly selected, 100,000 pieces of data are randomly selected, and the second tree is generated. The same method generates the third tree, the fourth tree...

In the end we can generate dozens of trees, say 50. The specific number of trees we decide. These trees form a forest. Because it is randomly generated, it is called a random forest.

When we judge a new user correctly, we put this user on each tree, so that 50 weights are obtained. These 50 weights are then added to get the final weight. By comparing the final weights of different users, you can determine which category they belong to.

The strategy formed by combining all the trees in the forest synthesizes all the features and the logic of the various combinations. Using this strategy to judge each new data, the accuracy rate will be greatly improved. Therefore, xgboost has performed well in many competitions.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325104965&siteId=291194637