Artificial intelligence algorithm popular explanation series (1): K-proximity method

    Today, I will introduce a very simple machine learning algorithm called K-nearest neighbors, English k-nearest neighbors, referred to as KNN.

    Before introducing the algorithm, let's take a case. After the case is finished, let's see how we can use the K-proximity method to solve the problem in this case.

    There is a game company, and they develop a game. After the game was released for a while, they got some user data. The data is shown in the picture below:

    Each graph on this graph represents a user. A red triangle means the user likes the game; a blue square means the user doesn't like the game. There are 2 coordinates on the graph. The horizontal axis represents the age of the user, and the vertical axis represents the average time the user spends playing the phone every day, in minutes.

   For example, the blue box in the lower right corner. This user is about 50 years old, and TA spends about 30 minutes on the phone every day on average. The graphic is blue to indicate that he doesn't like the game. Another example is the red triangle in the upper left corner. The user is under the age of 15 and spends 240 minutes on the phone every day, or 4 hours. The graphic is red to indicate that he likes the game. Other users have the same logic.

    These are known user data. Now, they get a new batch of users from certain channels. For example, there is a user, represented as a green dot on the graph. This user is about 30 years old and spends about 200 minutes on the phone every day.
   The company wants to know, should they promote the game to this user?

    It needs to be explained that there is a cost to promoting the game, and it needs to spend a certain amount of promotion fee. Therefore, the company can only make money by promoting it to those who like the game.

     So, the question just now becomes: Predict whether the new green user will like the game.

     How to predict it?

    Here we need an assumption: the closer the attributes are, the more similar the behavioral preferences are .

    For example, two people in their 20s may have similar behavioral preferences; while a person in their 20s and a person in their 50s, their behavioral preferences will be quite different. That is, the closer the attribute of age is, the more similar the behavioral preferences are.

    For another example, two engineers may have similar behavioral preferences. And an engineer and an actor, their behavioral preferences will be quite different. That is, the closer the attribute of industry is, the more similar the behavioral preferences are.

    To generalize it, people with similar attributes have more similar behavioral preferences. This is the same as what we often say, "Things gather together and people are divided into groups".

    Back to our case. To know which type of users the green user is more similar to, we can compare which attributes he is more similar to. Or conversely, who are the people close to him?

    We found that "closeness" is a concept of distance. On a two-dimensional plan, it refers to which graphics are relatively short from the green dot.

    We need to judge whether the new user (green dot) likes the game or not, that is, to judge whether this dot should belong to red or should belong to blue. According to the above assumptions, we can first find a graph that is close to his attributes, that is, its neighbors. See what kind of neighbors it is. If the neighbors are all red, according to the principle of clustering, it has a high probability that it is also red. If the neighbors are all blue, it has a high probability of being blue.

    How many neighbors should I choose? This number is up to us, we can choose any K neighbors, K is an integer. This is what the K-proximity method means.

   For example, we choose 4 nearest neighbors. It is found that 3 of the four neighbors are red and only one is blue. Red is the majority among the neighbors, and we can judge that this new user is probably also red. It's like clustering! He's red means he likes the game, so the company should recommend it to him.

    We can also expand the scope of neighbors, such as finding 8 neighbors. At this time, 5 red and 3 blue were found. There are still many red neighbors, so we still judge this user to be red.

   However, not just a few random selections of neighbors can lead to the same conclusion. For example, we choose 1 neighbor. At this time, it will be found that the neighbor is blue, so the conclusion should be that the user is blue. Not the same as the previous conclusion!

    In theory, the more neighbors you choose, the better. However, if there are too many, the amount of calculation will be large, and the calculation will be slow. Usually we need to choose an appropriate K value according to the actual situation, so that the conclusion is more reasonable, and the calculation amount is not too large.

    So far, the main principles of the K-proximity method have been introduced.

    However, interested students may find a problem. Our graph has only two attributes: "age" and "time spent playing mobile phone per day". What if there are three properties? For example, we also have an attribute of "education". How to express it?

    In fact, the three attributes can be represented in three-dimensional space , with three coordinates of x, y, and z. Each coordinate represents an attribute. The neighbors of a point are those points that are closest to it in three-dimensional space. The distance between two points in a three-dimensional space is calculated in a similar way to the distance between two points in a two-dimensional plane.

    If there are 4 properties, it is a 4-dimensional spatial representation. If there are N attributes, it is represented by an N-dimensional space. We may not be able to imagine how big the high-dimensional space is, but the calculation method of the distance between two points in the high-dimensional space is similar to the calculation method of the distance in two-dimensional and three-dimensional.

    Therefore, no matter how many attributes there are, the K-proximity method can use the same calculation method.

 

    Today we introduced a case of a game company. The company in the case had to judge whether a certain game should be promoted to new users, and then solved this problem with the K-proximity method.

    What other problems do you think the K-proximity method can solve? If you have an idea, you can write it in the comments.


related articles:

Artificial intelligence algorithm popular explanation series (1): K-proximity method

Artificial Intelligence Algorithms Popular Explanation Series (2): Logistic Regression

Artificial intelligence algorithm popular explanation series (3): decision tree

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325248877&siteId=291194637