Make music recommendations based on user similarity!

Today I would like to share the application questions I did in my data structure and algorithm course.

  1. (Coefficient 1) Many people like to listen to music. In the past, we used MP3 to listen to songs, but now we can listen to songs online directly through music apps. Moreover, the functions of various music apps are becoming more and more powerful. Not only can you choose songs to listen to, but they can also recommend music that you may like based on your taste preferences. Sometimes the recommended music is very suitable for your taste. , it may even surprise you! Such a smart function, do you know how it is implemented?

In fact, to solve this problem, no particularly advanced theory is required. The core idea of ​​the solution is very simple and straightforward, and can be summarized in two sentences. Find users with similar taste preferences to you and recommend songs they like to listen to; find songs with similar characteristics to your favorite songs and recommend these songs to you.

How to make recommendations based on similar users to find users with similar taste preferences to you? Or how to define similar taste preferences? In fact, the idea is very simple. We regard people who listen to similar songs to you as users with similar tastes. You can see the picture I drew below. I use "1" to mean "like" and "0" to generally mean "no opinion." From the picture below, we can see that you and Xiao Ming share the most favorite songs, 5 of them. So, we can say that Xiao Ming has very similar tastes to you.


We only need to traverse all users, compare the number of songs that each user likes with you, and set a threshold. If the number of songs that you and a user like in common exceeds this threshold, we will regard this user as the follower. Users with similar tastes to you will recommend songs that this user likes but you haven't heard yet. But there is a problem in the solution just now. How do we know which songs the user likes? In other words, how to define how much a user likes a certain song? In fact, we can define this degree of liking through user behavior. We define a score for each behavior, and the higher the score, the higher the degree of liking.

Still using the example just now, if we express how much everyone likes each song, it will look like this. In the picture, we no longer use "1" or "0" to express whether a person likes a certain song, but corresponds to a specific score.

With such a corresponding table of users' liking for songs, how to judge whether two users have similar tastes? Obviously, we can no longer use simple counting to count the similarity between two users as before. Remember the edit distance we mentioned when we talked about string similarity measures before? For the similarity measure here, we can use another distance, which is the Euclidean distance. There are two keywords in this concept: vector and distance. One-dimensional space is a line, and we use single numbers like 1, 2, 3... to represent a certain position in one-dimensional space; two-dimensional space is a surface, and we use (1, 3) (4, 2 ) (2, 2)... Two numbers like this represent a certain position in two-dimensional space; three-dimensional space is a three-dimensional space, we use (1, 3, 5) (3, 1, 7) (2 , 4, 3)...such three numbers to represent a certain position in three-dimensional space. It should not be difficult to understand one, two, and three dimensions, but how to represent a certain position in a higher dimension? Analogous to the one-dimensional, two-dimensional, and three-dimensional representation methods, a certain position in the K-dimensional space can be written as (X1​, X2​, X3​,..., XK). This representation is a vector. In two-dimensional and three-dimensional space, there is the concept of distance between two positions. By analogy to high-dimensional space, there is also the concept of distance. This is what we call the distance between two vectors. So how to calculate the distance between two vectors? We can still make an analogy to the calculation method of distance in two-dimensional and three-dimensional space. Through analogy, we can get the formula for calculating the distance between two vectors. This calculation formula is the calculation formula of Euclidean distance:

We use a vector to represent each user's liking for all songs. We compute the Euclidean distance between two vectors as a measure of how similar the tastes of two users are. From the calculation in the figure, we can see that the Euclidean distance between Xiao Ming and you is the smallest. That is to say, you two are closest to each other in the high-dimensional space. Therefore, we conclude that Xiao Ming has the most similar tastes to you.

Please design the above algorithm to implement a simple recommendation system.

The above are the algorithm tips given to me in school.

But if you think about it carefully, this is wrong.

The norms of all calculated values ​​are not uniform. This will lead to the fact that if the user with the highest similarity, that is, the smallest Euclidean distance, cannot make the music recommendation the user needs, we should consider finding similar users. The user with the second highest degree is asked to recommend the songs needed by the main user. Therefore, a number between 0 and 1 can be obtained by calculating the inner product. At this time, a threshold value Therhold can be set. If the value is greater than this threshold, it is judged to be similar to the user's taste, and music can be pushed, so the following can be done. change. As shown in Figure 3.2

                               

                                                             Figure 3.2

       At this time, we regard each user as a vector, and the multiplication operation of the vector is recorded in dotProduct, that is, dotProduct += UserMusic[i][j] * UserReference[j]. It can be seen that it is the product of the main user and each user. The sum of. Then the main user and other users perform inner products, store them in the UserNorm and MusicNorm arrays, and then perform division operations as shown in Formula 1.1

                                          0 i Primary User « Other Users Primary User 2 + (Other Users) 2                                                               

                                                                      Figure 1.1

       In this way, a number from 0 to 1 will be obtained. Similarity is recorded as the similarity of the user. Next, we need to perform ranking comparison to recommend songs to users.

So the theory is there and practice begins.

void RecommondMu(int UserReference[], int UserMusic[NoUser][NoSongs], float ThreHold, int Reference[], int RecommendIndex[], int requiredSongs) {
    // 计算用户之间的相似度
    float similarities[NoUser] = { 0.0 };
    for (int i = 0; i < NoUser; i++) {
        int dotProduct = 0;
        int userNorm = 0;
        int preNorm = 0;
        for (int j = 0; j < NoSongs; j++) {
            // 内积计算存储在dotProduct中 两个向量元素相乘后累加
            // UserMusic的二维向量和UserReference的一维向量进行内积再进行累加
            // 第i个用户的similaritie计算在similarities[i]中
            dotProduct += UserMusic[i][j] * UserReference[j];
            // 向量自身进行内积
            userNorm += UserMusic[i][j] * UserMusic[i][j];
            preNorm += UserReference[j] * UserReference[j];
        }
        // 计算相似度
        similarities[i] = dotProduct / (sqrt(userNorm) * sqrt(preNorm));
    }

 /*   for (int i = 0; i < NoUser; i++) {
        printf("用户%d相似度为%f\t",i+1, similarities[i]);
    }*/

    int recommendCount = 0;
    int recommendIndex = 0;
    int userIndex = 0;

    while (recommendCount < requiredSongs && userIndex < NoUser) {
        float currentMaxSimilarity = -1.0;
        int currentMaxSimilarityIndex = -1;

        // 找到相似度最高的用户
        for (int i = 0; i < NoUser; i++) {
            if (similarities[i] > currentMaxSimilarity) {
                // 检查该用户是否有音乐可以推荐且主用户还未喜欢
                for (int j = 0; j < NoSongs; j++) {
                    if (UserMusic[i][j] > 0 && UserReference[j] == 0) {
                        currentMaxSimilarityIndex = i;
                        currentMaxSimilarity = similarities[i];
                        break;
                    }
                }
            }
        }

        // 如果找到了相似度最高的用户
        if (currentMaxSimilarityIndex >= 0) {
            // 推荐该用户的音乐
            for (int j = 0; j < NoSongs; j++) {
                if (UserMusic[currentMaxSimilarityIndex][j] > 0 && UserReference[j] == 0) {
                    Reference[recommendCount] = j;
                    RecommendIndex[recommendCount] = currentMaxSimilarityIndex + 1;
                    recommendCount++;
                    similarities[currentMaxSimilarityIndex] = -1;//将该用户的相似度置为-1避免重复推荐
                    if (recommendCount == requiredSongs) {
                        break; // 推荐足够数量的音乐
                    }
                }
            }
        }

        userIndex++;
    }

    if (recommendCount == 0) {
        printf("无法推荐所需音乐。\n");
    }
    else if (recommendCount < requiredSongs) {
        printf("只能为您推荐 %d 首音乐。\n", recommendCount);
    }
}

This is the code for the recommended music department. I have included detailed annotations.

The following is the complete code with data test

#define _CRT_SECURE_NO_WARNINGS
#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#include<time.h>
#include<ctime>
#include<iostream>
#define NoUser 6
#define NoSongs 1000

using namespace std;
/*
在这段代码中,通过计算用户i的行为得分向量userMusicMatrix的内积(userNorm)和用户喜好程度向量userPreference的内积(prefNorm),可以得到这两个向量的范数的乘积。

然后,通过将用户i与userPreference之间的内积(dotProduct)除以这两个向量范数的乘积,可以获得一个在0到1之间的相似度值。

这样做的原因是,在计算相似度时,除法操作可以对不同用户之间的度量进行归一化。这是因为不同用户的行为得分和喜好程度的范围可能不一样,有可能存在数量级上的差异。通过进行归一化操作,可以消除这些数量级的差异,使得相似度值更加可比较和可解释。

此外,通过将相似度值限制在0到1之间,可以将其解释为两个向量之间的度量或相似程度。更高的相似度值表示两个向量在喜好程度上更加相似。
*/

//采用内积算法
//其中 UserReference为主用户的喜好情况,UserMusic是其他用户的音乐喜好情况为一个矩阵,为内积的判断值,Reference为最终的喜好

int IsSongRecommended(int song, int Reference[], int count) {
	for (int i = 0; i < count; i++) {
		if (Reference[i] == song) {
			return 1;//歌曲已经被推荐
		}
	}
	return 0;
}

void RecommondMu(int UserReference[], int UserMusic[NoUser][NoSongs], float ThreHold, int Reference[], int RecommendIndex[], int requiredSongs) {
    // 计算用户之间的相似度
    float similarities[NoUser] = { 0.0 };
    for (int i = 0; i < NoUser; i++) {
        int dotProduct = 0;
        int userNorm = 0;
        int preNorm = 0;
        for (int j = 0; j < NoSongs; j++) {
            // 内积计算存储在dotProduct中 两个向量元素相乘后累加
            // UserMusic的二维向量和UserReference的一维向量进行内积再进行累加
            // 第i个用户的similaritie计算在similarities[i]中
            dotProduct += UserMusic[i][j] * UserReference[j];
            // 向量自身进行内积
            userNorm += UserMusic[i][j] * UserMusic[i][j];
            preNorm += UserReference[j] * UserReference[j];
        }
        // 计算相似度
        similarities[i] = dotProduct / (sqrt(userNorm) * sqrt(preNorm));
    }

 /*   for (int i = 0; i < NoUser; i++) {
        printf("用户%d相似度为%f\t",i+1, similarities[i]);
    }*/

    int recommendCount = 0;
    int recommendIndex = 0;
    int userIndex = 0;

    while (recommendCount < requiredSongs && userIndex < NoUser) {
        float currentMaxSimilarity = -1.0;
        int currentMaxSimilarityIndex = -1;

        // 找到相似度最高的用户
        for (int i = 0; i < NoUser; i++) {
            if (similarities[i] > currentMaxSimilarity) {
                // 检查该用户是否有音乐可以推荐且主用户还未喜欢
                for (int j = 0; j < NoSongs; j++) {
                    if (UserMusic[i][j] > 0 && UserReference[j] == 0) {
                        currentMaxSimilarityIndex = i;
                        currentMaxSimilarity = similarities[i];
                        break;
                    }
                }
            }
        }

        // 如果找到了相似度最高的用户
        if (currentMaxSimilarityIndex >= 0) {
            // 推荐该用户的音乐
            for (int j = 0; j < NoSongs; j++) {
                if (UserMusic[currentMaxSimilarityIndex][j] > 0 && UserReference[j] == 0) {
                    Reference[recommendCount] = j;
                    RecommendIndex[recommendCount] = currentMaxSimilarityIndex + 1;
                    recommendCount++;
                    similarities[currentMaxSimilarityIndex] = -1;//将该用户的相似度置为-1避免重复推荐
                    if (recommendCount == requiredSongs) {
                        break; // 推荐足够数量的音乐
                    }
                }
            }
        }

        userIndex++;
    }

    if (recommendCount == 0) {
        printf("无法推荐所需音乐。\n");
    }
    else if (recommendCount < requiredSongs) {
        printf("只能为您推荐 %d 首音乐。\n", recommendCount);
    }
}


//测试数据
void generateRandomData(int UserReference[], int UserMusic[][NoSongs]) {
	srand(time(NULL));

	printf("随机生成的用户喜好:\n");
	printf("主用户:");
	for (int j = 0; j < NoSongs; j++) {
		UserReference[j] = rand() % 7 - 1;
		printf("%d ", UserReference[j]);
	}
	printf("\n");

	printf("其他用户:\n");
	for (int i = 0; i < NoUser; i++) {
		printf("User %d: ", i + 1);
		for (int j = 0; j < NoSongs; j++) {
			UserMusic[i][j] = rand() % 7 - 1;
			printf("%d ", UserMusic[i][j]);
		}
		printf("\n");
	}
}

//计算需要多少首音乐
int getRequired(int userReference[NoSongs]) {
	int count = 0;
	for (int i = 0; i < NoSongs; i++) {
		if (userReference[i] == 0) {
			count++;
		}
	}
	return count;
}

void main() {
	int userReference[NoUser];
	int UserMusic[NoUser][NoSongs];
	generateRandomData(userReference, UserMusic);
	float threshold = 0.6;
	//推荐结果数组
	int recommendMusic[NoSongs] = { 0 };
	int requireSongs = 0;
    int RecommendIndex[NoUser] = { 0 };
	requireSongs = getRequired(userReference);
    clock_t start = clock(); // 开始计时
	RecommondMu(userReference, UserMusic, threshold, recommendMusic, RecommendIndex, requireSongs);
    clock_t end = clock(); // 结束计时

    double elapsedTime = double(end - start) / CLOCKS_PER_SEC; // 计算运行时间

	for (int i = 0; i < requireSongs; i++) {
		printf("音乐%d,推荐者:User%d\n", recommendMusic[i] + 1, RecommendIndex[i]);
	}

    cout << "用户数: " << NoUser << "音乐数:" << NoSongs << endl;
    cout << "运行时间:" << elapsedTime << " seconds" << endl;
}

It can be seen that the time complexity of the algorithm is O(NoUser * NoSongs + NoUser * NoUser)

But there is a problem that when the test data is 1000, the heap memory will overflow. I hope someone can solve it.

Guess you like

Origin blog.csdn.net/weixin_73733267/article/details/135144512