Discussion on the Ranking Algorithm of Mass User Points

question

On a website with a large number of users, users have points, and the points may be updated at any time during use. It is now time to design an algorithm for the site that displays its current points rank each time a user logs in. The maximum number of users is 200 million; the points are non-negative integers and less than 1 million.

 

PS: It is said that this is an interview question from Xunlei, but the question itself has strong authenticity, so this article intends to consider it according to the real scene, not limited to the ideal environment of the interview question.

 

storage structure

First, we use a user score table user_score to save the user's score information.

 

Table Structure:

 

Sample data:

 

The following algorithm will be based on this basic table structure.

 

Algorithm 1: Simple SQL query

First of all, we can easily think of using a simple SQL statement to query the number of users whose points are greater than the user's points:

select 1 + count(t2.uid) as rank
from user_score t1, user_score t2
where t1.uid = @uid and t2.score > t1.score

 

For user 4 we can get the following results:

 

Algorithm Features

Advantages: Simple, using the functions of SQL, does not require complex query logic, and does not introduce additional storage structures, it is a good solution for small-scale or low-performance applications.

 

Disadvantages: It is necessary to perform a full table scan on the user_score table, and it is also necessary to consider that if the query is updated at the same time, the table will be locked if the score is updated. In applications with massive data scale and high concurrency, the performance is unacceptable.

 

Algorithm 2: Uniform Partition Design

In many applications, caching is an important way to solve performance problems. We naturally wonder if we can use Memcached to cache user rankings?

 

However, after thinking about it again, it seems that the cache does not help much, because the user ranking is a global statistical indicator, not a private attribute of the user. Changes in the points of other users may immediately affect the ranking of the user.

 

However, the changes of points in real applications actually have certain rules. Usually, a user's points will not suddenly increase or decrease suddenly. Generally, users always spend a long time in the low partition before slowly ascending to the high partition. , that is to say, the distribution of user points is generally segmented. We further noticed that the slight changes in the points of users with high segments have little effect on the rankings of users with low segments.

 

Therefore, we can think of the method of statistics by score segment, and introduce a partition score table score_range:

 

Table Structure:

 

Data example:

 

Indicates that there are count users in the [from_score, to_score) interval. If we divide an interval by 1000 points, there are 1000 intervals [0, 1000), [1000, 2000), …, [999000, 1000000), and the user points update in the future should update the interval value of the score_range table accordingly .

 

To query the ranking of a user whose score is s with the aid of the partition score table, you can first determine the range to which it belongs, accumulate the count value of the score range higher than s, and then query the ranking of the user in this range. Add up to get the user's ranking.

 

At first glance, this method seems to reduce the amount of query computation through interval aggregation, but it does not. The biggest problem is how to query the ranking of users in this range?

 

If the integral condition is added to the SQL in Algorithm 1:

select 1 + count(t2.uid) as rank
from user_score t1, user_score t2
where t1.uid =
@uid and t2.score > t1.score and t2.score < @to_score

 

Ideally, since the range of t2.score is limited to 1000, if the score field is indexed, we expect that this SQL statement will greatly reduce the number of rows scanned in the user_score table through the index.

 

However, this is not the case. The range of t2.score within 1000 does not mean that the number of users in this range is also 1000, because there are situations where the points are the same! The law of twenty-eight tells us that the top 20% of low-partitions tend to concentrate 80% of users, which means that the performance of ranking queries within the range for a large number of low-partition users is far less than that of a small number of high-partition users. This partitioning method will not bring substantial performance gains.

 

Algorithm Features

Advantages: The existence of the integration interval is noticed, and the full table scan of the query is eliminated by pre-aggregation.

Disadvantages: The non-uniform distribution of integrals makes the performance improvement unsatisfactory.

 

Algorithm 3: Tree Partition Design

The failure of the evenly partitioned query algorithm is due to the non-uniformity of the integral distribution, so we naturally think, can we design the score_range table as a non-uniform interval according to the law of 28?

 

For example, divide the low area into a denser, 10-point interval, and then gradually become 100 points, 1000 points, 10,000 points... Of course, this is a method, but this method of classification has a certain degree of randomness and is not easy to grasp. , and the integral distribution of the entire system will gradually change with use, and the initially better partitioning method may become unsuitable for future situations.

 

We hope to find a partitioning method that can accommodate both integral non-uniformity and changes in the system integral distribution, which is tree partitioning.

 

我们可以把[0, 1,000,000)作为一级区间;再把一级区间分为两个2级区间[0, 500,000), [500,000, 1,000,000),然后把二级区间二分为4个3级区间[0, 250,000), [250,000, 500,000), [500,000, 750,000), [750,000, 1,000,000),依此类推,最终我们会得到1,000,000个21级区间[0,1), [1,2) … [999,999, 1,000,000)。

 

这实际上是把区间组织成了一种平衡二叉树结构,根结点代表一级区间,每个非叶子结点有两个子结点,左子结点代表低分区间,右子结点代表高分区间。树形分区结构需要在更新时保持一种不变量(Invariant):非叶子结点的count值总是等于其左右子结点的count值之和。

 

以后,每次用户积分有变化所需要更新的区间数量和积分变化量有关系,积分变化越小更新的区间层次越低。

 

总体上,每次所需要更新的区间数量是用户积分变量的log(n)级别的,也就是说如果用户积分一次变化在百万级,更新区间的数量在二十这个级别。在这种树形分区积分表的辅助下查询积分为s的用户排名,实际上是一个在区间树上由上至下、由粗到细一步步明确s所在位置的过程。

 

比如,对于积分499,000,我们用一个初值为0的排名变量来做累加;首先,它属于1级区间的左子树[0, 500,000),那么该用户排名应该在右子树[500,000, 1,000,000)的用户数count之后,我们把该count值累加到该用户排名变量,进入下一级区间;其次,它属于3级区间的[250,000, 500,000),这是2级区间的右子树,所以不用累加count到排名变量,直接进入下一级区间;再次,它属于4级区间的…;直到最后我们把用户积分精确定位在21级区间[499,000, 499,001),整个累加过程完成,得出排名!

 

虽然,本算法的更新和查询都涉及到若干个操作,但如果我们为区间的from_score和to_score建立索引,这些操作都是基于键的查询和更新,不会产生表扫描,因此效率更高。

 

另外,本算法并不依赖于关系数据模型和SQL运算,可以轻易地改造为NoSQL等其他存储方式,而基于键的操作也很容易引入缓存机制进一步优化性能。进一步,我们可以估算一下树形区间的数目大约为2,000,000,考虑每个结点的大小,整个结构只占用几十M空间。

 

所以,我们完全可以在内存建立区间树结构,并通过user_score表在O(n)的时间内初始化区间树,然后排名的查询和更新操作都可以在内存进行。一般来讲,同样的算法,从数据库到内存算法的性能提升常常可以达到10^5以上;因此,本算法可以达到非常高的性能。

 

算法特点

优点:结构稳定,不受积分分布影响;每次查询或更新的复杂度为积分最大值的O(log(n))级别,且与用户规模无关,可以应对海量规模;不依赖于SQL,容易改造为NoSQL或内存数据结构。

缺点:算法相对更复杂。

 

算法4:积分排名数组

算法3虽然性能较高,达到了积分变化的O(log(n))的复杂度,但是实现上比较复杂。另外,O(log(n))的复杂度只在n特别大的时候才显出它的优势,而实际应用中积分的变化情况往往不会太大,这时和O(n)的算法相比往往没有明显的优势,甚至可能更慢。

 

考虑到这一情况,仔细观察一下积分变化对排名的具体影响,可以发现某用户的积分从s变为s+n,积分小于s或者大于等于s+n的其他用户排名实际上并不会受到影响,只有积分在[s,s+n)区间内的用户排名会下降1位。我们可以用于一个大小为1,000,000的数组表示积分和排名的对应关系,其中rank[s]表示积分s所对应的排名。

 

初始化时,rank数组可以由user_score表在O(n)的复杂度内计算而来。用户排名的查询和更新基于这个数组来进行。查询积分s所对应的排名直接返回rank[s]即可,复杂度为O(1);当用户积分从s变为s+n,只需要把rank[s]到rank[s+n-1]这n个元素的值增加1即可,复杂度为O(n)。

 

算法特点

优点:积分排名数组比区间树更简单,易于实现;排名查询复杂度为O(1);排名更新复杂度O(n),在积分变化不大的情况下非常高效。

缺点:当n比较大时,需要更新大量元素,效率不如算法3。

 

总结

上面介绍了用户积分排名的几种算法,算法1简单易于理解和实现,适用于小规模和低并发应用;算法3引入了更复杂的树形分区结构,但是O(log(n))的复杂度性能优越,可以应用于海量规模和高并发;算法4采用简单的排名数组,易于实现,在积分变化不大的情况下性能不亚于算法3。

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325585336&siteId=291194637