Turn: Principles HyperLogLog algorithm on Redis is and how to use it

from: https://www.cnblogs.com/linguanh/p/10460421.html

table of Contents

  • Question prototype
  • Select condition
  • HyperLogLog
  • Bernoulli trials
  • Estimated optimization
  • Something to do
    • Bit string
    • Points barrel
    • correspond
  • Redis in the application of HyperLogLog
    • HyperLogLog principle of Redis
  • Drift correction
  • The shoulders of giants

Question prototype

If you want to implement such a function:

A page Statistics APP or web page, there are number of how many users click to enter every day. The same user repeatedly referred to click to enter once.

Smart you might immediately think of using  HashMap this data structure can be, but also to meet the de-emphasis. Indeed, this is a solution, among other solutions.

Although the problem is not difficult, but participation in the variable reaches a certain magnitude of time, and then a simple question will become a problem. APP assume the Nikkatsu users reached 百万or 千万以上级别if we adopt  HashMap practices, will lead the program take up a lot of memory.

Here we try to estimate at  HashMap the time to deal with the problem of memory footprint. Suppose defined HashMap in  Key as the  string type value of  bool. key Corresponding to the user Id, valueit is 是否点击进入. Obviously, when millions of different user access times. This HashMap memory space is: 100万 * (string + bool).

Select condition

It can be said, in the above-mentioned problems currently existing solutions, HashMap it is one of the largest memory footprint. If the statistic is not much, you can use this method to solve the problem, it is also simple to achieve.

In addition B+ 树, Bitmap 位图and this article introduces the  HyperLogLogalgorithmic solutions.

Under certain conditions permit, if the statistical error rate before allowing huge amounts of data within an acceptable range, allowing 10 million views final statistics of twenty thousand less like this, it can be employed HyperLogLogalgorithms to solve the above count similar problems.

HyperLogLog

HyperLogLog, Hereinafter referred to as HLLit is  LogLog an upgraded version of the algorithm, it is to provide inaccurate count to weight. The following features:

  • Code is more difficult to achieve.
  • Statistical data memory to be able to use a huge amount of very little, in  Redis the implementation  HyperLogLog, only 12Kmemory can be statistical 2^64data.
  • Certain error count, the overall error rate is low. Standard error of 0.81%.
  • Error may be set 辅助计算因子to be reduced.

A little programming in the underlying data type memory footprint has an understanding of the students, you should only need to 12Kbe able to count memory 2^64data surprised. Why do you say, let's give the next example:

Take  Java language, generally longoccupies 8 bytes, a byte is 8 bits, namely: 1 byte = 8 bit, i.e., longthe maximum number of possible data types are represented by: 2^63-1. Corresponding to the above 2^64number, assuming that there are 2^63-1so many numbers, from  0 ~ 2^63-1, follow longand 1k = 1024字节rules to calculate the total amount of memory, is this: ((2^63-1) * 8/1024)KIt is a very large number, far more than storage space 12K. And  HyperLogLog you can  12K be able to complete statistics.

Bernoulli trials

Why understanding HyperLogLogBefore you can use very little data memory to a huge amount of statistics, first at understanding 伯努利试验.

伯努利试验Mathematics is 概率论part of the content, it's the story comes from 抛硬币.

Coin has both positive and negative, on a throw to fall, culminating in the front and back of probability is 50%. The assumption has always flip a coin until it appears so far positive, we recorded as a full trial, may throw an inter appeared positive, it could throw four times to appear positive. No matter how many times throwing, as long as there have been positive, it is recorded as an experiment. This test is 伯努利试验.

So for many times 伯努利试验, many times for this hypothesis ntimes. It means that there have been ntimes positive. Assuming that each 伯努利试验experienced a number of times to throw k. First 伯努利试验, the number is set k1, so the first ntime corresponds kn.

Among them, for this ntime 伯努利试验, we will certainly have a maximum number of throws k, for example, throwing 12 times to appear positive, then we call this k_max, on behalf of the highest number of toss.

伯努利试验Easy to draw the following conclusions:

  1. n tosses is not greater than the number of Bernoulli process k_max.
  2. Bernoulli process n times, once at least equal to the number of throwing k_max

The final bonding method of maximum likelihood estimation, it was found n, and k_maxthe presence of estimating association: n = 2^(k_max) . This may seem beyond our basic understanding of the information by the local method estimates the overall data flow characteristics, it is necessary to derive and test this relationship by the method of probability and statistics.

For example, look like this:

第一次试验: 抛了3次才出现正面,此时 k=3,n=1
第二次试验: 抛了2次才出现正面,此时 k=2,n=2
第三次试验: 抛了6次才出现正面,此时 k=6,n=3
第n 次试验:抛了12次才出现正面,此时我们估算, n = 2^12

Suppose in the above example the number of the experimental group a total of 3 groups, then k_max = 6, the final n = 3, we go into the estimation formulas, significantly: 3 ≠ 2 ^ 6. That is, when the number of trials is small when the error of this estimation methods is great.

Estimated optimization

In the three groups above example, we called an estimate. If only one round, then when n is large enough, the estimated error rate will be reduced, but still not small enough.

Is it possible to carry out multiple rounds of it? For example, 100 or more rounds of tests, and then take k_max each round, and then taking the average, namely:  k_mx/100. Then the final estimate n. Here is the LogLogestimation formula:

The above formula DVLLcorresponds to that n, constanta correction factor, and its specific value is uncertain, the branch may be set according to actual situation. mIt represents the number of test rounds. Head with a cross in Rthe average.: (k_max_1 + ... + k_max_m)/m.

By increasing this test rounds, then take the k_maxaverage of the algorithm is optimized LogLogapproach. And  HyperLogLogand LogLogthe difference is that it uses is not 平均数, but 调和平均数. 调和平均数Than 平均数the benefits is not easy to be a big impact value. Below is an example:

Averaging wages:

A is 1,000 / month, 30,000 / month in the B. Using the average way is: (30000 + 1000) / 2 = 15500

By way of the harmonic mean is: 2 / (1/1000 + 1/30000) ≈ 1935.484

Obviously, the 调和平均数ratio 平均数effect is to be better. The following is a 调和平均数method of calculation,  a summation sign.

Something to do

Above what we already know, in the example of a coin flip, the emergence of a Bernoulli trial can k_maxbe estimated n.

So how this estimation method and the following problems have been associated with it?

A page Statistics APP or web page, there are number of how many users click to enter every day. The same user repeatedly clicks into the record of 1 times

HyperLogLogI am doing. For input data, the following steps:

1. The bit string

By hashfunction, the data is converted 比特串, for example, enter 5, then turn: 101. Why should convert it?

Due to the coin toss and correspondence, 比特串where 0 represents the negative, 1 for positive, if a final data is transformed  10010000, then from right to left, from low to high to see, we can say that, for the first time when 1 It is positive.

So based on the above estimates conclusion, we can thrown through the front of the maximum number of times the coin toss experiment to estimate the total of the number of experiments, the same will be based on the data stored in the converted their biggest 1 k_max position to estimate how much data is stored.

2. points barrel

How many points is divided bucket wheel. Abstract to a computer to store, is stored in a unit of a bit ( 'bit), large arrays of length L S, S will be equally divided into m groups, m attention to this group, the number is the corresponding wheel, and each of the is the average number of bits of possession, to P. Easy to draw the following relationship:

  • L = S.length
  • L = m * p
  • In K, S memory occupied = L / 8/1024

In  Redis the, HyperLogLogset: m = 16834, p = 6 , L = 16834 * 6. Memory to occupying = 16834 * 6/8/1024 = 12K

Visualized as:

0组     第1组                       .... 第16833组
[000 000] [000 000] [000 000] [000 000] .... [000 000]

3. correspondence

Now back to our original question APP page user statistics go.

  • APP set up home key is: main
  • User id is: idn, n-> 0,1,2,3 ....

In this statistical problems, different user id identifies a user, then we can put the user id as the hashinput. which is:

hash (id) = bit string

Different user id, will inevitably have different 比特串. Each 比特串, once a location is also bound to at least. Each of our analogy 比特串once 伯努利试验.

Now 分轮, that is 分桶. So we can set each 比特串before how many bits into decimal, and its value corresponds to the label located on the barrel. Suppose 比特串the lower two bits is used to calculate the marker barrel, when there is a user's id 比特串is: 1001011000011. It tub where subscripts: 11(2) = 1*2^1 + 1*2^0 = 3, in the third barrel, i.e., the first three.

In the above example, the calculated number of the tub, and the rest 比特串is: 10010110000, see from low to high, the position of the first occurrence of 1 is 5. In other words, when the third barrel, the first three tests, k_max = 5. 5 is a corresponding binary: 101, and because each bit is a p-tub. When p> = 3, 101 can be stored into.

Imitate the above process, a plurality of different user id, it is dispersed to the different buckets and each bucket has its k_max. Then when you want to count the  mian number of user hits the page when there is one estimate. The final combination of all buckets k_max, substituting the estimation formula, will be able to provide estimates.

The following is a  HyperLogLog combination of a formula for estimating the harmonic mean of the variables and paraphrase LogLogof the same:

Redis in the application of HyperLogLog

First, in the Redis, HyperLogLog is one of its advanced data structures. Provided with the following two commands include, but are not limited to:

  • pfadd key value, the key value stored in a corresponding
  • pfcount key, key statistics of the number of value

Recall that the original APP statistics page the user's problem. If the key corresponding to the page name, value corresponding to the user id. The question then just right on the correspondence.

HyperLogLog principle of Redis

前面我们已经认识到,它的实现中,设有 16384 个桶,即:2^14 = 16384,每个桶有 6 位,每个桶可以表达的最大数字是:2^5+2^4+...+1 = 63 ,二进制为: 111 111 。

对于命令:pfadd key value

在存入时,value 会被 hash 成 64 位,即 64 bit 的比特字符串,前 14 位用来选择这个 value 的比特串中从右往左第一个 1 出现的下标位置数值要存到那个桶中去,即前 14 位用来分桶。设第一个1出现位置的数值为 index 。当 index=5 时,就是: ....10000 [01 0000 0000 0000]

之所以选 14位 来表达桶编号是因为,分了 16384 个桶,而 2^14 = 16384,刚好地,最大的时候可以把桶利用完,不造成浪费。假设一个字符串的前 14 位是:00 0000 0000 0010 (从右往左看) ,其十进制值为 2。那么 index 将会被转化后放到编号为 2 的桶。

index 的转化规则:

首先因为完整的 value 比特字符串是 64 位形式,减去 14 后,剩下 50 位,那么极端情况,出现 1 的位置,是在第 50 位,即位置是 50。此时 index = 50。此时先将 index 转为 2 进制,它是:110010 。

因为16384 个桶中,每个桶是 6 bit 组成的。刚好 110010 就被设置到了第 2 号桶中去了。请注意,50 已经是最坏的情况,且它都被容纳进去了。那么其他的不用想也肯定能被容纳进去。

因为 fpadd 的 key 可以设置多个 value。例如下面的例子:

pfadd lgh golang
pfadd lgh python
pfadd lgh java

根据上面的做法,不同的 value,会被设置到不同桶中去,如果出现了在同一个桶的,即前 14 位值是一样的,但是后面出现 1 的位置不一样。那么比较原来的 index 是否比新 index 大。是,则替换。否,则不变。

最终地,一个 key 所对应的 16384 个桶都设置了很多的 value 了,每个桶有一个k_max。此时调用 pfcount 时,按照前面介绍的估算方式,便可以计算出 key 的设置了多少次 value,也就是统计值。

value 被转为 64 位的比特串,最终被按照上面的做法记录到每个桶中去。64 位转为十进制就是:2^64,HyperLogLog 仅用了:16384 * 6 /8 / 1024 K 存储空间就能统计多达 2^64 个数。

偏差修正

在估算的计算公式中,constant 变量不是一个定值,它会根据实际情况而被分支设置,例如下面的样子。

假设:m为分桶数,p是m的以2为底的对数。

// m 为桶数
switch (p) {
   case 4:
       constant = 0.673 * m * m; case 5: constant = 0.697 * m * m; case 6: constant = 0.709 * m * m; default: constant = (0.7213 / (1 + 1.079 / m)) * m * m; }

巨人的肩膀

由简单的抛硬币试验可以引导出如此的震撼的算法,数学之强大。

感谢下面两遍博文的指引:

本文所有图片来源于:

https://www.jianshu.com/p/55defda6dcd2

本文内容参考于:

http://www.rainybowe.com/blog/2017/07/13/%E7%A5%9E%E5%A5%87%E7%9A%84HyperLogLog%E7%AE%97%E6%B3%95/index.html

手动直观观察 LogLog 和 HyperLogLog 变化的网站:

http://content.research.neustar.biz/blog/hll.html

Guess you like

Origin www.cnblogs.com/liuqingsha3/p/11958801.html