from: https://www.cnblogs.com/linguanh/p/10460421.html
table of Contents
- Question prototype
- Select condition
- HyperLogLog
- Bernoulli trials
- Estimated optimization
- Something to do
- Bit string
- Points barrel
- correspond
- Redis in the application of HyperLogLog
- HyperLogLog principle of Redis
- Drift correction
- The shoulders of giants
Question prototype
If you want to implement such a function:
A page Statistics APP or web page, there are number of how many users click to enter every day. The same user repeatedly referred to click to enter once.
Smart you might immediately think of using HashMap
this data structure can be, but also to meet the de-emphasis. Indeed, this is a solution, among other solutions.
Although the problem is not difficult, but participation in the variable reaches a certain magnitude of time, and then a simple question will become a problem. APP assume the Nikkatsu users reached 百万
or 千万以上级别
if we adopt HashMap
practices, will lead the program take up a lot of memory.
Here we try to estimate at HashMap
the time to deal with the problem of memory footprint. Suppose defined HashMap
in Key
as the string
type value
of bool
. key
Corresponding to the user Id
, value
it is 是否点击进入
. Obviously, when millions of different user access times. This HashMap
memory space is: 100万 * (string + bool)
.
Select condition
It can be said, in the above-mentioned problems currently existing solutions, HashMap
it is one of the largest memory footprint. If the statistic is not much, you can use this method to solve the problem, it is also simple to achieve.
In addition B+ 树
, Bitmap 位图
and this article introduces the HyperLogLog
algorithmic solutions.
Under certain conditions permit, if the statistical error rate before allowing huge amounts of data within an acceptable range, allowing 10 million views final statistics of twenty thousand less like this, it can be employed HyperLogLog
algorithms to solve the above count similar problems.
HyperLogLog
HyperLogLog
, Hereinafter referred to as HLL
it is LogLog
an upgraded version of the algorithm, it is to provide inaccurate count to weight. The following features:
- Code is more difficult to achieve.
- Statistical data memory to be able to use a huge amount of very little, in
Redis
the implementationHyperLogLog
, only12K
memory can be statistical2^64
data. - Certain error count, the overall error rate is low. Standard error of 0.81%.
- Error may be set
辅助计算因子
to be reduced.
A little programming in the underlying data type memory footprint has an understanding of the students, you should only need to 12K
be able to count memory 2^64
data surprised. Why do you say, let's give the next example:
Take Java
language, generally long
occupies 8 bytes, a byte is 8 bits, namely: 1 byte = 8 bit, i.e., long
the maximum number of possible data types are represented by: 2^63-1
. Corresponding to the above 2^64
number, assuming that there are 2^63-1
so many numbers, from 0 ~ 2^63-1
, follow long
and 1k = 1024字节
rules to calculate the total amount of memory, is this: ((2^63-1) * 8/1024)K
It is a very large number, far more than storage space 12K
. And HyperLogLog
you can 12K
be able to complete statistics.
Bernoulli trials
Why understanding HyperLogLog
Before you can use very little data memory to a huge amount of statistics, first at understanding 伯努利试验
.
伯努利试验
Mathematics is概率论
part of the content, it's the story comes from抛硬币
.
Coin has both positive and negative, on a throw to fall, culminating in the front and back of probability is 50%. The assumption has always flip a coin until it appears so far positive, we recorded as a full trial, may throw an inter appeared positive, it could throw four times to appear positive. No matter how many times throwing, as long as there have been positive, it is recorded as an experiment. This test is 伯努利试验
.
So for many times 伯努利试验
, many times for this hypothesis n
times. It means that there have been n
times positive. Assuming that each 伯努利试验
experienced a number of times to throw k
. First 伯努利试验
, the number is set k1
, so the first n
time corresponds kn
.
Among them, for this n
time 伯努利试验
, we will certainly have a maximum number of throws k
, for example, throwing 12 times to appear positive, then we call this k_max
, on behalf of the highest number of toss.
伯努利试验
Easy to draw the following conclusions:
- n tosses is not greater than the number of Bernoulli process k_max.
- Bernoulli process n times, once at least equal to the number of throwing k_max
The final bonding method of maximum likelihood estimation, it was found n
, and k_max
the presence of estimating association: n = 2^(k_max)
. This may seem beyond our basic understanding of the information by the local method estimates the overall data flow characteristics, it is necessary to derive and test this relationship by the method of probability and statistics.
For example, look like this:
第一次试验: 抛了3次才出现正面,此时 k=3,n=1
第二次试验: 抛了2次才出现正面,此时 k=2,n=2
第三次试验: 抛了6次才出现正面,此时 k=6,n=3
第n 次试验:抛了12次才出现正面,此时我们估算, n = 2^12
Suppose in the above example the number of the experimental group a total of 3 groups, then k_max = 6, the final n = 3, we go into the estimation formulas, significantly: 3 ≠ 2 ^ 6. That is, when the number of trials is small when the error of this estimation methods is great.
Estimated optimization
In the three groups above example, we called an estimate. If only one round, then when n is large enough, the estimated error rate will be reduced, but still not small enough.
Is it possible to carry out multiple rounds of it? For example, 100 or more rounds of tests, and then take k_max each round, and then taking the average, namely: k_mx/100
. Then the final estimate n. Here is the LogLog
estimation formula:
The above formula DVLL
corresponds to that n
, constant
a correction factor, and its specific value is uncertain, the branch may be set according to actual situation. m
It represents the number of test rounds. Head with a cross in R
the average.: (k_max_1 + ... + k_max_m)/m
.
By increasing this test rounds, then take the k_max
average of the algorithm is optimized LogLog
approach. And HyperLogLog
and LogLog
the difference is that it uses is not 平均数
, but 调和平均数
. 调和平均数
Than 平均数
the benefits is not easy to be a big impact value. Below is an example:
Averaging wages:
A is 1,000 / month, 30,000 / month in the B. Using the average way is: (30000 + 1000) / 2 = 15500
By way of the harmonic mean is: 2 / (1/1000 + 1/30000) ≈ 1935.484
Obviously, the 调和平均数
ratio 平均数
effect is to be better. The following is a 调和平均数
method of calculation, ∑
a summation sign.
Something to do
Above what we already know, in the example of a coin flip, the emergence of a Bernoulli trial can k_max
be estimated n
.
So how this estimation method and the following problems have been associated with it?
A page Statistics APP or web page, there are number of how many users click to enter every day. The same user repeatedly clicks into the record of 1 times
HyperLogLog
I am doing. For input data, the following steps:
1. The bit string
By hash
function, the data is converted 比特串
, for example, enter 5, then turn: 101. Why should convert it?
Due to the coin toss and correspondence, 比特串
where 0 represents the negative, 1 for positive, if a final data is transformed 10010000
, then from right to left, from low to high to see, we can say that, for the first time when 1 It is positive.
So based on the above estimates conclusion, we can thrown through the front of the maximum number of times the coin toss experiment to estimate the total of the number of experiments, the same will be based on the data stored in the converted their biggest 1 k_max position to estimate how much data is stored.
2. points barrel
How many points is divided bucket wheel. Abstract to a computer to store, is stored in a unit of a bit ( 'bit), large arrays of length L S, S will be equally divided into m groups, m attention to this group, the number is the corresponding wheel, and each of the is the average number of bits of possession, to P. Easy to draw the following relationship:
- L = S.length
- L = m * p
- In K, S memory occupied = L / 8/1024
In Redis
the, HyperLogLog
set: m = 16834, p = 6 , L = 16834 * 6. Memory to occupying = 16834 * 6/8/1024 = 12K
Visualized as:
第0组 第1组 .... 第16833组
[000 000] [000 000] [000 000] [000 000] .... [000 000]
3. correspondence
Now back to our original question APP page user statistics go.
- APP set up home key is: main
- User id is: idn, n-> 0,1,2,3 ....
In this statistical problems, different user id identifies a user, then we can put the user id as the hash
input. which is:
hash (id) = bit string
Different user id, will inevitably have different 比特串
. Each 比特串
, once a location is also bound to at least. Each of our analogy 比特串
once 伯努利试验
.
Now 分轮
, that is 分桶
. So we can set each 比特串
before how many bits into decimal, and its value corresponds to the label located on the barrel. Suppose 比特串
the lower two bits is used to calculate the marker barrel, when there is a user's id 比特串
is: 1001011000011. It tub where subscripts: 11(2) = 1*2^1 + 1*2^0 = 3
, in the third barrel, i.e., the first three.
In the above example, the calculated number of the tub, and the rest 比特串
is: 10010110000, see from low to high, the position of the first occurrence of 1 is 5. In other words, when the third barrel, the first three tests, k_max = 5
. 5 is a corresponding binary: 101, and because each bit is a p-tub. When p> = 3, 101 can be stored into.
Imitate the above process, a plurality of different user id, it is dispersed to the different buckets and each bucket has its k_max. Then when you want to count the mian
number of user hits the page when there is one estimate. The final combination of all buckets k_max, substituting the estimation formula, will be able to provide estimates.
The following is a HyperLogLog
combination of a formula for estimating the harmonic mean of the variables and paraphrase LogLog
of the same:
Redis in the application of HyperLogLog
First, in the Redis, HyperLogLog is one of its advanced data structures. Provided with the following two commands include, but are not limited to:
- pfadd key value, the key value stored in a corresponding
- pfcount key, key statistics of the number of value
Recall that the original APP statistics page the user's problem. If the key corresponding to the page name, value corresponding to the user id. The question then just right on the correspondence.
HyperLogLog principle of Redis
前面我们已经认识到,它的实现中,设有 16384 个桶,即:2^14 = 16384,每个桶有 6 位,每个桶可以表达的最大数字是:2^5+2^4+...+1 = 63 ,二进制为: 111 111
。
对于命令:pfadd key value
在存入时,value 会被 hash 成 64 位,即 64 bit 的比特字符串,前 14 位用来选择这个 value 的比特串中从右往左
第一个 1 出现的下标位置数值要存到那个桶中去,即前 14 位用来分桶。设第一个1出现位置的数值为 index 。当 index=5 时,就是: ....10000 [01 0000 0000 0000]
之所以选 14位
来表达桶编号是因为,分了 16384 个桶,而 2^14 = 16384,刚好地,最大的时候可以把桶利用完,不造成浪费。假设一个字符串的前 14 位是:00 0000 0000 0010 (从右往左看) ,其十进制值为 2。那么 index 将会被转化后放到编号为 2 的桶。
index 的转化规则:
首先因为完整的 value 比特字符串是 64 位形式,减去 14 后,剩下 50 位,那么极端情况,出现 1 的位置,是在第 50 位,即位置是 50。此时 index = 50。此时先将 index 转为 2 进制,它是:110010 。
因为16384 个桶中,每个桶是 6 bit 组成的。刚好 110010 就被设置到了第 2 号桶中去了。请注意,50 已经是最坏的情况,且它都被容纳进去了。那么其他的不用想也肯定能被容纳进去。
因为 fpadd 的 key 可以设置多个 value。例如下面的例子:
pfadd lgh golang
pfadd lgh python
pfadd lgh java
根据上面的做法,不同的 value,会被设置到不同桶中去,如果出现了在同一个桶的,即前 14 位值是一样的,但是后面出现 1 的位置不一样。那么比较原来的 index 是否比新 index 大。是,则替换。否,则不变。
最终地,一个 key 所对应的 16384 个桶都设置了很多的 value 了,每个桶有一个k_max
。此时调用 pfcount 时,按照前面介绍的估算方式,便可以计算出 key 的设置了多少次 value,也就是统计值。
value 被转为 64 位的比特串,最终被按照上面的做法记录到每个桶中去。64 位转为十进制就是:2^64,HyperLogLog
仅用了:16384 * 6 /8 / 1024 K
存储空间就能统计多达 2^64 个数。
偏差修正
在估算的计算公式中,constant
变量不是一个定值,它会根据实际情况而被分支设置,例如下面的样子。
假设:m为分桶数,p是m的以2为底的对数。
// m 为桶数
switch (p) {
case 4:
constant = 0.673 * m * m; case 5: constant = 0.697 * m * m; case 6: constant = 0.709 * m * m; default: constant = (0.7213 / (1 + 1.079 / m)) * m * m; }
巨人的肩膀
由简单的抛硬币试验可以引导出如此的震撼的算法,数学之强大。
感谢下面两遍博文的指引:
本文所有图片来源于:
https://www.jianshu.com/p/55defda6dcd2
本文内容参考于:
手动直观观察 LogLog
和 HyperLogLog
变化的网站: