hyperloglog calculation uv principle

1. Starting from the Bernoulli experiment

What is a Bernoulli experiment?

Under the same conditions, a random experiment is repeated independently of each other, which has only two possible situations: occurrence or non-occurrence. If we perform n trials under this condition, it is called n-fold Bernoulli trials.

In daily life, the most common Bernoulli experiment is to toss a coin, after which there are only two outcomes: positive or negative. If we independently conduct n Bernoulli coin toss trials, the number of coin tosses when each Bernoulli trial appears heads up for the first time is k1, k2,...kn respectively

                                                Note: k_max = max(k1,k2,...,kn)

Then according to the maximum likelihood estimation method ( the reasoning process has not been carefully studied, I will add it after my research ), the estimated relationship between n and k_max can be obtained as:

                                                        n = 2 ^ k_max

Cases borrowed from the Internet are as follows:

The first experiment: 1 toss and heads, at this time k=1, n=1; the second
experiment: 3 tosses and heads, at this time k=3, n=2; the
third experiment: 6 tosses There is a head, at this time k=6, n=3;
the nth test: throwing 10 times and a head appears, at this time k=10, n=n, through the calculation of the estimated relationship, n=2^10
can be seen from the above case, Assuming n=3, at this time by estimating the relationship n=2^kmax, 2^6≠3, and the deviation is large. Therefore, it can be concluded that this estimation method has a large error.

2. Valuation optimization

        In order to solve the above-mentioned problem of large estimation deviation, increase the number of test rounds to reduce the error, the process is as follows:

        We conduct m rounds of experiments, and each round performs n independent Bernoulli experiments, so that each round can get a k_max, which is recorded as: k_1_max, k_2_max,..., k_m_max, and the process is shown as follows:

At this time, the average number of k_i_max in each round is used as the value of k_max, which is recorded as follows: 

 k\_max = \frac{k\_1\_max + k\_2\_max + ... + k\_m\_max}{m}

but:n = 2^{k\_max}

Because we have thrown m rounds in total, the final estimation formula is as follows:

Estimate = m * n = m * 2^{k\_max}

Note: When I first looked at this estimation formula, I didn't understand why it was multiplied by an m. Now understand: our k_max takes the average of all rounds as an estimate, then it corresponds to the estimated result of a round of experiments: n = 2^k_max. But we actually threw a total of m rounds, a total of m*n times, so the final estimated value has to be multiplied by m.

This estimation method is called LogLog

3. Hyperloglog process

        The only difference between hyperloglog and loglog is that loglog uses the average value of the maximum value of each round, while hyperloglog uses the harmonic mean of the maximum value of each round. The reason for taking the harmonic average is to solve the huge disturbance caused by a certain abnormal data. There is a very vivid example to illustrate this problem.

Ma Yunyue's salary is 100 million, and the salaries of my nine friends and I are 10,000, 11,000, ... , 19,000.

Using the average, our average salary is = 9104090

Using the harmonic mean, our average salary = 15303

Obviously, using the harmonic mean can more accurately describe our average salary.

The formula for calculating the harmonic mean data is as follows:

H_{m} = \frac{m}{\sum_{i=1}^{m}\frac{1}{x_{i}}}

So the formula estimated by hyperloglog is as follows:

Estimate = m * 2^{H_{m}}

4. Why can the Bernoulli test estimate uv?

       1. How to simulate a Bernoulli experiment in uv calculation?

             For our user id, we can perform hash conversion, because the user id is random, and the value of the hash conversion is also random, and then we calculate the number k of consecutive 0s in the lower digits of the hash value, as a coin tossing Bernoulli test first A simulation of the number of heads, which is also a random number. In summary, a Bernoulli experiment is simulated in this way.

        2. So why can this process simulate uv deduplication?

        Suppose we have two sets of data:

        The first set of data: {uid1, uid2, uid3, uid4, uid5, uid6} No duplicate uid data

        The second set of data: {uid1,uid1,uid2,uid2,uid3,uid4,uid4,uid5,uid6} has uid duplicate data

        Because the hash value of the same uid is the same, the number of consecutive low-order 0s is also the same. Assume that the numbers of low-order consecutive 0s of uid1, uid2, uid3, uid4, uid5, and uid6 are k1, k2, k3, k4, k5, and k6 respectively.

        Then for the first set of data, k_max should be

        kmax_1 = max(k1,k2,k3,k4,k5,k6)

        For the second set of data, k_max should be

        kmax_2 = max(k1,k1,k2,k2,k3,k4,k4,k5,k6)

        Obviously: kmax_1 = kmax_2

        Then the estimates for the two groups should also be the same. This is why uv estimation is possible using Bernoulli trials.

5. Implementation of bucketing and code

        The principle of bucketing is equivalent to conducting m rounds of experiments.

        There are still some parameter adjustments, so I won’t go into details here. The purpose of writing this article is to clarify the principle of hyperloglog, and to further think and explain the places that other blogs do not understand. If you want to see the bucketing process and code implementation, I recommend the following blogs:

The use of HyperLogLog and its algorithm principle are explained in detail

Guess you like

Origin blog.csdn.net/chenzhiang1/article/details/126602472