Detailed HyperLogLog data structure of Reids

background

Realize the UV data of a statistical page, the user visits of each page (multiple requests by the same user are only counted once). So how do we implement this function?

Maybe some classmates will say, we all use growingIO, don't need to realize the cost too high by ourselves, just use others'. This is also good.

Today, let's take a look at how we can achieve this demand and can fight against higher TPS services?

Design

Since the statistics of page users are not repeated, we choose a data structure that is the SET collection for storage. Store the user ID, if it is a user who is not logged in, randomly generate one (using a timestamp, etc.) and store it in the set. To be faster, do it based on memory? But it is impossible to use the memory of its own services, then use the memory of third-party services, then use redis and have its own set collection data structure.
After you have selected the direction and middleware, you have to consider the issue of volume. The statistics are all hot pages. If there are tens of millions of UVs a day, you have to waste a lot of space for storage. , Is it worth it? There is also no need for too precise data just to pass the request volume, so do we have a better data structure?
Redis provides the HyperLogLog data structure to solve
this statistical problem

HyPerLogLog

Introduction

HyperLogLog provides an inaccurate deduplication counting solution. Although it is inaccurate but not very
inaccurate, the standard error is 0.81%. This accuracy can already meet the above UV statistical requirements.
The HyperLogLog data structure is a high-level data structure of Redis. It is very useful. Today we will take a deeper look (the one with nine shallows and one deep).

use

HyperLogLog provides three commands, pfadd and pfcount. pfmerge is well understood from the literal meaning. One is to increase the count and the other is to get the count. The usage of pfadd is the same as the sadd of the set collection. When a user ID comes, the user ID is entered. Yes. The usage of pfcount and scar is the same. Get the count value directly. And pfmerge literally means merge, that is, when the statistics of two keys are combined into one key (the number of statistics when the two pages are combined)
But some people will ask why another pf? Isn't HL better to play in regular routines? This PF is the spelling of the inventor of this data structure.
Corresponding to the above business requirements, use the page as the key to put the user iD into HyPerLogLog for statistics.

127.0.0.1:6379> pfadd codehole user1
(integer) 1
127.0.0.1:6379> pfcount codehole
(integer) 1
127.0.0.1:6379> pfadd codehole user2
(integer) 1
127.0.0.1:6379> pfcount codehole
(integer) 2
127.0.0.1:6379> pfadd codehole user3
127.0.0.1:6379> pfcount codehole
(integer) 3
127.0.0.1:6379> pfadd codehole user4
(integer) 1
127.0.0.1:6379> pfcount codehole
(integer) 4
127.0.0.1:6379> pfadd codehole user5
(integer) 1
127.0.0.1:6379> pfcount codehole
(integer) 5
127.0.0.1:6379> pfadd codehole user6
(integer) 1
127.0.0.1:6379> pfcount codehole
(integer) 6
127.0.0.1:6379> pfadd codehole user7 user8 user9 user10
(integer) 1
127.0.0.1:6379> pfcount codehole
(integer) 10

Test it and it feels quite accurate. It feels like this is often wrong.
Let's perform a lot of tests through scripts:

public class JedisTest {
public static void main(String[] args) {
Jedis jedis = new Jedis();
for (int i = 0; i < 100000; i++) {
jedis.pfadd("codehole", "user" + i);
}
long total = jedis.pfcount("codehole");
System.out.printf("%d %d\n", 100000, total);
jedis.close();
} }

After running, check: 100000 99723 and
run it again: 100000 99723
does not change, indicating that it is indeed heavy.
The difference is 277, and the percentage is 0.277%

Realization principle

Several important concepts:

HyperLogLog does not actually store the value of each element. It uses a probability algorithm to calculate the number of elements by storing the first 1 position of the element's hash value.
Bernoulli experiment: Probably means to carry out N teasing girl, nothing more than to tease the girl and not to tease the girl (success and unsuccess are both 50%, no other factors). It counts as one round if you get a girl (for example, if you have 100 times without success and 101 times, then these 101 times are only one round). So N rounds were performed, and the longest one was told to my buddies, and let your buddies guess how many rounds were performed in total? (That is, guess what N is). A big guy also encountered this problem and got a formula through a lot of experiments, which can be calculated. For details, please click (https://zhuanlan.zhihu.com/p/58519480)
The meaning of the figure below is that given a series of random integers, we record the maximum length k of the low consecutive zero bits, and
the number of random numbers can be estimated through this k value. (This random number is analogous to our user ID, and the maximum length K is the number of repetitions of this random number on the bucket calculated by hash (personal understanding)) It
can be compared with the Bernoulli experiment above.

4. All in all, we can get statistical results that are not very different through probability calculation formulas without storing values.

Simple code implementation

A simple calculation formula implemented (to help understand):

import java.util.concurrent.ThreadLocalRandom;

public class PfTest {
    static class BitKeeper {
        private int maxbits;

        public void random(long value) {
            int bits = lowZeros(value);
            if (bits > this.maxbits) {
                this.maxbits = bits;
            }
        }

        private int lowZeros(long value) {
            int i = 1;
            for (; i < 32; i++) {
                if (value >> i << i != value) {
                    break;
                }
            }
            return i - 1;
        }
    }

    static class Experiment {
        private int n;
        private int k;
        private BitKeeper[] keepers;

        public Experiment(int n) {
            this(n, 1024);
        }

        public Experiment(int n, int k) {
            this.n = n;
            this.k = k;
            this.keepers = new BitKeeper[k];
            for (int i = 0; i < k; i++) {
                this.keepers[i] = new BitKeeper();
            }
        }

        public void work() {
            for (int i = 0; i < this.n; i++) {
                long m = ThreadLocalRandom.current().nextLong(1L << 32);
                BitKeeper keeper = keepers[(int) (((m & 0xfff0000) >> 16) % keepers.length)];
                keeper.random(m);
            }
        }

        public double estimate() {
            double sumbitsInverse = 0.0;
            for (BitKeeper keeper : keepers) {
                sumbitsInverse += 1.0 / (float) keeper.maxbits;
            }
            double avgBits = (float) keepers.length / sumbitsInverse;
            return Math.pow(2, avgBits) * this.k;
        }
    }

    public static void main(String[] args) {
        for (int i = 100000; i < 1000000; i += 100000) {
            Experiment exp = new Experiment(i);
            exp.work();
            double est = exp.estimate();
            System.out.printf("%d %.2f %.2f\n", i, est, Math.abs(est - i) / i);
        }
    }
}

to sum up

HyperLogLog can achieve high-performance UV statistics, but there will be a percentage error
Three commands of HyperLogLog pfadd pfcount pfmerge
The general principle of HyperLogLog obtains the result through probability statistics (Bernoulli experiment) without recording specific values
Use Java to simply implement probability statistics and calculate the amount of random numbers.