Interpretation of HyperLogLog type of Redis

Table of contents

basic introduction

basic command

pathd

pfcount

pgmerge

Statistical Visitor Application Scenarios 

What is UV, PV, DAU, MAU

scene description

java code example


basic introduction

HyperLogLog is an algorithm for cardinality statistics. The advantage of HyperLogLog is that when the number or volume of input elements is very, very large, the space required to calculate the cardinality is always fixed and small.

In Redis, each HyperLogLog key only needs 12 KB of memory to calculate the cardinality of nearly 2^64 different elements. This is in stark contrast to collections where more elements consume more memory when computing cardinality.

However, because HyperLogLog only calculates the cardinality based on the input elements and does not store the input elements themselves, HyperLogLog cannot return the individual elements of the input like a set.

What is the cardinality?

For example, if the data set is {1, 3, 5, 7, 5, 7, 8}, then the cardinality set of this data set is {1, 3, 5, 7, 8}, and the cardinality (non-repeating elements) is 5. Cardinality estimation is to quickly calculate the cardinality within an acceptable range of error.

The cardinality estimation algorithm of the de-duplication statistics function is HyperLogLog (used to count the number of elements that are not repeated in a set, which is the calculation of the remaining elements after de-duplication of the set). It is  necessary to pay attention to the existence of errors (accuracy is exchanged for space, and the error is only only around 0.81%)

Complete set i={1,2,3,4,5,6,7,8,8,9,9,5}
remove duplicate content
base={1,2,3,4,5,6,7,8 ,9}

basic command

serial number command and description
1 PFADD key element [element ...]  Add the specified element to HyperLogLog.
2 PFCOUNT key [key...]  Returns the cardinality estimate for the given HyperLogLog.
3 PFMERGE destkey sourcekey [sourcekey ...]  Merge multiple HyperLogLogs into one HyperLogLog

pathd

The Pfadd command adds all element parameters to the HyperLogLog data structure.

redis> PFADD mykey a b c d e f g h i j
(integer) 1

redis> PFCOUNT mykey
(integer) 10

Return value: Integer, returns 1 if at least one element is added, otherwise returns 0. 

pfcount

The Pfcount command returns the cardinality estimate for a given HyperLogLog .

Syntax: PFCOUNT key [key...] 

redis> PFADD hll foo bar zap
(integer) 1

redis> PFADD hll zap zap zap
(integer) 0

redis> PFADD hll foo bar
(integer) 0

redis> PFCOUNT hll
(integer) 3

redis> PFADD some-other-hll 1 2 3
(integer) 1

redis> PFCOUNT hll some-other-hll
(integer) 6

Return value: integer, returns the base value of the given HyperLogLog, or the sum of base estimates if multiple HyperLogLogs.

pgmerge

The Pgmerge command merges multiple HyperLogLogs into one HyperLogLog, and the cardinality estimate of the merged HyperLogLog is calculated by unioning all given HyperLogLogs.

redis> PFADD hll1 foo bar zap a
(integer) 1

redis> PFADD hll2 a b c foo
(integer) 1

redis> PFMERGE hll3 hll1 hll2
OK

redis> PFCOUNT hll3
(integer) 6

Return value: return OK.

Statistical Visitor Application Scenarios 

What is UV, PV, DAU, MAU


①. UV: Unique Visitor, independent visitor, generally understood as client IP (requires de-duplication consideration)

②. PV: Page View, page views (no need to deduplicate)

③. DAU: Daily active users (the number of users who have logged in or used a certain product (to remove users who have logged in repeatedly))

④. MAU: MonthIy Active User, monthly active users 

scene description

The UV on the homepage of Taobao and Tmall is about 100-150 million per day on average

150 million IPs are saved every day. After the visitor comes, check whether it exists or not. Join

Multiple visits by a user within a day are only counted as one

java code example

@Service
@Slf4j
public class HyperLogLogService {
    @Resource
    private RedisTemplate redisTemplate;

    /**
     * 模拟有用户来点击首页,每个用户就是不同的ip,不重复记录,重复不记录
     */
    @PostConstruct
    public void init() {
        log.info("------模拟后台有用户点击,每个用户ip不同");
        //自己启动线程模拟,实际上产不是线程
        new Thread(() -> {
            String ip = null;
            for (int i = 1; i <=200; i++) {
                Random random = new Random();
                ip = random.nextInt(255)+"."+random.nextInt(255)+"."+random.nextInt(255)+"."+random.nextInt(255);

                Long hll = redisTemplate.opsForHyperLogLog().add("hll", ip);
                log.info("ip={},该ip访问过的次数={}",ip,hll);
                //暂停3秒钟线程
                try { TimeUnit.SECONDS.sleep(3); } catch (InterruptedException e) { e.printStackTrace(); }
            }
        },"t1").start();
    }
}
@RestController
@Slf4j
public class HyperLogLogController {

    @Resource
    private RedisTemplate redisTemplate;

    @ApiOperation("获得ip去重复后的首页访问量,总数统计")
    @RequestMapping(value = "/uv",method = RequestMethod.GET)
    public long uv() {
        //pfcount
        return redisTemplate.opsForHyperLogLog().size("hll");
    }
}

 By sacrificing accuracy in exchange for space, it can be used in scenarios that do not require absolute accuracy, because the probability algorithm does not directly store the data itself and estimates the base value through a certain probability statistical method, while ensuring that the error is within a certain range. Storing data can greatly save memory. HyperLogLog is an implementation of a probabilistic algorithm.

Guess you like

Origin blog.csdn.net/m0_62436868/article/details/132432080