redis HyperLogLog key (data structure papers)

HyperLogLog

Estimated using constant space base large number of elements.

problem

Independent IP number of records obtained from the site every day.

Collection implementations

Using a collection to store each visitor IP, by the nature of the collection (set of each element are different) to obtain a plurality of independent IP, IP then be derived by calling the number of independent SCARD command.
For example, the program can use the following code to register August 15, 2014, each site visitor's IP:
ip = get_vistor_ip ()
SADD '2014.8.15 :: :: UNIQUE ip' ip
then use the following code to get the day unique IP number:
SCARD '2014.8.15 :: :: uNIQUE ip'

Set of implementation issues

Each store string using the IPv4 address takes up 15 bytes (format 'XXX.XXX.XXX.XXX', such as '202.189.128.186').
The following table shows the use of different number of recording set independent IP, the amount of memory takes a:
Here Insert Picture Description
as more and more a collection of records IP, will consume more memory.
Also, if you want to save IPv6 addresses, memory will need some more.

HyperLogLog

In order to better address as the unique IP address calculation this problem, Redis 2.8.9 release adds HyperLogLog in the structure.

HyperLogLog Introduction

HyperLogLog can accept as input a plurality of elements, the input element and given cardinality estimate values: • Base: number of different elements of the collection. For example { 'apple', 'banana' , 'cherry', 'banana', 'apple'} is the base 3. • estimate: radix algorithm given is not accurate, may be slightly more or slightly less than the actual, but will be controlled within a reasonable range.
HyperLogLog advantage is that, even though the input volume or the number of elements is very large, the space required for the calculation is always fixed base, and is very small.
In Redis inside each key HyperLogLog takes only 12 KB of memory, you can calculate the closest base 2 ^ 64 different elements. When this calculation and the base, the more the cost of memory elements of the set, the more contrast.
However, because HyperLogLog will be calculated based on the input element base, but does not store the input element itself, so HyperLogLog not like a collection that returns each element inputs.

Adding elements to HyperLogLog

PFADD key element [element ...]
will be added to any number of elements specified HyperLogLog inside.
This command may have HyperLogLog be amended to reflect the new cardinality estimates if the cardinality estimates HyperLogLog has changed after the command is executed, then the command returns 1, otherwise it returns 0.
Command is the complexity of O (N), N is the number of elements to be added.

Cardinality estimate is returned to the predetermined HyperLogLog

PFCOUNT key [key ...]
when given only a HyperLogLog, returns to the command given cardinality estimate of HyperLogLog. When a plurality of HyperLogLog given, the command will first be given HyperLogLog set and calculated the HyperLogLog after a merger, then returns the combined HyperLogLog cardinality estimate as the result of the command (not combined obtained HyperLogLog is stored, it will be deleted after use).
When applied to a single command HyperLogLog, complexity is O (1), and has a very low mean time constant.
When applied to a plurality of commands HyperLogLog, complexity is O (N), and the constant time than processing a single HyperLogLog much greater.

Example of use PFADD and PFCOUNT
redis> PFADD unique::ip::counter '192.168.0.1'
(integer) 1
redis> PFADD unique::ip::counter '127.0.0.1'
(integer) 1
redis> PFADD unique::ip::counter '255.255.255.255'
(integer) 1
redis> PFCOUNT unique::ip::counter
(integer) 3
Merge multiple HyperLogLog

PFMERGE destkey sourcekey [sourcekey ...]
will be merged into a plurality HyperLogLog HyperLogLog, HyperLogLog cardinality estimates are combined by any given HyperLogLog be calculated and set.
Complexity of the command is O (N), where N is the number of merged HyperLogLog, but the complexity of the command constant relatively high.

PFMERGE examples of their use
redis> PFADD str1 "apple" "banana" "cherry" 
(integer) 1
redis> PFCOUNT str1
(integer) 3
redis> PFADD str2 "apple" "cherry" "durian" "mongo" 
(integer) 1
redis> PFCOUNT str2
(integer) 4
redis> PFMERGE str1&2 str1 str2
OK
redis> PFCOUNT str1&2
(integer) 5
The only counter API and its implementation

Here Insert Picture Description
This unique counter can be seen in the implementation code unique_counter.py.

unique_counter.py
# encoding: utf-8

class UniqueCounter:

    def __init__(self, client, key):
        self.client = client
        self.key = key

    def include(self, element):
        self.client.pfadd(self.key, element)

    def result(self):
        return self.client.pfcount(self.key)
The only example of the use of counter
# 创建一个 IP 地址唯一计数器
>>> ip_counter = UniqueCounter(client, 'unique::ip::counter')
# 添加一些 IP
 >>> ip_counter.include('192.168.0.1')
 >>> ip_counter.include('8.8.8.8')
 >>> ip_counter.include('255.255.255.255')
 # 查看计数器当前的值
 >>> ip_counter.result()
 3
HyperLogLog independence IP computing functions

The following table lists the number of different recording independent IP HyperLogLog use, the amount of memory it takes:
Here Insert Picture Description
you can see, to count the same number of independent IP, memory collections HyperLogLog than required much less.

review

HyperLogLog accepts as input a plurality of elements, estimate the cardinality of the input element. Because HyperLogLog only need to use a small amount of memory you can count on a lot of elements, for those who just want to know the base of input elements, but does not need to know the specific input is a program which elements of it, instead of using a set of computing HyperLogLog base, you can save a lot of memory.
Three commands: PFADD, PFCOUNT, PFMERGE.

Published 252 original articles · won praise 151 · views 10000 +

Guess you like

Origin blog.csdn.net/qq_39885372/article/details/104245363