Fourinone-4.17.10 new version released: Single machine completes routine statistics of hundreds of millions of big data in milliseconds

Although AI is the most popular right now, big data and computing power are still important supports for machine learning/AI algorithms. Most of our business scenarios are constantly generating log data through mobile terminals and server logs, and sending them to big data through message channels. The platform performs storage, processing and statistics, and then provides algorithms on top of the statistical data to mine user preference behaviors and portraits. To this end, our key task is to statistically analyze the deduplication users, new users, For indicators such as pv, uv, dau (daily activity), and mau (monthly activity), the less storage occupied in this process, the faster the calculation time, the better. Fourinone (CoolHash) has original database engine design capabilities and intellectual property rights, and can flexibly expand various functional support at the engine level. In order to provide the optimal solution for big data statistical computing, 4.17 has enhanced the following features on the engine:

 

1. Added self-addition and existence of two new atomic operations

1 、 Object putPlus (String key, T plusValue)

If the value corresponding to the key is a numeric type (int, long, double, float), the plusValue (numeric type) is incremented, for example, plusValue=1, which means that it increments by 1 each time, and the plusValue can also be a decimal. If the value corresponding to the key is a string type, the plusValue (string) will be added automatically, and it will be added to the back of the original string, which can be separated by a delimiter. The return value of putPlus is the previous value of the key.

 

2、Object putNx(String key, T value)

If the key exists, do not operate, if not, write the value. The return value of putNx is the value before the key operation. If it is null, it means it does not exist. Otherwise, it returns the existing value.

 

Using putPlus and putNx can complete many atomic operations, such as count class count statistics, the countTest method in CountDemo.java attached to the open source package guide demonstrates the use of putPlus, and the putPlusTest method and putNxTest method in ThreadClient.java demonstrate the use of multi-threading usage of.

The pvTest method demonstrates the calculation of pv. If the id does not exist, it will be written, and the pv number will be incremented by 1. When other threads find that the id exists, the pv number cannot be updated.

Object nx = chc.putNx("v0_"+i, i);

if(nx==null)

chc.putPlus("pv_v0",1);

 

2. Added powerful bitmap support on the client side and on the storage engine side

The above atomic operations of putPlus and putNx can calculate pv, but it is not the most efficient solution. Using bitmap has two very significant advantages: low bit storage footprint and high bit calculation efficiency. Convert the IDs that need to be calculated into numerical numbers, each of which occupies only 1 bit. For 2 billion user IDs, only 2 billion bits are needed, about 238m in size, and the compressed space takes up less, at least 200k; through a single bitmap The deduplication operation can be completed, and most of the statistical calculations such as daily activity, monthly activity, hour and minute activity, heavy users, new users, user flow, etc. Moreover, it can be completed in milliseconds on a single machine, and the results can be calculated in real time. Compared with the offline calculation of hadoop/hive, "select distinct count...from...groupby join..." is performed in a similar way to sql, which often requires hundreds of machines and consumes 30 It takes minutes to complete, the comparison is very different, and it is easy to form a large number of SQL task scheduling and large table joins, which bring heavy pressure to the cluster. (picture)

 

 

 

 

 

1. De-duplicate users: find the total number of 1

2. Active users: take or

bitmap1 | bitmap2

3. Inactive users: negate:

~bitmap1

4. Heavy users: take and:

Bitmap1 & bitmap2

5. New user: take or add XOR:

(Bitmap1 | bitmap2)^bitmap1

6. Various indicator combinations:

Bitmap1 & bitmap2 & bitmap3 &…

etc

 

At the same time, it provides the implementation of bitmap local and engine-side interoperability, which enables more flexible architecture design. Bitmap can be compressed and stored in any database. After the client pulls back, the aggregation calculation is completed, and the calculation results are written back to the database. Multiple clients can also connect to the CoolHash storage engine at the same time, and complete deduplication, aggregation, decompression and other support through the engine's bitmap operation support. BitMap combined with the storage engine as shown below:



 

 

1. Local memory implementation, CoolBitSet implements the following bitmap functions:

CoolBitSet(int maxSize), you can specify the size limit, the default size is 10 million, there is no local maximum limit, you can use the bitmap of multiple partitions to represent data in the integer range or long integer range, each 10 million bitmap is compressed in 2m Within, it is very suitable to put into kv storage.

(1) Basic operations: CoolBitSet provides basic get(int n), set(int n), put(int n) operations, where put returns get if it exists, and set does not exist, except that it also provides batch operations: int set(CoolBitSet cbs): Merge another bitmap object into the current bitmap and return the new number.

(2) Aggregation operations: sum, OR, XOR, negate, and add

CoolBitSet and(CoolBitSet cbs): sum of two CoolBitSets, update to the current object, and return the object reference

CoolBitSet or(CoolBitSet cbs): sum of two CoolBitSets, same as above

CoolBitSet xor(CoolBitSet cbs): XOR of two CoolBitSets, same as above

CoolBitSet andnot(): Negate the CoolBitSet object, same as above

CoolBitSet setNew(CoolBitSet cbs): Find the new user of the current CoolBitSet, and return the object reference of the result of the new user

(3) Find the total number: int getTotal() returns the total number of users of the CoolBitSet, and the bit bit is the total number of 1

(4) Find the capacity: int getSize() returns the capacity size of the CoolBitSet

(5) Debugging view: String toString(int num) returns the binary string of the CoolBitSet. In order to reduce the length, the parameter num is the number of bytes to be viewed. For example, num=5 means to view the binary string of the first 5 bytes.

 

The difference from the implementation of bitmap in java: the BitSet class that comes with jdk is implemented as a long array, and it can only initialize the size, but cannot limit the size. Each bitset consumes hundreds of m of memory, and multiple bitmaps are likely to cause a lot of waste of space. , the BitSet class is only a local memory implementation, and there is no distributed storage engine persistence support.

 

2. Engine-side persistence implementation, CoolHashClient provides the following interfaces to operate the storage engine:

(1)int putBitSet(String key, int index):

Single operation, similar to the put of CoolBitSet, the first parameter is the key of the bitmap, and the second parameter is the index position of the bitmap set to 1.

(2)boolean getBitSet(String key, int index):

A single operation, similar to the get of CoolBitSet, the first parameter is the key of the bitmap, and the second parameter needs to obtain the value of the index position.

(3)int putBitSet(String key, CoolBitSet cbs):

Batch operation, similar to the batch set of CoolBitSet, merges another bitmap object into the bitmap of the specified key, and returns the new number. Getting the CoolBitSet object still uses the get interface Object get(String key)

(4)Object putBitSet(String key, CoolBitSet cbs, String logical):

Aggregation operation, the parameter logical can be set to one of "and", "or", "xor", "andnot", "new", for "andnot", the parameter cbs does not work, you can pass in any non-null CoolBitSet object. The aggregation operation will act on the bitmap specified by the key, and the return value will be the aggregated CoolBitSet object.

 

The above operations follow the k/v storage constraint of CoolHash, k is a string, and v does not exceed 2m (the default configuration size can be modified).

Note that CoolBitSet objects can be stored and compressed k/v in three ways:

(1) Store in bitSet format, merge data: putBitSet(String key, CoolBitSet cbs)

(2) Store in bitSet format and directly overwrite: put(String key, CoolBitSet cbs)

(3) Ordinary kv storage format, non-bitSet format: put(String key, cbs.getBytes());

Because it is an object storage, the three put methods will compress the value data, and use gzip compression with a balanced compression rate and time-consuming.

The first two storage methods in bitSet format will verify that the size of the CoolBitSet cannot exceed 100 million, otherwise it cannot be submitted.

The third common kv storage format has no limit of 100 million. As long as the compressed size does not exceed 2m, it can be submitted normally. However, because it is not in the CoolBitSet format, the storage engine cannot recognize operations such as aggregation.

 

It is different from the implementation of bitmap of redis: redis implements the single operation and aggregation operation of bitmap, but there is no batch operation and no compression. It is easy to waste space allocating space by specifying the offset by offset.

 

The open source package guide comes with a demo in CountDemo.java:

The bitSetTest method: first demonstrates full storage, writing 1 billion data to 1 bitmap, which takes less than 1 second; then demonstrates partition storage, dividing 100 million data into 10 10 million bitmap storages.

realtimeStatistics method: Demonstrates real-time calculations based on bitmap for user deduplication, active users, inactive users, heavy users, and new users

retainLocal method and retainServer method: Demonstrate how to calculate user retention using local memory and storage engine respectively

 

3. Add bitmap support of String type:

StringBitMap implements the bitMap of String type. By improving the hash algorithm, it can achieve a collision rate of only more than 200 for 100 million string data, and almost no collision rate for data within 50 million, which is very suitable for data of no more than 100 million. , but the number of strings above 100 million is still inappropriate, and the collision rate will increase significantly. The stringBitMapTest method in CountDemo.java attached to the open source package guide demonstrates the simulation of 10 million randomly generated 15-digit IMEI device numbers and returns the number of collisions.

 

Version 4.17.10 also provides the "fourinone.jar" package compiled with jdk1.8.0_151 and the "fourinone-jdk7.jar" package compiled with jdk1.7.0_80. Version 4.17.10 updates github code and gitee code. All open source content in this version has been reported to the company. Thank you for your support for open source.

https://github.com/fourinone/fourinone 

https://gitee.com/fourinone/fourinone 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326266487&siteId=291194637