Big data deduplication solutions

There is a table in the database has a special dimension to store user data, because as time goes on, the user's data may change as the dimensions, it will be saved each time you view a record.
Now we need to analyze data by user, but which have a lot of duplicate data, deduplication database only equivalent is clearly not feasible.

Find MD5 value data content

    MD5 value characteristics:
    1 compression: data of an arbitrary length, the calculated length of the MD5 value is fixed.
    2. easily calculated: MD5 value is calculated from the original data easily.
    3. Anti-modifiability: any changes to the original data, even if only a byte modifications, MD5 values obtained are very different.
    4. The strong anti-collision: the original data and the known MD5 value thereof, to find the data (i.e., data falsification) is a very difficult to have the same MD5 value.

According to the characteristics of the MD5 value is calculated MD5 value of the content of the dimension data of each record, and then repeat the recording in accordance with the MD5 value is determined.

After the data storage is directly detected using the sql duplicated data or remove data will then repeat marker.

In this stage efficiency at least memory and CPU are limited within a fixed time, the weight and check deduplication processing large amounts of data can not be in memory simultaneously. Like the external and internal sorting algorithm sorting algorithm is very different, we encountered such a problem for a lot of data re-check algorithm design is necessary.

Bloom filter

Bloom filter is a tool to check for re-using hash method. It data each separate hash processing n times, each treatment to obtain an integer, to give a total of n integers. Using a long array represents different integers, each insert this position corresponding to the n integers 0 to 1 (if it is set to a constant). Next time looking through the same calculation, if these locations are a description already exists.

Advantage Bloom filter is easy to use, because the key is not stored into memory so very space-saving, irrespective of multiple hash algorithms that can be executed concurrently high efficiency. Shortcomings are obvious, such algorithms are possible errors, this concept has a false positive rate. By the number of hash we can reduce the false positive rate, but can not guarantee that no miscarriage of justice cases.

BitMap

For example, we do not have to find unique integer number, memory space 250 million is not sufficient to accommodate the integer integer 250 million.

A digital state only three, are not present, only one duplicate. Therefore, we only need 2bits can be stored for a number of states, suppose we set a number 00 does not exist, there is a 01, and there is more than twice as 11. Then we probably need about tens of megabytes of storage space. The next task is to go through once this 250 million figure, if the corresponding status bit is 00, then it becomes 01; if the corresponding status bit is 01, then it becomes 11; 11 ,, if the corresponding transient remain unchanged.

Finally, we will state 01-bit statistics, the number of digits does not get repetitive, time complexity is O (n).

hash grouping

If there are two 50G of data, to check weight, memory 4G, how to check?

The idea is to first 50G data were made hash% 1000, into 1000 files, theoretically hash well done then this 1000 file size is almost close. If there are duplicate, the duplicate data A and B must be in the same relative to a file, the result is the same as hash. The 1000 files are loaded to come in, one by one than whether there is hash repeat. The idea is that first of all the data packets according to relevance, the relevant data will be at or near the same position, and then comparing the small files.

There are 10 million text messages to find out the most recurring before 10?

Hash table method can be used for a ten million divided into several groups to build a hash table while scanning. The first scan, taking the first byte, byte tail, just two bytes as the intermediate Hash Code, inserted into the hash table. And record the address information and the length and number of repetitions, 10 million pieces of information, records these information further let go. With Hash Code and as long as it suspected the same, compare. 1 is added to the same recording only into the hash table, it will add a number of repetitions. After a scan, the respective number of repetitions has been recorded, a second hash table process. Linear time selection can be completed before the 10 looking in O (n) level. After each packet must ensure the top10 vary hash can be guaranteed, you can also press the hash value to size classification.

Establish a database key field (one or more) to be re-indexed

According to de heavy url address:

Usage scenarios: url situation data corresponding to the address will not change, the url data can uniquely determines a case where

the idea:

  url present in Redis

  get url addresses, url determines whether there Redis set in

    the presence of: url address has been described after the request, the request is not

    absent: the set of instructions through the url is not requested, the request url to the address stored in Redis

Bloom filter:

  Using a plurality of encryption algorithm url addresses, to give a plurality of values

  to positions corresponding to the result value is set to 1

  a url to the new address, generating a plurality of values as the encryption algorithm

    if the value corresponding to position 1 are all described this url the address has been crawled

    or not crawled, then the corresponding value is set to the position 1

The deduplication data itself:

  Selecting a particular field (field uniquely identifies capable of data), using an encryption algorithm (MD5, sha1) field will be encrypted, to generate a string stored in Redis set

  subsequent to a new data, encrypted in the same way,

    if the obtained Redis string in the presence of described data exists, the data is updated,

    or description data does not exist, the data is inserted.

Guess you like

Origin www.cnblogs.com/jingsupo/p/11601271.html