[C ++ Series] 86. The mass data processing

0. Introduction

The article included a number of long-term will be on massive data processing common problems, it is easy to be asked in the interview, recorded want to do to help the reader.

1. Application of a bitmap

1. 10 billion a given integer, integer design algorithms to find only appear once?

  • method one
  • Problems encountered huge amounts of data, must first analyze the data size. If we need to store a intrange of all the data, namely 4Gdata, we only need 4G / 8 = 512Mspace enough.
  • Depending on the nature of the bitmap, we use a bitmap can not get a appear only integer. Here it is easy to think of dual bitmap practices:
    • 1 bitmap to determine whether there is a digital, occurs when the corresponding position is set
    • 2 bitmap to determine whether the number again, when the corresponding bit is 1 in FIG. 1 has, and the number reappears, i.e. the bit map is set 2
  • In this way they can find to get only an integer emergence of two used 512Mbitmaps, so that extra 1Gspace

  • Method Two
  • This problem can also use a large bitmap to be resolved before the 512Mbitmap performance of intall cases of type data, but only appeared in the performance, and can not show more than once. The reason is the only one bit to indicate, then it can only produce 0, 1 these two cases, here we use two bit number indicating the occurrence of a case, it can be obtained in this case 4, appears 0,1,2,3 times, which is the combined result of two ordinary bitmaps, to do so is 1Gspace. Referring to achieve our common bitmap [ C ++ Series] 84. The concept and application of the bitmap little transformation can solve the problem.

See the code below:

#include<stdio.h>
#include<assert.h>
#include<windows.h>

typedef struct TwoBitSet
{
	// 数组
    size_t *bts;
    // 数据的范围
    size_t range;
}TwoBitSet;

void TBSInit(TwoBitSet *tbs,size_t range)
{
    assert(tbs);
    // 需要两位表示一个数据,所以除以16,
    // (range >> 4) + 1计算需要多少个整型
    tbs->bts = (size_t *)malloc(((range >> 4) + 1) * sizeof(size_t));
    assert(tbs->bts);
    memset(tbs->bts, 0,((range >> 4) + 1) * sizeof(size_t));
    tbs->range = range;
}
int TBSGetValue(TwoBitSet *tbs, size_t x)//
{
    assert(tbs);
    // 计算x在数组里下标
    size_t index = x >> 4;
    // 因为两位表示一位,需要乘2
    size_t num = x % 16 * 2;
    // 不能改变tbs->bts[index],所以设置value
    int value = tbs->bts[index];
    // 将value向右移
    value >>= num;
    // 任何数和3(11)&为任何数
    return value & 3;
}
void TBSSetVaule(TwoBitSet *tbs, size_t x, size_t value)//设置值
{
    assert(tbs);
    // 计算x在数组里下标
    size_t index = x >> 4;
    // 因为两位表示一位,需要乘2
    size_t num = x % 16 * 2;
    if (value == 0)//设置为00
    {
        // 先将3左移到要设置的位,然后取反,保证在这两位为0,
        // 然后&,这两位为0,其他为不变
        tbs->bts[index] &= ~(3 << num);
    }
    // 要设置为01
    else if (value == 1)
    {
   		// 设置1 
        tbs->bts[index] |= (1 << num);
        tbs->bts[index] &= ~(1 << (num + 1));
    }
    // 要设置为10
    else if (value == 2)
    {
    	// 设置1
        tbs->bts[index] |= (1 << (num + 1));
        // 设置0
        tbs->bts[index] &= ~(1 << num);
    }
    // 要设置为11
    else if (value == 3)
    {
        tbs->bts[index] |= (3 << num);
    }       
}

// 销毁
void BTSDestory(TwoBitSet *tbs)
{
    assert(tbs);
    free(tbs->bts);
    tbs->bts = NULL;
    tbs->range = 0;
}

2. Give two files, there are 10 billion integers, we only have 1Gthe memory, how to find the intersection of the two documents?

  • As a problem: a file 512M, file Adata will occur once the bit is set to 1 in FIG. 1, again without providing
  • File Bdata appear once in the bitmap 2 is set to 1, again without providing (a 32-bit integer, data memory 32), then bit 1 &bit 2, &the result is two bits is 1 the intersection of the file.

  • When given memory and then small, for example, only to 512Mmemory, then we need to file this file segmentation. Specific operation is: setting / es hash rule, the large file into smaller cutting file, the same numbers in the same kind appear in the same hash map file small. Then separately for each small file to find the intersection. The main investigation file segmentation.

3. bitmap Deformation: a file has 10 billion int, 1Gmemory, design algorithms to find all occurrences of the integer no more than 2 times

  • The problem with the solution of the problem Solution 1. Two bitmaps two performance four cases in which 00 appear 0, 1 appeared 01 times, 10 occurs twice, 11 appear twice or more. 01, 10, scanning both cases it is possible to solve the problem.

2. Bloom filter applications

1. to two documents, there were 10 billion query, we have only 1Gmemory, how to find the intersection of the two documents? And algorithms are given precise approximation algorithm

  • Precise application of the algorithm reference bitmap to 2
  • Approximate algorithm is employed to Bloom filter, all the contents of the first file into the Bloom filter, and then look for whether there all content files in the second Bloom filter. Compared with an approximate intersection of algorithm.

2. how to extend the bloom filter to remove elements that it supports operations

  • Bloom filter needs to be simple transformation, a transformation of each counter, which is used once a position of the operation is no longer performed is set, but the counter ++, when deleting data is no longer set to 0, and the counter is --, when the position is 0, the data does not exist. But the counter in the end how much to set it? This is obtained considering the size of the data, if the set intwill have 32to indicate the presence or absence of data, but this advantage completely lost Bloom filter. If set charthis will be very prone to count wrap, because it can only save 256 kinds of circumstances.

3. hash of cutting

1. to exceed a 100Gsize log file, login the memory IPaddress, the design algorithm to find the largest number of occurrences IPaddress? On the same subject to the conditions, how to find the IP top K's? How to achieve the direct use of Linux system commands?

  • IPAddress is a decimal point, unsigned_intthe easiest way is to direct brutal sort, set 4Gmemory for all IPaddresses mark, which is to get 4G * 4 = 16Ga space for processing. On the server 16Gis not a difficult thing. But the efficiency is very low.
  • Cutting by way hash, the hash function is provided, the file splitting, the same data by the same hash function stored in the same small affirmative file. To know a IPaddress is a 4-byte integer, 2 bytes we can have 2^16possibilities, divided into 65,536 barrels, then there will be 65,536 kinds of possibilities 65,536 barrels, this possibility from the former 2 bytes. You can then look in the 65536 barrel to its highest number of occurrences of the barrel IP, and then re-sort them, you can get first place. Similarly getTOP K

  • LinuxFirst, the use of sortinstruction ordering log_file, so that you can get an ascending order, the same case IPwill be adjacent. Then using uniqinstructions to re-needed here uniq -cwill be the number of rows combined directly in front of the display, then use sort -nrwhere sort -nwill sort numbers, i.e., the size of the merged line number order. sort -nrThat merger underwent numbers in descending order. Then use head - Kcan be obtained before Krepeating IP. Integrated is:sort log_file | uniq -c | sort -nr | head -k

4. The inverted index

1. to thousands of files, each file size 1K—100M. To nwords, design the algorithm finds all files that contain every word of it, you have only 100Kmemory

  • Simple to understand inverted index: find the corresponding word in the document, it is the positive sequence index. So I took the nword to find the corresponding file is an inverted index. All files are scanned, if the word appears in a document, then on the next file number can be recorded in the end of a word. In this way all the scanned files will be able to solve this problem. As for 100Mthe only file 100Kmemory, only the file cut enough.

5. added subsequent

Trie树 That is more than the hash tree, trie, trie prefix tree for quick retrieval in a large number of string.

Jump list, jump tables, jump tables and other issues related to massive data.

Published 359 original articles · won praise 259 · views 70000 +

Guess you like

Origin blog.csdn.net/yl_puyu/article/details/104890604