Bitmap for massive data processing

 Burning Cup Talking About Big Data 

image

I. Overview

This article will describe the relevant principles of the Bit-Map algorithm and some of the use scenarios of the Bit-Map algorithm. For example, BitMap solves the problems of finding duplication of massive data and judging whether individual elements are in the massive data. Finally, I will talk about the characteristics of BitMap in various scenarios. Usability.

Two, Bit-Map algorithm

Let’s take a look at such a scenario first: For an ordinary PC with 2G memory, it is required to process a 4 billion unsigned int integers that are not repeated and are not sorted. Give an integer and ask if you can quickly judge this integer. Is it among the 4 billion data in the file?

Question thinking:

4 billion ints account for (4 billion*4)/1024/1024/1024, which is about 14.9G. Obviously, the memory is only 2G, which can't fit. Therefore, it is impossible to put these 4 billion data in the memory for calculation. The best solution to solve this problem quickly is to store the data in memory, so the problem now is how to store 4 billion integers within 2G of memory space. An int integer occupies 4 bytes in Java, that is, 32 bits. If a bit can be used to identify an int integer, the storage space will be greatly reduced. Calculate the memory space required for 4 billion ints as 4 billion/ 8/1024/1024 is about 476.83 mb, so we can put these 4 billion int numbers into the memory for processing.

Specific ideas:

1 int occupies 4 bytes, that is, 4*8=32 bits, then we only need to apply for an int array with a length of int tmp[1+N/32] to store these data, where N represents the total number of searches. Each element in tmp contains 32 bits which can correspond to the decimal number 0~31, so the BitMap table can be obtained:

tmp[0]: can represent 0~31

tmp[1]: can represent 32~63

tmp[2] can represent 64~95

.......

Then let's take a look at how the decimal number is converted to the corresponding bit position:

Assuming that the 4 billion int data is: 6,3,8,32,36,..., then the specific BitMap is expressed as:

image

How to determine which subscript of the int number is in the tmp array, this can actually be obtained by directly dividing by 32 to get the integer part, for example: the integer 8 divided by 32 is rounded to 0, then 8 is on tmp[0]. In addition, how do we know which of the 32 bits of 8 is in tmp[0]? In this case, directly mod 32 is ok, and like the integer 8, 32 is equal to the 8th mod in tmp[0] 8, then the integer 8 is in the eighth bit of tmp[0] (counting from the right).

An implementation in java is bitset, a tool that has existed for a long time. For specific use, you can refer to the following source code:

import java.util.BitSet;

public class BitSetTest {

   public static void main(String[] args{
       int [] array = new int [] {1,2,3,22,0,3,63};
       BitSet bitSet  = new BitSet(1);
       System.out.println(bitSet.size());   //64
       bitSet  = new BitSet(65);
       System.out.println(bitSet.size());   //128
       bitSet  = new BitSet(23);
       System.out.println(bitSet.size());   //64

       //将数组内容组bitmap
       for(int i=0;i<array.length;i++)
       {
           bitSet.set(array[i], true);
       }

       System.out.println(bitSet.get(22));
       System.out.println(bitSet.get(60));

       System.out.println("下面开始遍历BitSet:");
       for ( int i = 0; i < bitSet.size(); i++ ){
           System.out.println(bitSet.get(i));
       }
   }

}

Of course, this is the use of ready-made, it is very simple to write by yourself, there is a simple implementation below:

{
    [] (length) {
        .= length= [() (length >> ) + ((length & ) > ? : )]}

    (index) {
        = [() ((index - ) >> )]= () ((index - ) & )>> & }


    (index) {
        = () ((index - ) >> )= () ((index - ) & )= [][] = | (<< )}
    ([] args) {
        = BitMap().setBit()..println(.getBit())..println(.getBit())}
}


Guess you like

Origin blog.51cto.com/15127544/2665518