BitMap principle and the use of

  

Bitmap ( Bitmap), Accession (Bit) is set, is a data structure that can be used for recording a large amount of state 0-1, are used in many places, such as the Linux kernel (such as the inode, disk blocks), Bloom Filter Algorithm etc. the advantage is that you can save a lot of 0-1 state at a very high utilization of space.

 

The principle BitMap

  BitMap basic principle is to use one bit to store a certain state, for large-scale data, but the data are not a lot of cases the state. Generally it is used to determine a data storage does not exist.

  For example: In Java inside an int is four bytes, if you want to for 10 Yi Ge int data is processed it? 1000000000 * 4/1024/1024/1024 = about 4 G, required memory 4 G's.  

            If possible to employ bit reservoir, 10_0000_0000Bit = 1_2500_0000byte = 122070KB = 119MB , then significant savings in storage space.

  In Java there, BitMap has a data structure corresponding to the class implementation java.util.BitSet, using the bottom of a BitSet long type of array to store elements.

  Let's look at the specific store:

  For 1,3,5,7 four numbers, if present, it can be expressed this way:

  

  1 represents the presence of this number, 0 for does not exist. 01010101 table representative of example, the presence of 1,3,5,7, 0,2,4,6 absence. What if there is how 8,10,14 keep it? As shown, 8,10,14 we can exist in the second byte

    

   And so on.

Map Map

    Total assumed to be sorted to find or N = 10000000, then we need to apply the size of the memory space is int a [1 + N / 32 ], wherein: a [0] representing a decimal number 32 may correspond to 0-31 in memory, and so on: 
    Bitmap table: 
   A [0] ---------> 0-31 of 
   A [. 1] ---------> 32-63 of 
   A [2] ----- ----> 64-95 
   A [. 3] ---------> 96-127 
   .......... 

BitMap algorithm to handle large data problem scenarios :

 ( 1) given numbers 1000000000 positive int do not overlap, no row over sequence, and then give a number, this number determines how quickly whether that number 1000000000 them.

Solution: traverse 40 one hundred million number mapped to BitMap, and then given to the number of directly determining the presence of a specified bit can not exist.

 ( 2) method using a bitmap is determined whether or not duplicate positive integer array

Solution: iterate again, arranged to present after 1, prior to each discharge to determine whether there is, if present, on behalf of the repeated element.

 ( Elemental not repetitive positive integer array sorting 3) method using a bitmap

Solution: iterate again, setting the state 1, then iterates again, the output state is equal to 1, with reference to the principle of counting sequencing.

 ( 4) identify the non-repetition of positive integers in 250 million integers, note, not enough memory to accommodate the 250 million integer

Solution 1: using 2-Bitmap (each apportionment 2bit, 00 indicates absence, means one 01, 10 represents a plurality of times, meaningless 11).

Solution 2: using two BitMap, i.e., a first storage Bitmap whether there is an integer, and then, after a first traversal first determines whether there appeared BitMap, BitMap is set corresponding to the second position if there is also 1, the final traverse BitMap, only appear in a BitMap over-elements is not unique integer.

Solution 3: Partition + Hash modulus, split into multiple small files, and reading a file, until the installed memory, and then using the Hash + Count manner can be determined.

Deformation of this kind of problem, such as those containing some phone numbers within a file known, each number is 8 digits, count the number of different numbers. 8 up to 99,999,999, a 'bit takes about 99m, about 10 bytes of memory to a few m. (It will be appreciated from 0-99999999 numbers, each number corresponding to a bit Bit, so only one need 99M Bit == 12MBytes, so, use the left and right small memory 12M shows all eight digits phone)

BitMap some disadvantages:

( 1) Data collision. Issues such as mapping the string to BitMap when there will be a collision, it could be considered to address the Bloom Filter, the probability of Bloom Filter using multiple Hash functions to reduce conflicts.

( 2) data is sparse. Another example is to be stored (10,8887983,93452134) three data, we need to build a BitMap 99999999 length, but in fact only keep the three data, this time there is a lot of wasted space, in that problem, can be solved by the introduction of Roaring BitMap.

example:

   Looking for unique integer positive integer array

  

Import java.util.BitSet;
 Import java.util.HashSet with;
 Import java.util.Set; 

public  class TestBitMap {
         // hypothetical data in array form to our 
        public  static the Set Test ( int [] ARR) {
             int J = 0 ;
             // to avoid return the number of repeating, in the presence of Set 
            Set Output = new new HashSet (); 
            BitSet to BITSET = new new BitSet to (Integer.MAX_VALUE);
             int I = 0 ;
             the while (I < arr.length) {
                 int value = arr[i];
                //判断该数是否存在bitSet里
                if (bitSet.get(value)) {
                    output.add(value);
                } else {
                    bitSet.set(value, true);
                }
                i++;
            }
            return output;
        }
        //测试
        public static void main(String[] args) {
            int[] t = {1,2,3,4,5,6,7,8,3,4,9};
            Set t2 = test(t);
            System.out.println(t2);
        }
    }

 

to sum up

     This paper describes the basic principles and application case BitMap algorithm, which essentially is the use of a bit bits to represent elements of the state, which can greatly save storage space in a particular scene, very suitable for mass data lookup, the heavy sentence, deal with issues such as deleted.

Other references:

https://www.cnblogs.com/hongdada/p/8267032.html

https://www.cnblogs.com/gczr/p/7358813.html



 

Guess you like

Origin www.cnblogs.com/dragonsuc/p/10993938.html