Data Structure and Algorithm Hash&BitMap

One: Introduction

        1. What are the problems with the Hash expansion algorithm in multi-thread situations?

        2. How to determine whether a certain number exists among 300 million integers (0~200 million)? The memory limit is 500M, one machine.

        Divide and conquer:

        Bloom Filter: Artifact

        Redis Hash: Open 300 million spaces,

        HashMap put(key,value) put(1,true);

        Array: age problem; data[200 million], starting with 0, data[1]=1 means existence is feasible? Intangible

        Bit : bitMap, bitmap; Minimum unit: bit, byte 1Btye=8bit Char Int=bit? Int = 4byte=4*8bit

        

1.1 Problems with Hash expansion algorithm in multi-threading:

        1. Multi-threaded put operation, get will loop endlessly (). This can indeed be optimized. For example, when expanding the capacity, we open a new array and do not use the shared array.

        2. Multi-threaded put may cause get value error. Why does an infinite loop occur?

        When we talked about hash conflicts in the last lesson, we used a chain structure to save the conflicting values. If we traverse this linked list, it is like this 1->2->3->null. If we traverse to 3, it should be null. At this time, someone happened to calculate the value of this null, null => 1- >3, now it’s over. The original 3 was supposed to point to null and end, but now it points to 1, and this 1 happens to point to 3. Will this keep looping? .

1.2 Since Hash expansion is thread-unsafe, how should we use it?

        1. Systems that use hashing should operate as single-threaded as possible.

        2. If it is a multi-threaded environment, pay attention to locking. For example, you can use ConCurrentHashMap in jdk

        Jdk1.7: I won’t do an in-depth analysis in the segmented locking algorithm class. You can check the information online. There are too many in this area. It is a must-know content in JAVA. Of course, students who are not studying Java don’t need to read it. , you just need to know not to use the Hash algorithm indiscriminately in the multi-thread situation I mentioned above.

2: BitMap

        2.1 Prerequisite knowledge for learning bitmap

        Type basis: The smallest memory unit in calculation is bit , which can only represent 0, 1

        1Byte = 8bit 1int = 4byte 32bit Float = 4byte 32bit Long=8byte 64bit Char 2byte 16bit Int a = 1, how is this 1 stored in the calculation?

        0000 0000 0000 0000 0000 0000 0000 0001 

        2 << 1 = 2*2   2 << 2 = 2 * 4

        Operator basics: Left shift <<: 8 << 2 = > 8*4=32

        8:     0000 0000 0000 0000 0000 0000 0000 1000

        <<2: 0000 0000 0000 0000 0000 0000 0010 0000    => 2^5= 2*2*2*2*2=32

        Shift right >>:8 >> : 8 / 4 = 2

        8:     0000 0000 0000 0000 0000 0000 0000 1000

        <<2: 0000 0000 0000 0000 0000 0000 0000 0010      => 2^1=2  

        8 / 4 => 8 >> 2

        8*4 => 8 << 2

        Bit AND & : If both numbers in the same bit are 1, then the bit is 1, otherwise it is 0

        Bit or | : If one of two numbers with the same digit is 1, it is 1, otherwise it is 0

        

        2.2 BitMap

        From the above knowledge, we can know that an int occupies 32 bits. If we use the value of each bit of this 32 bits to represent a number, can we represent 32 numbers? In other words, 32 numbers only need the space occupied by an int, and it can be done instantly. Reduce space 32 times. For example, assuming that we have N{2, 3, 64}, the largest one is MAX, then we only need to open int[MAX /32+1] int arrays to store these data. For details, you can see the following structure: Int a : 0000 0000 0000 0000 0000 0000 0000 0000 Here are 32 positions. We can use 0 or 1 in each position to indicate whether the number at that position exists, so that we can get the following storage structure: Specifically, we can draw a picture to show

Data[0]:0~31 32 bits

Data[1]:32~63 32 bits

Data[2]:64~95 32 bits;

Data[MAX /32+1]

Suppose we want to determine whether 100 is in the list, then we can calculate it like this: 65/32=2 => locate data[2], 65%32=1 locate the 2nd position of data[2] (note here Counting from the right). Let's see if the second bit of data[2] is 1. If it is 1, then 65 exists in the list, otherwise it does not exist.

This is the core idea of ​​the bitMap algorithm.

package tree.bitmap;

public class BitMap {

	byte[] bits;		//如果是byte那就一个只能存8个数
	int max;			//表示最大的那个数

	public BitMap(int max) {
		this.max = max;
		bits = new byte[(max >> 3) + 1];		//max/8 + 1
	}	
	public void add(int n) {		//往bitmap里面添加数字
		
		int bitsIndex = n >> 3;		// 除以8 就可以知道在那个byte  哪一个数组
		int loc = n % 8;		///这里其实还可以用&运算  位置
		//接下来就是要把bit数组里面的 bisIndex这个下标的byte里面的 第loc 个bit位置为1
		bits[bitsIndex] |= 1 << loc; //
		//
	}
	public boolean find(int n) {
		int bitsIndex = n >> 3;		// 除以8 就可以知道在那个byte
		int loc = n % 8;		///这里其实还可以用&运算
		
		int flag = bits[bitsIndex] & (1 << loc);	//如果原来的那个位置是0 那肯定就是0 只有那个位置是1 才行
		if(flag == 0) return false;
		return true;
	}
	public static void main(String[] args) {
		BitMap bitMap = new BitMap(200000001);	//10亿
		bitMap.add(2);
		bitMap.add(3);
		bitMap.add(65);
		bitMap.add(66);
		
		System.out.println(bitMap.find(3));
		System.out.println(bitMap.find(64));
	}
	
}

Three: Application

        3.1 Advantages:

                1. Data judgment

                2. Sort the data without duplication. Since duplicate data cannot be processed, Hash conflicts cannot be processed. If we only have 10 numbers (0~1 billion), if you use bitmap, you still have to open 1 billion/32 spaces. We can directly Would it be better to use hashMap or an array of 10 spaces? Next Tuesday we will solve this duplicate problem; email filtering, crawler judgment, etc. hbase

                3. Many other applications can be extended based on 1 and 2, such as finding unique numbers, statistical data, etc.

        3.2 Disadvantages:

                1. The data cannot be repeated: the data only has 0 and 1, that is, whether it is there or not, I don’t know how many there are.

                2. When the amount of data is small, it has no advantage over ordinary hashing.

                3. Unable to process string: hash conflict

        

Guess you like

Origin blog.csdn.net/qq_67801847/article/details/132982638