Data Structure and Algorithm-Hash Algorithm

One: Introduction 

        1. Give you N (1<N<10) natural numbers, each number ranges from (1~100). Now let you determine whether a certain number is within these N numbers as quickly as possible. You must not use the encapsulated class. How to implement it.     

N:5     

10 50 60 1 5 Determine whether 7 is present: sorting (not necessary), traversal, (enumeration), array subscript; for age issues, the subscript is used as age.   

a[]=new int[101] => a[10] = 1,a[50]=1,a[7]=-1; The time complexity of the search is O(1). Does it still need to be traversed? No, Directly determine the value of a[7]% 100. The data is too large to be saved, which wastes 90% of the space.

        2. Give you N (1<N<10) natural numbers, the range of each number is (1~10000000000). Now let you determine whether a certain number is within these N numbers as quickly as possible. You must not use the encapsulated class. How to implement it. A[] = new int[N+1]? N:5 11 52 63 4 5,999999999; remainder: determine whether 7 is present

Two: Hash table

        2.1 Hash table in English is Hash Table, which is what we often call hash table. You must have heard it often. In fact, our example just above was solved using the idea of ​​hash table. The hash table uses array support as follows . It marks the characteristics of random access to data, so the hash table is actually an extension of the array and evolved from the array. It can be said that if there is no array, there is no hash table. In fact, this example has already used the idea of ​​hashing. In this example, N is a natural number and forms a one-to-one mapping with the subscript of the array, so the array is used to support random access based on the subscript. The time complexity of the search is O(1), which allows you to quickly determine whether the element exists in the sequence.

        2.2 Hash function 

        What is a hash function? Taking our example above, we take modulo N. In fact, this is a hash function. That is the Hash(key) that everyone often sees. This Hash function is what we call a hash function. We use it to calculate the hash value. Now let’s look at another case of the above example:     

N:10     

11 52 62 63 4 5,999999999; Remainder: We take the remainder of N.:x%n is a kind of hash function.     

11 % 10 = 1  =>a[1] = 11      

52 % 10 = 2 => a[2] = 52     

62% 10 = 2? It’s called a hash conflict     

I want to find 52 52%10 => 2 and find a[2] = 52? If it is equal, it exists, if it is not equal, it does not exist. 62 % 10=>2 a[2] = 52 is not equal to 62

        2.3 How to resolve hash conflicts

                2.3.1 Open addressing (as shown below)

                Open addressing: The core idea of ​​open addressing is that if a hash conflict occurs, we re-detect a free location and insert it.

                How to insert : When we insert data into the hash table, if a certain data has been hashed by the hash function and the storage location is already occupied, we will start from the current location and search backwards to see if there is a free location. Until you find it. Look at the graphic below: green means data has been stored

                How to search : For example, search for 20: first get 0 according to the Hash function, and then compare the values ​​in sequence. If found, it is empty.

                shortcoming:

                         1. Deletion requires special treatment

                          2. If too much data is inserted, it will cause many conflicts in the hash table and the search may degenerate into traversal.

               2.3.2 Link address

                In fact, it is to use a linked list. The linked list method is a more commonly used hash conflict resolution method, and it is much simpler than the open addressing method. Let's look at this picture. In the hash table, each key corresponds to a linked list. All elements with the same hash value are placed in the linked list corresponding to the same slot. The above example will look like this.

Two illustrations of hash conflict resolution

Three: hash application

        3.1 hashmap

        Since the linked list structure does have some shortcomings, it has been optimized in our JDK and a more efficient data structure has been introduced: the red-black tree.

        1. Initial size: The default initial size of HashMap is 16. This default value can be set. If you know the approximate amount of data in advance, you can modify the default initial size to reduce the number of dynamic expansions, which will greatly improve the performance of HashMap. performance.

        2. Dynamic expansion: The default maximum loading factor is 0.75. When the number of elements in the HashMap exceeds 0.75*capacity (capacity represents the capacity of the hash table), expansion will be started. Each expansion will double the original size.

        3. Hash conflict resolution: The bottom layer of JDK1.7 uses the linked list method. In JDK1.8 version, in order to further optimize HashMap, we introduced red-black trees. When the length of the linked list is too long (default exceeds 8), the linked list is converted into a red-black tree. We can use the characteristics of red-black tree to quickly add, delete, modify and check to improve the performance of HashMap. When the number of red-black tree nodes is less than 8, the red-black tree will be converted into a linked list. Because when the amount of data is small, the red-black tree needs to maintain balance. Compared with the linked list, the performance advantage is not obvious.

        3.2 How to design Hash

If you are asked in the interview how to design an efficient enterprise-level hash table. How should you deal with it? Here you can learn from the design ideas of HashMap:

        1. It must be efficient: that is, insertion, deletion and search must be fast

        2. Memory: Don’t take up too much memory. Consider using other structures, such as B+Tree, HashMap 1 billion, hard disk storage algorithm: mysql B+tree

        3.Hash function: This should be considered based on the actual situation.%

        4. Capacity expansion: It is to estimate the size of the data. The default space of HashMap is 16? I know that I want to save 10,000 numbers, 2^n > 10,000 or 2^n-1 5. How to resolve Hash conflicts: linked list array.

        3.3 Application

                1. Encryption: MD5 hash algorithm. There is still a password conflict. The 128-bit binary string can be represented 2^128 times, md5(md5(),"1231"), b is irreversible. I built a hash library and saved it. Md5(88888888),exhaustive

                2. How to determine whether a video is a duplicate? Md5(); 128 bits

                3. Similarity detection: paper detection, fingerprint algorithm. A fingerprint, Hamming distance, will be calculated for each paper. 4. Load balancing: nginx, 2 servers; you can calculate the hash based on the IP, and then do a modulo 2 operation.

                5. Distribution system: data sharding problem. Didn’t I say that a search term with 1 billion data cannot be stored on a single machine. To be divided into 10 files. Hash(key)%10 = > Can we know which file a certain key is in? Expand into a sub-table of the database (10 tables) id%10 = ().

                6. In distributed storage: the problem arises if I add a table. What should I do if there were originally 10 photos but now there are 11 photos? What should I do if I need to recalculate and query during allocation?

                7. The amount of data is huge, isn’t it too much to migrate?

                8. Search algorithm: hashMap search How to design your own hash search algorithm? Fast, less hash conflicts, smaller data size. Initial size, expansion

Guess you like

Origin blog.csdn.net/qq_67801847/article/details/132906550