[] Hash table data structure

[] Hash table data structure


Hash tables are also called hashtable, it is a linear data structure. In general, additions and deletions may be made with data o (1) change search time complexity. In the Java programming language, the underlying HashMap is a hash table.


1. What is a Hash Table

Hash table data structure is a linear, underlying such a data structure is generally achieved by the array. After performing the time data CRUD, Hash table first by a Hash function Hash key operation, the operation will Hash the key mapped to an index array, index can be obtained directly on the array a data operation. In theory, the time complexity Hash table data operations are O (1).

Hash table bottom is achieved through an array. There is a data feature: its length must be specified at initialization time. So when data Hash table fills want to continue to put data to the inside, then you must re-create a greater capacity array, and then copy the array before the array to the new array. This process is an operating cost performance, so our best estimate of the capacity of the data before using the Hash table, to avoid the expansion operation.

2. Hash Functions

Also known as a hash function hash function is to input arbitrary length (also called pre-mapping, pre-image), through a hash algorithm, converted into a fixed-length output, the output is the hash value. This conversion is a compression map, i.e., the space hash value is typically much smaller than the input space, different inputs may hash to the same output, and is impossible to uniquely determine the value of the input from the hash value. Assuming that the output range is S, the nature of the hash function is as follows:

  • Typical hash functions have infinite input range;

  • When the same hash function input, the output will be the same;

  • When the incoming hash function different input values, the return value may be the same or may be different;

  • Evenly distributed output values ​​for different input obtained;

Further, Hash function also has the following two properties:

  • Collision-free : that does not appear input x ≠ y, but the case of H = (x) H (y), in fact, this feature is not established, in theory, such as the current Bitcoin SHA256 algorithms used, there will be 2 ^ 256 kinds of output If we were 2 ^ 256 + 1 input, it would have a collision once, in fact, the theory proved by 2 ^ 130 times the input there will be 99% likelihood of a collision, but even so, even All man-made computer start up operations since the birth of the universe to the present day, the chance of a collision is extremely small.

  • Occult : That is, for a given output of H (x), wants to reverse the introduction of input x, it is computationally impossible. If you want to get H (x) of likely primary input method better than the exhaustive does not exist.

Hash functions are commonly used: SHA1, MD5, SHA2, etc.

3. Hash conflict

For different input values, Hash function may give the same output, this is called Hash collision.

The method of hash conflict is inevitable, we used to resolve hash collisions are opening address law and law ** ** zipper.

3.1 zipper law

The core idea is that the zipper method: If a conflict somewhere Hash Hash tables on (that is one element in the array to be placed in a position of time, there have been other elements occupy this position), then these elements are stored in the form of a linked list.

List query efficiency is relatively low, so if the number of conflict somewhere Hash tables on too much, then this position is a very long list. Queries slower. In Java 8 in, HashMap do an optimized, that is when the chain length of 8, the list will automatically convert into a red-black tree, higher query efficiency (red-black tree is a self-balancing binary search tree).

3.2 Opening address law

In the opening address method, if the data can not be stored directly in the hash function is computed array index, you need to find another location to store. There are three ways in the opening address method to find other locations, which are linear probing, quadratic probing, re-hashing .

Detection method 3.2.1 Linear

Linear probe insertion relatively simple approach is: first elements hash map, if there is no other element locations mapped on to insert data directly in this position; if already contains data to this position, it is judged that the next position presence or absence of data, if the data is not directly inserted further if the next judgment until I found empty.

Find linear probing: first by positioning the key to the array subscript position, then the value of this position and you want to find data comparing the value of the data, if found directly equal, if not equal, to continue to determine the next element, All the elements did not find the words been traversed, does not exist.

Delete linear probe: First is mapped to an array subscript position by the key, and then by comparing the value of elements in the array and you want to delete the elements, find elements that you want to delete. Then delete the elements on this position and sets a flag explain this position have had data (this step we all think about why you want to do)

3.2.2 Secondary detection method

Linear probe the hash table, the data aggregation occurs, once aggregated form, it becomes more and more, the hash function after those items fall within the scope of the aggregate, are required back step by step movement, and inserted after the aggregation, aggregation becomes larger and therefore, the faster the aggregate growth. The water and soil as we like a lot of the time when a local person, people will more and more, we all just want to know what to do here.

Secondary probe is an attempt to prevent aggregation generated idea is to detect the farther apart the unit, rather than the original location and the adjacent unit. Linear probe, the hash function if the obtained original subscripts x, linear detection is x + 1, x + 2, x + 3 ......, and so on, while in the second probe, the probe process x + 1, x + 4, x + 9, x + 16, x + 25 ......, and so on, to the square of the number of steps of the original distance.

3.2.3 double hashing

Double hashing is to eliminate the original aggregation and secondary aggregation problem, whether it is linear or quadratic detection probe, probe each time step size is fixed. Double hash is a hash function in addition to the first hash function is used to add a keyword to generate the detection step, so that even if the first hash function is mapped to the same bit array standard, but not the same as the detection step so that we can solve the problem of aggregation.

The second hash function must have the following characteristics

  • And a hash function is not the same;
  • 0 is not output, since the step length is 0, each probe is pointing to the same position, into the loop, after testing results stepSize = constant- (key% constant); the form of a hash function effect is very good, constant and a prime number is less than the capacity of the array.

The core idea is to double hash, the second step generate a random detection step.

4. Hash tables related applications

Only 2G memory computer, how to find the most occurrences integer 2 billion data

First we need to determine the scope of value, because the number of possible two billion number is the same, then the value to 20 billion times. Therefore, we need a minimum of data storage type int this number (Java int in 4 bytes);

At the same time we have to determine the range of this integer 2 billion is how much. If the range is from 1 to 2 billion, we can also use int to store key, if it is a larger range of values, then we need to consider long saved. We have extreme bad situation to consider this question: 20 is a full data are different data, the range of these data is more than 2 billion, so we need to keep key values ​​of type long, int type should be stored value value, 2 billion records, then takes about 26G of memory space. So obviously out of memory, so the number of one-time statistical 2000000000 risky.

Solution: contains 2 billion the number of large file into 16 small files using a hash function, in this case, a repeat of the same number will certainly not go into different files, and, if the hash function enough well, then this 16 different file number will not be greater than 200 million (20/16). Then we can turn the statistics in this 16 document, and finally to summarize the number of repetitions to get the largest number. (Summary when I just need to remove each small file the largest number of occurrences and the number of these 16 were compared on the line)

Question: If this is the same number 2000000000 how to judge it?

Guess you like

Origin www.cnblogs.com/54chensongxia/p/11566973.html