Data structure and algorithm hash (hash table)

Hash

Basic definition

1. Hash: Determine the location of data storage by the value of the data item. The storage location in the hash table is called a slot.

2. Hash function: The function that realizes the conversion from a data item to a storage slot is called a hash function.

3. Slot number: the storage location of the data item returned by the hash function.

Several commonly used hash functions:

Find the remaining hash:

  1. Method:
    Divide the data item by the size of the hash table, and use the remainder as the slot number.
    In fact, the "remainder" method will appear in all hash functions in different forms. Because the slot number returned by the hash function must be within the size of the hash table, the size of the hash table is generally calculated.
  2. Data search:
    only need to use the same hash function to calculate the search item, test whether there is a data item in the slot corresponding to the returned slot number
  3. Insufficiency:
    "Conflict" may occur. That is: two different data items get the same slot number after calculating the remainder.

Perfect hash function:

1. Method:
Given a set of data items, if a hash function can map each data item to a different slot. For a fixed set of data, we can always find ways to design a perfect hash function.

2. Insufficiency:
But if this group of data changes frequently, it is difficult to have a perfect hash function (that is, there will be some conflicts, but! The conflicts are not fatal, we can handle them properly!)

3. The method of designing a perfect hash function:
①Design a large enough hash table (that is, expand the capacity of the hash table) so that all possible data items can occupy different slots. (Not practical)
Second, a good hash function needs to have characteristics:
minimal conflicts (approximately perfect),
low computational difficulty (small additional overhead), and
fully dispersed data items (save space)

One of the applications of hashing

Area chain
Introduction to area chain: area chain is a kind of distributed database. Each node of the nodes connected through the network stores all the data of the entire database. The data stored in any place will be synchronized.
Its essential feature: decentralization, that is, there is no control center, coordination center node, all nodes are equal and cannot be controlled.

区块链由一个个区块(block)组成,区 块分为头(head)和体(body) 
区块头记录了一些元数据和链接到前一个区块的信息。
生成时间、前一个区块(head+body)的散列值
区域链具有不可修改性:
由于散列值具有抗修改性,任何对某个区 块数据的改动必然引起散列值的变化,为了不导致这个区块脱离链条,就需要修改所有 后续的区块。
由于有“工作量证明”的机制,这种大规模修改 不可能实现的,除非掌握了全网51%以的计算力。

Hash function design

1. Folding method:
divide the data item into several segments according to the number of digits, then add several segments of numbers, and finally calculate the remainder of the hash table size to obtain the hash value.
Sometimes the folding method also includes a step of reversing the interval

2. Take the middle of the square (the amount of calculation is slightly larger):
First, the data item is squared, then the middle two digits of the square are taken, and then the size of the hash table is calculated.

3. Non-numerical items: It is
also possible to hash non-digital data items, and treat each character in the string as an ASCII code, and then accumulate these integers to calculate the remainder of the hash table size.

Note:
Such a hash function returns the same hash value for all anagrams to prevent this. You can use the position of the string as a weighting factor and multiply the ord value.

4. Digital analysis method:
For a given set of key codes, analyze the frequency of each digit in all key codes, and select several numbers with better distribution as the value of the hash function.

Basic principles of hash function design

1. The
hash function cannot be too complicated, otherwise it will become a computational burden for the stored procedure and the search process.

2. The
hash value should be distributed as evenly as possible

Conflict resolution:

1. Conflict resolution:
a systematic method to save the second data item in the conflicting data in the hash table.

2. Solution
Open addressing: that is, find an open empty slot to save. The easiest way is to start from the conflicting slot and scan backwards until it encounters an empty slot. If the end of the hash table has not been found, then scan from the beginning. .
The method of searching backward one by one is the "linear detection" in the open addressing technology.

3. Disadvantages:
tend to gather.

4. Improvement:
Change one-by-one detection to skip detection (re-hashing).

Reference materials:

Data structure and algorithm (python) online courseware by Chen Bin, Peking University

Guess you like

Origin blog.csdn.net/qq_48314528/article/details/108689714
Recommended