1. Description

I just reviewed the data structure. In the previous blogs, we learned about sequential search, binary search, block search, and tree search (binary sorting tree, balanced binary tree, red-black tree, B tree and B+ tree). This blog introduces The last algorithm in the common lookup algorithm - hash table (hash lookup).

At the same time, the new star project I just participated in: the data structure and algorithm channel ended at the weekend, and this blog is also the learning task of this week.

2. Basic concept of hash table

1. Explanation of terms

A hash table is a data structure based on a hash table, which stores and looks up data elements by mapping a given key value to a specific index position. Hash tables usually use arrays as the underlying data structure, and require a reliable hash function to calculate the index position corresponding to each key.

A hash function is a function that maps arbitrary-length input data to a fixed-length output. A hash function is usually used to calculate the index position of an element in a hash table. It converts each key value into a unique integer value, which can be used as the index of the element in the hash table. The design of the hash function is very important because it directly affects the performance and efficiency of the hash table. A good hash function should try to avoid hash collisions, that is, different key values are mapped to the same index position.

Hash collision (Hash collision) refers to two or more key values are hashed to the same index position. Since the size of the hash table is limited, hash collisions are inevitable. When a hash collision occurs, we need to take some measures to resolve it to ensure the correctness and efficiency of the hash table. Common methods for solving hash conflicts include chain storage method and open address method.

Synonyms are words that have the same or similar meanings. In a hash table, different key values may be mapped to the same index position, and these key values are synonyms. One of the methods to solve the problem of synonyms is to use the zipper method for chain storage, and connect multiple data elements into a linked list under the same hash value to avoid conflicts.

2. Baidu Supplement

Hash table (Hash table) is a common data structure, it has the ability to quickly insert, find and delete elements. A hash table implements these operations by mapping each element to a unique index position, which may be called a hash value or hash value.

A hash table consists of a fixed-size array, usually initialized to be empty. When inserting an element, you first need to calculate the hash value of the element and use it as an array index to store the element. If two elements have the same hash value, a hash collision will occur, and the methods to solve the hash collision include using chain storage or open address method, etc.

When querying for an element, we first calculate the hash value of the element and check if there is an element at that position in the array. If the position is empty, it means that the element does not exist in the hash table; otherwise, a further comparison is required to determine whether the element is the desired element.

A hash table is an efficient data structure, and its insertion, lookup, and deletion operations have a time complexity of O(1) in most cases. However, due to the existence of hash collisions, the time complexity in the worst case may degenerate to O(n), so a reasonable hash function design and a method of dealing with hash collisions are very important.

3. Personal Supplements

hashtable == hashtable

hash function == hash function

hash lookup == hash lookup

They are just different names in different programming languages, but they are essentially a concept.

3. Construction method of hash function

1. Design Considerations

When designing a hash function, you need to pay attention to the following points:

Uniformity: The hash function should evenly distribute the input data to the entire hash table, so that the number of elements in each bucket can be balanced as much as possible, and avoid the situation where some buckets are particularly crowded and cause a sharp drop in performance.
Collision rate: Collision refers to the situation where two different keys are mapped to the same hash table location. Hash functions should be designed to minimize collisions. Generally speaking, open addressing method or chained hash table can be used to solve the collision problem. But too many collisions will also affect the performance of the hash table, so it is necessary to avoid collisions as much as possible in the design of the hash function.
Easy to calculate: The calculation speed of the hash function should be fast enough, otherwise it will affect the performance of the hash table. Different types of data may require different hash function implementations.
Anti-attack: The hash function should be able to effectively prevent intentionally constructed input data, such as a malicious attacker intentionally creating a large number of collisions, which will cause the performance of the hash table to drop sharply.
Randomness: The hash function should be as random as possible, which reduces the difficulty for an attacker to perform a hash collision attack. Randomness can be achieved by using random numbers or adding salt in the hash function.

2. Remainder method

This is the simplest and most commonly used method. Assuming that the length of the hash table is m, take a prime number p not greater than m but closest to m, and use the following formula to convert the key into a hash address.

The hash function is H(key) = key % p

The key to the division and remainder method is to select p so that each keyword is mapped to any address in the hash space with equal probability after being converted by this function, so as to reduce the possibility of conflict as much as possible.

Applicable scenarios: more commonly used, as long as the keyword is an integer

3. Direct addressing method

Directly take a linear function value of the keyword as the hash address, and the hash function is

H(key) = key 或 H(key) = a*key + b

In the formula, a and b are constants. This method is the easiest and does not create conflicts. It is suitable for the situation where the distribution of keywords is basically continuous. If the distribution of keywords is discontinuous and there are many spaces, it will cause waste of storage space.

Applicable scenario: Keyword distribution is basically continuous

4. Digital Analysis

The digital analysis method is one of the hash function construction methods, and its basic idea is to use the digital features in the keywords to be stored to construct the hash address.

Specifically, the digital analysis method regards the keyword as an r-ary number (r usually takes 10), and then takes out several digits from right to left to form a new number, and then takes the length m of the hash table modulo, as the hash address of the record. If there are less than m digits, add 0 on the left.

It should be noted here that which digits are selected as the digits of the new number and the length of the new number will directly affect the performance of the hash. Generally speaking, you should choose digits with relatively uniform distribution of numbers, and select as many digits as possible to ensure a better hashing effect.

It should be noted that although the digital analysis method is simple and easy to implement, it has relatively high requirements for data and is also easily affected by data regularity, so its hashing performance may not be ideal. In practical applications, it is also necessary to select the most appropriate hash function based on the specific situation and in combination with other hash function construction methods.

5. Take the middle of the square

The middle square method is one of the hash function construction methods. Its basic idea is to take the middle digits of the square value of the keyword as the hash address.

Specifically, assuming that the key is an n-digit number, first square it to obtain a 2n-digit number, and then take m (usually m=3 or 4) digits from the middle as a hash address. If it is less than m bits, add 0 on the left and right, and if it exceeds m bits, intercept the middle m bits.

The advantage of this method is that it is simple and easy to implement, and the information on each bit of the key can be better utilized to improve the hashing performance. However, it should be noted that in practical applications, since the square operation may cause the result to overflow, it is necessary to deal with the overflow so as not to affect the hashing effect.

In general, the square method is a relatively basic hash function construction method, which is suitable for some simple application scenarios. In practical applications, it is also necessary to select other hash function construction methods according to specific situations, and to comprehensively consider multiple factors, such as time and space complexity, to construct the best hash function.

4. Methods of dealing with conflicts

1. Zipper method

Zipper method (chaining method, chain address method) (Chaining) is a method that uses a hash table to resolve conflicts, also known as chaining method. In this method, each location in the hash table stores a linked list, and if multiple key values map to the same location, they will be added to the same linked list.

When you need to find a key, first calculate its hash value, then locate the corresponding linked list, and search sequentially in the linked list. Since the size of the hash table is limited and the amount of data may be large, it is necessary to consider the design of the hash function during design to reduce the occurrence of conflicts and improve query efficiency.

When adding a new key-value pair, you also need to calculate its hash value first, then locate the corresponding linked list, and add it to the end of the linked list.

Although the zipper method is better in dealing with conflicts, it will take up more memory space, because each location needs to maintain a linked list. In addition, when traversing the entire linked list, query efficiency may be affected, especially when the linked list is too long.

2. Open address method

Open address method is a common method to solve the hash collision problem.

When using a hash table, multiple keys may be mapped to the same hash slot, which is a hash collision. The main idea of the open address method is that when a conflict occurs, continue to detect the next hash slot until an empty slot is found or all the slots have been detected.

There are several different detection methods for open addresses, including linear probing, quadratic probing, double hashing, and more. Among them, linear detection is the simplest and most commonly used method. Its basic idea is: if the current hash slot is already occupied, check the next slot in turn until an empty slot is found.

For example, assume that a hash table is used to store string-type keys, the hash function is to convert the string to an integer and take the remainder, and when a collision occurs, linear detection is used. When inserting a key-value pair, if the calculated hash value is already occupied, then search one by one from that position until an empty slot is found. If the entire hash table has been searched, but still no empty slot is found, then the capacity needs to be expanded and the memory space redistributed.

Although the open address method is a simple and effective method for solving hash collisions, it still has some problems. For example, when there are too many elements in the hash table, the detection time will become longer, and may even lead to performance degradation. Therefore, in practical applications, it is necessary to select an appropriate hash function and a conflict resolution method according to the specific situation.

5. Hash lookup and performance analysis

Hash Lookup (Hash Lookup) is a lookup algorithm based on a hash table. It speeds up lookups by mapping the key to a location in the hash table. In hash lookup, the hash function plays a key role. The hash function can map the keyword to be looked up to a position in the hash table, and ensure that different keywords are mapped to different positions.

The time complexity of hash lookup is usually O(1), because only one hash value needs to be calculated to access the corresponding data in constant time. However, in practical applications, due to the occurrence of hash collisions, the lookup efficiency may decrease. Therefore, how to resolve hash collisions is also the key to optimizing hash lookup performance.

Some common methods for solving hash collisions include: chain method, open address method, etc. When using the chain method, each hash slot holds a pointer to the head node of the linked list. If a hash collision occurs, a new element is inserted at the end of the linked list corresponding to the slot. When using the open address method, when a hash collision occurs, you can try to detect the next empty slot until you find a suitable location.

In practical applications, it is very important to choose an appropriate hash function and a method to resolve hash collisions. A good hash function should have a low hash collision rate and be able to evenly map keys to the hash table. The method for solving hash conflicts needs to be selected according to the specific situation, and different methods correspond to different performances.

In conclusion, hash lookup is an efficient lookup algorithm, which has been widely used in large-scale data processing. In practical applications, the performance of hash lookup can be further improved by optimizing the hash function and solving hash collisions.

Six. C language to achieve hash lookup

The following is an example of a simple C language implementation of hash lookup, using the chain method to resolve hash collisions:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define TABLE_SIZE 10

// 定义哈希表中的数据结构
typedef struct node {
    
    
    char key[20]; // 键值
    int value; // 值
    struct node *next; // 指向下一个节点的指针
} Node;

// 定义哈希表结构体
typedef struct hashtable {
    
    
    Node *table[TABLE_SIZE]; // 存储元素的数组
} Hashtable;

// 初始化哈希表
void initHashtable(Hashtable *ht) {
    
    
    int i;
    for (i = 0; i < TABLE_SIZE; i++) {
    
    
        ht->table[i] = NULL;
    }
}

// 计算哈希值
int hash(char *key) {
    
    
    int sum, i;
    for (sum = 0, i = 0; key[i] != '\0'; i++) {
    
    
        sum += key[i];
    }
    return sum % TABLE_SIZE;
}

// 向哈希表中插入元素
void insertElement(Hashtable *ht, char *key, int value) {
    
    
    int h = hash(key);
    // 创建新节点
    Node *newNode = (Node *) malloc(sizeof(Node));
    strcpy(newNode->key, key);
    newNode->value = value;
    newNode->next = NULL;

    if (ht->table[h] == NULL) {
    
     // 如果该位置没有元素，则直接插入
        ht->table[h] = newNode;
    } else {
    
     // 如果该位置已经有元素，则使用链式法解决冲突
        Node *p = ht->table[h];
        while (p->next != NULL) {
    
    
            p = p->next;
        }
        p->next = newNode;
    }
}

// 从哈希表中查找元素
int findElement(Hashtable *ht, char *key) {
    
    
    int h = hash(key);
    Node *p = ht->table[h];
    while (p != NULL) {
    
    
        if (strcmp(p->key, key) == 0) {
    
    
            return p->value;
        }
        p = p->next;
    }
    return -1; // 没有找到
}

// 测试函数
int main() {
    
    
    Hashtable ht;
    initHashtable(&ht);

    insertElement(&ht, "apple", 10);
    insertElement(&ht, "banana", 20);
    insertElement(&ht, "orange", 30);

    printf("The value of apple is %d\n", findElement(&ht, "apple"));
    printf("The value of banana is %d\n", findElement(&ht, "banana"));
    printf("The value of orange is %d\n", findElement(&ht, "orange"));

    return 0;
}

In the above example, we defined a hash table structure Hashtable, which contains an array tablefor storing elements. Each element is a pointer to the head node of the linked list. If a hash collision occurs, a new element is inserted at the end of the linked list corresponding to the slot.

In the implementation, we first hashcalculate the hash value of the element to be searched through the function, and then find the pointer of the slot where the element is located according to the hash value. If the pointer is NULL, it means that there is no element at this position, just insert it directly; otherwise, we need to traverse the linked list and insert the new element at the end of the linked list.

When searching, we also calculate the hash value first, and then traverse the linked list corresponding to the slot, and find the element to be searched in it. If found, returns the corresponding value; otherwise, returns -1not found.

The above is a simple C language implementation of hash lookup example, you can refer to this example to implement your own hash lookup program.

illustrate:

Rising Star Project: Data Structure and Algorithm, @西安第一感情, Creation Punch 4!

Lookup Algorithms for Hash Tables