An article to help you learn what hashing is

Insert image description here

Hash concept

Hashing is widely used in C++. It is a data structure and algorithm used to quickly find and store data. The following are some common applications of hashing in C++:

  1. Hash Table: A hash table is an efficient data structure used to store key-value pairs. In C++, std::unordered_mapand std::unordered_setare the hash table implementations provided by the standard library. These containers provide constant-time lookup, insertion, and deletion operations, making them ideal for quickly finding and storing data.
#include <unordered_map>
#include <unordered_set>

std::unordered_map<std::string, int> hashMap;
hashMap["apple"] = 3;
int value = hashMap["apple"];

std::unordered_set<int> hashSet;
hashSet.insert(42);
bool exists = hashSet.count(42) > 0;
  1. Hash Function: A hash function maps input data to a fixed-size hash code (hash value), usually an integer. The C++ standard library provides multiple hash functions and also allows users to customize hash functions.
std::hash<int> intHash;
size_t hashCode = intHash(42);
  1. Hash Collections and Hash Maps: In addition to the standard library's hash tables, C++ also provides other third-party libraries and implementations, such as Google's and hash tables absl::flat_hash_map, absl::flat_hash_setthat provide similar functionality but with lower memory overhead.
  2. Hash sets are used for deduplication: by inserting elements into a hash set, duplicates can be quickly removed.
std::unordered_set<int> uniqueValues;
uniqueValues.insert(1);
uniqueValues.insert(2);
uniqueValues.insert(1); // 重复元素将被自动去重
  1. Application of hashing in cryptography: Hash functions are used in cryptography to store and verify passwords. Commonly used password hashing functions include SHA-256 and bcrypt.
  2. Cache implementation: Hash tables can be used to implement caching. By storing key-value pairs in a hash table, data in the cache can be looked up in constant time.
  3. Data uniqueness check: Hash functions and hash tables can be used to check the uniqueness of data to ensure that duplicate data is not stored.

In the sequential structure and balanced tree, there is no corresponding relationship between the element key and its storage location. Therefore, when searching for an element, multiple comparisons of the key must be made. The time complexity of sequential search is O(N), in a balanced tree, the height of the tree, that is O(log_2 N), the efficiency of the search depends on the number of comparisons of elements during the search process.

The ideal search method: you can get the elements to be searched directly from the table at one time without any comparison .
If you construct a storage structure and use a certain function (hashFunc) to establish a one-to-one mapping relationship between the storage location of the element and its key code, then the element can be found quickly through this function during search.

When entering this structure :

Insert an element.
According to the key code of the element to be inserted, use this function to calculate the storage location of the element and store the search
element
according to this location . Perform the same calculation on the key code of the element, and use the obtained function value as the storage location of the element. Compare the elements at this position in the structure. If the key codes are equal, the search is successful.

This method is the hash (hash) method. The conversion function used in the hash method is called the hash (hash) function, and the constructed structure is called a hash table (Hash Table) (or hash table). For
example : Data set {1, 7, 6, 4, 5, 9};

The hash function is set to: hash(key) = key % capacity; capacity is the total size of the underlying space for storing elements
Insert image description here

Searching with this method does not require multiple key code comparisons, so the search speed is faster. However, the question that comes with it is, what should I do if the remainders of two numbers are the same when taking the remainder ?

Hash collision

In a hash table (Hash Table) form of storage, a hash conflict refers to when different keys (or data) are calculated by the hash function, and they get the same hash value and try to store it in the same hash table. bucket (or hash table location) . Since hash functions map input data into a limited number of buckets, and the number of input data may be much larger than the number of buckets, hash collisions are a common problem in hash tables.

Hash collisions can cause the following problems :

  1. Data overwriting: If two different keys are mapped to the same bucket location, the data of one key will overwrite the data of the other key, resulting in data loss.
  2. Reduced lookup efficiency: When multiple keys map to the same bucket location, finding a specific key may be less efficient because one must search within the bucket to find the correct key.

To resolve hash collisions, hash tables typically use one of the following methods :

  1. Chaining: In this method, each bucket stores a linked list, array, or other data structure that stores multiple elements mapped to the same hash value. When a hash collision occurs, new elements are added to the linked list or array within the bucket without overwriting existing elements.
  2. Open Addressing: In this method, when a hash collision occurs, the algorithm looks for the next available bucket in the hash table and inserts data into that bucket until a free bucket is found. This process usually involves a series of detection methods, such as linear detection, secondary detection, etc.
  3. Good hash function: Choosing the right hash function can reduce the occurrence of hash collisions. A good hash function should map data into buckets as evenly as possible, thereby reducing the probability of collisions.

Handling hash conflicts is an important issue that needs to be considered when designing and implementing hash tables. Different application scenarios may require different conflict resolution strategies. Reasonable conflict handling methods can improve the performance and efficiency of hash tables.

Hash function

Hash function design principles :

The domain of the hash function must include all keys that need to be stored, and if the hash table allows m addresses, its value range must be between 0 and m-1. The addresses calculated by the hash function can be evenly distributed throughout
the In space
, the hash function should be relatively simple

1. Direct addressing method

"Direct Addressing" (Direct Addressing) is a simple and effective hashing (hash) technology, often used to solve the storage and retrieval problem of key-value pairs. In the direct addressing method, each possible key corresponds to a bucket (or slot), and these buckets are allocated according to the key range, usually implemented as an array .

The core idea of ​​this approach is that each key is mapped directly to a specific bucket, so ideally there are no hash collisions since each key has a unique bucket index .

The following are the main features and limitations of the direct addressing method :

  1. Space complexity : This method requires allocating storage space large enough to accommodate all possible keys, so the space complexity depends on the range size of the keys. If the range of keys is large, it will result in high memory consumption.
  2. Conflict resolution : Direct addressing generally does not require a conflict resolution strategy because each key has a unique bucket index. This makes storage and retrieval operations both O(1) constant time complexity, very efficient.
  3. Applicability : Direct addressing usually works when the range of keys is relatively small and contiguous. If the range of keys is very large or non-contiguous, this approach may not be suitable as it would require allocating a large amount of storage space.
  4. Example : A common example is to use direct addressing to implement a data structure containing integer keys, such as a counter. If you have a counter that needs to track the occurrences of a large number of integer values, you can use an array where the index of the array is the integer value and the value of the array is the number of occurrences of that integer value.
// 使用直接定址法实现计数器
const int MAX_RANGE = 1000; // 假设整数范围在0到999之间
int counter[MAX_RANGE] = {
    
    0};

// 增加某个整数值的计数
int key = 42;
counter[key]++;

In summary, direct addressing is a simple but effective hashing method that is suitable for situations where the range of keys is relatively small and contiguous. It provides storage and retrieval operations with constant time complexity, but you need to pay attention to the range of keys and memory consumption, and is suitable for situations where the search is relatively small and continuous.

2. Division with remainder method

Division Method is a common hashing technique used to map keys into hash table buckets (slots) or array indexes. The basic idea is to calculate a hash value by dividing the key by a suitable number and taking the remainder, and then use the hash value as the index of the bucket. This remainder should be a small positive integer, usually a prime number, to ensure good distribution .

The steps for division with remainders are as follows :

  1. Choose a suitable divisor (usually a prime number), denoted by M.
  2. Hash calculation on key K: hash value = K % M.
  3. Use the hash value as the bucket index and store the data in the corresponding bucket.

The advantage of this method is that it is simple and easy to implement. However, it also has some limitations and caveats :

  1. Choose a suitable divisor : Choosing a suitable divisor M is crucial for the division with remainder method. A good choice ensures good hash distribution and avoids hash collisions. Often, choosing prime numbers can help reduce the likelihood of conflicts.
  2. Uniform distribution : To get a good hash distribution, keys should be evenly distributed throughout the key space. If the keys are unevenly distributed, it can result in some buckets being overcrowded while other buckets have little to no data.
  3. Handling negative numbers : The divide-and-leave-remainder method usually requires that the keys be positive integers. If you need to hash negative numbers, you can consider some adjustments, such as converting negative numbers to positive numbers.
  4. Handling hash collisions : Although the divide-and-remainder method reduces the likelihood of collisions, collisions can still occur. In practical applications, it is usually necessary to use conflict resolution strategies, such as Chaining or Open Addressing, to handle conflicts.

3. Square-Medium Method

The "Mid-Square Method" is a simple pseudo-random number generation method, usually used to generate pseudo-random integer sequences. Its basic idea is to generate pseudo-random numbers iteratively by squaring an integer and then taking the middle number as the next integer . Despite its simplicity, its quality is generally inferior to more complex pseudo-random number generation algorithms.

The following are the basic steps for finding the center of squares :

  1. Choose an initial seed (or starting value), usually a positive integer.
  2. Square the seed and get a larger integer.
  3. From the result of this square, take out the middle number of digits as the next pseudo-random number.
  4. Use this new pseudo-random number as the seed for the next round and repeat the above steps

Suppose the keyword is 1234, and squaring it is 1522756. Extract the middle 3 digits 227 as the hash address.
For example, if the keyword is 4321, squaring it is 18671041. Extract the middle 3 digits 671 (or 710) as the hash address.

The core idea of ​​this method is to generate pseudo-random numbers by repeatedly squaring and taking the middle number. However, the quality and uniformity of the center-of-square method is generally inferior to that of more complex pseudo-random number generation algorithms, so it is mainly used for teaching or some simple simulation problems.

The performance and uniformity of the squared centering method depend on the choice of initial seed and the number of bits extracted . If the seed is chosen poorly or the number of bits extracted is inappropriate, it may result in periodic or uneven pseudo-random number sequences. Therefore, in practical applications, more advanced and reliable pseudo-random number generators, such as Linear Congruential Generator or Mersenne Twister, are usually used to obtain higher quality pseudo-random numbers. Number, the square method is more suitable for situations where the distribution of keywords is not known and the number of digits is not very large .

4. Folding method

The Folding Method is a hashing technique often used to map large integer or long string keys to smaller hash values ​​so that they can be stored in a hash table or hash table middle. The basic idea of ​​folding is to split the input key into fixed-length parts and then add the parts or perform other mathematical operations to generate a hash value .

The following are the general steps for the folding method :

  1. Choose an appropriate split size, usually a positive integer. This size can be chosen based on the needs of the application, and is usually a value that allows the input keys to be evenly divided.
  2. Split an input key into fixed-size parts (split chunks), which can be consecutive characters, numbers, or other appropriate units. If the inputs are numbers, they are usually split into an equal number of parts.
  3. Perform a mathematical operation on these parts, usually a sum. Different mathematical operations such as addition, bit operations, etc. can be selected to generate hash values ​​depending on the situation.
  4. The end result is a hash value, which can be used as a bucket index in a hash table or hash table.

Here is a simple example of how to map an integer key to a hash using folding :

Suppose the input key is 1234567890, we choose a split block size of 3, and then split this integer into 1, 234, 567, and 890. Next, we sum the parts and get the hash value: 1 + 234 + 567 + 890 = 1692. This hash value can be used to store and retrieve data.

The performance and uniformity of the folding method depends on the split size and choice of mathematical operations . If chosen properly, it can provide a better hash distribution, but parameters need to be chosen carefully to avoid potential problems such as collisions or uneven hash distributions. The folding method is suitable for key words without knowing the distribution of keywords in advance. The situation where the number of digits is relatively large .

5. Random number method

Choose a random function, and take the random function value of the keyword as its hash address, that is H(key) = random(key), where random is a random number function.
This method is usually used when keywords are of different lengths.

6. Mathematical analysis

Suppose there are n d digits. Each digit may have r different symbols. The frequency of these r different symbols appearing in each digit may not be the same. They may be evenly distributed in some digits. The frequency of each symbol appearing is Equal opportunity, uneven distribution on some bits, only certain symbols appear frequently. According to the size of the hash table, several bits in which various symbols are evenly distributed can be selected as the hash address.

Suppose we want to store a company's employee registration form. If we use mobile phone numbers as keywords, then it is very likely that the first 7 digits are the same. Then we can choose the last four digits as the hash address. If such extraction work is easy If a conflict occurs, you can also reverse the extracted numbers (such as 1234 to 4321), shift the right ring (such as 1234 to 4123), shift the left ring, and superpose the first two numbers and the last two numbers (such as 1234 to 4123). 12+34=46) and other methods.

The numerical analysis method is usually suitable for processing situations where the keyword digits are relatively large. If the distribution of keywords is known in advance and the distribution of several digits of the keyword is relatively even,

The more sophisticated the hash function is designed, the lower the possibility of hash collisions, but hash collisions cannot be avoided.

Hash conflict resolution

Two common ways to resolve hash collisions are: closed hashing and open hashing

1. Closed hashing

Closed hashing: also called open addressing method. When a hash conflict occurs, if the hash table is not full, it means that there must be an empty position in the hash table, then the key can be stored in the conflict position "under go to an empty position

1.1 Linear detection

For example, in the scenario mentioned in the above concept, you now need to insert element 44. First, calculate the hash address through the hash function. The hashAddr is 4, so 44 should theoretically be inserted at this position, but the value of 4 has already been placed at this position. element, that is, a hash collision occurs.

Linear detection: Starting from the position where the conflict occurs, detect backwards until the next empty position is found .

insert

Obtain the position of the element to be inserted in the hash table through the hash function.
If there is no element in the position, insert the new element directly. If there is a hash conflict with the element in the position, use linear detection to find the next empty position and insert the new element.

Insert image description here

Deletion
When using closed hashing to handle hash conflicts, you cannot physically delete existing elements in the hash table. Direct deletion of elements will affect the search for other elements. For example, if you delete element 4 directly, the search for 44 may be affected. Therefore linear probing uses marked pseudo-deletion to delete an element.

Each space in the hash table is given a mark
EMPTY. This position is empty. There is already an element at this EXIST position. DELETE elements have been deleted.

enum State{
    
    EMPTY, EXIST, DELETE}; 

Linear detection implementation

template<class K, class V>
struct HashData
{
    
    
	pair<K, V> _kv;
	State _state = EMPTY;
};

Define a template structure HashDatato represent data items in the hash table. It contains two main members:

  1. _kv: This is a key-value pair ( pair<K, V>), used to store data corresponding to the key value.
  2. _state: This is an enumeration type Statethat represents the status of the data item, which may be one of the following three:
    • EMPTY: Indicates that the slot is empty, that is, there is no data.
    • EXIST: Indicates that the slot contains valid data.
    • DELETE: Indicates that the slot contains deleted data.

This structure is designed to store key-value pairs in a hash table and track the state of each slot. The existence of state _stateallows the hash table to handle deletions, not just insertions and lookups.

template<class K>
struct HashFunc
{
    
    
	size_t operator()(const K& key)
	{
    
    
		return (size_t)key;
	}
};

Defines a generic hash function template HashFuncthat can be used for any type of key K. The implementation of this hash function is very simple, it keyconverts the input key directly to size_ttype and returns.

Specifically, the operation steps of this hash function are:

  1. Accepts a Kkey of type keyas input parameter.
  2. Cast keys keyto size_t, i.e. map keys of different types to an unsigned integer.
  3. Returns the converted size_tvalue as the hash result.

It should be noted that it may not work for all types of keys, especially for custom data types, which may require more complex hash functions to ensure good hashing performance and uniformity.

template<>
struct HashFunc<string>//对字符型特化给定值乘以固定质数131降低冲突
{
    
    
	size_t operator()(const string& key)
	{
    
    
		size_t val = 0;
		for (auto ch : key)
		{
    
    
			val *= 131;
			val += ch;
		}

		return val;
	}
};

Defines a specialized version of the hash function HashFuncfor handling stringkeys of type string ( ). This hash function converts each character of the string into its corresponding integer value and combines them together to produce a hash value.

Here's how this specialized version of the hash function works:

  1. Iterate over each character in a string.
  2. For each character, multiply the current hash value by a fixed prime number (131) and then add the character's integer value.
  3. Repeat steps 1 and 2 until the entire string has been traversed.
  4. Returns the final hash value as the string hash result.

This hash function is characterized by being simple and effective. It takes each character in the string into account in the hash value, and uses the prime number 131 for mixing (refer to the implementation in the STL source code) to increase the uniformity of the hash. sex. This method can produce good hashing results in many situations.

template<class K, class V, class Hash = HashFunc<K>>
class HashTable
{
    
    
private:
	vector<HashData<K, V>> _tables;
	size_t _size = 0; // 存储多少个有效数据
};

Define a template class HashTableto implement the hash table data structure. This hash table can store key-value pairs, where the key is of type Kand the value is of type V, and the type of hash function can optionally be specified. By default, it is used HashFunc<K>as the hash function.

The following are the main members and characteristics of this hash table class:

  1. _tables: This is a HashData<K, V>vector ( ) that stores data items vector, representing the storage space of the hash table. Each element corresponds to a slot in the hash table and can store a key-value pair. The size of the hash table is determined by the size of the vector.
  2. _size: This is a counter that records the number of valid data items, indicating how many valid key-value pairs are currently stored in the hash table. It is updated during insert and delete operations _size.
  3. Default hash function: Through template parameters Hash, you can optionally specify a custom hash function. If no custom hash function type is provided, the default one will be used HashFunc<K>as the hash function. This allows using different hash functions on different types of keys.

The following member functions are all public

insert function

The insertion function needs to consider the expansion issue. In the expansion of hash table storage form, we need to consider the load factor issue.

The load factor of a hash table is defined as :α=填入表中的元素个数/散列表的长度

α is a factor that indicates how full the hash table is. Since the table length is a fixed value, α is proportional to the "number of elements filled in the table". Therefore, the larger α is, the more elements are filled in the table, and the greater the possibility of conflict; conversely, α The smaller the value, the fewer elements are filled in the table and the less likely a conflict will occur. In fact, the average search length of a hash table is a function of the load factor α, but different methods of handling collisions have different functions.

For the open addressing method, the load factor is a particularly important factor and should be strictly limited to 0.7-0.8the following. If exceeded 0.8, the CPU cache misses during table lookup will increase according to an exponential curve. Therefore, some hash libraries that use open addressing methods, such as Java's system library, limit the load factor to 0.75. If this value is exceeded, the hash table will be resized.

bool Insert(const pair<K, V>& kv)
{
    
    
	if (Find(kv.first))
		return false;

	// 负载因子到了就扩容
	if (_tables.size() == 0 || 10 * _size / _tables.size() >= 7) // 扩容
	{
    
    
		size_t newSize = _tables.size() == 0 ? 10 : _tables.size() * 2;
		HashTable<K, V, Hash> newHT;
		newHT._tables.resize(newSize);
		// 旧表的数据映射到新表
		for (auto e : _tables)
		{
    
    
			if (e._state == EXIST)
			{
    
    
				newHT.Insert(e._kv);
			}
		}

		_tables.swap(newHT._tables);
	}

	Hash hash;
	size_t hashi = hash(kv.first) % _tables.size();
	while (_tables[hashi]._state == EXIST)
	{
    
    
		hashi++;
		hashi %= _tables.size();
	}

	_tables[hashi]._kv = kv;
	_tables[hashi]._state = EXIST;
	++_size;

	return true;
}

InsertImplementation of the method used to insert new key-value pairs into the hash table. The following are the main steps and logic of the method:

  1. FindFirst, check if a data item with the same key already exists by calling the method. If the same key exists, the insertion fails and returns directly false.
  2. Next, check the load factor of the hash table. The load factor is the ratio of the number of valid data items currently stored _sizeto the number of hash table slots _tables.size(). If the load factor exceeds the specified threshold (7/10), it means that the hash table is overloaded and needs to be expanded. The purpose of expansion is to maintain the performance and uniformity of the hash table.
  3. If expansion is required, create a new hash table newHTwith twice the size of the current hash table (or initialize to 10 if the current hash table is empty). Then, _tablesthe data in the old hash table is mapped to the new hash table newHTby calling newHT.Insert(e._kv)insert each valid data item into the new hash table.
  4. Once the data mapping is complete, swapswap the old hash table _tablesand the new hash table by calling the function newHTto make the new hash table the current hash table.
  5. Next, calculate the hash of the key to be inserted. The slot index to be inserted is determined by hashing the hashkey by calling a hash function and then taking the modulo operation .kv.firsthashi
  6. Use linear probing to find available slots. If _tables[hashi]the status of the current slot is EXIST, continue searching for the next slot until an empty slot is found.
  7. Once an empty slot is found, the key-value pair is kvstored in the slot and the status is marked to EXISTindicate that the slot contains valid data. Then, increment the number of valid data items _size.
  8. Finally, return trueindicates successful insertion.

Find function

HashData<K, V>* Find(const K& key)
{
    
    
    if (_tables.size() == 0)
    {
    
    
        return nullptr;
    }

    Hash hash;
    size_t start = hash(key) % _tables.size();
    size_t hashi = start;
    while (_tables[hashi]._state != EMPTY)
    {
    
    
        if (_tables[hashi]._state != DELETE && _tables[hashi]._kv.first == key)
        {
    
    
            return &_tables[hashi];
        }

        hashi++;
        hashi %= _tables.size();

        if (hashi == start)
        {
    
    
            break;
        }
    }

    return nullptr;
}

FindThe implementation of the method is used to find the data item of the specified key and return the corresponding HashData<K, V>pointer. The following are the main steps and logic of the method:

  1. First, check the size of the current hash table. If the hash table is empty (that is, _tables.size()is 0), the search cannot be performed, and a direct return nullptrindicates that it is not found.
  2. Creates a hash function object hashand uses the hash function to calculate keythe hash value of the given key. % _tables.size()The hash value is used to obtain the slot index through modulo operation start, indicating the position where the search starts.
  3. Initialize a hash index and start searching hashifrom slot . startEnter the loop.
  4. In the loop, first check _tables[hashi]the status of the current slot. If the status is EMPTY, it means that the current slot is empty, indicating that the specified key is not found, and the search continues for the next slot.
  5. If the status is not EMPTY, continue checking the status of the data item _tables[hashi]._state. If the status is DELETE, it means that the data in the current slot has been deleted, and the search for the next slot will continue.
  6. If the status is not EMPTYand it is not DELETE, it means that the current slot contains valid data. Continue to check whether the key of the data item _tables[hashi]._kv.firstis equal to the target key key. If equal, the specified key is found and a pointer to the data item is returned &_tables[hashi].
  7. If none of the above conditions are met, it means that the data in the current slot does not match the target key and continue to search for the next slot. Loop search is implemented by incrementing hashiand taking modulo ._tables.size()
  8. The loop continues until it reaches the starting position again start, indicating that the entire hash table has been traversed once and no matching key was found. Exit the loop at this point.
  9. Finally, if the entire loop ends without finding a matching key, return nullptrmeans not found.

delete function

bool Erase(const K& key)
{
    
    
    HashData<K, V>* ret = Find(key);
    if (ret)
    {
    
    
        ret->_state = DELETE;
        --_size;
        return true;
    }
    else
    {
    
    
        return false;
    }
}

EraseThe implementation of the method is used to delete the data item corresponding to the specified key. The following are the main steps and logic of the method:

  1. First, call Findthe method to find keythe data item corresponding to the specified key. If a matching data item is found, Findthe method returns a pointer to the data item and stores it in ret. If no matching data item is found, Findthe method returns nullptr.
  2. Next, check retif is a non-null pointer. If retis not empty, it means that a matching data item has been found and the deletion operation can be performed.
  3. In the delete operation, set the status of the matching data item _stateto DELETE, indicating that the data item has been deleted.
  4. At the same time, the count of the number of valid data items in the hash table is decremented _sizeto reflect the deletion.
  5. Finally, return trueindicates successful deletion.
  6. If retis empty (that is, no matching data item is found), it is returned falseto indicate that the deletion failed.

This Erasemethod implements tombstone deletion of data items in the hash table, by marking the status as DELETEindicating the deletion status, rather than actually removing the data from the hash table. This approach allows deleted data to be skipped during lookups while preserving the integrity of the hash table.

All code

#pragma once
#include<iostream>
using namespace std;
enum State
{
    
    
	EMPTY,
	EXIST,
	DELETE
};

template<class K, class V>
struct HashData
{
    
    
	pair<K, V> _kv;
	State _state = EMPTY;
};

template<class K>
struct HashFunc
{
    
    
	size_t operator()(const K& key)
	{
    
    
		return (size_t)key;
	}
};

template<>
struct HashFunc<string>//对字符型特化给定值乘以固定质数131降低冲突
{
    
    
	size_t operator()(const string& key)
	{
    
    
		size_t val = 0;
		for (auto ch : key)
		{
    
    
			val *= 131;
			val += ch;
		}

		return val;
	}
};
template<class K, class V, class Hash = HashFunc<K>>
class HashTable
{
    
    
public:
	bool Insert(const pair<K, V>& kv)
	{
    
    
		if (Find(kv.first))
			return false;

		// 负载因子到了就扩容
		if (_tables.size() == 0 || 10 * _size / _tables.size() >= 7) // 扩容
		{
    
    
			size_t newSize = _tables.size() == 0 ? 10 : _tables.size() * 2;
			HashTable<K, V, Hash> newHT;
			newHT._tables.resize(newSize);
			// 旧表的数据映射到新表
			for (auto e : _tables)
			{
    
    
				if (e._state == EXIST)
				{
    
    
					newHT.Insert(e._kv);
				}
			}

			_tables.swap(newHT._tables);
		}

		Hash hash;
		size_t hashi = hash(kv.first) % _tables.size();
		while (_tables[hashi]._state == EXIST)
		{
    
    
			hashi++;
			hashi %= _tables.size();
		}

		_tables[hashi]._kv = kv;
		_tables[hashi]._state = EXIST;
		++_size;

		return true;
	}

	HashData<K, V>* Find(const K& key)
	{
    
    
		if (_tables.size() == 0)
		{
    
    
			return nullptr;
		}

		Hash hash;
		size_t start = hash(key) % _tables.size();
		size_t hashi = start;
		while (_tables[hashi]._state != EMPTY)
		{
    
    
			if (_tables[hashi]._state != DELETE && _tables[hashi]._kv.first == key)
			{
    
    
				return &_tables[hashi];
			}

			hashi++;
			hashi %= _tables.size();

			if (hashi == start)
			{
    
    
				break;
			}
		}

		return nullptr;
	}

	bool Erase(const K& key)
	{
    
    
		HashData<K, V>* ret = Find(key);
		if (ret)
		{
    
    
			ret->_state = DELETE;
			--_size;
			return true;
		}
		else
		{
    
    
			return false;
		}
	}

	void Print()
	{
    
    
		for (size_t i = 0; i < _tables.size(); ++i)
		{
    
    
			if (_tables[i]._state == EXIST)
			{
    
    
				printf("[%d:%d] ", i, _tables[i]._kv.first);
			}
			else
			{
    
    
				printf("[%d:*] ", i);
			}
		}
		cout << endl;
	}

private:
	vector<HashData<K, V>> _tables;
	size_t _size = 0; // 存储多少个有效数据
};

Disadvantages of linear detection : Once a hash conflict occurs, all the conflicts are connected together, and it is easy to "pile up" data. That is, different key codes occupy available empty positions, so that finding the location of a certain key code requires many comparisons. This results in reduced search efficiency .

1.2 Secondary detection

The flaw of linear detection is that conflicting data accumulates together, which is related to finding the next empty position, because the way to find empty positions is to find them one by one. Therefore, in order to avoid this problem, secondary detection needs to find the next empty position . The methods for empty positions are: H_i = (H_0 + i^2 )% m, or: H_i = (H_0 - i^2)% m. Among them: i =1,2,3…, H_0is the position obtained by calculating the key code of the element through the hash function Hash(x), and m is the size of the table.

Modify the interpolation function in linear probing :

Hash hash;
size_t start = hash(kv.first) % _tables.size();
size_t i = 0;
size_t hashi = start;
// 二次探测
while (_tables[hashi]._state == EXIST)
{
    
    
    ++i;
    hashi = start + i*i;
    hashi %= _tables.size();
}

_tables[hashi]._kv = kv;
_tables[hashi]._state = EXIST;
++_size;

The main steps and logic of the code :

  1. Create a hash function object hash.
  2. Calculate the hash value of the given key kv.firstand use the modulo operation % _tables.size()to obtain the slot index start, indicating the position to start the search.
  3. Initialize an integer ito track the number of attempts, and initialize it hashito startrepresent the current slot to be searched.
  4. Enter the loop and use secondary probing to find available slots. On each iteration, increments iand then computes a new hash index hashi, start + i*icomputed by . Next, a modulo operation is used % _tables.size()to ensure that the hash index does not exceed the size of the hash table.
  5. Check _tables[hashi]the status of the current slot. If the status is EXIST, it means that the slot is already occupied, continue with the next iteration to try the next slot.
  6. _tables[hashi]._stateIf an empty slot is found EXIST, it means that the slot can store data. Store the key-value kvpair in the slot and mark the status EXISTas indicating that the slot contains valid data.
  7. Increment the number of valid data items _size, indicating successful insertion of data.

By using quadratic probing, data items are distributed more evenly and the clustering effects seen with linear probing are reduced .

When the length of the table is a prime number and the table loading factor a does not exceed 0.5, new entries can be inserted without any position being explored twice . Therefore, as long as there are half empty positions in the table, there will be no table full problem. When searching, you do not need to consider that the table is full, but when inserting, you must ensure that the loading factor a of the table does not exceed 0.5. If it exceeds, you must consider increasing the capacity . Therefore, the biggest flaw of closed hashing is the relatively low space utilization, which is also a flaw of hashing.

2. Open hashing

The open hash method is also called the chain address method (open chain method). First, a hash function is used to calculate the hash address of the key code set. Key codes with the same address belong to the same sub-set. Each sub-set is called a bucket. The elements in the bucket are linked through a singly linked list, and the head node of each linked list is stored in the hash table .

Insert image description here
Insert image description here

Each bucket in the open hash contains elements that have a hash conflict.

template<class K, class V>
struct HashNode
{
    
    
	pair<K, V> _kv;
	HashNode<K, V>* _next;

	HashNode(const pair<K, V>& kv)
		:_kv(kv)
		, _next(nullptr)
	{
    
    }
};

Define a structure HashNodeto represent the nodes in the hash table. This node contains the following members:

  1. _kv: This is a key-value pair ( pair<K, V>) used to store key-value data in the node.
  2. _next: This is a pointer to the next node, used to build a linked list structure and handle hash conflicts. If a hash conflict occurs, multiple nodes may be mapped to the same hash bucket (slot), and the _nextpointers of the linked list are used to connect nodes with the same hash value.

In the chain address method, each hash bucket (slot) maintains a linked list. When multiple keys map to the same slot, they are added to the linked list in order and connected through pointers _next. In this way, multiple key-value pairs can be stored in the same hash bucket, solving the problem of hash conflicts. When you need to find or delete a key-value pair, you can traverse the linked list to locate the specific node. The implementation of this chain address method allows hash tables to effectively manage data and maintain efficient performance .

template<class K, class V>
class HashTable
{
    
    
	typedef HashNode<K, V> Node;
private:
	vector<Node*> _tables;
	size_t _size = 0; // 存储有效数据个数
};

A template class that defines a hash table HashTable, used to store key-value pair data. The following are the main members and properties of this class :

  • typedef HashNode<K, V> Node;: This is a type alias declaration, HashNode<K, V>simplified to Node, to improve code readability.
  • private:: This is a private access identifier, indicating that the following member variables and methods are private members of the class and cannot be directly accessed from the outside.
  • vector<Node*> _tables;: This is a vectorcontainer used to store the hash buckets (slots) of the hash table. Each element is a HashNode<K, V>pointer of type, which is the head node of the linked list. This container stores all the data of the hash table.
  • size_t _size = 0;: This is a counter used to record the number of valid data in the hash table. During insertion, deletion and other operations in the hash table, this counter will be updated to maintain the accurate amount of data.

HashTableThe function of the class is to implement a hash table data structure, support storage of key-value pair data, and provide basic operations such as insertion, search, and deletion. The hash table uses the chain address method to resolve hash conflicts, using a vectorto store hash buckets, and each bucket corresponds to a linked list for storing data. In addition, _sizeit is used to track the amount of valid data and help manage the load factor and automatic expansion of the hash table.

insert function

Like closed hashing, we need to consider the expansion issue when inserting, so the number of buckets is certain. As elements continue to be inserted, the number of elements in each bucket continues to increase. In extreme cases, it may lead to a bucket There are many nodes in the linked list, which will affect the performance of the hash table. Therefore, under certain conditions, the hash table needs to be expanded. How to confirm this condition? The best situation for hashing is: there is exactly one node in each hash bucket. When you continue to insert elements, a hash conflict will occur every time. Therefore, when the number of elements is exactly equal to the number of buckets, you can give Hash table capacity expansion

bool Insert(const pair<K, V>& kv)
{
    
    
	// 去重
	if (Find(kv.first))
	{
    
    
		return false;
	}

	// 负载因子到1就扩容
	if (_size == _tables.size())
	{
    
    
		size_t newSize = _tables.size() == 0 ? 10 : _tables.size() * 2;
		vector<Node*> newTables;
		newTables.resize(newSize, nullptr);
		// 旧表中节点移动映射新表
		for (size_t i = 0; i < _tables.size(); ++i)
		{
    
    
			Node* cur = _tables[i];
			while (cur)
			{
    
    
				Node* next = cur->_next;

				size_t hashi = cur->_kv.first % newTables.size();
				cur->_next = newTables[hashi];
				newTables[hashi] = cur;

				cur = next;
			}

			_tables[i] = nullptr;
		}

		_tables.swap(newTables);
	}

	size_t hashi = kv.first % _tables.size();
	// 头插
	Node* newnode = new Node(kv);
	newnode->_next = _tables[hashi];
	_tables[hashi] = newnode;
	++_size;

	return true;
}
  1. First, check if a node with the same key already exists, i.e. Find(kv.first)find kv.firstif the key already exists in the hash table by calling . If it already exists, it returns false, indicating that the insertion failed because duplicate keys are not allowed.
  2. Next, check the load factor, which is the ratio of the number of stored data _sizeto the number of hash buckets _tables.size(). If the load factor reaches 1 (meaning that each hash bucket stores one piece of data on average), the capacity expansion operation is performed.
  3. If expansion is required, first calculate the size of the new hash table newSize. If the current hash table is empty (that is, _tables.size()is 0), then set the new size to 10, otherwise set the new size to twice the current size. Then, create a new vectorcontainer newTablesand set its size to newSize, while initializing all elements to nullptr.
  4. Traverse _tableseach slot in the current hash table (each slot corresponds to a linked list), and remap the nodes in the linked list to the new hash bucket. The specific operation is to traverse the linked list, use the hash function to calculate the new slot index of each node's key hashi, then insert the node into the new hash bucket (using head insertion method), and update the _nextpointer of the node to build the linked list. Once done, set the old slot to nullptr.
  5. Finally, use swapthe operation to exchange the old and new hash tables, and replace the old hash table with the new hash table to complete the expansion operation.
  6. If expansion is not required (the load factor does not reach 1), calculate the hash value of the key hashi, determine the slot to be inserted, and then use the head insertion method to insert the new node into the linked list of the corresponding slot. Increase the number of valid data _sizeand return true, indicating successful insertion.

The core idea of ​​this code is to maintain the load factor of the hash table, and trigger an expansion operation when the load factor is too high to maintain the performance of the hash table . At the same time, it handles hash conflicts through linked lists and supports the situation where multiple keys are mapped to the same hash bucket . It should be noted that no insertion operation will be performed for existing keys to ensure the uniqueness of the keys .

Find function

Node* Find(const K& key)
{
    
    
	if (_tables.size() == 0)
	{
    
    
		return nullptr;
	}

	size_t hashi = key % _tables.size();
	Node* cur = _tables[hashi];
	while (cur)
	{
    
    
		if (cur->_kv.first == key)
		{
    
    
			return cur;
		}

		cur = cur->_next;
	}

	return nullptr;
}
  1. First, check if the hash table is empty, ie _tables.size() == 0. If the hash table is empty, it means there is no data and it is returned directly nullptrbecause no data can be found.
  2. Calculate the hash value of the key , obtain a slot index hashithrough the keymodulo operation on the key, and determine which hash bucket to search in.% _tables.size()
  3. Initialize a pointer to point to the head node in the curselected hash bucket , that is, the starting position of the linked list._tables[hashi]
  4. Enter the loop and traverse the nodes in the linked list. In each iteration, check curwhether the current node is empty. If it is empty, it means that the end of the linked list has been traversed and the matching key has not been found. If it is returned, it nullptrmeans that it has not been found.
  5. If the node curis not empty, continue to check whether the key in the key-value pair of the current node _kv.firstis equal to the target key key. If they are equal, it means that a matching key-value pair is found, and cura pointer to the current node is returned to access or modify the data.
  6. If the current node does not match, it will curpoint to the next node, that is cur = cur->_next, to continue searching for the next node in the linked list.
  7. Loop until a matching node is found or the entire linked list is traversed.

delete function

bool Erase(const K& key)
{
    
    
    if (_tables.size() == 0)
    {
    
    
        return false; // 哈希表为空,无法删除
    }

    size_t hashi = key % _tables.size();
    Node* cur = _tables[hashi];
    Node* prev = nullptr; // 用于记录当前节点的前一个节点

    // 遍历链表
    while (cur)
    {
    
    
        if (cur->_kv.first == key)
        {
    
    
            // 找到匹配的节点,进行删除操作
            if (prev)
            {
    
    
                prev->_next = cur->_next; // 从链表中移除当前节点
            }
            else
            {
    
    
                // 如果当前节点是链表的头节点,则更新哈希桶的头指针
                _tables[hashi] = cur->_next;
            }

            delete cur; // 释放当前节点的内存
            --_size;    // 减少有效数据个数
            return true; // 删除成功
        }

        prev = cur;
        cur = cur->_next; // 移动到下一个节点
    }

    return false; // 未找到匹配的键,删除失败
}
  1. prevIf the current node is not the head node of the linked list, point the pointer of the previous node _nextto the next node of the current node, thereby removing the current node from the linked list.
  2. If the current node is the head node of the linked list, directly update the head pointer of the hash bucket _tables[hashi]to the next node of the current node to ensure the correct update of the linked list head.
  3. Release the memory of the current node and reduce the number of valid data _size.
  4. Return truemeans deletion is successful.
  5. If no matching key is found, the final return falseindicates that the deletion failed.

destructor

~HashTable()
{
    
    
    for (size_t i = 0; i < _tables.size(); ++i)
    {
    
    
        Node* cur = _tables[i];
        while (cur)
        {
    
    
            Node* next = cur->_next;
            free(cur);
            cur = next;
        }
        _tables[i] = nullptr;
    }
}

Iterate through each slot of the hash table, free the node in the linked list, and then set the slot to nullptr, ensuring that all allocated memory is freed. The following is the main logic of the code:

  1. Use a forloop to traverse _tablesthe container, which _tablesstores all slots of the hash table.
  2. In each iteration, _tables[i]the head node of the linked list corresponding to the current slot is obtained Node* cur.
  3. Enter the inner whileloop and traverse each node in the linked list. In each iteration, the next node pointer is first Node* nextset to the node immediately following the current node.
  4. Use the function to release the memory occupied by freethe current node . curNote that freethe function is used here instead of delete, because curthe memory allocation of the object may be mallocdone through a or similar function instead of new.
  5. Point the current node curto the next node nextto continue traversing the linked list.
  6. Loop until there are no more nodes in the linked list, that is, curbecomes nullptr.
  7. After each iteration, the current slot _tables[i]is set to nullptr, ensuring that the slot no longer contains any nodes.

All code

#pragma once
#include<iostream>
using namespace std;
template<class K, class V>
struct HashNode
{
    
    
	pair<K, V> _kv;
	HashNode<K, V>* _next;

	HashNode(const pair<K, V>& kv)
		:_kv(kv)
		, _next(nullptr)
	{
    
    }
};

template<class K, class V>
class HashTable
{
    
    
	typedef HashNode<K, V> Node;
public:

	~HashTable()
	{
    
    
		for (size_t i = 0; i < _tables.size(); ++i)
		{
    
    
			Node* cur = _tables[i];
			while (cur)
			{
    
    
				Node* next = cur->_next;
				free(cur);
				cur = next;
			}
			_tables[i] = nullptr;
		}
	}

	bool Insert(const pair<K, V>& kv)
	{
    
    
		// 去重
		if (Find(kv.first))
		{
    
    
			return false;
		}

		// 负载因子到1就扩容
		if (_size == _tables.size())
		{
    
    
			size_t newSize = _tables.size() == 0 ? 10 : _tables.size() * 2;
			vector<Node*> newTables;
			newTables.resize(newSize, nullptr);
			// 旧表中节点移动映射新表
			for (size_t i = 0; i < _tables.size(); ++i)
			{
    
    
				Node* cur = _tables[i];
				while (cur)
				{
    
    
					Node* next = cur->_next;

					size_t hashi = cur->_kv.first % newTables.size();
					cur->_next = newTables[hashi];
					newTables[hashi] = cur;

					cur = next;
				}

				_tables[i] = nullptr;
			}

			_tables.swap(newTables);
		}

		size_t hashi = kv.first % _tables.size();
		// 头插
		Node* newnode = new Node(kv);
		newnode->_next = _tables[hashi];
		_tables[hashi] = newnode;
		++_size;

		return true;
	}

	Node* Find(const K& key)
	{
    
    
		if (_tables.size() == 0)
		{
    
    
			return nullptr;
		}

		size_t hashi = key % _tables.size();
		Node* cur = _tables[hashi];
		while (cur)
		{
    
    
			if (cur->_kv.first == key)
			{
    
    
				return cur;
			}

			cur = cur->_next;
		}

		return nullptr;
	}

	bool Erase(const K& key)
	{
    
    
		if (_tables.size() == 0)
		{
    
    
			return false; // 哈希表为空,无法删除
		}

		size_t hashi = key % _tables.size();
		Node* cur = _tables[hashi];
		Node* prev = nullptr; // 用于记录当前节点的前一个节点

		// 遍历链表
		while (cur)
		{
    
    
			if (cur->_kv.first == key)
			{
    
    
				// 找到匹配的节点,进行删除操作
				if (prev)
				{
    
    
					prev->_next = cur->_next; // 从链表中移除当前节点
				}
				else
				{
    
    
					// 如果当前节点是链表的头节点,则更新哈希桶的头指针
					_tables[hashi] = cur->_next;
				}

				delete cur; // 释放当前节点的内存
				--_size;    // 减少有效数据个数
				return true; // 删除成功
			}

			prev = cur;
			cur = cur->_next; // 移动到下一个节点
		}

		return false; // 未找到匹配的键,删除失败
	}

private:
	vector<Node*> _tables;
	size_t _size = 0; // 存储有效数据个数
};

Open hashing and closed hashing comparison

The chain address method needs to add a link pointer to handle overflow, which seems to increase storage overhead. In fact: since the open address method must maintain a large amount of free space to ensure search efficiency, for example, the secondary exploration method requires a loading factor a <= 0.7, and the table entry occupies a much larger space than the pointer, so using the chain address method will not Saves storage space than open address method

Guess you like

Origin blog.csdn.net/kingxzq/article/details/133207145