c# Dictionary

Author: Zeng Zhiwei
Link: https://zhuanlan.zhihu.com/p/96633352
Source: Zhihu

I have been using C# for two or three years, and then suddenly one day I was asked about the basic implementation of C# Dictionary. This made me reflect that I have always been in a borrowing mentality and just use it. I have not considered and learned some underlying architecture at all. Think about it. Scalp-tingling. Let's start learning some things that I usually use for granted. Today I will first learn the source code of the dictionary.

1. Dictionary source code learning

In Dictionary implementation, we mainly analyze it by comparing it with the source code. The current version compared with the source code is .Net Framwork 4.8 .
Source code address: dictionary.cs

Here we mainly introduce several key classes and objects in Dictionary.

Then follow the code to go through the process of insertion, deletion and expansion.

1. Entry structure

First, we introduce a structure such as Entry , whose definition is shown in the following code. This is the smallest unit for storing data in a Dictionary. Add(Key,Value)Elements added by calling methods will be encapsulated in such a structure.

        private struct Entry {
            public int hashCode;    // Lower 31 bits of hash code, -1 if unused
            public int next;        // Index of next entry, -1 if last
            public TKey key;        // Key of entry
            public TValue value;    // Value of entry
        }

2. Other key private variables

private int[] buckets; // Hash桶
private Entry[] entries; // Entry数组,存放元素
private int count; // 当前entries的index位置
private int version; // 当前版本,防止迭代过程中集合被更改
private int freeList; // 被删除Entry在entries中的下标index,这个位置是空闲的
private int freeCount; // 有多少个被删除的Entry,有多少个空闲的位置
private IEqualityComparer<TKey> comparer; // 比较器
private KeyCollection keys; // 存放Key的集合
private ValueCollection values; // 存放Value的集合

3. Structure of Dictionary

        private void Initialize(int capacity)
        {
            int prime = HashHelpers.GetPrime(capacity);
            this.buckets = new int[prime];
            for (int i = 0; i < this.buckets.Length; i++)
            {
                this.buckets[i] = -1;
            }
            this.entries = new Entry<TKey, TValue>[prime];
            this.freeList = -1;
        } 

We see that Dictionary does the following things when it is constructed:

  1. Initialize a this.buckets = new int[prime]
  2. Initialize a this.entries = new Entry<TKey, TValue>[prime]
  3. The capacity of Bucket and entries is the smallest prime number greater than the dictionary capacity.

Among them, this.buckets is mainly used for Hash collision , and this.entries is used to store the contents of the dictionary and identify the position of the next element.

4. Dictionary – Add operation

        public void Add(TKey key, TValue value) {
            Insert(key, value, true);
        }

        int targetBucket = hashCode % buckets.Length;

Let's take Dictionary<int,string> as an example to show how to add elements to Dictionary:

First, we construct a dictionary, and then the capacity of Bucket and entries is a minimum prime number 7 that is greater than the capacity of the dictionary :

Dictionary<int, string> test = new Dictionary<int, string>(6);

Test.Add(4,"4")

According to the Hash algorithm: int targetBucket = hashCode % buckets.Length; buckets.Length is equal to 7 , 4.GetHashCode()%7= 4 , so it collides with the slot with subscript 4 in buckets. At this time, since Count is 0, so The element is placed on the 0th element in Entries, and Count becomes 1 after addition.

Test.Add(11,"11")

According to Hash algorithm 11.GetHashCode()%7= 4 , it collides with the slot with subscript 4 in Buckets again. Since the value in this slot is no longer -1, Count=1 at this time, so this new value is added. The elements are put into the array with subscript 1 in entries, and the Buckets slot points to the entry with subscript 1, and the entry with subscript 1 is under the entry with subscript 0.

Test.Add(18,"18")
Test.Add(19,"19")

5. Dictionary – Remove operation

Test.Remove(4)

When we delete an element, we use a collision and search three times along the linked list to find the location of the element with key 4, and delete the current element. And point the position of FreeList to the position of the currently deleted element, and set FreeCount to 1

The deleted data will form a FreeList linked list. When adding data, data will be added to the FreeList linked list first. If the FreeList is empty, it will be arranged in order by count.

6. Dictionary – Resize operation (capacity expansion)

Careful friends may want to ask after seeing the Addbuckets、entries operation. Isn't it just two arrays? What if the array is full? Next is the Resize (capacity expansion) operation that I want to introduce to buckets、entriesexpand our capacity.

6.1 Trigger conditions for expansion operations

First, we need to know under what circumstances expansion operations will occur;

The first situation is naturally that the array is full and there is no way to store new elements. As shown in the figure below.

Second, too many collisions occur in the Dictionary, which will seriously affect performance and trigger expansion operations.

Hash operations will inevitably cause conflicts. The zipper method is used in Dictionary to solve the conflict problem, but look at the situation in the picture below. All elements fall exactly on buckets[3], which results in a time complexity of O(n) and search performance will decrease;

6.2 How to perform capacity expansion operation

In order to give you a clear demonstration, the following data structure is simulated , a Dictionary of size 2, assuming that the collision threshold is 2; now the Hash collision expansion is triggered.

  • 1. Apply for buckets and entries twice the current size
  • 2. Copy existing elements to new entries
  • 3. If it is a Hash collision expansion, use the new HashCode function to recalculate the Hash value.
  • 4. For each element of entries, bucket = newEntries[i].hashCode % newSize determines the position of the new buckets.
  • 5、重建hash链,newEntries[i].next=buckets[bucket]; buckets[bucket]=i;

focus point

Regarding the implementation principle of Dictionary, there are two key algorithms,

  • One is the Hash algorithm,
  • One is used to deal with Hash collision conflict resolution algorithm.

2. Hash algorithm

The Hash algorithm is a digital digest algorithm that maps a variable-length binary data set to a shorter binary-length data set.

The function that implements the Hash algorithm is called the Hash function . The Hash function has the following characteristics.

If the same data is subjected to Hash operation, the results obtained must be the same. HashFunc(key1) == HashFunc(key1)
When Hash operations are performed on different data, the results may be the same ( Hash collisions will occur ). key1 != key2 => HashFunc(key1) == HashFunc(key2) .
Hash operation is irreversible, and the original data cannot be obtained by key. key1 => hashCode But hashCode ==> key1 .

The following figure about Hash collision is clearly explained. It can be seen from the figure that Sandra Deeand John Smithafter the hash operation, both fall into 02positions, resulting in collisions and conflicts.

Common algorithms for constructing Hash functions include the following.

  • 1. Direct addressing method: take the keyword or a linear function value of the keyword as the hash address. That is, H(key)=key or H(key) = a·key + b, where a and b are constants (such a hash function is called its own function),
The application of this is, for example, for the mask of our world map, we directly use coordinates x * 1000 + coordinate y to get the key.
  • 2. Number analysis method: Find out the patterns of numbers, and use these data as much as possible to construct a hash address with a low probability of conflict.
Analyzing a set of data, such as the birth date of a group of employees, we find that the first few digits of the birth date are roughly the same. In this case, the chance of conflict will be very high, but we find that the year, month and day The last few digits representing the month and the specific date are very different. If the latter digits are used to form a hash address, the chance of conflict will be significantly reduced.
  • 3. Square the middle method: take the middle digits of the squared keyword as the hash address.
  • 4. Folding method: Cut the keyword into several parts with the same number of digits. The last part can have different digits, and then take the superposition sum of these parts (carries removed) as the hash address.
  • 5. Random number method: Choose a random function and take the random value of the keyword as the hash address. It is usually used in situations where the keyword lengths are different.
  • 6. Divide and leave remainder method: take the remainder obtained after the keyword is divided by a number p that is not larger than the length of the hash table m, as the hash address.
That is, H(key) = key MOD p, p<=m. Not only can the keyword be modulo directly, but it can also be moduloed after folding, squaring, and other operations. The choice of p is very important. Generally, a prime number or m is used. If p is not chosen well, collisions may easily occur.

7. Hash bucket algorithm

When it comes to Hash algorithm, everyone will think of Hash table . A Key can quickly get the hashCode after being calculated by the Hash function. Through the mapping of hashCode, it can directly get the Value.
However, the value of hashCode is generally very large, often 2^32. Above, it is impossible to specify a mapping for each hashCode.

Because of such a problem, people map the generated HashCode in segmented form, calling each segment a Bucket . A common Hash bucket is to directly take the remainder of the result.

Assume that the generated hashCode may have 2^32 values, and then divide it into segments and use 8 buckets for mapping. Then bucketIndex = HashFunc(key1) % 8 such an algorithm can be used to determine which specific bucket the hashCode is mapped to.

Dictionary uses the hash bucket algorithm

int hashCode =comparer.GetHashCode(key)&0x7FFFFFFF;
int targetBucket = hashCode %buckets.Length;

3. Hash collision conflict resolution algorithm

For a hash algorithm, conflicts will inevitably occur, so how to deal with conflicts after they occur is a very critical point. Currently, common conflict resolution algorithms include zipper method (used in Dictionary implementation), open addressing method, re-Hash method, common spill zoning laws

1. Zipper method ( open hashing ): Create a singly linked list of conflicting elements, and store the head pointer address to the location of the corresponding bucket in the Hash table. In this way, after locating the location of the Hash table bucket, the element can be found by traversing the singly linked list.
2. Open addressing method (closed hashing): When a hash conflict occurs, if the hash table is not full, it means that there must be an empty position in the hash table, then the key can be stored in the conflict position. Next" empty position.
3. Re-Hash method: As the name suggests, the key is hashed again using other Hash functions until a non-conflicting position is found.

1. Zipper method

2. Open addressing method

Suppose there is a key code set {1,4,5,6,7,9}, the capacity of the hash structure is 10, and the hash function is Hash(key)=key%10. Insert all keys into the hash structure, as shown in the figure.

If there is a key code 24 to be inserted into the structure, the hash address obtained by using the hash function is 4, but the element is already stored at this address, and a hash conflict occurs.

Linear detection: starting from the position where the hash conflict occurs, and exploring backward until the next empty position is found. For example, in the above example, when key code 24 is inserted, linear detection is performed, as shown below after insertion.

limit:

1. Using this method requires that the key code must be an integer before it can be modulated, so we need to convert non-integer types into integer types .

2. The numerical value of the module is preferably a prime number, which requires us to create a prime number table.

3. Capacity expansion problem.

Guess you like

Origin blog.csdn.net/qq_40097668/article/details/124441090