Golang's map implementation (2)

How to expand

In the process of map growth, the demand for space is also increasing, so how to achieve transparent expansion without affecting performance too much is what we will discuss here.

Considering only the space growth, there are two ways to expand the map, overflowandhashGrow

  • overflow is an overflow chain, a bucket-level expansion, understood as a linked list
  • hashGrow is an implementation of rehashing. When the space becomes larger, hashing is performed again, but the rehash process is not synchronized, but amortized to the operations of mapassignand changesmapdelete . At the same time, the release of old space is also involved in this process. How to release this part will be discussed in "How to migrate" below.

overflow

The organization of map storage data is done through the bucket structure. Then in the data h.bucketstructure represents a group of buckets, and each bucket and the corresponding overflow area form a linked list. The table represent the bucket in h.bucket, and the bucket is used to represent the bucket data structure. Note: This header is not a dummy node, it is also used to put data.

The addition of new linked list nodes is newoverflowrealized , and the main work is to first obtain space (may be pre-allocated or newly applied from the system), and then put it at the end of the linked list. (For the management of overflow[0], the management conditions are not very sure at present), the corresponding code and explanation are as follows.

It should be noted that the logic for pre-allocated space and new application space is somewhat different

func (h *hmap) newoverflow(t *maptype, b *bmap) *bmap {
    var ovf *bmap

    if h.extra != nil && h.extra.nextOverflow != nil {
        // 如果在预先创建了 overflow 区域,则先从本区域进行获取
        ovf = h.extra.nextOverflow
        if ovf.overflow(t) == nil {
            // overflow() 是读取的该 bucket 最后一个指针空间,是否有值
            // 有则代表预先申请的 overflow 区域已经用完,还记得 makeBucketArray 最后的设置吗?
            // 是保存的 h.buckets 的起始地址
            // 然后 nextOverFlow 维护预申请 overflow 域内的偏移量
            h.extra.nextOverflow = (*bmap)(add(unsafe.Pointer(ovf), uintptr(t.bucketsize)))
        } else {
            // 预先的已经用完了。此时的 ovf 代表了 overflow 内最后一个 bucket,将最后的指针位设置为 空
            // 并标记下预先申请的 overflow 区域已经用完
            ovf.setoverflow(t, nil) 
            h.extra.nextOverflow = nil
        }
    } else {
        // 没有预先申请或或者之前已经用完,则从系统中获取
        ovf = (*bmap)(newobject(t.bucket))
    }

    h.incrnoverflow()

    if t.bucket.kind&kindNoPointers != 0 {
        // 这个比较的意义我并没有 get 到,后面附上了大神的解释,希望以后能看明白
        if h.extra == nil {
            h.extra = new(mapextra)
        }
        if h.extra.overflow[0] == nil {
            h.extra.overflow[0] = new([]*bmap)
        }
        h.extra.overflow[0] = append(*h.extra.overflow[0], ovf)
    }
    b.setoverflow(t, ovf)
    // setoverflow 就是将一个节点添加到某个节点的后方,一般就是末位节点(链表结构)
    return ovf
}

Regarding t.bucket.kind&kindNoPointers, I got the reply from the big guy as follows (although I still haven't figured it out)

In that code t is *maptype, which is to say it is a pointer to the
type descriptor for the map, essentially the same value you would get
from calling reflect.TypeOf on the map value. t.bucket is a pointer
to the type descriptor for the type of the buckets that the map uses. This type is created by the compiler based on the key and value types
of the map. If the kindNoPointers bit is set in t.bucket.kind, then
the bucket type does not contain any pointers.

With the current implementation, this will be true if the key and value types do not themselves contain any pointers and both types are less than 128 bytes. Whether the bucket type contains any pointers is interesting because the garbage collector never has to look at buckets that contain no pointers. The current map implementation goes to some effort to preserve that property. See the comment in the mapextra type.

There are two situations when accessing the overflow area:
1. The space is full
2. A collision occurs

hashGrow

hashGrowThe process is as follows

  1. Apply for a new space (maybe the same size. Not considered for the time being)
  2. Store the current buckets and overflow chains in oldbuckets and the corresponding chain
  3. Handling possible pre-application overflow areas
  4. Data Migration (Amortized Asynchronously)

Note: There is a detail problem in process 2, that is, when switching, it is required that the map at this time must not be growing. That is, two growths cannot be performed at the same time. The map implementation ensures that this situation will not happen, as will be described later.

data migration

hashGrowIt is a relatively long state, from the space application until the data migration is completed. So how the intermediate data is migrated is the focus of this section. There are two issues to be aware of

  1. How data is migrated
  2. How to guarantee the completion of the migration (i.e. how to ensure that both are not executed at the same time hashGrow)

The entry code for data migration is as follows mapassign, mapdeletewhich can be found in and

    bucket := hash & (uintptr(1)<<h.B - 1)
    if h.growing() {
        growWork(t, h, bucket)
    }

Note : The migration unit is not a piece of data, but an entire bucket ! ! ! !

Below is growWorkthe code for

func growWork(t *maptype, h *hmap, bucket uintptr) {
    // make sure we evacuate the oldbucket corresponding
    // to the bucket we're about to use
    evacuate(t, h, bucket&h.oldbucketmask())

    // evacuate one more oldbucket to make progress on growing
    if h.growing() {
        evacuate(t, h, h.nevacuate)
    }
}

Note : It growWorkis called twice evacuate, one is to release the bucket area where the previous data falls (this is the normal logic), and the other is yes h.nevacuate, and it is this call to ensure that when hashGrow is executed again, the old data has been completed migration.

There is such a piece of code at evacuatethe end of

    // Advance evacuation mark
    if oldbucket == h.nevacuate {
        h.nevacuate = oldbucket + 1
        // Experiments suggest that 1024 is overkill by at least an order of magnitude.
        // Put it in there as a safeguard anyway, to ensure O(1) behavior.
        stop := h.nevacuate + 1024
        if stop > newbit {
            stop = newbit
        }
        for h.nevacuate != stop && bucketEvacuated(t, h, h.nevacuate) {
            h.nevacuate++
        }
        if h.nevacuate == newbit { // newbit == # of oldbuckets
            // Growing is all done. Free old main bucket array.
            h.oldbuckets = nil
            // Can discard old overflow buckets as well.
            // If they are still referenced by an iterator,
            // then the iterator holds a pointers to the slice.
            if h.extra != nil {
                h.extra.overflow[1] = nil
            }
            h.flags &^= sameSizeGrow
        }
    }

This code cooperates with the second call in growWork to evacuateensure that two buckets can be migrated each time. It is not difficult to understand h.nevacuatethis status label, which means that the buckets before that have been migrated. Moreover, every time a piece of data is operated, two buckets will be migrated, so it is actually guaranteed that when the next hashGrow is executed, the old buckets have been emptied, that is, the last expansion work has been completed. . This solves the second problem raised at the beginning.

Let's talk about how to migrate this problem (only discuss the scenario where the capacity becomes larger).

func evacuate(t *maptype, h *hmap, oldbucket uintptr)

evacuateThe point of the accepted parameter is oldbucketthat it actually represents an offset, indicating that it is the position in the header. Then the migration work is to migrate all the header and some overflow areas.

Then the traversal logic is as follows, traverse the overflow area, and then traverse the internal tophash to get all the k/v pairs

for ; b != nil; b = b.overflow(t) {
   ...
   for i := 0; i < bucketCnt; ... 

At this point, it can be rehashexecuted to determine where to store it in the new space. The interesting place is here.

Because the space is expanded by 2 times, then the rehashlast drop point is either still the original position, or the original position plus the original capacity position.

Based on this logic, go divides the expanded space into two areas, X and Y. X is in front and Y is behind. Both spaces are the same as the space before expansion. So how to quickly confirm the placement is the key. The confirmation method is calculated by the following formula.

    useX = hash&newbit == 0

Remember the formula for calculating the remainder mod = hash & (2^n -1) This is to filter out the nth bit and above (including the nth bit) of the hash in binary form. Then to judge whether it is placed in Y, it is only necessary to judge whether the nth is 1.

think

When looking at this migration logic, consider a detail, and then confirm the logic of mapasign.

It is found that when writing a key, it is not written immediately when it encounters a writeable location. Instead, it prioritizes finding equal items, that is, prioritizing update operations over insert operations. So here's something different from a book I've seen before. This ensures that a key may only appear once. Then the delete operation only needs to be deleted once.

If you don't consider tophash, then the performance will be general, and the query for additions, deletions, and changes is actually a relatively average state. Since there is no redundancy in the data, there is no skew.

After reading, I think the design is better:

tophash

  1. Control comparison consumption: The length of the key may not be fixed. If the original string is used for collision detection, it is not controllable here.
  2. Cache: Calculate the hash directly, and then store it, which can avoid reading the original string of the key with a high probability
  3. Use the upper 8 bits (disassembled into two questions)
    1. Using 8 bits, it is certain that the memory consumption is reduced, but why it is 8, it remains to be verified
    2. The high 8 bits are used because the collision rate is lower than that of the low 8 bits, because the values ​​of the last few bits of the low 8 bits are equal

Others
1. The key and value store are references, making the data structure compact. Then because of the sub-regional storage, it makes the positioning a lot more convenient (a lot of information has been mentioned here)
2. The amortization time of data migration is a design choice
3. A very good place is the choice of XY during rehash , although I didn't understand
4. There is another very obscure, but the wonderful thing is that it implicitly guarantees that it will not point to two at the same time hashGrow
. This may be the so-called unity of heart and hand.”

References
https://www.gitbook.com/book/tiancaiamao/go-internals/details

The article was first published on my personal blog https://blog.i19.me

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326038392&siteId=291194637
Recommended