Golang's map implementation (1)

Overview

Hash table is a data type commonly used in engineering, which can provide fast retrieval and update. The complexity is generally O(1)

This blog post is written in two parts, the first part is source code learning, the second part is some internal implementations, as well as some interesting places, as well as personal thoughts

theory

There are two problems that the hash table needs to solve

  • location index
  • data collision

The index is handed over to the hash functionhash algorithm, which is commonly used in the modulo operation.

There are three main ways to resolve collisions:

  1. Separating links, that is, using the properties of linked lists to store conflicting keys, and then distinguish them by traversal (separate storage level)
  2. open addressing
    1. Linear detection (both storage level and algorithm level have been adjusted)
    2. Square detection (same as above, just a small change at the algorithm level)
    3. Twin-scattering
  3. Rehashing (scaling and data migration)

Diffusible columns are used to solve scenarios where the data is too large to fit into memory, and will not be discussed here

The efficiency of a hash table is related to the load factorfilling factor, which is used to measure its average complexity. The meaning is that the calculation method is generally to use the amount of data that has been stored / the number of indexable addresses. In other words, the average length of a single index address

collision resolution

Ideally, when there are no collisions, the hash structure can be implemented using an array, and a hash algorithm. However, the collision cannot be completely avoided, so there are several ways to solve it.

split link

The core of detached links is to deal with collisions by using linked lists. The array is used for indexing, and the linked list is stored internally. The linked list stores the key and value of the hash collision. The key is stored so that in the event of a conflict, the location can still be achieved by comparison.

open addressing

The problem with the linked list is the node application, which will cause frequent memory operations. If the amount of data is not particularly large, the open addressing method can be considered. It still uses a relatively large array. It's just that when a collision occurs, it can be stored by offsetting in a fixed direction to solve the collision problem.
Linear probing and square probing are all about offset selection. Double hash (omitted)

rehashing

The collision can be said to be caused by the small storage space to some extent. Then the idea of ​​rehash is to apply for a larger space, and then recalculate and relocate the data.

Map implementation in Golang

The map in golang is a hash table, and its implementation uses linked lists and rehash.

Linked lists are used for collisions at a smaller level, and rehash is used for load factorlarger ones.

Note: This record is based on go 1.9.2version records.

data structure

mapGolang does not directly store the keysum passed in value, but uses its reference and the high-order bit of the key's hash value (more on that later).

The following is the part of the map data structure, which selects the fields that are mainly related to storage.

type hmap struct {
    B          uint8
    buckets    unsafe.Pointer
    oldbuckets unsafe.Pointer
    extra      *mapextra
}

Buckets and oldbuckets are pointer addresses that point to a continuous address. It is mainly used to store the reference address of key and value, which is temporarily understood as the data part. Among them, oldbuckets will only be used when expanding the capacity. Both are similar to the array function in the previous implementation of "separate chaining" and are used for preliminary indexing.

type mapextra struct {
    overflow [2]*[]*bmap
    nextOverflow *bmap
}

Just focus on nextOverflow for the time being, and it points to a continuous address similar to buckets (the last bucket maintained is an address. Buckets and oldbuckets do not have this), it can be seen from the name that when there is not enough space ( But it is not enough to trigger the rehash logic) to apply for temporary memory space from the system for buffering.

type bmap struct {
    tophash [bucketCnt]uint8
}

bmap is part of the bucket data structure . Its function is to roughly confirm the address of this linked list. is an array of space 8. Its value is the high-order bit of the key's hash value. When a key is passed in, it will do a comparison, and then determine the subscript of the array, which is related to the head of the linked list stored by the key.

memory structure

The following structures are drawn through the personal understanding of the source code, and there may be deviations. In fact, it should be placed after the map operation. But in order to help understand the subsequent operations, it is placed in the front.

Regardless of the expansion scenario, map storage data will use buckets first, and only use the overflow area when there is not enough space (say this first). So the structure of these two is put below.

bucket

bmap
|    数据对齐
|    |
|    |  |    key field  |  value field  | 指向下一个 bucket 的指针
|____|__|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|

bmap is a continuous space of 8 uint8s, used to store tophash. The data alignment that follows is a memory-level operation. The key field and value field are used to store key references and value references, both of which are 8 spaces. The storage location of each reference itself can be calculated by the offset of tophash, so as to obtain the key and value.

nextOverFlow

|_|_|_|_|_|_|_-|

每两个竖线之间都是一个 bucket

Nextoverflow points to a set of bucket-sized contiguous spaces, and the function is the same as the bucket above. However, the last bucket of nextoverflow, that is, the above |----|is a special use, not used to store data, but a terminator. Used to tell that the buffer has ended. The implementation method is to store the header address of buckets in the hmap structure in the space of the last reference size.

Note: When more than 8 buckets need to be created, golang pre-applies for nextoverflow space to reduce memory operations (details are liked), then buckets and nextoverflow are contiguous in memory at this time. The structure will be as shown below.

                                          |     | ----> 最后一个 bucketSize 大小的空间
buckets 头部                 nextOverFlow  |   存储了 buckets 头部地址
|                           |             |  |  |
|                           |             |  |  |
|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|-| 最后 '-' 为 sys.PtrSize,不知道 ptrSeize 和 bucketSize

这个图是最早画的,不舍得删

interface implementation

In addition to the above memory structure, the interface implementation of map is related to some undrawn fields, such as data competition, load factor size, and the data structure used when expanding.

create

Creation is actually the completion of the memory application and the setting of some initial values. Then it is assumed that the created space is large, that is to say, the initialization of the overflow area is also recorded here.

makemapis the function that completes the map structure. The following is the "pseudo-pseudo-code" that has been extracted from the native code and has been modified for easy reading.

// hint 代表的 capacity
func makemap(t *maptype, hint int64) *hmap  {
    // 条件检查
    t.keysize = sys.PtrSize = t.key.size
    t.valuesize = sys.PtrSize = t.elem.size

    // 通过 hint 确定 hmap 中最小的 B 应该是多大。
    // B 与后面的内存空间申请,以及未来可能的扩容都有关。B 是一个基数。
    // overLoadFactor 考虑了装载因子。golang 将其初始设置为 0.65
    B := uint8(0)
    for ; overLoadFactor(hint, B); B++ {}

    // golang 是 lazy 形式申请内存
        if B != 0 {
        var nextOverflow *bmap
        buckets, nextOverflow = makeBucketArray(t, B)
        if nextOverflow != nil {
            extra = new(mapextra)
            extra.nextOverflow = nextOverflow
        }
    }

    // 后面就是将内存地址关联到 hmap 结构,并返回实例
    h.count = 0  // 记录存储的 k/v pari 数量。扩容时候会用到
    h.B = B  // 记录基数
    h.flags = 0 // 与状态有关。包含并发控制,以及扩容。

    ...
}

// makeBucketArray 会根据情况判断是否要申请 nextOverflow 。
func makeBucketArray(t *maptype, b uint8) (buckets unsafe.Pointer, nextOverflow *bmap) {
    base := uintptr(1 << b)
    nbuckets := base
    if b >= 4 {
        // 向上调整 nbuckets
    }

    // 注意,是按照 nbuckets 申请内存的
    buckets = newarray(t.bucket, int(nbuckets))

    // 处理 overflow 情况,
    if base != nbuckets {
        // 移动到 数据段 的末尾
        nextOverflow = (*bmap)(add(buckets, base*uintptr(t.bucketsize)))

        // 设置末尾地址,参考上面内存图中 nextoverflow 最后的那个指针位。用来做末尾检测
        last := (*bmap)(add(buckets, (nbuckets-1)*uintptr(t.bucketsize)))
        last.setoverflow(t, (*bmap)(buckets))
    }
    return buckets, nextOverflow
}

With the third one to see the memory map, the effect is better. It is convenient to have an overall impression.

read

There are two read mapaccess1and mapaccess2two, the former returns a pointer, the latter returns a pointer and one bool, which is used to determine whether the key exists. Just say it here mapaccess1. pointer value fieldis the address stored in

func mapaccess1(t *maptype, h *hmap, key unsafe.Pointer) unsafe.Pointer {
// 如果为空或者长度为 0,那么就返回一个 0 值

// 如果正在被写入,那么抛出异常

// 获取 key 的 hash 值

// 确认该 key 所在的 bucket 位置 (可能是在 buckets 也有可能在 oldbuckets 中)
// 使用模计算,先计算出如果在 buckets 中,则是在哪个 bucket
// 检测 oldbucket 是否为空,如果不为空,则用上面同样的方式得出在 oldbuckets 的位置
// 并检测该 bucket 是否已经被 evacuate ,如果已经被 evacuate 则使用 buckets, 否则使用 oldbuckets 中的位置
    m := uintptr(1)<<h.B - 1
    b := (*bmap)(add(h.buckets, (hash&m)*uintptr(t.bucketsize)))  // buckets 结构
    if c := h.oldbuckets; c != nil {
        if !h.sameSizeGrow() {  // 上次扩容是等量还是双倍扩容, 会有影响
            // There used to be half as many buckets; mask down one more power of two.
            m >>= 1
        }
        oldb := (*bmap)(add(c, (hash&m)*uintptr(t.bucketsize)))
        if !evacuated(oldb) {
            b = oldb
        }
    }

// 得到 bucket 以后,通过 tophash 来再次定位,如果定位不到,则递归该 bucket 的 overflow 域,循环查找。
// 以上步骤有两个结果。
// 1 遍历到最后,都没有找到命中的 tophash ,此时则返回一个零值。
// 2 命中 tophash 则进行 key 比对,相同则返回对应的 val 位置,不同则通过 overflow 继续获取,否则返回一个零值

write

The use of "write" here is not very strict. Because the last return is a pointer address, used to store the value. That is, the address to be written to the value reference is confirmed by the input key (consider the bucket structure value field)

Before going further into the operation of writing, let's talk about scaling. As the amount of writing increases, capacity expansion is inevitable. If the capacity is expanded, it involves the application of new space, then the migration of data in the old space, and finally the recovery of the old space. The data migration part can be done at one time, but this may lead to a very slow operation, so golang uses the lazy method when migrating. Only when an element in an oldbucket needs to be changed, it will quietly re-hash the oldbucket and write it. Go to the bucket, delete the reference to the oldbucket, and hand it over to gc for space reclamation.

More grow-related operations will be detailed in "internal", here is more of the overall process.

mapassignis the body function that completes the operation. (Writing here, suddenly I don't want to write anymore... I'm so tired)

func mapassign(t *maptype, h *hmap, key unsafe.Pointer) unsafe.Pointer {
    // 并发检查
    // 计算 hash 值
    // 更新状态为 写入

    // 下面是一个三层的循环嵌套。从里向外说
    // 第三层是为了在一个 bucket 中,定位到一个 key 的位置
        // 如果成功(更新操作),则直接可以计算出 val 的位置,跳转到结束阶段
        // 定位失败,则第二层循环开始工作
    // 第二层循环是递归该 bucket 的 overflow 区域,持续获取新的bucket位置
        // 成功,则执行第三层循环
        // 失败(没有 overflow 区域了,即插入操作)跳回到第一层
    // 第一层是为了获取空间来执行写入操作(如果是插入操作,则 h.count++,记录 map 内 key 的数量)
        // 要么 hashGrow (后面讲),然后接着跳转到三层循环,继续运行
        // 要么 overflow, 本层直接执行插入操作
        // 操作完成

    // 最后返回 val 地址
}

delete

mapdeleteis the main function responsible for deletion

func mapdelete(t *maptype, h *hmap, key unsafe.Pointer) {
    // 同 access 定位到位置,没定位到就啥都不干
    // 定位到以后,会先删除 value 里面的引用,后面由 gc 进行进行空间回收
    // 将 tophash 中对应位置设置为 empty (有意思的也就是这里)
    // h.count--
}

References
https://www.gitbook.com/book/tiancaiamao/go-internals/details

The article was first published on my personal blog https://blog.i19.me

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326038544&siteId=291194637
Recommended