Simulation Implementation of Python Dictionary Hash Table

This article is learned from: "In-depth analysis of Python source code"

  • Thanks to the author for teaching
  • If you still don’t know the knowledge of the dictionary, you can click to learn more about the content and knowledge points

Features

Main class (dictionary): PyDictObject

  • (1) add_entry (self, key, value): add element
  • (2) del_entry (self, key): delete the element
  • (3) hash_list : (attribute), you can traverse the underlying hash array
  • (4) entry_list : (attribute), traverse the underlying key-value pair array
  • (5) It can be iterated through a simple for loop, which is very convenient
pd = PyDictObject()
''' 查看原始hash表存储情况 '''
# for 循环遍历
for i in pd.ma_keys:
    print(i)

# 或者 生成器表达式, 再转化为字典
a = (i for i in pd.hash_list)
print(list(a))

# 查看底层键值对数组情况
b = (i for i in pd.entry_list)
print(list(b))

Internal function:

  • The realization of expansion, pseudo expansion, and shrinkage. Hash conflict prevention, detection (linear detection, square detection)
  • The detection is divided into two methods (linear detection and square detection in turn and cyclic detection)
  • Modify or increase the use of the same function (the bottom layer automatically monitors whether there is the same key, and decides whether to update data or add data)

PyDictKeysObject

  • Complete all the main core functions

De_Entries

  • Array of key-value pairs

Dk_Indices

  • Hash table

Entry

  • Hash table array element
  • Contains two attributes: state and key-value pair storage subscript

PyDictKeyEntry

  • Key-value object
  • Several methods are defined to manipulate key-value pairs

The main function diagram:

Expansion and shrinkage (may be slightly different from the source code implementation)

Insert picture description here

Delete logic (the font is rather sloppy~)

Insert picture description here

test

  • 10,000 loops, except for performance problems , basically all bugs can be eliminated, and no bugs will occur due to the logic of the code itself, and the robustness is good.
  • The number of hash detections each time is less than 10 times (the abnormal exit setting is 10 times, and the number of detections may be lower)

problem

  • (1) First, the head of an object is not re-engraved, that is, it does not consider the problems of memory management and garbage collection when deleting or modifying an object . Because the refcnt field is not set to consider garbage collection.
  • (2) In the dictionary, some fields are not used, so they are abandoned.
  • (3) Many functions feel very simple~ , but there will be many logical vulnerabilities and bugs during testing, and the code is constantly updated and improved. Even if it is a small thing, if you want to be foolproof, you need to think about it, carefully consider and perfect it, and admire the developers of the language, realize the functions of the entire language, and avoid loopholes.
  • (4) I wanted to recreate it. The dictionary traversal, for i in range dict.items, etc. are traversed according to the key and value, but the bottom layer is to traverse the key-value pair array, but the iterator of the key-value pair array, It shares an i variable for traversal, which involves that when traversing keys, values, etc. at the same time, the problem of i value is not realized.

Feelings:

When I first designed it, I felt that I had to write a lot of code to realize a big function. When it was realized, only 300 lines of code were used. In addition, many things can only be understood in practice, including a field. , A piece of processing logic, you need to repeatedly verify a little bit before you can start. In addition to the design ideas as well as aspects of the program is very rewarding, it is true the code now, the quality is not high, many important places, because the limitations of thinking, not considered when writing code, but progress is still there.

Code display

class Me_Key():
    """ 定义数据描述器,限定 Me_key 值为可hash的 """

    def __init__(self, name):
        self.attrname = name

    def __get__(self, instance, owner):
        return getattr(instance, self.attrname)

    def __set__(self, instance, value):
        # 当数据是 字典或者列表,我们就认为是不可哈希的,返回错误
        if isinstance(value, dict) or isinstance(value, list):
            raise("键错误,不可哈希")
        setattr(instance, self.attrname, value)


class PyDictKeyEntry():
    """ 定义存储的键值对 对象,PyDictKeyEntry """

    me_key = Me_Key('_key')
    def __init__(self):
        """
        :param    me_hash: 键对象的 哈希值 ,避免重复调用 __hash__ 计算哈希值;
        :param    me_key: 键对象指针;
        :param    me_value: 值对象指针;
        """
        self.me_key = None 
        self.me_value = None
        self.me_hash = None
    
    def set_data(self, key, value):
        
        self.me_key = key
        self.me_value = value
        self.me_hash = hash(key)

    def del_data(self):   # 默认全部设置为None, 就是删除了
        self.me_key = None
        self.me_hash = None
        self.me_value = None
    

class Entry():
    """ 定义散列表每一个格子 """

    # 定义三种状态
    EMPTY = 1
    DUMMY = 2
    USED = 3

    def __init__(self):
        self.status = self.EMPTY
        self.index = None

class Dk_Indices():
    """ 定义dk_indices 散列表 """
    entry = Entry

    def __init__(self, n = 8):
        """
            散列表定义初始长度为8
        """
        self.hash_table = [self.entry() for i in range(n)]


class De_Entries():
    """ 定义存储键值对数组 """
    pke = PyDictKeyEntry

    def __init__(self, n = 5):
        self.i = -1
        self.length = n
        self.entries = [self.pke() for i in range(n)]
    
    def __iter__(self):
        self.i = -1
        return self
        
    def __next__(self):

        while self.i < self.length -1:
            self.i += 1
            return self.entries[self.i].me_key, self.entries[self.i].me_value
        raise StopIteration


class PyDictKeysObject():
    """ 定义哈希表对象结构 """

    indices = Dk_Indices
    entries = De_Entries
    def __init__(self):
        """
            :param    dk_size: 哈希表大小
            :param    dk_usable: 键值对数组可用个数
            :param    dk_nentries: 键值对已用个数 (因为删除数据了, 可用加上已用,并不等于键值对数组大小,还有的删除状态不再使用了),当数据删除时,已用个数减一,但是可用个数不变
            :param    dk_indices: 存储dk_entries数组对应元素下标(也是hash表)
            :param    dk_entries: 保存键值对的数组
            :param    de_deldatas: 删除元素的个数
        """
        # 初始化数据,散列表总长度8,键值对数组长度5(三分之二长),已用个数为0
        self.dk_size = 8
        self.dk_usable = 5
        self.dk_nentries = 0
        self.de_deldatas = 0
        self.dk_indices = self.indices()
        self.dk_entries = self.entries()

        # 用于迭代器计数
        self.i = -1
    
    def find_location(self, me_key, me_value):
        """ 寻找数据的 hash表存储位置,并插入 """
        ''' 官网的根据 对象hash值选择,生成不同的探测序列,我们搞简单的,利用线性和平方探测,交替进行,减少冲突 '''

        # 这里用一个技巧,取模运算,被代替为 按位与& 运算, 因为hash表的长度为dk_size =  2的n次方, 则2的n次方减一,二进制恰好是 低位全1,高位为0,相与即是等于 与 dk_size 取模运算,但更速度!https://blog.csdn.net/u014266077/article/details/80672995
        a = self.add_entries_data(me_key, me_value)
        if not a: 
            return
        self.insert_hash_index(a)
        # 数据插入成功,更新 self.dk_usable     self.dk_nentries 字段的值
        self.dk_usable -= 1
        self.dk_nentries += 1
        print("+++++", self.dk_usable)
    
    def insert_hash_index(self, a):
        location = a.me_hash & (self.dk_size - 1)
        entry = self.dk_indices.hash_table[location]
        s = self.status_1_2(entry)

        if not s:
            sequence = entry.status
            n = 1
            while sequence == 3:
                # 防止冲突算法不好,程序陷入死循环,最多查找十次
                n += 1
                if n > 10:
                    print("这~~ 执行十次都找不到插入的位置。。")
                    raise TimeoutError
                location = self.linear_detect(location)  # 线性探测
                temp =  self.dk_indices.hash_table[location]  
                sequence = temp.status
                if sequence == 3:
                    location = self.square_detect(location)  # 平方探测
                    temp =  self.dk_indices.hash_table[location]
                    sequence = temp.status
            self.status_1_2(temp)
    
    def del_element(self, key):
        """ 定义删除元素的函数 """

        try:
            hash_key = hash(key)
        except TypeError as e:
            print("提示:传入的键是不合法的,请重新尝试!")
            raise(e)
        location = hash_key & (self.dk_size - 1)
        entry = self.dk_indices.hash_table[location]
        n = 0
        while True:
            if n != 0:
                if n & 1 == 1:
                    location = self.linear_detect(location)  # 线性探测
                else:
                    location = self.square_detect(location)  # 平方探测
                entry = self.dk_indices.hash_table[location]

            if entry.status == 1:
                raise KeyError
            elif entry.status == 2:
                n += 1
                continue
            else:
                array_key_value = self.dk_entries.entries[entry.index]
                if hash_key == array_key_value.me_hash:
                    print("执行删除")
                    self.de_deldatas += 1
                    entry.status = 2
                    entry.index = None
                    array_key_value.del_data()
                    break
                else:
                    n += 1
    
    def add_entries_data(self, key, value):
        """ 向键值对数组插入数据 """

        temp = self.dk_entries.pke()
        temp.set_data(key, value)
        for i in self.dk_entries.entries:
            if i.me_hash == temp.me_hash:
                print("执行修改操作")
                i.me_value = temp.me_value
                break
        else:
            # 判断是否需要扩容
            if not self.cheack():
                # 执行扩容
                self.expansion()
            a = self.dk_entries.entries[self.dk_nentries]
            a.set_data(key, value)
            return a
    
    def cheack(self):
        """ 插入数据时对 键值对数组进行检查,是否会出现元素溢出,进行相应的扩容,或者缩容操作 """
        return False if self.dk_usable == 0 else True

    def expansion(self):
        """ 扩容函数 """

        # used, 除去删除的,真正已用的, entries就是键值对数组的长度
        used = self.dk_nentries - self.de_deldatas 
        entries = (self.dk_size*2 // 3)
        if used > entries*3 // 4 :
            print("执行双扩容")
            # 重新初始化数据
            self.dk_size *= 2
            self.dk_indices = self.indices(self.dk_size)  # 和下面的初始化操作一样,可以 重写一个方法,但是程序执行速度太低了
            self.dk_usable = self.dk_size*2 // 3
            self.dk_nentries = 0
            # 分配键值对数组,与重新插入 hash表
            self.redistribute()
        elif used >= entries // 2 :
            print("执行伪扩容")
            self.dk_indices = self.indices(self.dk_size)
            self.dk_usable = self.dk_size*2 // 3
            self.dk_nentries = 0
            self.redistribute()
        else:
            print("执行缩容")
            # 默认最小长度是8,不允许缩容
            if self.dk_size == 8:
                return 
            n = 1
            i = 0
            while n < used * 3:  # 计算最小的满足 键值对为hash大小1/3, 的n (n表示2的n次方)
                n *= 2
                i += 1
            # 初始化数据
            self.dk_size = 2 ** i
            self.dk_indices = self.indices(self.dk_size)
            self.dk_usable = self.dk_size*2 // 3
            self.dk_nentries = 0
            self.redistribute()

    def redistribute(self):
        """ 重新分配 hash,键值对数组 """

        i = 0
        temp = self.entries(self.dk_size*2 // 3)
        for data in self.dk_entries.entries:
            if data.me_hash == None:
                continue
            temp.entries[i] = data
            self.insert_hash_index(data)
            self.dk_usable -= 1
            self.dk_nentries += 1
            i += 1
        self.dk_entries = temp

    def status_1_2(self, b):
        """ 转态转换: EMPTY 与 DUMMY """
        ''' 用来判断状态 1,2 说明不存在hash冲突,返回True,否则返回False '''
        if b.status != 3:
            b.index = self.dk_nentries
            b.status = 3    
            return True
        else:
            return False    

    def linear_detect(self, index):
        """解决hash冲突,开放地址法,线性探测方法"""
        # 经过试验,当后面 +3时, 循环一万次添加数据时,都不会出现死循环, 为0或 1都不行, 看来解决冲突的算法真的很重要
        return (index * 2 + 3) & (self.dk_size - 1)

    def square_detect(self, index):
        """ 平方探测 """
        return (index ** 2) & (self.dk_size - 1)

    # 定义成迭代器,方便查看 散列表内的元素存储情况
    def __iter__(self):
        self.i = -1  # 当多次for 循环时, i值必须重新初始化,从头开始
        return self
    
    def __next__(self):
        while self.i < self.dk_size: 
            self.i += 1
            if self.i == self.dk_size:
                break
            return self.dk_indices.hash_table[self.i].index, self.dk_indices.hash_table[self.i].status
        raise StopIteration


class PyDictObject():

    pdk = PyDictKeysObject
    
    def __init__(self):
        """初始化大整数
        :param    ma_used: 对象当前所保存的键值对个数
        :param    ma_version_tag:对象当前版本号,每次修改时更新
        :param    ma_keys:指向映射的hash表结构
        :param    ma_values: 分离模式下指向由所有 值对象 组成的数组
        """
        
        # 定长对象 公共头字段:PyObject_HEAD, 里面ob_refcnt, 
        self.ma_used = 0
        self.ma_version_tag = 1
        self.ma_keys = self.pdk()
        self.ma_values = None
    
    @property  # hash数组
    def hash_list(self):
        return self.ma_keys
    
    @property  # 键值对数组
    def entry_list(self):
        return self.ma_keys.dk_entries

    # 添加 键值对数据
    def add_entry(self, key, value):
        self.hash_list.find_location(key, value)

    # 删除键值对数组
    def del_entry(self, key):
        self.hash_list.del_element(key)

    def __iter__(self):
        return self

    # Entry数组的"不可变性", Python3.7中,支持按插入顺序迭代
    def __next__(self):
        pass



""" 功能测试 """
pd = PyDictObject()
# pd.add_entry(123, 789)

for i in range(0, 42):
    pd.add_entry(i, i)


for i in range(37):
    pd.del_entry(i)
pd.add_entry(48, 34)

pd.add_entry(39, "hash")


''' 查看原始hash表存储情况 '''
# for 循环遍历
# for i in pd.ma_keys:
#     print(i)

# 或者 生成器表达式, 再转化为字典
a = (i for i in pd.hash_list)
print(list(a))

b = (i for i in pd.entry_list)
print(list(b))

Guess you like

Origin blog.csdn.net/pythonstrat/article/details/115321715