"Fluent Python" study notes (4) - dictionary, a collection of hash

"Fluent Python" study notes (4) - dictionary, a collection of hash

Abstract: Python dictionary data structure is one of the most commonly used in the programming, because of its high efficiency and convenient interface retrieval NLP programming is often used in building and storing vocabulary, and completion word2int int2word like. Therefore, it is necessary to in-depth look at the structure of the dictionary.

1. dictionary construction method: Derivation dictionary

And a list of similar derivation, derivation method dictionary format:

reduced_dict = {key: value for key, value in ...}

This method is very useful, especially in a known key values ​​array, for example:

(1) represent the known list of keys and values ​​of the two one-to build a dictionary

keys = ['a', 'b', 'c', 'd']
values = list(range(len(keys)))
reduced_dicts = {key: value for key, value in zip(keys, values)}

(2) is known to one mapping of a dictionary, how to build a de-mapping, as is known in NLP word2int, how do you know int2word

word2int = {...}
int2word = {word2int[key], key for key in word2int}

2. Dictionary of CRUD optimization

Ordinary course of business, in fact, use up or CRUD functionality, first examine some of the ways Python's built-in dictionary, and then in the transformation of these methods utilize or complete CRUD functionality we need

2.1 three kinds of dictionaries built-in method

Here Insert Picture Description
Here Insert Picture Description

Excerpt from "Fluent Python"

Wherein the commonly used methods are:

  • d.values ​​(), d.keys (), less the two method returns the object key and value in view of the way memory resource occupation
  • d.items () is the return of key-value pairs
  • d.setdefault

2.2 reliable access, modify

Substantially more skilled Python program knows to replace the d [key] with d.get (key, default_value), can reduce the use reason statement try expect more simply and efficiently solve the problem. But that was not enough, d.get () not enough on modifying pairs. E.g:

"""创建一个从单词到其出现情况的映射"""
import sys
import re
WORD_RE = re.compile(r'\w+')
index = {}
with open(sys.argv[1], encoding='utf-8') as fp:
for line_no, line in enumerate(fp, 1):
	for match in WORD_RE.finditer(line):
		word = match.group()
		column_no = match.start()+1
		location = (line_no, column_no)
		# 这其实是一种很不好的实现,这样写只是为了证明论点
		occurrences = index.get(word, [])	#1
		occurrences.append(location) 		#2
		index[word] = occurrences 			#3
# 以字母顺序打印出结果
for word in sorted(index, key=str.upper): 
	print(word, index[word])	

This method can solve the key problem of modification does not appear, but there are two problems:

  • Three lines to achieve modification of functions, too long
  • Memory access twice a modification in the course of time is 1, a 3

To address these issues we can have a number of ways:

(1) setdefault method, the default value for a modified

"""创建从一个单词到其出现情况的映射"""
import sys
import re
WORD_RE = re.compile(r'\w+')
index = {}
with open(sys.argv[1], encoding='utf-8') as fp:
for line_no, line in enumerate(fp, 1):
	for match in WORD_RE.finditer(line):
		word = match.group()
		column_no = match.start()+1
		location = (line_no, column_no)
		index.setdefault(word, []).append(location)# 以字母顺序打印出结果
for word in sorted(index, key=str.upper):
	print(word, index[word])

这里使用的是setdefault方法来解决,当index不存在word这个键时,将index[word] = default
但这个方法就割裂了查找和删除的接口,会用到setdefault这个不常见的方法。

(2)弹性键查询方法

弹性键查询方法,是指当某个键的映射不存在的时候,我们希望通过这个键获取一个默认的值
目的是获得了一个统一的接口,如index[k].append(value)可以不出现错误的直接使用,有两种方法,一个是使用default_dict,另一个是使用__missing__方法重写

  • 利用default_dict的方法是
"""创建从一个单词到其出现情况的映射"""
import sys
import re
import collections
WORD_RE = re.compile(r'\w+')
index = collections.defaultdict(list)with open(sys.argv[1], encoding='utf-8') as fp:
for line_no, line in enumerate(fp, 1):
	for match in WORD_RE.finditer(line):
		word = match.group()
		column_no = match.start()+1
		location = (line_no, column_no)
		index[word].append(location)# 以字母顺序打印出结果

这样就能达到要求,但值得注意的是:defaultdict的方法只会在__getitem__中调用,其他调用,在其他的方法中如d.get()中不会调用。

  • __missing__方法发调用

在defaultdict中,其实当出现键值问题时,则调用__missing__方法

class StrKeyDict0(dict): 
    
    def __missing__(self, key):
        if isinstance(key, str): 
            raise KeyError(key)
        return self[str(key)] 

    def get(self, key, default=None):
        try:
            return self[key]

        except KeyError:
            return default 

    def __contains__(self, key):
        return key in self.keys() or str(key) in self.keys()

这是一种继承与dict类的变种类,更好的方式是继承Userdict,这样能够写出更优美的代码。

3 哈希与效率的问题

可以看到,字典这种类型的使用方式一般是:一次建立,多次查询的;所以查询的效率需要重点考察。而为了弄清楚是字典的效率就必须了解背后的算法原理。在Python中字典的实现方法是通过Hash表的方法。

大致的工作流程如:
Dictionary work flow chart
既然使用了Hash算法,则由如下几个特点:

  • 键需要是能Hash的类型,基本上要是不可变类型如元组(由不可变的内容组成),字符串,
  • 内存资源消耗很大,是实际需要内存的3/2
  • 查询速度很快,基本上不随着数据量的增大,是O(1)级别
  • 增添和删除字典,可能会导致内存的迁移
Released four original articles · won praise 2 · Views 144

Guess you like

Origin blog.csdn.net/baidu_34912627/article/details/104087304