"Fluent Python" study notes (4) - dictionary, a collection of hash

"Fluent Python" study notes (4) - dictionary, a collection of hash

Abstract: Python dictionary data structure is one of the most commonly used in the programming, because of its high efficiency and convenient interface retrieval NLP programming is often used in building and storing vocabulary, and completion word2int int2word like. Therefore, it is necessary to in-depth look at the structure of the dictionary.

1. dictionary construction method: Derivation dictionary

And a list of similar derivation, derivation method dictionary format:

reduced_dict = {key: value for key, value in ...}

This method is very useful, especially in a known key values ​​array, for example:

(1) represent the known list of keys and values ​​of the two one-to build a dictionary

keys = ['a', 'b', 'c', 'd']
values = list(range(len(keys)))
reduced_dicts = {key: value for key, value in zip(keys, values)}

(2) is known to one mapping of a dictionary, how to build a de-mapping, as is known in NLP word2int, how do you know int2word

word2int = {...}
int2word = {word2int[key], key for key in word2int}

2. Dictionary of CRUD optimization

Ordinary course of business, in fact, use up or CRUD functionality, first examine some of the ways Python's built-in dictionary, and then in the transformation of these methods utilize or complete CRUD functionality we need

2.1 three kinds of dictionaries built-in method

Here Insert Picture Description
Here Insert Picture Description

Excerpt from "Fluent Python"

Wherein the commonly used methods are:

  • d.values ​​(), d.keys (), less the two method returns the object key and value in view of the way memory resource occupation
  • d.items () is the return of key-value pairs
  • d.setdefault

2.2 reliable access, modify

Substantially more skilled Python program knows to replace the d [key] with d.get (key, default_value), can reduce the use reason statement try expect more simply and efficiently solve the problem. But that was not enough, d.get () not enough on modifying pairs. E.g:

"""创建一个从单词到其出现情况的映射"""
import sys
import re
WORD_RE = re.compile(r'\w+')
index = {}
with open(sys.argv[1], encoding='utf-8') as fp:
for line_no, line in enumerate(fp, 1):
	for match in WORD_RE.finditer(line):
		word = match.group()
		column_no = match.start()+1
		location = (line_no, column_no)
		# 这其实是一种很不好的实现,这样写只是为了证明论点
		occurrences = index.get(word, [])	#1
		occurrences.append(location) 		#2
		index[word] = occurrences 			#3
# 以字母顺序打印出结果
for word in sorted(index, key=str.upper): 
	print(word, index[word])	

This method can solve the key problem of modification does not appear, but there are two problems:

  • Three lines to achieve modification of functions, too long
  • Memory access twice a modification in the course of time is 1, a 3

To address these issues we can have a number of ways:

(1) setdefault method, the default value for a modified

"""创建从一个单词到其出现情况的映射"""
import sys
import re
WORD_RE = re.compile(r'\w+')
index = {}
with open(sys.argv[1], encoding='utf-8') as fp:
for line_no, line in enumerate(fp, 1):
	for match in WORD_RE.finditer(line):
		word = match.group()
		column_no = match.start()+1
		location = (line_no, column_no)
		index.setdefault(word, []).append(location)# 以字母顺序打印出结果
for word in sorted(index, key=str.upper):
	print(word, index[word])

Setdefault method used here is to solve, when the index word this key does not exist, the index [word] = default
but this method will split the find and delete interface, setdefault this unusual method will be used.

(2) resilient key query method

Resilient key query method, refers to when a key mapping does not exist, we want to get a default value through this key
objective is to obtain a unified interface, such as index [k] .append (value) can not appear error direct use, there are two methods, one is to use default_dict, another method is to use rewritable __missing__

  • The method is to use default_dict
"""创建从一个单词到其出现情况的映射"""
import sys
import re
import collections
WORD_RE = re.compile(r'\w+')
index = collections.defaultdict(list)with open(sys.argv[1], encoding='utf-8') as fp:
for line_no, line in enumerate(fp, 1):
	for match in WORD_RE.finditer(line):
		word = match.group()
		column_no = match.start()+1
		location = (line_no, column_no)
		index[word].append(location)# 以字母顺序打印出结果

So that we can meet the requirements, but it is worth noting that: defaultdict method is only called in __getitem__, the other calls, other methods such as d.get () will not be called.

  • __missing__ method calls made

In defaultdict in fact, when there is the key issue, call the method __missing__

class StrKeyDict0(dict): 
    
    def __missing__(self, key):
        if isinstance(key, str): 
            raise KeyError(key)
        return self[str(key)] 

    def get(self, key, default=None):
        try:
            return self[key]

        except KeyError:
            return default 

    def __contains__(self, key):
        return key in self.keys() or str(key) in self.keys()

This kind of change is one of the inheriting of the dict class, a better way is to inherit Userdict, so can write more elegant code.

3 hash and efficiency issues

You can see, this type of dictionary use is generally: once established, repeatedly queries; therefore need to focus on the efficiency of queries investigation. In order to find out the efficiency of a dictionary we must understand the principles behind the algorithm. Dictionary in Python is achieved by a method of the Hash table.

Roughly workflows such as:
Dictionary work flow chart
Since the use of the Hash algorithm, by the following characteristics:

  • Hash keys need to be able to type, substantially as immutable if the tuples (the content of invariable composition), the string,
  • Large memory resource consumption, the actual memory requirements 3/2
  • Queries quickly, with substantially no increase in the amount of data, is O (1) Level
  • Add and delete dictionaries, may lead to memory migration
Released four original articles · won praise 2 · Views 144

Guess you like

Origin blog.csdn.net/baidu_34912627/article/details/104087304