Implementation of LRU Cache in Python

LRU Cache - Least Recently Used Cache Most recently unused cache

Today, I asked my colleague a question, how to implement the LRU Cache system, and the colleague answered that the timestamp is used. Then using timestamps, his possible idea is to implement it in a dictionary in Python, so that LRU can be implemented by judging whether the timestamp is sooner or later.

First of all, cache, for example, for such a function, task(arg1, arg2)the cache mainly caches its parameters and return values, then it can be implemented directly using a dictionary. The key is that the parameters can be hashed, that is to say, the parameters should not be variable. In this way, the parameter can be used as the key, and the function return value can be stored in the dictionary as the value.

Then for LRU, LRU is commonly used in the page replacement of operating system memory management, that is, the most recently used page is in memory, otherwise it is in the hard disk, in order to improve the page hit rate, its English is hits, if there is no hit, then for misses. For the above function, it is to store the corresponding relationship between the most recently used parameters and the return value.

Is a timestamp dictionary feasible?

For the dictionary mentioned by a colleague, the cache model should be the function parameter as the key, because the key is used to determine whether the hit is hit, and the timestamp is also used as the key, because it needs to be used more recently, such as the cache is full, and the earliest used cache is to be used. Cleared out, then at this time it is necessary to compare which parameter corresponds to the earliest time.

There is one here multi_key_dict, that is, multiple keys correspond to one value. Assuming that the timestamps will not conflict, use the timestamp and the parameter as the key, then to get the earliest timestamp, a sorting is also required, because the key is not in the dictionary. sequential. Not to mention finding the timestamp key from so many keys, and there is a problem if the parameter is also a timestamp= = !

Use LIST

Since the timestamp needs to be sorted, then instead of the timestamp, you can directly store the corresponding key, that is, the function parameter, in a Python list. The most recently used is at the head of the list, and the earliest used is at the end of the list, and then stored in a dictionary. Parameter-return value, this implements an LRU.

If there is a parameter hit (key is in the dictionary), then put the parameter from its original position to the head of the list, and delete the last element if it is full. This seems OK, but there are performance issues involved. Both pop and insert from the middle of the list are performance bottleneck operations, and the time complexity is high O(n).

use linked list

Since there is a performance problem when using a list when inserting, then using a singly linked list, I can insert at the head only if there is a head pointer, and the time complexity is only at this time O(1). However, to remove the parameters from the tail of the singly linked list, you have to traverse it from the beginning to the end, then use a doubly circular linked list, so that you can directly use the head pointer to get forward to find the tail element.

It looks like everything is ok, but think about it, if you hit a parameter, you want to get it from somewhere in the middle to the beginning? The doubly circular linked list still has to traverse the linked list to find it, and the average time complexity is alsoO(n)

Combining dictionaries and linked lists

When fetching the cache, the return value is directly obtained from the parameter through the dictionary, so can the linked list and the dictionary be combined, and can the key's position in the linked list be directly obtained by using the key? Of course, the value corresponding to the key can be stored as the linked list node corresponding to the key. At the same time, the linked list node also stores the return value, so the return value can just be retrieved.

`FUNCTOOLS.LRU_CACHE`

The functools in Python lru_cacheimplements LRU Cache, the core of which is a doubly circular linked list plus a dictionary.

from functools import lru_cacheCode you can see lru_cache:

_CacheInfo = namedtuple("CacheInfo", ["hits", "misses", "maxsize", "currsize"])
def lru_cache(maxsize=128, typed=False):
    '''最近最少使用的缓存装饰器
    如果maxsize为None，则不使用LRU特性，并且缓存可以没有限制的增长
    如果typed为True，不同类型的参数将会独立的缓存，比如f(3.0)和f(3)将被看作不同的调用，并且有不同的结果。
    缓存函数的参数必须是可哈希的。
    访问命名元组(hints, misses, maxsize, currsize)来查看缓存统计数据，使用f.cache_info()来得到。使用f.cache_clear()来清除缓存数据。使用f.__wrapped__来获取底层包装函数
    '''
    if maxsize is not None and not isinstance(maxsize, int):
        raise TypeError('Expected maxsize to be an integer or None')
    def decorating_function(user_function):
        wrapper = _lru_cache_wrapper(user_function, maxsize, typed, _CacheInfo)
        return update_wrapper(wrapper, user_function)
    return decorating_function

First of all, this is a decorator with parameters. When used, it should @lru_cache()be returned decorating_function, and the function used to decorate the user will be returned. The function that wraps the user's function will be returned using _lru_cache_wrapper:

def _lru_cache_wrapper(user_function, maxsize, typed, _CacheInfo):
    # 所有lru缓存实例共享的常量
    sentinel = object()          # 当缓存没有命中的时候返回的对象
    make_key = _make_key         # 从函数参数来生成一个key的函数
    PREV, NEXT, KEY, RESULT = 0, 1, 2, 3   # 每一个链表节点中存储的每个元素位置
    cache = {}
    hits = misses = 0
    full = False
    cache_get = cache.get  # 绑定查询键的函数
    lock = RLock()  # 链表的更新可能不是线程安全的，对于一个函数，可能多线程调用
    root = []   # 双向循环链表的头指针
    root[:] = [root, root, None, None]  # 初始化双向循环链表

    if maxsize == 0:
        # 不使用缓存，每次调用没有命中次数都会加一
        def wrapper(*args, **kwds):
            # 重新绑定自由变量misses
            nonlocal misses
            result = user_function(*args, **kwds)
            misses += 1
            return result
    elif maxsize is None:
        # 简单的缓存，没有次序，也没有大小限制，也就是直接使用字典进行缓存
        def wrapper(*args, **kwds):
            nonlocal hits, misses
            key = make_key(args, kwds, typed)
            result = cache_get(key, sentinel)
            if result is not sentinel:
                hits += 1
                return result
            result = user_function(*args, **kwds)
            cache[key] = result
            misses += 1
            return result
    else:
        # 这种情况就是有大小限制的lru_cache，我觉得应该加一个有限制的普通缓存系统
        def wrapper(*args, **kwds):
            nonlocal root, hits, misses, full
            key = make_key(args, kwds, typed)
            with lock:
                link = cache_get(key)
                # 如果命中，将命中的链表节点放到头部，每个链表节点存储了PREV,NEXT,KEY,VALUE
                # 所以先将当前节点的前后节点接在一起
                if link is not None:
                    # 从链表中取出这个节点
                    link_prev, link_next, _key, result = link
                    link_prev[NEXT] = link_next
                    link_next[PREV] = link_prev
                    # 将链表节点放到头部,root前面
                    last = root[PREV]
                    last[NEXT] = root[PREV] = link
                    link[PREV] = last
                    link[NEXT] = root]
                    hits += 1
                    return result
                result = user_function(*args, **kwds)
                # 没有命中的话，将其放入缓存，更新操作必须加锁
                with lock:
                    if key in cache:
                        # 如果key已经在cache，说明有其它的线程已经将这个key放入缓存中了
                        pass
                    elif full:
                        # 如果缓存已经满了，则取出最久没有使用的节点将其pop出去，也就是双向循环链表中root后面的指针。
                        # 这段代码的具体过程是，将root指针作为新节点，root指针后面的节点作为root节点
                        oldroot = root
                        oldroot[KEY] = key
                        oldroot[RESULT] = result
                        # root的节点作为新节点，只需要将key，result换掉

                        root = oldroot[NEXT]
                        # 记录oldkey是为了在cache字典中删除
                        oldkey = root[KEY]
                        oldresult = root[RESULT]
                        root[KEY] = root[RESULT] = None
                        # 更新好root节点后删除字典中的key

                        del cache[oldkey]

                        cache[key] = oldroot
                    else:
                        # 如果缓存没有满，更新缓存
                        last = root[PREV]
                        # 新的链表节点在root和root前一个节点之间
                        link = [last, root, key, result]
                        last[NEXT] = root[PREV] = cache[key] = link
                        full = (len(cache) >= maxsize)
                    misses += 1
                return result

        def cache_info():
            """Report cache statistics"""
            with lock:
                # 因为以下数据可能在更新，所以加锁获取共享资源
                return _CacheInfo(hits, misses, maxsize, len(cache))

        def cache_clear():
            """Clear the cache and cache statistics"""
            nonlocal hits, misses, full
            with lock:
                cache.clear()
                root[:] = [root, root, None, None]
                hits = misses = 0
                full = False

        wrapper.cache_info = cache_info
        wrapper.cache_clear = cache_clear
        return wrapper

This is fine when the cache is full, but if I were to do it I would delete the nodes behind and put the new node in front of root. In addition, is this person left-handed and likes to turn counterclockwise?

ORDEREDDICT related

Another example of using a doubly circular linked list is OrderedDict. The linked list saves the order in which keys are added to the dictionary, but in fact, when the dictionary is actually used, it is the version implemented in the C language:

try:
    from _collections import OrderedDict
except ImportError:
    # Leave the pure Python version in place.
    pass

Here is the source code of OrderedDict:

class OrderedDict(dict):
    '''记录插入顺序的字典
    内部的self.__map字典记录了key到双向循环链表的节点，双向循环链表开始和结束都是一个sentinel元素。这个元素从来都不会被删除，它在self.__hardroot, 并且self.__hardroot在self.__root有一个弱引用代理
    往前的连接是弱引用代理，为了防止循环引用
    在self.__map中存储了硬连接到单个链表节点上，当key被删除后，这些硬连接就会消失
    '''
    def __init__(*args, **kwds):
    '''Initialize an ordered dictionary.  The signature is the same as
    regular dictionaries, but keyword arguments are not recommended because
    their insertion order is arbitrary.

    '''
    if not args:
        raise TypeError("descriptor '__init__' of 'OrderedDict' object "
                        "needs an argument")
    self, *args = args
    if len(args) > 1:
        raise TypeError('expected at most 1 arguments, got %d' % len(args))
    try:
        self.__root
    except AttributeError:
        # 这里使用的是一个限定属性的类来作为链表节点
        self.__hardroot = _Link()
        self.__root = root = _proxy(self.__hardroot)
        root.prev = root.next = root
        self.__map = {}
    self.__update(*args, **kwds)

Here _Link()is a class with limited properties, the code is as follows;

class _Link(object):
    __slots__ = 'prev', 'next', 'key', '__weakref__'

_proxy是from _weakref import proxy as _proxy 对一个对象的弱引用。相对于通常的引用来说，如果一个对象有一个常规的引用，它是不会被垃圾收集器销毁的，但是如果一个对象只剩下一个弱引用，那么它可能被垃圾收集器收回。werkref

def proxy(p_object, callback=None):
    """
    创建一个代理对象，弱引用至object
    """
    pass

继续来看OrderedDict:

def __setitem__(self, key, value, dict_setitem=dict.__setitem__, proxy=_proxy, Link=_Link):
    # 当设置一个新值时，首先查看是否已经在字典中了，如果在字典中，就不需要再加入到链表中了，保持之前的次序
    # 如果没有，则加入到链表中，同样是将新节点放到root前面一个
    if key not in self:
        self.__map[key] = link = Link()
        root = self.__root
        last = root.prev
        link.prev, link.next, link.key = last, root, key
        last.next = link
        # prev连接是一个proxy
        root.prev = proxy(link)
    dict_setitem(self, key, value)

def __delitem__(self, key, dict_delitem=dict.__delitem__):
    dict_delitem(self, key)
    link = self.__map.pop(key)
    link_prev = link.prev
    link_next = link.next
    link_prev.next = link_next
    link_next.prev = link_prev
    link.prev = None
    link.next = None

def __iter__(self):
    root = self.__root
    curr = root.next
    while curr is not root:
        yield curr.key
        curr = curr.next

def __reversed__(self):
    root = self.__root
    curr = root.prev
    while curr is not root:
        yield curr.key
        curr = curr.prev

def clear(self):
    root = self.__root
    root.prev = root.next = root
    self.__map.clear()
    dict.clear(self)

其它方法就不一一说明了，但是从中可以看出，涉及到元素顺序的操作，双向循环链表的应用还是很广泛的。