On Distributed Cache - Talking about the Understanding of Cache


PS: Before understanding distributed caching, let's take a look at what the concept of caching is

1 Overview

1.1 Concept

A component of hardware or software used to store data, so that the corresponding data can be accessed faster in the future. The data in the cache may be a result calculated in advance, a copy of the data, etc.

1.2 Function

Caching is an important component in distributed distribution. It mainly solves the performance problem of hot data access in high concurrency and big data scenarios, and provides high-performance fast data access.

1.3 Principle

  • Write data to faster storage devices
  • Cache data closest to the app
  • Cache data closest to the user

2. Classification of caches

The cache is mainly divided into:

  • CDN caching
  • reverse proxy cache
  • local application cache
  • Distributed cache

2.1 CDN cache

The basic principle of CDN (Content Delivery Network) is to widely use various cache servers, and distribute these cache servers to areas or networks where user access is relatively concentrated. Point to the nearest working cache server, and the cache server directly responds to user requests

Application scenarios

Mainly cache static resources, such as pictures, videos
insert image description here

insert image description here

advantage

  • Mirroring service
    Eliminates the impact caused by the bottleneck of interconnection between different operators, realizes network acceleration across operators, and ensures that users in different networks can get good quality access
  • Remote acceleration
    Remote access users intelligently and automatically select the Cache server according to the DNS load balancing technology, select the fastest Cache server, and speed up the remote access speed
  • Bandwidth optimization
    Automatically generate a remote minior (mirror) cache server of the server, read data from the cache server when remote users access, reduce the bandwidth of remote access, share network traffic, and reduce the load of the original site WEB server, etc.
  • Cluster anti-attack
    Widely distributed CDN nodes and intelligent redundancy mechanism between nodes can effectively prevent hacker intrusion and reduce the impact of various DOS attacks on the website, while ensuring better service quality

2.2 Reverse proxy cache

The reverse proxy is located in the application server room and handles all requests to the web server.
If the page requested by the user is buffered on the proxy server, the proxy server sends the buffered content directly to the user. If there is no buffering, first send a request to the WEB server, retrieve the data, cache it locally, and then send it to the user. By reducing the number of requests to the WEB server, the load on the WEB server is reduced.

Application scenarios

Generally, only small static file resources are cached, such as css, js, pictures
insert image description here

Open source implementations are: Varnish, Nginx, Squid

2.3 Local application cache

It refers to the cache component in the application. Its biggest advantage is that the application and the cache are in the same process, the request cache is very fast, and there is no excessive network overhead. It is more appropriate to use local cache in scenarios where nodes do not need to notify each other;
at the same time, its disadvantage is that the cache should be coupled with the application, multiple applications cannot directly share the cache, and each application or each node of the cluster needs to maintain its own separate Cache is a waste of memory.

Application scenarios

Cache common data such as dictionaries

cache medium

  • Hard Disk Cache Cache
    data to the hard disk, and read from the hard disk when reading. The principle is to directly read the local file, which reduces the network transmission consumption and is much faster than reading the database through the network.
    Applicable scenarios: the speed requirements are not very high. High, but requires a lot of cache storage scenarios
  • Memory cache
    Directly store data in the local memory, and maintain the cache object directly through the program, which is the fastest way to access

accomplish

1. By programming

insert image description here

2.Ehcahe

Application Scenario:

  • 1. A single application or an application with high requirements for cache access
  • 2. Simple sharing is OK, but for cache recovery, data caching is not suitable
  • 3. Large-scale systems with cache sharing, distributed deployment, and large cache content are not suitable

Main features of Ehcahe
insert image description here

Cache data expiration policy

insert image description here

  • 1.FIFO
    first in first out
  • 2. LFU
    is used the least, the cached attribute has a hit attribute, and the one with the smallest hit value will be cleared out of the cache
  • 3. LRU is
    least recently used, and the cached element has a timestamp. When the cache capacity is full and space needs to be made to cache new elements, then the element with the timestamp farthest from the current time in the existing cache elements
    Ehcache expired data elimination mechanism will be cleared out of the cache

Lazy elimination mechanism: every time you put data into the cache, it will save a time. When reading, you need to compare the TTL with the set time to determine whether it has expired.


With some of the above knowledge bases, let's talk about distributed caching.

3. Distributed cache

It refers to a cache component or service that is separated from the application. Its biggest advantage is that it is an independent application itself, isolated from the local application, and multiple applications can directly share the cache.

Main application scenarios

  • 1. The cache obtains data through complex operations
  • 2. There are frequent access to hot data in the cache storage system to reduce the pressure on the database

Main access method

  • 1. The middle layer of program code implementation
  • 2. Independently deployed middleware

The following introduces two common open source implementations of distributed caching, Memcached and Redis

3.1 Memcached

basic introduction

Memcached is a high-performance, distributed memory object caching system. By maintaining a unified huge hash table in memory, it can be used to store data in various formats, including images, videos, files, and database retrieval results.
Simply put, the data is called into the memory, and then read from the memory, thereby greatly improving the reading speed.

Features

  • 1. Use physical memory as a cache area and can run independently on the server.
    Each process has a maximum of 2G. If you want to cache more data, you can open more Memcached processes (different ports) or use distributed Memcached for caching to cache data on different physical machines or virtual machines.
  • 2. Use the key-value method to store data, which is a single-index structured data organization, which can make the query time complexity of data items O(1)
  • 3. The protocol is simple: the protocol based on the text line can directly access data on the memcached server through telnet. It is simple and convenient for various caches to refer to this protocol.
  • 4. High-performance communication based on Libevent: Libevent is a set of program libraries developed with C. It encapsulates event processing functions such as kqueue of BSD system and epoll of Linux system into an interface, which improves performance compared with traditional select.
  • 5. Built-in memory management method: All data is stored in memory, and access data is faster than hard disk. When the memory is full, the unused cache is automatically deleted through the LRU algorithm, but the disaster recovery problem of the data is not considered, restart the service, all data will be lost
  • 6. Distributed: Memcached servers do not communicate with each other, access data independently, and do not share any information. The server is not distributed, and distributed deployment depends on the Memcached client.

Basic Architecture

insert image description here

Cache data expiration policy

LRU (Least Recently Used) expiration and invalidation policy. When storing data items in Memcached, you can specify its expiration time in the cache. The default is permanent. When the Memcached server runs out of allocated memory, the stale data is replaced first, followed by the most recently unused data.

Internal implementation of data elimination

Lazy elimination mechanism: every time you put data into the cache, it will save a time. When reading, you need to compare the TTL with the set time to determine whether it has expired.

Distributed Cluster Implementation

The server does not have a "distributed" function. Each server is a completely independent and isolated service. The distribution of Memcached is implemented by the client program
insert image description here

Data storage steps:

  • 1. The client selects the server according to the algorithm
  • 2. The client communicates with the server to access data

distributed algorithm

remainder algorithm, hash algorithm

3.2 Redis

basic introduction

Redis is a remote in-memory database (non-relational database) with strong performance, replication features and a unique data model for problem solving.
It can store the mapping between key-value pairs and 5 different types of values, can persist the key-value pair data stored in memory to the hard disk, can use the replication feature to expand the read performance, and Redis can also use client-side sharding to expand write performance.
Built-in: replication (replication), LUA scripting (Lua scripting), LRU-driven events (LRU eviction), transactions (transactions) and different levels of disk persistence (persistence), and through Redis Sentinel (Sentinel) and automatic partition (Cluster) ) to provide high availability (high availability).

data model

insert image description here

Data Retirement Policy

  • volatile-Iru: Pick the least recently used data from the data set (server.db[i].expires) with an expiration time set to evict
  • volatile-ttl: Select the data to be expired from the data set (server.db[i].expires) that has set the expiration time
  • volatile-random: Arbitrarily select data eviction from the data set (server.db[i].expires) with set expiration time
  • allkeys-lru: pick the least recently used data from the dataset (server.db[i].dict) for elimination
  • allkeys-random: Randomly select data from the dataset (server.db[i].dict) to eliminate
  • no-enviction: prohibit eviction of data

Internal implementation of data elimination

  • 1. The passive way:
    • If the primary key is found to be invalid when it is accessed, it is deleted;
    • Trigger timing: It will be called when implementing GET, MGET, HGET, LRANGE, etc., when all read commands are involved
  • 2. Active way:
    • Periodically, from the primary keys with the expiration time set, select a part of the expired primary keys to delete
    • Trigger timing: The time event of redis, that is, it is interrupted every once in a while to complete some operation instructions

Persistence

  • RDB (Redis DataBase): The default persistence scheme, the snapshot of the database is regularly saved to disk in binary mode
  • AOF (Append Only File): Record all the commands and parameters that have been written to the database to the APF file in the form of protocol text, so as to achieve the purpose of recording the status of the database

Partial analysis of the underlying implementation

  • Illustration of part of the process of startup

  • Partial operation diagram of server-side persistence

  • Low-level hash table implementation (progressive Rehash)

    • initialize dictionary
    • Dictionary element diagram
    • Rehash execution process

Redis cache design principles

  • 1. Only hot data should be placed in the cache
  • Cache expiration times should be spread out to avoid centralized expiration
  • Cache keys should be readable
  • Cache keys with the same name in different services should be avoided
  • Keys can be abbreviated appropriately to save memory space
  • Choose the right data structure
  • Make sure the data written to the cache is complete and correct
  • Avoid using time-consuming operation commands, such as :keys*
  • The data corresponding to a key should not be too large
  • Avoid cache penetration
  • Can do proper cache warmup
  • The order of reading is cache first, then database;
  • The order of writing is database first and then cache

3.3 Comparison of Redis and Memcached

Redis Memcached
Supported data structures Hash, List, Set, Sorted Set 纯kev-value
Persistence support have without
High Availability Support Redis naturally supports clustering functions, which can achieve active replication and read-write separation. The official also provides the sentinel cluster management tool, which can realize master-slave service monitoring and automatic failover, all of which are transparent to the client, without program changes or manual intervention. need secondary development
Storage value capacity Maximum 512M Max 1M
memory allocation Temporarily request space, which may lead to fragmentation Manage memory by pre-allocating memory pools, which can save memory allocation time
virtual memory usage It has its own VM mechanism, which can theoretically store more data than physical memory. When the data exceeds, it will trigger swap and flush cold data to the disk. All data is stored in physical memory
network model The non-blocking IO multiplexing model provides some sorting and aggregation functions other than non-KV storage. When performing these functions, complex CPU calculations will block the entire IO scheduling Non-blocking IO multiplexing model
Horizontal scaling support No No
Multithreading Redis supports single thread Memcached supports multi-threading, and Memcache is better than Redis in terms of CPU utilization
Expiration Policy There is a dedicated thread to clear cached data Lazy elimination mechanism: every time you put data into the cache, it will save a time. When reading, you need to compare the TTL with the set time to determine whether it has expired.
Stand-alone QPS About 10W About 60W
source code readability Clean and concise code It may be that too much scalability, multi-system compatibility, and the code is not clean
Applicable scene Complex data structure, persistence, high availability requirements, large value storage content Pure KV, very large amount of data, very large concurrent business

Next Article Distributed Cache - Common Problems and Solutions of Cache Architecture Design

Guess you like

Origin blog.csdn.net/zhiyikeji/article/details/123467257
Recommended