Article directory

foreword
1. Lock-free
- 1.1, serial lock-free
- 1.2. The structure is lock-free
2. Zero copy
- 2.1, memory mapping
- 2.2, zero copy
3. Serialization
- 3.1. Classification
- 3.2. Performance indicators
- 3.3, selection considerations
4. Pondization
- 4.1. Memory pool
- 4.2, thread pool
- 4.3, connection pool
- 4.4, object pool
5. Concurrency
- 5.1. Request concurrency
- 5.2. Redundant requests
6. Asynchronization
- 6.1. Call asynchronously
- 6.2. Process asynchronization
7. Cache
- 7.1, cache usage scenarios
- 7.2, classification of cache
- 7.3, cache mode
- 7.4. Cache recycling strategy
- 7.5. Cache crash and repair
- 7.6. Some good practices of caching
8. Fragmentation
- 8.1. Fragmentation strategy
- 8.2. Secondary index
- 8.3. Routing strategy
- 8.4, dynamic balance
- 8.5. Sub-database and sub-table
- 8.6. Task sharding
9. Storage
- 9.1, read and write separation
- 9.2, static and dynamic separation
- 9.3. Separation of hot and cold
- 9.4. Rewrite and light read
- 9.5. Data heterogeneity
10. Queue
- 10.1. Application scenarios
- 10.2. Application classification
Summarize

"N is high and N can be", high performance, high concurrency, high availability, high reliability, scalability, maintainability, availability, etc. are familiar words in background development, and some of them express similar meanings in most cases. This series of articles aims to discuss and summarize the commonly used technologies and methods in background architecture design, and summarize them into a set of methodology.

foreword

This article mainly discusses and summarizes high-performance technologies and methods in service architecture design. As shown in the mind map below, the left part is mainly biased towards programming applications, and the right part is biased towards component applications. The article will be expanded according to the content in the figure.

insert image description here

1. Lock-free

In most cases, multithreading can improve concurrency performance, but if the shared resources are not properly handled, severe lock competition can also lead to performance degradation. Faced with this situation, some scenarios adopt a lock-free design, especially on the underlying framework. There are two main implementations of lock-free, serial lock-free and data structure lock-free.

1.1, serial lock-free

The simplest implementation of lock-free serial may be the single-threaded model, such as redis/Nginx have adopted this method. In the network programming model, the conventional way is that the main thread is responsible for processing I/O events, and pushes the read data into the queue, and the worker thread takes the data out of the queue for processing. This semi-synchronous/semi-asynchronous model needs to The queue is locked, as shown in the following figure:

insert image description here

The mode in the above figure can be changed to a lock-free serial form. When the MainReactor accepts a new connection, select one from many SubReactors to register, and create a Channel to bind with the I/O thread. After that, the connection can be read and written without any delay. Execute on the same thread without synchronization.

insert image description here

1.2. The structure is lock-free

Atomic operations supported by hardware can be used to implement lock-free data structures. Many languages provide CAS atomic operations (such as the atomic package in go and the atomic library in C++11), which can be used to implement lock-free queues. Let's look at the difference between lock-free programming and ordinary locking with a simple thread-safe single-linked list insertion operation.

template<typename T>
struct Node
{
    
    
    Node(const T &value) : data(value) {
    
     }
    T data;
    Node *next = nullptr;
};

There is a locked list WithLockList:

template<typename T>
class WithLockList
{
    
    
    mutex mtx;
    Node<T> *head;
public:
    void pushFront(const T &value)
    {
    
    
        auto *node = new Node<T>(value);
        lock_guard<mutex> lock(mtx); //①
        node->next = head;
        head = node;
    }
};

Unlocked list LockFreeList:

template<typename T>
class LockFreeList
{
    
    
    atomic<Node<T> *> head;
public:
    void pushFront(const T &value)
    {
    
    
        auto *node = new Node<T>(value);
        node->next = head.load();
        while(!head.compare_exchange_weak(node->next, node)); //②
    }
};

It can be seen from the code that in the locked version ① locks are performed. In the lock-free version, ② uses the atomic CAS operation compare_exchange_weak. This function returns true if the storage is successful. At the same time, in order to prevent false failures (that is, when the original value is equal to the expected value, the storage may not be successful, it mainly occurs when there is a lack of a single comparison and exchange instruction. hardware machine), usually put the CAS in the loop.

Here's a simple performance comparison of the locked and unlocked versions, each performing 1,000,000 push operations. The test code is as follows:

int main()
{
    
    
    const int SIZE = 1000000;
    //有锁测试
    auto start = chrono::steady_clock::now();
    WithLockList<int> wlList;
    for(int i = 0; i < SIZE; ++i)
    {
    
    
        wlList.pushFront(i);
    }
    auto end = chrono::steady_clock::now();
    chrono::duration<double, std::micro> micro = end - start;
    cout << "with lock list costs micro:" << micro.count() << endl;

    //无锁测试
    start = chrono::steady_clock::now();
    LockFreeList<int> lfList;
    for(int i = 0; i < SIZE; ++i)
    {
    
    
        lfList.pushFront(i);
    }
    end = chrono::steady_clock::now();
    micro = end - start;
    cout << "free lock list costs micro:" << micro.count() << endl;

    return 0;
}

The output of the three times is as follows. It can be seen that the performance of the lock-free version and the lock version is higher.

with lock list costs micro:548118

free lock list costs micro:491570

with lock list costs micro:556037

free lock list costs micro:476045

with lock list costs micro:557451

free lock list costs micro:481470

2. Zero copy

The copy here refers to the direct transmission of data between the kernel buffer and the application buffer, not the memory copy in the process space (of course, zero copy can also be achieved in this respect, such as pass reference and move operation in C++). Now suppose we have a service that allows users to download a certain file. When the request comes, we send the data on the server disk to the network. The pseudo code of this process is as follows:

filefd = open(...); //打开文件
sockfd = socket(...); //打开socket
buffer = new buffer(...); //创建buffer
read(filefd, buffer); //从文件内容读到buffer中
write(sockfd, buffer); //将buffer中的内容发送到网络

The data copy process is as follows:

insert image description here

The green arrow in the figure above indicates DMA copy, and DMA (Direct Memory Access) is direct memory access, which is a mechanism for fast data transmission, and refers to the interface technology for external devices to directly exchange data with system memory without going through the CPU. The red arrow indicates CPU copy. Even with DMA technology there are 4 copies, 2 for DMA copy and 2 for CPU copy.

2.1, memory mapping

Memory mapping maps a section of memory area in user space to kernel space. The modification of this memory area by the user can be directly reflected in the kernel space. Similarly, the modification of this area in the kernel space also directly reflects the user space. space to share this kernel buffer.

The pseudocode after rewriting using memory mapping is as follows:

filefd = open(...); //打开文件
sockfd = socket(...); //打开socket
buffer = mmap(filefd); //将文件映射到进程空间
write(sockfd, buffer); //将buffer中的内容发送到网络

The data copy flow after using memory mapping is shown in the following figure:

insert image description here

It can be seen from the figure that the data copy is reduced to 3 times after memory mapping is adopted, and the data in the kernel buffer is no longer directly copied to the Socket buffer through the application program. For high-performance message storage, RocketMQ uses a memory mapping mechanism to divide the storage file into multiple fixed-size files, and perform sequential writing based on memory mapping.

2.2, zero copy

Zero copy is a technology that prevents the CPU from copying data from one storage to another, thereby effectively improving the efficiency of data transmission. After Linux kernel 2.4, it supports transmission with DMA collection and copy function, and directly packages and sends the data in the kernel page cache to the network. The pseudo code is as follows:

filefd = open(...); //打开文件
sockfd = socket(...); //打开socket
sendfile(sockfd, filefd); //将文件内容发送到网络

The process after using zero copy is as follows:
insert image description here

zero copy

The steps of zero copy are:

1) DMA copies the data to the kernel buffer of the DMA engine;

2) Add the descriptor of the position and length information of the data to the socket buffer;

3) The DMA engine directly transfers data from the kernel buffer to the protocol engine;

It can be seen that zero copy is not really no copy, there are still 2 DMA copies of the kernel buffer, but the CPU copy between the kernel buffer and the user buffer is eliminated. The main zero-copy functions in Linux include sendfile, splice, tee, etc. The figure below shows the performance comparison between normal transmission and zero-copy transmission on IBM's official website. It can be seen that zero-copy is about 3 times faster than normal transmission, and Kafka also uses zero-copy technology.

insert image description here

3. Serialization

When data is written to a file, sent to the network, and written to storage, serialization technology is usually required, and deserialization (deserialization), also known as encoding (encode) and decoding (decode), is required when reading from it. . As a representation of transmitted data, serialization is decoupled from network frameworks and communication protocols. For example, the network framework taf supports jce, json and custom serialization, and the HTTP protocol supports XML, JSON and streaming media transmission, etc.

There are many serialization methods. As the basis of data transmission and storage, how to choose the appropriate serialization method is particularly important.

3.1. Classification

Generally speaking, serialization technologies can be roughly divided into the following three types:

Built-in type: refers to the type supported by the programming language, such as java.io.Serializable of java. This type is not universal due to language binding, and generally has poor performance, and is generally only used locally.
Text type: Generally, it is a standardized text format, such as XML and JSON. This type is more readable, supports cross-platform, and has a wide range of applications. The main disadvantage is that it is relatively bloated, and network transmission takes up a lot of bandwidth.
Binary type: Binary encoding is adopted, data organization is more compact, and multi-language and multi-platform are supported. The common ones are Protocol Buffer/Thrift/MessagePack/FlatBuffer, etc.

3.2. Performance indicators

There are three main indicators for measuring serialization/deserialization:

1) The byte size after serialization;

2) The speed of serialization/deserialization;

3) CPU and memory consumption;

The following figure shows the performance comparison of some common serialization frameworks:
insert image description here

Serialization and deserialization speed comparison Serialization byte occupancy comparison
insert image description here

It can be seen that Protobuf can be said to be the best in terms of serialization speed and byte ratio. However, there are people outside the world, and there is a sky beyond the sky. I heard that FlatBuffer is more invincible than Protobuf. The picture below is a comparison of FlatBuffer from Google and other serialization performance. Just looking at the data FB in the picture seems to kill PB in seconds.

insert image description here

3.3, selection considerations

When designing and selecting a serialization technology, there are many aspects to consider, mainly the following aspects:

1) Performance: CPU and byte size are the main overhead of serialization. High-performance and high-compression binary serialization should be selected for basic RPC communication, storage systems, and high-concurrency services. Some internal services and applications with fewer web requests can use textual JSON, and browsers directly support JSON built-in.

2) Ease of use: Rich data structures and auxiliary tools can improve ease of use and reduce the amount of business code development. Now many serialization frameworks support List, Map and other structures and readable printing.

3) Versatility: Modern services often involve multiple languages and platforms. Whether they can support cross-platform and cross-language intercommunication is the basic condition for serialization selection.

4) Compatibility: Modern services are fast iterated and upgraded. A good serialization framework should have good forward compatibility and support the addition, deletion and modification of fields.

5) Scalability: Whether the serialization framework can support custom formats with a low threshold is sometimes an important consideration.

4. Pondization

Pooling is probably the most commonly used technology. Its essence is to improve object reuse and reduce the overhead of repeated creation and destruction by creating pools. Commonly used pooling technologies include memory pools, thread pools, connection pools, and object pools.

4.1. Memory pool

We all know that in C/C++, malloc/free and new/delete are used to allocate memory respectively, and the underlying system calls sbrk/brk. Frequent calls to system calls to allocate and release memory not only affect performance but also easily cause memory fragmentation. The memory pool technology aims to solve these problems. For these reasons, the memory operation in C/C++ does not directly call the system call, but has realized its own set of memory management. The realization of malloc mainly has three major implementations.

1) ptmalloc: implementation of glibc.

2) tcmalloc: Google's implementation.

3) jemalloc: Facebook's implementation.

The following is a comparison chart of three kinds of malloc from the Internet. The performance of tcmalloc and jemalloc is similar, and the performance of ptmalloc is not as good as the two. We can choose a more suitable malloc according to our needs. For example, redis and mysl can specify which malloc to use. As for the implementation and differences of the three, you can check them online.

insert image description here

Although the implementation of the standard library adds a layer of memory management on top of the operating system memory management, applications usually also implement their own specific memory pools, such as for reference counting or for small object allocation. So it looks like there are generally three levels of memory management.
insert image description here

4.2, thread pool

Thread creation needs to allocate resources, which has a certain overhead. If we create a thread to process a task, this will inevitably affect the performance of the system. The thread pool can limit the number of threads created and reused, thereby improving the performance of the system.

Thread pools can be classified or grouped, and different tasks can use different thread groups, which can be isolated to avoid mutual influence. For classification, it can be divided into core and non-core. The core thread pool will always exist and will not be recycled. The non-core may recycle threads after a period of idleness, thereby saving system resources. When needed, create them on demand and put them in the pool. .

4.3, connection pool

Commonly used connection pools include database connection pool, redis connection pool, TCP connection pool, etc. The main purpose is to reduce the overhead of creating and releasing connections through multiplexing. Connection pool implementation usually needs to consider the following issues:

1) Initialization: startup is initialization and lazy initialization. Startup initialization can reduce some locking operations and can be used directly when needed. The disadvantage is that it may cause slow service startup or no task processing after startup, resulting in waste of resources. Lazy initialization is to create when it is really necessary. This method may help reduce resource usage, but if you face sudden task requests and then create a bunch of connections in an instant, it may cause slow system response or response failure. , usually we will use the way of initialization at startup.

2) Number of connections: Balance the number of connections required. If the number of connections is too small, the task processing may be slow. If the number of connections is too large, the task processing will not only be slow, but also consume system resources excessively.

3) Connection removal: When the connection pool has no available connections, whether to wait until there is an available connection or allocate a new temporary connection.

4) Put the connection: When the connection is used up and the connection pool is not full, put the connection into the connection pool (including the temporary connection created in 3), otherwise close it.

5) Connection detection: Long idle connections and invalid connections need to be closed and removed from the connection pool. Commonly used detection methods are: detection during use and regular detection.

4.4, object pool

Strictly speaking, all kinds of pools are applications of the object pool pattern, including the previous three buddies. Object pools, like various pools, also cache some objects to avoid creating a large number of objects of the same type, while limiting the number of instances. For example, the 0-9999 integer objects in redis are shared by using the object pool. The object pool mode is often used in game development. For example, when monsters and NPCs appear on the map, they are not recreated every time, but taken from the object pool.

5. Concurrency

5.1. Request concurrency

If a task needs to process multiple subtasks, subtasks without dependencies can be parallelized. This scenario is very common in background development. For example, a request needs to query 3 data, which takes time T1, T2, and T3 respectively. If the serial call takes time T=T1+T2+T3. Concurrent execution of three tasks takes a total time T=max(T1,T 2,T3). The same is true for write operations. For the same type of requests, batch merging can also be performed at the same time to reduce the number of RPC calls.

5.2. Redundant requests

Redundant requests refer to sending multiple identical requests to the backend service at the same time, whoever responds quickly will be used, and the others will be discarded. This strategy shortens the waiting time of the client, but it also increases the amount of system calls. It is generally applicable to scenarios with few initialization or requests. The horse racing module of the company's WNS is actually this mechanism. In order to quickly establish a long connection, the horse racing module initiates a request to multiple ip/ports in the background at the same time, and whoever uses it quickly is especially useful on mobile devices with weak networks. If you use the waiting timeout The retry mechanism will undoubtedly greatly increase the user's waiting time.

6. Asynchronization

For processing time-consuming tasks, if the method of synchronous waiting is adopted, the throughput of the system will be seriously reduced, which can be solved by asynchronization. There are some differences in the concept of asynchrony at different levels, and we will not discuss asynchronous I/O here.

6.1. Call asynchronously

When making a time-consuming RPC call or task processing, the commonly used asynchronous methods are as follows:

Callback: Asynchronous callback registers a callback function, and then initiates an asynchronous task. When the task is completed, the callback function registered by the user will be called back, thereby reducing the waiting time of the caller. This method will make the code scattered and difficult to maintain, and it is relatively difficult to locate the problem.
Future: When the user submits a task, a Future will be returned immediately, and then the task will be executed asynchronously, and the execution result can be obtained through the Future later. For the request concurrency in 1.4.1, we can use Future to implement, the pseudo code is as follows:

//异步并发任务
  Future<Response> f1 = Executor.submit(query1);
  Future<Response> f2 = Executor.submit(query2);
  Future<Response> f3 = Executor.submit(query3);

  //处理其他事情
  doSomething();

  //获取结果
  Response res1 = f1.getResult();
  Response res2 = f2.getResult();
  Response res3 = f3.getResult();

(Continuation-passing style) can arrange multiple asynchronous programming to form more complex asynchronous processing, and achieve asynchronous effects in the form of synchronous code calls. CPS passes the subsequent processing logic to Then as a parameter and can finally catch the exception, which solves the problems of scattered asynchronous callback code and difficult exception tracking. CompletableFuture in Java and C++ PPL basically support this feature. A typical call is as follows:

void handleRequest(const Request &req)
{
    
    
  return req.Read().Then([](Buffer &inbuf){
    
    
      return handleData(inbuf);
  }).Then([](Buffer &outbuf){
    
    
      return handleWrite(outbuf);
  }).Finally(){
    
    
      return cleanUp();
  });
}

6.2. Process asynchronization

A business process is often accompanied by long call chains and many post-dependencies, which will reduce the availability and concurrent processing capabilities of the system at the same time. Asynchronous resolution of non-critical dependencies can be used. For example, Penguin E-sports broadcast service, in addition to broadcasting and writing program storage, also needs to synchronize program information to the SHIELD recommendation platform, App home page and secondary page, etc. Since synchronization to the outside is not the key logic of broadcasting and the requirements for consistency are not very high, these post-synchronization operations can be asynchronized, and a response will be returned to the App after writing the storage, as shown in the following figure:

insert image description here

7. Cache

From single-core CPU to distributed system, from front-end to back-end, cache is everywhere. In ancient times, Zhu Yuanzhang "slowly became king" and finally won the world. Today, both chip manufacturers and Internet companies have adopted the same policy of "slowly becoming king" (caching is king) to occupy a place. Cache is a copy set of original data, and its essence is to exchange space for time, mainly to solve high concurrent reading.

7.1, cache usage scenarios

Caching is the art of exchanging space for time, and using cache can improve system performance. "Although strong wine is good, don't be greedy." The purpose of using cache is to improve cost performance, not to use cache for so-called performance improvement regardless of cost, but depends on the scene.

Scenarios that are suitable for using cache, take the project Penguin E-sports that I participated in as an example:

1) Data that will basically not change once generated: such as the game list of Penguin E-sports, after creating a game in the background, there will be little change, and the entire game list can be directly cached;

2) Data that is read-intensive or has hot spots: typically it is the homepage of various apps, such as the homepage live list of Penguin E-sports;

3) Data with high calculation cost: such as Penguin E-sports Top hot list video, such as the 7-day list is cached and sorted list after calculation according to various indicators in the early morning of each day;

4) One-sided data: It is also the top trending video of Penguin E-sports. In addition to the entire sorted list cached, the final return result after the assembly of the first N pages of data is directly cached page by page in the process;

Scenarios that are not suitable for caching:

1) Write more and read less, and update frequently;

2) Strict requirements on data consistency;

7.2, classification of cache

Process-level cache: The cached data is directly in the process address space, which may be the fastest and easiest way to access the cache. The main disadvantage is that limited by the size of the process space, the amount of data that can be cached is limited, and the cached data will be lost when the process is restarted. It is generally used in scenarios where the amount of cached data is not large.
Centralized cache: The cached data is centralized on one machine, such as shared memory. The capacity of this type of cache is mainly limited by the size of the machine memory, and the data will not be lost after the process is restarted. Commonly used centralized cache middleware include stand-alone version redis, memcache, etc.
Distributed cache: The cached data is distributed on multiple machines. It is usually necessary to use a specific algorithm (such as Hash) for data sharding, and evenly distribute the massive cached data on each machine node. Commonly used components are: Memcache (client fragmentation), Codis (proxy fragmentation), Redis Cluster (cluster fragmentation).
Multi-level cache: refers to data caching at different levels in the system to improve access efficiency and reduce the impact on back-end storage. A multi-level cache application of Penguin Gaming is shown in the figure below. According to our current network statistics, the hit rate of the first-level cache has reached 94%, and the number of requests penetrating to the grocery is very small.

insert image description here

The overall workflow is as follows:

1) After the request arrives at the home page or the service of the live broadcast room, if it hits the local cache, it will return directly; otherwise, it will query from the next-level cache core storage and update the local cache;

2) The front-end service cache does not hit and penetrates to the core storage service. If it hits, it will directly return to the front-end service. If not, it will request the storage layer grocery and update the cache;

3) The first two levels of Cache do not hit back to the storage layer grocery.

7.3, cache mode

Regarding the use of cache, some models have been summed up, mainly divided into two categories: Cache-Aside and Cache-As-SoR. Among them, SoR (system-of-record): indicates the system of record, that is, the data source, and Cache is the replica set of SoR.

Cache-Aside: bypass cache, this should be the most common cache mode. For reading, first read data from the cache, if there is no hit, read back to the source SoR and update the cache. For write operations, write SoR first, then write cache. The structure diagram of this mode is as follows:
insert image description here

Logic code:

//读操作
data = Cache.get(key);
if(data == NULL)
{
    
    
    data = SoR.load(key);
    Cache.set(key, data);
}

//写操作
if(SoR.save(key, data))
{
    
    
    Cache.set(key, data);
}

This mode is simple to use, but it is opaque to the application layer and requires business code to complete the read and write logic. At the same time, for writing, writing the data source and writing the cache is not an atomic operation, and the following situations may cause data inconsistency between the two:

1) During concurrent writing, data inconsistency may occur. As shown in the figure below, user1 and user2 read and write almost simultaneously. User1 writes to db at t1, user2 writes to db at t2, then user2 writes to cache at t3, and user1 writes to cache at t4. In this case, db is the data of user2, and the cache is the data of user1, and the two are inconsistent.
insert image description here

Cache-Aside concurrent read and write

2) The first write to the data source is successful, but then the write to the cache fails, and the data between the two is inconsistent. For these two cases, if the business cannot bear it, it can be solved simply by first deleting the cache and then writing to the db. The cost is the cache miss of the next read request.

Cache-As-SoR: The cache is the data source. In this mode, the Cache is regarded as SoR, so the read and write operations are all for the Cache, and then the Cache entrusts the read and write operations to the SoR, that is, the Cache is a proxy. As shown below:
insert image description here

Cache-As-SoR Structure Diagram

There are three implementations of Cache-As-SoR:

1) Read-Through: When a read operation occurs, the Cache is first queried, and if there is a miss, the Cache returns to the source to the SoR, that is, the storage end implements Cache-Aside instead of business).

2) Write-Through: Called the write-through mode, the business first invokes the write operation, and then the Cache is responsible for writing the cache and SoR.

3) Write-Behind: It is called write-back mode. When a write operation occurs, the business only updates the cache and returns immediately, and then writes SoR asynchronously, which can improve performance by using combined write/batch write.

7.4. Cache recycling strategy

In the case of limited space, low-frequency hotspot access, or no active update notification, the cached data needs to be recycled. The commonly used recycling strategies are as follows:

1) Time-based: Time-based strategies can be divided into two main types:

Based on TTL (Time To Live): that is, the lifetime, from the creation of the cache data to the specified expiration period, regardless of whether the cache is accessed or not. Such as redis's EXPIRE.
Based on TTI (Time To Idle): the idle period, the cache will be recycled if it is not accessed within the specified time.

2) Space-based: The cache sets the upper limit of storage space, and when the upper limit is reached, the data is removed according to a certain strategy.

3) Capacity-based: The cache sets the upper limit of storage entries, and when the upper limit is reached, the data is removed according to a certain strategy.

4) Reference-based: Recycling based on some strategies of reference counting or strong and weak references.

Common recycling algorithms for caches are as follows:

FIFO (First In First Out): Based on the first-in-first-out principle, the data that enters the cache first is removed first.
LRU (Least Recently Used): The most based on the principle of locality, that is, if the data has been used recently, it is very likely to be used in the future. Conversely, if the data has not been used for a long time, the probability of being used in the future is relatively low.
LFU: (Least Frequently Used): The least recently used data is eliminated first, that is, the number of times each object is used is counted. When elimination is required, the least frequently used data is selected for elimination.

7.5. Cache crash and repair

Due to some caching problems caused by insufficient design, request attacks (not necessarily malicious attacks), etc., common caching problems and solutions are listed below.

Cache penetration: When a large number of non-existing keys are used for queries, the cache will not hit, and these requests will penetrate to the back-end storage, which will eventually cause excessive pressure on the back-end storage or even be overwhelmed. The reason for this situation is generally that the data in the storage does not exist. There are two main solutions.

1) Set a vacant or default value: If there is no data in the storage, set a vacant or default value to cache, so that the next request will not penetrate to the backend storage. However, if you encounter malicious attacks in this situation, you cannot deal with it well when you continuously forge different keys to query. At this time, you need to introduce some security policies to filter the requests.

2) Bloom filter: use the Bloom filter to hash all possible data into a large enough bitmap, and a data that must not exist will be intercepted by this bitmap, thus avoiding the impact on the underlying database Query pressure.

Cache avalanche: refers to the collective failure of a large number of caches within a certain period of time, causing the back-end storage load to instantly increase or even be overwhelmed. Usually it is caused by:

1) The cache expiration time is concentrated in a certain period of time. In this case, you can use different expiration times for different keys, and add different random times on the basis of the original basic expiration time;

2) A cache instance using the modulo mechanism is down. In this case, a large number of cache misses will be caused after the faulty instance is removed. There are two solutions: ① Take master-slave backup, and directly replace the master instance with the slave instance when the master node fails; ② Use consistent hash instead of modulo, so that even if an instance crashes, only a small part of the cache misses.

Cache hotspots: Although the cache system itself has high performance, it cannot withstand the high concurrent access of some hot data, which causes the cache service itself to be overloaded. Assume that Weibo uses the user ID as the hash key, and suddenly one day sister Zhiling announces her marriage. If her Weibo content is cached on a node according to the user ID, when her thousands of fans view her Weibo, it will inevitably It will overwhelm the cache node because the key is too hot. In this case, multiple caches can be generated on different nodes, and the content of each cache is the same, reducing the pressure on single node access.

7.6. Some good practices of caching

1) Separation of dynamic and static: For a cache object, it may be divided into many kinds of attributes, some of which are static and some are dynamic. It is best to use dynamic and static separation when caching. For example, the video details of Penguin E-sports are divided into title, duration, resolution, cover URL, number of likes, number of comments, etc. Among them, the title, duration, etc. are static attributes and will basically not change, while the number of likes and comments often change , the two parts are separated during caching, so as not to pull out the entire video cache and update it every time the dynamic attribute changes, which is very costly.

2) Use large objects with caution: If the cache object is too large, the overhead of each read and write is very high and other requests may be stuck, especially in a single-threaded architecture such as redis. A typical situation is to hang a bunch of lists on a value field or store a list without boundaries. In this case, it is necessary to redesign the data structure or split the value and aggregate it by the client.

3) Expiration setting: Try to set the expiration time to reduce dirty data and storage usage, but pay attention to the expiration time cannot be concentrated in a certain period of time.

4) Timeout setting: As a means of accelerating data access, caching usually needs to set a timeout period and the timeout period should not be too long (such as about 100ms), otherwise the entire request will timeout and there will be no chance to return to the source.

5) Cache isolation: First, different services use different keys to prevent conflicts or mutual overwriting. Second, core and non-core services are physically isolated through different cache instances.

6) Failure downgrade: Using cache requires a certain downgrade plan. Caching is usually not the key logic, especially for core services. If part of the cache fails or fails, you should continue to return to the source instead of directly interrupting the return.

7) Capacity control: Capacity control is required when using caches, especially local caches. When there are too many caches and memory is tight, frequent swap storage space or GC operations will occur, thereby reducing response speed.

8) Business orientation: Be business-oriented, don't cache for the sake of caching. When the performance requirements are not high or the request volume is not large, and the distributed cache or even the database is sufficient, there is no need to increase the local cache. Otherwise, the introduction of data node replication and idempotent processing logic may not be worth the candle.

9) Monitoring and alarming: Just like sister paper is always right, you can never be wrong. Monitor large objects, slow queries, memory usage, etc.

8. Fragmentation

Fragmentation is to divide a larger part into multiple smaller parts. Here we divide it into data sharding and task sharding. For data sharding, in this paper, the splitting technical terms (such as region, shard, vnode, partition) of different systems are collectively referred to as sharding. Fragmentation can be said to be a technology that kills three birds with one stone. A large data set is distributed to more nodes, and the read and write load of a single point is also distributed to multiple nodes, while improving scalability and availability.

Data sharding, ranging from collections in programming language standard libraries to distributed middleware, is ubiquitous. For example, when I wrote a thread-safe container to store various objects, in order to reduce lock contention, the container was segmented, each segment had a lock, and the object was placed in a certain segment according to hash or modulus. Segmentation, such as ConcurrentHashMap in Java also adopts a segmentation mechanism. In the distributed message middleware Kafka, the topic is also divided into multiple partitions, and each partition is independent of each other and can be read and written concurrently.

8.1. Fragmentation strategy

When sharding, try to evenly distribute the data on all nodes to share the load. If the distribution is uneven, it will lead to tilt and reduce the performance of the entire system. Common sharding strategies are as follows:

Range Fragmentation

Fragmentation based on a continuous keyword maintains the sorting and is suitable for range search, reducing the read and write of broken fragments. The disadvantage of interval sharding is that it is easy to cause uneven data distribution and hot spots. For example, on a live broadcast platform, if interval segmentation is carried out by ID, usually short-digit IDs are some big anchors, such as IDs within 100-1000 must be visited more frequently than IDs with more than ten digits. It is also common to shard by time range, and the read and write operations in the most recent time period are usually more frequent than those in a long time ago.

insert image description here

random sharding

Shard according to a certain method (such as hash modulo), in which the data distribution is relatively uniform, and hot spots and concurrency bottlenecks are not easy to occur. The disadvantage is that the feature of orderly adjacency is lost, such as when a range query will initiate a request to multiple nodes.

insert image description here

Combined sharding
is a compromise between interval sharding and random sharding, and a combination of two methods is adopted. Composite keys are composed of multiple keys, the first key is used for random hashing, and the remaining keys are used for interval sorting. If the live broadcast platform uses the anchor id + live time (anchor_id, live_time) as the combination key, then you can efficiently query the start record of a certain anchor within a certain period of time. Social scenarios, such as WeChat Moments, QQ Talk, Weibo, etc., use the combination of user id + publication time (user_id, pub_time) to find the user's publication records for a certain period of time.

8.2. Secondary index

Secondary indexes are usually used to speed up the lookup of specific values, and cannot uniquely identify a record. Using secondary indexes requires secondary search. Relational databases and some KV databases support secondary indexes, such as auxiliary indexes (non-clustered indexes) in mysql, and ES inverted indexes find documents by term.

local index

The index is stored in the same partition as the keyword, that is, the index and the record are in the same partition, so that all write operations are performed in one partition without cross-partition operations. But for read operations, data on other partitions needs to be aggregated. For example, take the short video of Glory of Kings as an example. The video vid is used as the key index, and the video tags (such as five kills, three kills, Li Bai, and A Ke) are used as the secondary index. The local index is shown in the following figure:
insert image description here

local index

global index

Partition by the index value itself, independent of the key. In this way, when reading the data of an index, it is all performed in one partition, while for writing operations, it needs to span multiple partitions. Still taking the above example as an example, the global index is shown in the following figure:
insert image description here

global index

8.3. Routing strategy

Routing policies determine how to send data requests to specified nodes, including shard-adjusted routes. There are usually three ways: client routing, proxy routing and cluster routing.

client routing

The client directly operates the sharding logic, perceives the allocation relationship between shards and nodes, and directly connects to the target node. Memcache is distributed in this way, as shown in the figure below.
insert image description here

Memcache client routing

proxy layer routing

The client's request is sent to the proxy layer, which forwards the request to the corresponding data node. Many distributed systems have adopted this approach, such as the industry's distributed storage codis (codis-proxy layer) based on redis, and companies such as CMEM (Access access layer), DCache (Proxy+Router), etc. As shown in the CMEM architecture diagram below, the Access layer in the red box is the routing agent layer.

CMEM Access Layer Routing

cluster routing

The fragment routing is implemented by the cluster, and the client connects to any node. If the node has the requested fragment, it will be processed; otherwise, the request will be forwarded to the appropriate node or the client will be redirected to the target node. For example, redis cluster and the company's CKV + adopt this method, and the CKV + cluster routing and forwarding in the figure below.
insert image description here

CKV + cluster routing

The above three routing methods have their own advantages and disadvantages. The implementation of client routing is relatively simple, but it is relatively strong for business intrusion. Proxy layer routing is transparent to business, but adds a layer of network transmission, which has a certain impact on performance, and is relatively complicated in deployment and maintenance. Cluster routing is transparent to business and has one less structure than proxy routing, which saves costs, but the implementation is more complicated, and unreasonable strategies will increase multiple network transmissions.

8.4, dynamic balance

When learning to balance binary trees and red-black trees, we all know that the insertion and deletion of data will destroy their balance. In order to maintain the balance of the tree, after insertion and deletion, we will dynamically adjust the height of the tree by left-handed and right-handed to maintain rebalancing. Distributed data storage also needs to be rebalanced, but there are more factors that cause imbalance, mainly in the following aspects:

1) The read and write load increases, requiring more CPUs;

2) The data scale increases, requiring more disks and memory;

3) The data node fails and needs to be replaced by other nodes;

Many products in the industry and companies also support dynamic balance adjustment, such as resharding of redis cluster and rebalance of HDFS/kafka. Common ways are as follows:

fixed partition

Create a number of partitions that far exceeds the number of nodes, assigning multiple partitions to each node. If a new node is added, several partitions can be evenly removed from the existing nodes to achieve balance, and vice versa for deleting nodes, as shown in the figure below. The typical one is consistent hashing, which creates 2^32-1 virtual nodes (vnodes) and distributes them to physical nodes. This mode is relatively simple, and the number of partitions needs to be determined at the time of creation. If the setting is too small, the cost of rebalancing will be very high if the data expands rapidly. If the number of partitions is set to be large, there will be some management overhead.
insert image description here

Fixed partition rebalancing

dynamic partition

Automatically increase or decrease the number of partitions. When the partition data grows to a certain threshold, the partition will be split. When the partition data shrinks to a certain threshold, the partitions are merged. Similar to the split-delete operation of a B+ tree. Many storage components adopt this method, such as splitting and merging of HBase Region, and Set Shard of TDSQL. The advantage of this method is that it automatically adapts to the amount of data and has good scalability. One thing to note when using this type of partition is that if there is only one partition initialized, the load on a single point will be high if there is a large amount of requests just after going online. Usually, multiple partitions are pre-initialized to solve this problem, such as HBase pre-split.

8.5. Sub-database and sub-table

When the database has a large amount of data in a single table/machine, it will cause performance bottlenecks. In order to disperse the pressure on the database and improve read and write performance, it is necessary to adopt a divide-and-conquer strategy to divide databases and tables. Usually, sub-database sub-table is required in the following situations:

1) When the amount of data in a single table reaches a certain level (such as mysql is generally tens of millions), the performance of reading and writing will decrease. At this time, the index will be too large and the performance will be poor, so a single table needs to be decomposed.

2) The throughput of the database reaches the bottleneck, and more database instances need to be added to share the pressure of data reading and writing.

Sub-databases and sub-tables disperse data into multiple databases and tables according to specific conditions, which can be divided into two modes: vertical splitting and horizontal splitting.

Vertical splitting: According to certain rules, such as business or module type, multiple tables in one database are distributed to different databases. Taking the live broadcast platform as an example, live program data, video-on-demand data, and user attention data are stored in different databases, as shown in the following figure:

insert image description here

advantage:

1) Segmentation rules are clear and business division is clear;

2) Cost management can be carried out according to the type and importance of the business, and the expansion is also convenient;

3) Data maintenance is simple;

shortcoming:

1) Different tables are divided into different libraries, and the table connection cannot be used to join. However, in the actual business design, the join operation is basically not used. Generally, the mapping table is established and the data is constructed and stored in a storage system with higher performance through two queries or writing.

2) The transaction processing is complicated, and the original operation of different tables of the same library in the transaction is no longer supported. For example, updating the live program at the end of the live broadcast and generating a live on-demand playback cannot be completed in one transaction after the sub-database. At this time, flexible transactions or other distributed transaction solutions can be used.

Horizontal sharding: Split the data in the same table into multiple databases according to certain rules, such as hashing or modulo. It can be simply understood as splitting by row, and the table structure after splitting is the same. For example, the broadcasting records of the live broadcast system will accumulate over time, and the table will become larger and larger, which can be horizontally segmented according to the anchor id or broadcasting date and stored in different database instances. Advantages: 1) The table structure is the same after splitting, and the business code does not need to be changed; 2) It can control the data volume of a single table, which is conducive to performance improvement; Disadvantages: 1) Issues such as Join, count, record merging, sorting, and paging require cross-node Processing; 2) It is relatively complex and needs to implement a routing strategy; in summary, vertical splitting and horizontal splitting have their own advantages and disadvantages, and usually these two modes will be used together.

8.6. Task sharding

I remember distributing new books when I was a child. The teacher brought piles of new books to the classroom, and then asked a few classmates to distribute them together. Some distributed Chinese, some mathematics, and some natural. The assembly line in the workshop, after the parallelization of each process, finally synthesizes the final product, which is also a kind of task slicing.

Task sharding divides a task into multiple subtasks for parallel processing to speed up task execution. It usually involves data sharding. For example, merge sort first divides data into multiple subsequences, first sorts each subsequence, and finally synthesizes an ordered sequence. In big data processing, Map/Reduce is a classic combination of data sharding and task sharding.

9. Storage

For any system, from single-core CPU to distributed, from front-end to back-end, to implement various functions and logic, there are only two operations: read and write. The business characteristics of each system may be different. Some focus on reading, some focus on rewriting, and some combine both. This section mainly discusses some methodologies for storing and reading in different business scenarios.

9.1, read and write separation

Most of the business requires more reading and less writing. In order to improve the system processing capacity, the master node can be used for writing and the slave node can be used for reading, as shown in the figure below.

insert image description here

The read-write separation architecture has the following characteristics: 1) The database service is a master-slave architecture, which can be one master and one slave or one master and multiple slaves; 2) The master node is responsible for writing operations, and the slave nodes are responsible for reading operations; 3) The master node will Data is copied to the slave nodes; based on the basic architecture, a variety of read-write separation architectures can be modified, such as master-master-slave, master-slave-slave. The master and slave nodes can also be different storage, such as mysql+redis.

The master-slave architecture with read-write separation generally adopts asynchronous replication, and there will be a problem of data replication delay, which is suitable for businesses that do not require high data consistency. There are several ways to minimize the problems caused by replication lag.

1) Read-after-write consistency: that is, read your own writes, which is suitable for users who require real-time updates after writing operations. A typical scenario is that a user logs in after registering an account or changing the account password. At this time, if a read request is sent to the slave node, since the data may not be synchronized yet, the user fails to log in, which is unacceptable. In this case, you can send your own read request to the master node, and the request to view other user information is still sent to the slave node.

2) Secondary read: read the slave node first, and read from the master node if the read fails or the tracking update time is less than a certain threshold.

3) The key business reads and writes the master node, and the non-key business reads and writes separately.

4) Monotonic read: Ensure that the user's read requests are all sent to the same slave node to avoid rollback. For example, after the user updates the information on the M master node, the data is quickly synchronized to the slave node S1. When the user queries, the request is sent to S1, and the updated information is seen. Then the user queries again. At this time, the request is sent to the slave node S2 whose data synchronization has not been completed. The phenomenon that the user sees is that the updated information just disappeared, that is, the data is rolled back.

9.2, static and dynamic separation

Dynamic and static separation separates frequently updated data from data with low update frequency. Most common in CDN, a webpage is usually divided into static resources (pictures/js/css, etc.) and dynamic resources (JSP, PHP, etc.), and static resources are cached on CDN edge nodes by means of dynamic and static separation, and dynamic resources only need to be requested That is, reducing network transmission and service load.

Dynamic separation can also be adopted on the database and KV storage, such as the dynamic and static separation of video-on-demand cache mentioned in 7.6. In the database, dynamic and static separation is more like a vertical segmentation, storing dynamic and static fields in different database tables, reducing the granularity of database locks, and at the same time, different database resources can be allocated to reasonably improve utilization.

9.3. Separation of hot and cold

Hot and cold separation can be said to be an essential function for every storage product and massive business. Mysql, ElasticSearch, CMEM, Grocery, etc. all directly or indirectly support hot and cold separation. Put hot data on storage devices with better performance, and sink cold data to cheap disks, thus saving costs. In order to save costs on Tencent Cloud, Penguin E-sports also adopts hot and cold separation for live playback according to the number of anchor fans and time. The following figure is an implementation architecture diagram of ES cold and hot separation.
insert image description here

ES cold and hot separation architecture diagram

9.4. Rewrite and light read

My personal understanding of rewriting light may have two meanings: 1) Critical writing, reducing the criticality of reading, such as asynchronous replication, ensuring that the master node writes successfully, and the reading of the slave node can tolerate synchronous delays. 2) Write more logic, read less logic, and transfer the logic of calculation from reading to writing. It is suitable for scenarios where calculations need to be performed during read requests. For example, the common leaderboard is constructed when writing rather than when reading requests.

In social product scenarios such as Weibo and Moments, there are functions similar to follow or friends. Take the circle of friends simulation as an example (I don't know how the circle of friends does it specifically), if the list of friend messages that the user sees when entering the circle of friends is assembled by traversing the new messages of his friends when requesting, and then sorting them by time , it is obviously difficult to meet such a large number of requests from the circle of friends. You can take the method of rewriting and light reading, construct the list when posting to Moments, and then read it directly.

Following the Actor model, create a mailbox for the user. After the user writes his own mailbox after posting to Moments, he returns, and then asynchronously pushes the message to his friend’s mailbox, so that when a friend reads his mailbox, it is the message list of his Moments. ,As shown below:
insert image description here

Rewrite light reading process

The above picture is only to show the idea of rewriting lightly, and there are some other problems in practical applications. Such as: 1) Writing diffusion: This is an act of writing diffusion. If a large family has many friends, the cost of writing diffusion is also very high, and some people may not read Moments for thousands of years or even block friends. Some other strategies need to be adopted, such as only when the number of friends is within a certain range, and if the number is too large, use push-pull combination and analyze some active indicators, etc. 2) Mailbox capacity: Generally speaking, when viewing Moments, you will not continue to scroll down to view. At this time, you should limit the number of mailbox storage entries, and the excess entries will be queried from other storage.

9.5. Data heterogeneity

Data heterogeneity is mainly to establish index relationships according to different dimensions to speed up queries. For example, online shopping malls such as JD.com and Tmall generally divide databases and tables according to order numbers. Since the order numbers are not in the same table, to query the order list of a buyer or merchant, it is necessary to query all sub-databases and perform data aggregation. A heterogeneous index can be built, and an index table from buyers and merchants to orders can be created at the same time when an order is generated. This table can be divided into databases and tables according to user id.

10. Queue

In system applications, not all tasks and requests must be processed in real time. In many cases, data does not need strong consistency but only needs to maintain final consistency. Sometimes we do not need to know the dependencies between system modules. In these scenarios, queues Technology has a lot to offer.

10.1. Application scenarios

Queues have a wide range of application scenarios, which can be summarized as follows:

Asynchronous processing: There are usually many processing processes for business requests. Some processes do not need to be processed immediately in this request, and asynchronous processing can be used at this time. For example, in the live broadcast platform, after the anchor starts broadcasting, he needs to send a broadcasting notification to the fans. The broadcasting event can be written into the message queue, and then a special daemon will process and send the broadcasting notification, thereby improving the response speed of the broadcasting.
Traffic clipping: The performance bottleneck of a high-concurrency system is generally in I/O operations, such as reading and writing databases. In the face of sudden traffic, you can use the message queue for queuing and buffering. Taking Penguin E-sports as an example, every once in a while there will be big anchors, such as Menglei. At this time, there will be a large number of users subscribing to the anchor, and the subscription process requires multiple write operations. At this time, only write which anchor the user has followed to store. Then enter the message queue for temporary storage, and then write who the anchor is following and other storage.
System decoupling: Some basic services are relied upon by many other services, such as Penguin E-sports search, recommendation and other systems need to broadcast events. The broadcasting service itself does not care who needs the data, it only needs to handle the broadcasting. The dependent service (including the daemon that sends the broadcasting notification mentioned in the first point) can subscribe to the message queue of the broadcasting event for decoupling.
Data synchronization: Message queues can act as a data bus, especially when synchronizing data across systems. Taking a distributed cache system that I participated in the development of before as an example, the data is synchronized to Redis through RabbitMQ when writing Mysql, so as to realize an eventually consistent distributed cache.
Flexible transactions: Traditional distributed transactions are implemented using a two-phase protocol or its optimized variants. When transactions are executed, they need to compete for lock resources and wait. In high concurrency scenarios, the performance and throughput of the system will be seriously reduced, and even deadlocks will occur. . The core of the Internet is high concurrency and high availability, and generally converts traditional transaction problems into flexible transactions. The following figure is a distributed transaction implementation based on message queues in Ali (for details, please refer to: The Way of Enterprise IT Architecture Transformation, Alibaba’s Central Taiwan Strategic Thought and Architecture Actual Combat, WeChat Reading has an electronic version):

insert image description here

Its core principles and processes are:

1) Before the distributed transaction initiator executes the first local transaction, it sends a transaction message to MQ and saves it to the server, and the MQ consumer cannot perceive and consume the message①②.

2) After the transaction message is sent successfully, start the stand-alone transaction operation ③:

a) If the local transaction is successfully executed, update the transaction message of the MQ server to the normal state ④;

b) If the local transaction is not timely fed back to the MQ server due to downtime or network problems during the execution of the local transaction, the previous transaction message will always be saved in the MQ. The MQ server will periodically scan transaction messages, and if it finds that the message storage time exceeds a certain time threshold, it will send a request to the MQ production terminal to check the transaction execution status⑤;

c) After checking the local transaction result ⑥, if the transaction is successfully executed, update the previously saved transaction message to a normal state, otherwise inform the MQ server to discard it;

3) After the consumer obtains the transaction message and sets it to a normal state, it executes the second local transaction ⑧. If the execution fails, the MQ sender will be notified to roll back or forward compensate the first local transaction.

10.2. Application classification

Buffer queue: The basic function of the queue is buffer queuing, such as the sending buffer of TCP, and the buffer of the application layer is usually added to the network framework. When using the buffer queue to deal with burst traffic, it makes the processing smoother and protects the system. Those who have bought tickets on 12306 understand it.

insert image description here

In a big data log system, it is usually necessary to add a log buffer queue between the log collection system and the log parsing system to prevent the parsing system from being blocked or even cause the log to be discarded when the parsing system is under high load, and to facilitate their own upgrade and maintenance. In the data collection system of Tianji Pavilion in the figure below, Kafka is used as the log buffer queue.
insert image description here

Request queue: Queue user requests. Network frameworks generally have request queues. For example, spp has a shared memory queue between the proxy process and the work process, and taf also has a queue between the network thread and the Servant thread, which is mainly used for flow control. , overload protection and timeout discarding, etc.

insert image description here

Task queue: Submit tasks to the queue for asynchronous execution, the most common one is the task queue of the thread pool.
message queue

For message delivery, there are mainly two modes: point-to-point and publish-subscribe. The common ones are RabbitMQ, RocketMQ, Kafka, etc. The following table is a comparison of commonly used message queues:

comparison item	ActiveMQ	RabbitMQ	RocketMQ	Kafka
Community/Company	Apache	Mozilla Public License	alibaba	Apache
Maturity and Authorization	mature / open source	mature / open source	Relatively mature/open source	mature / open source
Development language	java	Erlang	java	Scala&java
Client Supported Languages	Java、C/C++、Python、PHP、Perl、.net etc	Officially supports Erlang, java, Ruby, etc., and the community has produced multi-language APIs, supporting almost all commonly used languages	Java、C++	Officially supports java, and the open source community has multiple language versions, such as PHP, Python, Go, C/C++, Ruby, etc.
protocol support	OpenWire、STOMP、REST、XMPP、AMQP	Multi-protocol support: AMQP, XMPP, SMTP, STOMP	A set defined by oneself (community provides JMS – immature)	Own protocol, the community encapsulates HTTP protocol support
HA	Master-Slave implementation based on ZooKeeper + LevelDB	Master/slave mode, the master provides services, and the slave is only for backup (cold standby)	Support multi-Master mode, multi-Master and multi-Slave mode, asynchronous replication mode, synchronous double writing	Supports the replica mechanism. After the leader fails, the backup will automatically replace it and re-elect the leader (based on Zookeeper)
data reliability	Master/slave, there is a lower probability of data loss	It can guarantee that the data will not be lost, and there is a slave for backup	Supports asynchronous real-time flashing, synchronous flashing, synchronous replication, and asynchronous replication	The data is reliable, and there is a replica mechanism, which has fault tolerance and disaster recovery capability
Stand-alone throughput	10,000 level	10,000 level	100,000 level, supporting high throughput	100,000 level, high throughput, generally cooperate with big data systems for real-time data calculation, log collection and other scenarios
message delay	Millisecond	microsecond level	Millisecond	Within milliseconds
flow control		Based on the Credit-Based algorithm, it is an internal passively triggered protection mechanism that acts on the producer level.		It supports client and user levels, and the flow control can be applied to producers or consumers through active settings.
endurance ability	The default memory, when it is closed normally, the unprocessed messages in the memory will be persisted to the file, and if the JDBC strategy is used, it will be stored in the database	Memory, files, support data accumulation. But accumulation in turn affects throughput	disk file	disk file. As long as the disk capacity is sufficient, unlimited message accumulation can be achieved
load balancing	support	support	support	support
management interface	generally	better	command line interface	Officially only provides the command line version, yahoo open source its own web management interface
Deployment method and difficulty	independent/easy	independent/easy	independent/easy	independent/easy
function support	The functions in the MQ field are relatively complete	Developed based on Erlang, it has strong concurrency capability, excellent performance and low latency	The MQ function is relatively complete, or distributed, and has good scalability	The functions are relatively simple, mainly supporting simple MQ functions, and are widely used in real-time computing and log collection in the field of big data

Summarize

This article discusses and summarizes the common methods and technologies for background development and design of high-performance services, and summarizes a set of methodology through mind maps. Of course, this is not all about high performance, or even just the water chestnut. Each specific field has its own way of high performance, such as the I/O model and C10K problems of network programming, the data structure and algorithm design of business logic, and the parameter tuning of various middleware. The article also describes the practice of some projects. If there is something unreasonable or a better solution, please enlighten me.

[Architecture] High-performance design method of back-end service architecture

Article directory

foreword

1. Lock-free

1.1, serial lock-free

1.2. The structure is lock-free

2. Zero copy

2.1, memory mapping

2.2, zero copy

3. Serialization

3.1. Classification

3.2. Performance indicators

3.3, selection considerations

4. Pondization

4.1. Memory pool

4.2, thread pool

4.3, connection pool

4.4, object pool

5. Concurrency

5.1. Request concurrency

5.2. Redundant requests

6. Asynchronization

6.1. Call asynchronously

6.2. Process asynchronization

7. Cache

7.1, cache usage scenarios

7.2, classification of cache

7.3, cache mode

7.4. Cache recycling strategy

7.5. Cache crash and repair

7.6. Some good practices of caching

8. Fragmentation

8.1. Fragmentation strategy

8.2. Secondary index

8.3. Routing strategy

8.4, dynamic balance

8.5. Sub-database and sub-table

8.6. Task sharding

9. Storage

9.1, read and write separation

9.2, static and dynamic separation

9.3. Separation of hot and cold

9.4. Rewrite and light read

9.5. Data heterogeneity

10. Queue

10.1. Application scenarios

10.2. Application classification

Summarize

Guess you like