Tens of billions of visits, how to design a cache architecture

Said it in front

In the reader community (50+) of Nien, a 40-year-old architect , some friends have recently obtained interview qualifications from first-tier Internet companies such as Alibaba, NetEase, Youzan, Xiyin, Baidu, NetEase, and Didi. They met a Several very important interview questions::

  • How to structure a distributed cache system?
  • How to build a cache architecture for tens of billions of accesses?

Recently, a friend on Weibo encountered this problem again: How to build a cache architecture for tens of billions of visits?

Next, Nien uses the design practice case of Weibo Cache architecture to reveal the answer to this question.

This article is very important. Everyone can collect it and slowly digest and master it.

why? In modern Internet applications, distributed caching has become the standard for application performance optimization. A good cache system needs to have high availability, high performance, and be able to ensure data consistency and fault tolerance.

This article will introduce how to design a highly available distributed cache architecture, including architecture basics, overall architecture design, implementation principles, consistency model, and fault handling.

Also include this question and reference answers in our " Nin Java Interview Collection PDF " V101 version, for the reference of later friends, to improve everyone's 3-high architecture, design, and development level.

For the PDFs of "Nin Architecture Notes", "Nin High Concurrency Trilogy" and " Nin Java Interview Collection ", please go to the official account [Technical Freedom Circle] to obtain

Basic knowledge of distributed cache system architecture

In a distributed system, the construction of a cache system is a key link, and its infrastructure mainly includes the following important parts:

  • Data sharding

In order to prevent the system from crashing due to excessive pressure on a single node, we need to shard the cached data. In this way, different cache nodes can manage different data slices respectively, and are responsible for processing related read and write requests. This sharding method can not only effectively disperse the data pressure, but also improve the response speed of the system.

  • load balancing

When a client initiates a request, we need to select an appropriate cache node to handle it. This requires the intervention of a load balancing algorithm, which can select the most suitable node for processing the request based on factors such as the current system load and network topology. In this way, it can not only ensure timely response to client requests, but also avoid excessive pressure on a certain node and affect the overall operation of the system.

  • Fault tolerance

In a distributed environment, failures between cache nodes may occur due to problems such as network communication. In order to improve the availability of the system, we need to design some fault-tolerant mechanisms. For example, data backups prevent data loss, and failover ensures that even if a node fails, the system continues to operate. These fault-tolerant mechanisms can effectively improve the stability and reliability of the system.

  • data consistency

Since data is distributed and stored on multiple nodes, data consistency needs to be ensured. In order to achieve this goal, we need to adopt some protocols and algorithms to handle concurrent read and write operations. For example, we can use a distributed transaction protocol to ensure the atomicity and consistency of data, or use an optimistic locking algorithm to avoid conflicts with concurrent write operations. These means can effectively ensure data consistency and ensure the correct operation of the system.

Overall architecture design

When building a distributed cache system with high availability and high concurrency capabilities, the following aspects must be considered:

  • Cache service node

A cache service node is responsible for processing corresponding data fragments and providing interfaces for various cache operations. By using multiple cache service nodes, we can build a cache cluster to improve the scalability and availability of the system.

  • Shared storage device

Shared storage devices are used to store cache data and metadata, which can be shared by multiple cache service nodes. Common shared storage devices include distributed file systems, distributed block storage, and distributed object storage.

  • Load balancing device

The role of the load balancing device is to distribute the client's request to the appropriate cache service node, and it can be dynamically adjusted according to the load status of the cache service node.

  • configuration service

The configuration service is responsible for maintaining and managing the metadata information of the cache system, including node information, fragmentation information, load balancing rules, and failover settings.

Implementation principle

The implementation of a distributed cache system requires many technologies, such as data sharding algorithms, load balancing algorithms, fault-tolerant processing algorithms, consistency algorithms, etc.

Data Sharding Algorithm

In distributed cache systems, data sharding algorithm is a key technology. Common data sharding algorithms include hash algorithms, range algorithms, sequential algorithms, and random algorithms. As one of the most commonly used sharding algorithms, the hash algorithm can map the key value of the data to a number through a hash function, and then shard it based on this number.

  • load balancing algorithm

Load balancing strategy plays a vital role in the caching system. It needs to comprehensively consider multiple factors such as the load status of the cache node, network topology, and client requests. Common load balancing strategies include round robin strategy, minimum number of connections strategy, IP hash strategy, weighted round robin strategy, etc.

  • consensus algorithm

In a distributed environment, data consistency is a core issue. When multiple clients read and write the same key value at the same time, data consistency needs to be ensured.

Common consensus algorithms include Paxos algorithm, Raft algorithm, Zab algorithm, etc.

  • The difference between strong consistency and eventual consistency

Strong consistency means that all clients can get the same results when reading the same key value. The final consistency means that if the client writes data in the cache node, then at some point in the future, all cache nodes will eventually have the same data.

Strong consistency can be achieved by replicating data, but this may have an impact on the performance and availability of the system. Eventual consistency can be achieved through asynchronous replication and protocol algorithms, thereby improving system performance and availability to a certain extent.

  • How to handle failures and exceptions

In a distributed cache system, faults and exceptions are inevitable. In order to ensure the availability of the system, we need to take a series of measures to deal with failures and exceptions, such as data backup, failover, data recovery, and monitoring alarms. At the same time, the system also needs to be maintained and optimized regularly to ensure high availability and high performance of the system.

Summary of distributed cache system architecture

This article introduces in detail how to build a highly available distributed cache system, including basic knowledge of architecture, overall architecture design, implementation principle, consistency model, and fault handling. An excellent cache system needs to have high availability, high performance, and ensure data consistency and fault tolerance.

With tens of billions of visits, how does Weibo implement a caching architecture?

Weibo is a large-scale social platform with 160 million+ daily active users and tens of billions of daily visits. Its efficient and continuously optimized cache system plays a crucial role in supporting the massive access of a huge user base.

The first is the data challenge during the operation of Weibo, and then the feed system architecture.

Next, we will focus on analyzing the Cache architecture and evolution, and finally summarize and look forward.

Weibo’s access traffic challenges

Weibo Feed Platform System Architecture

The entire system can be divided into five levels. The top level is the terminal layer, such as the Web client, client (including iOS and Android devices), open platform and third-party access interface. Next is the platform access layer, which is mainly to allocate high-quality resources to key core interfaces so that it can have better elastic service capabilities and improve service stability during burst traffic. Further down is the platform service layer, which mainly includes feed algorithms, relationships, etc. Then there is the middle layer, which provides services through various intermediate media. The bottom layer is the storage layer, and the entire platform architecture is roughly the same.

1. Feed timeline

  • Build process

When we use Weibo in our daily life, for example, if we refresh the homepage or client, we will see the latest ten to fifteen Weibo posts. So how is this process achieved?

The refresh operation will obtain the user's attention relationship. For example, if there are 1,000 followers, these 1,000 IDs will be obtained. Based on these 1,000 UIDs, the Weibo posted by each user will be obtained. At the same time, the user's Inbox will be obtained, which is what she has received. some special news.

For example, grouped Weibo, group Weibo, below her following relationship, the Weibo list of people she follows, after obtaining this series of Weibo lists, collect and sort them to obtain the required Weibo ID, and then sort these ID to get the Weibo content corresponding to each Weibo ID.

If these Weibos are forwarded, the original Weibos will also be obtained, and the user information will be obtained based on the original Weibos. The user information will be obtained through the original Weibos, and these Weibos will be further filtered based on the user's filter words to filter out the ones that the user does not want to see. Weibo, after leaving these Weibo, we will take a closer look to see if the user has collected or liked these Weibo, make some flag settings, and also assemble various counts of these Weibo, including the number of reposts, comments, and likes. , and finally return these dozen Weibo messages to various terminals of users.

Looking at it this way, a user will get more than a dozen records in one request, and the back-end server needs to process hundreds or even thousands of data in real time and return it to the user. The entire process depends on the strength of the Cache system. Therefore, the quality of Cache architecture design will directly affect the performance of the Weibo system.

2. Feed Cache architecture

Then we take a look at the Cache architecture, which is mainly divided into six layers:

  • The first layer of Inbox , this part mainly includes group Weibo and group owner Weibo. The number of Inbox is relatively small, and the push method is mainly used.
  • The second layer of Outbox , each user will publish regular Weibo, these Weibo will be stored in the Outbox, according to the number of stored IDs, it is actually divided into multiple caches, usually about 200, if it is long About 2,000.
  • The third layer of Social Graph is some relationships, including attention, fans and users.
  • The fourth layer, Content , is the content. Some of the content of each Weibo post is stored here.
  • The fifth level of existence is existential judgment. For example, in Weibo, whether a certain Weibo post has been liked. Some celebrities have said that they have liked a certain Weibo post, but in fact they have not, which will cause some news. , actually because she liked it at some point and then forgot about it.
  • The sixth layer Counter is counting, including Weibo comments, forwarding and other counts, as well as user followings, fans and other data.

Weibo Cache architecture and evolution

1. Simple KV data type

Next, we will focus on the evolution process of Weibo Cache architecture. When Weibo was first launched, we stored it as a simple KV key-value pair data type. We mainly used hash sharding to store it in the MC pool. However, a few months after going online, we discovered some problems. For example, due to some node machine downtime or other reasons, a large number of requests would penetrate the Cache layer and reach the DB, causing the entire request speed to slow down or even the DB to freeze.

To solve this problem, we quickly retrofitted the system and added a High Availability (HA) layer. In this way, even if some nodes in the main layer are down or unable to operate normally, the request will further penetrate to the HA layer instead of penetrating to the DB layer. This can ensure that the hit rate of the system will be stable under any circumstances. will not degrade, thereby significantly improving the stability of system services.

Currently, this approach is widely used in the industry. However, some people use hashing technology directly, which actually has some risks. For example, if a node (such as Node 3) goes down, the main layer will remove it and distribute some of Node 3's requests to other nodes. If the business volume is not very large, the database can withstand this pressure. However, if node 3 recovers and rejoins, its traffic will come back, and if it goes down again due to network or other reasons, node 3's requests will be allocated to other nodes. At this time, there may be problems. The requests assigned to other nodes have not been updated. If they are not deleted in time, data confusion will occur.

There is a big difference between WeChat and Weibo. In fact, Weibo is more like an open square business. For example, in the event of an unexpected incident or the exposure of a celebrity's relationship, the traffic may surge to 30% in an instant. In this case, a large number of requests will appear on certain nodes, making these nodes extremely busy, and even using MC cannot satisfy such a huge amount of requests. At this time, the entire MC layer will become a bottleneck, causing the entire system to slow down. To solve this problem, we introduced the L1 layer, which is actually a main relation pool. Each L1 layer is approximately one-sixth, one-eighth, or one-tenth the size of the Main layer, depending on the request volume. When the request volume is large, we will add 4 to 8 L1 layers. In this way, when a request comes, the L1 layer will be accessed first. If the L1 layer hits, the request will access directly; if there is no hit, the request will continue to access the Main-HA layer. In case of burst traffic, the L1 layer can withstand most of the hot requests, thereby reducing the memory pressure of Weibo. For Weibo, new data will become more and more popular, and only a small increase in memory is needed to handle a larger number of requests.

  • Key Point
    • Memcached mainly
    • HASH nodes within the layer do not drift, and misses penetrate
    • Multi-group L1 read performance improvement, peak traffic cost reduction read and write strategy
      • Write: write more
      • Read: penetrate layer by layer, miss write back
    • Json/xml --> Protocol Buffer
    • QuickLZ compression

To sum up, we use simple KV data type storage, mainly MC, the HASH nodes in the layer do not drift, and Miss penetrates to the next layer for reading. Through multiple groups of L1 read performance improvements, it can cope with peak and burst traffic while reducing costs. For the read and write strategy, we use multi-write, and for reading, we use layer-by-layer penetration. If miss, write back is performed. For the stored data, we initially used Json/xml, and after 2012 directly used the Protocol|Buffer format. For larger data, we used QuickL for compression.

2. Collection data

  • Business characteristics
    • partially modified
    • Paging to get
    • Resource Computing: Linked Computing
    • Type: follow, fans, group, follow together, XX also follows
  • Solution: Redis
    • Hash distribution, MS, cache/storage
    • 30+T memory, 2-3 trillion rw/day

Regarding simple QA data, we already know how to deal with it. But for complex collection data, for example, if you follow 2,000 people, adding a new person will involve some modifications. One way is to take down all 2000 IDs and modify them, which will bring greater bandwidth and machine pressure. There are also some requirements for pagination acquisition. For example, I only need to fetch the first few pages, such as the second page, which is the tenth to twentieth. Can I not retrieve all the data back. There are also some resource linkage calculations, such as following some people in ABC and following user D, which involves the modification, acquisition and calculation of some data. For MC, it is actually not good at it. All the attention relationships are stored in Redis, distributed and stored through Hash, and a set of multiple storage methods are used to separate reads and writes. Now Redis has about 30 T of memory, and there are 2-3 trillion requests every day.

  • Redis Extension (Longset)
    • Long type open array, Double Hash addressing
    • Client builds data structure, and elements are written once
    • Lsput: fill rate too high, rebuilt by client
    • Lsgetall --> Lsdump
    • Small amount of hot data: mc anti-reading

In the process of using Redis, we still encountered some other problems. For example, from the perspective of following relationships, I follow 2,000 UIDs. One way is to store all of them. However, Weibo has a large number of users. Some users log in less often, and some users are particularly active. The cost of storing them all in memory is Relatively large. So we changed the use of Redis to Cache, and only active users are stored. If you have not been active in the recent period, you will be kicked out of Redis, and you will be added when you are accessed again. However, the working mechanism of Redis is single-threaded mode. If it adds a certain UV and pays attention to 2,000 users, it may be expanded to 20,000 UIDs. If the 20,000 UIDs are stuffed back, Redis will basically be stuck and cannot provide other services. . Therefore, we have expanded a new data structure. Twenty thousand UIDs are directly opened. When writing, they are directly written to Redis in sequence. The overall efficiency of reading and writing will be very high. Its implementation is a long Open array of type, addressed through Double Hash.

  • Redis other extensions
    • Hot upgrade: 10+ minutes>millisecond level
    • AOF : Rotate
    • RDB : Pos of AOF
    • full incremental copy
    • Floor/synchronized speed control

For Redis, we also made some other extensions. For example, in some previous sharings, you can see on the Internet that we put the data into public variables. The entire upgrade process took 10 minutes to load when we tested 1G, and about ten minutes or more for 10G. Now it is a millisecond-level upgrade. . For AOF, we use rolling AOF, each AOF has an ID, and after reaching a certain amount, scroll to the next AOF. When the RDB is implemented, we will record the AOF file and its location when building the RDB, and implement full incremental replication through the new RDB and AOF expansion modes.

3. Other data types-count

  • Business characteristics
    • Single key has multiple counts (Weibo/user multiple counts)
    • Value size is small (2-8 bytes)
    • Nearly one billion new records are added every day, and the total record is hundreds of billions.
    • Request multiple KVs at a time

Next, we'll discuss some other data types, such as counts. In fact, counting is essential in every field of Internet companies. For some small and medium-sized businesses, MC and Redis are sufficient to meet the needs. However, in Weibo, the count has some specialties, for example, a Weibo may have multiple counts, including the number of retweets, comments, and likes. In addition, a user may have various numbers such as the number of fans, the number of followers, etc. Due to the nature of counting, its Value size is usually small, about 2-8 bytes, and the most common is 4 bytes. About one billion records are added to Weibo every day, and the total number of records is even larger. In one request, hundreds of counts may need to be returned.

  • Selection 1: Memcached
    • MC culling, restart data loss
    • A large number of counts are 0, how to save
  • Selection 2: Redis
    • Memory payload low
    • Access performance
  • Final plan: Self-developed CounterService
    • Shema supports multiple columns and is allocated by bits
    • Tables pre-allocated, double-hash addressing
    • The memory is reduced to below 1/5-1/15
    • Hot and cold separation, SSD stores old data, and old hot data is stored in LRU
    • Landed RDB + AOF full incremental replication
    • Single machine: hot data at 10 billion levels, cold data at 100 billion levels

4. Counter-Counter Service

Initially, we chose to use Memcached, but it had a problem that caused some counts to be evicted when the count exceeded its capacity, or the count would be lost after a downtime or restart. In addition, there are many counts that are zero, how to store them at this time, whether they need to be stored, and how to avoid taking up a lot of memory need to be considered. Weibo has billions of counts per day, and just storing zero values ​​would take a lot of memory. If it is not stored, it may cause penetration to the database layer, which will affect service performance.

Starting in 2010, we switched to using Redis for access. However, as the amount of data continues to increase, we find that Redis's memory utilization is relatively low. A KV requires about 65 bytes, but in fact we only need 8 bytes to store a count, plus the 4 bytes of Value, the actual effective storage is only 12 bytes. The remaining 40+ bytes are wasted. This is only the case for a single KV. If a Key has multiple counts, more space will be wasted. For example, for four counts, one Key occupies 8 bytes, and each count occupies 4 bytes. A total of 16 bytes are required, but only 26 bytes are actually used.

However, using Redis storage requires about 200 bytes. Later, through independent research and development of Counter Service, we reduced the memory usage to less than one-fifth to one-fifteenth of Redis. At the same time, we realize the separation of hot and cold data, store hot data in memory, and put cold data into LRU. When cold data becomes hot again, put it into RDB and AOF to realize full incremental replication. In this way, a single machine can store tens of billions of hot data and hundreds of billions of cold data.

An overview of the entire storage architecture is as follows: memory at the top and SSD at the bottom. The memory is pre-divided into N Tables, and each Table is allocated a certain range according to the ID pointer sequence. When a new ID comes in, first find the Table where it is located, and then perform an increase or decrease operation. When the memory is insufficient, export a small Table to SSD, and reserve the new location for the new ID.

Some people may have doubts, if within a certain range, the ID count is originally set to 4 bytes, but due to the popularity of Weibo, the count exceeds 4 bytes and becomes a very large count. How to deal with this situation? For counts that exceed the limit, we store it in an Aux dict. For Tables stored in SSD, we have a dedicated IndAux to access them and copy them through RDB.

5. Other data types-existence judgment

  • Business type requirements
    • Check if it exists (read like)
    • The size of a single record is small, value 1bit (0/1)
    • The total amount of data is huge, and a large number of values ​​are 0
    • The number of new additions every day reaches hundreds of billions

In addition to counting, Weibo also has some businesses, such as existence judgment. For example, whether a tweet has been liked, read, or recommended. If the user has already read the Weibo, it will no longer be displayed to them. The characteristic of this type of data is that although each record is very small (for example, Value only needs 1 bit), the total amount of data is huge. For example, Weibo publishes about 100 million new Weibo posts every day, and the number of reads may reach tens or hundreds of billions. How to store this data is a big problem. And many of them have existence 0. The previously mentioned question arises again: 0 Is storage required? If stored, hundreds of billions of records will be stored every day; if not stored, a large number of requests will penetrate the Cache layer and reach the DB layer, and no DB can withstand such a large amount of traffic.

  • Option 1: Redis
    • Single kv: 65 bytes
    • 6T of new memory added every day (excluding HA)
  • Option 2: CounterService
    • Single kv: 9 bytes
    • 900G of new memory added every day (excluding HA)

We also made some selections. First of all, we directly considered whether we could use Redis. A single KV has 65 bytes. If a KV can be 8 bytes, the Value has only 1 bit. Calculated in this way, my daily new memory efficiency is very low. The second type of our newly developed Counter Service has a single KV value of 1 bit. I will save 1 byte, and a total of 9 bytes will be enough. In this way, 900G of new memory will be added every day. If I save it, I may only be able to save the latest few days. , saving almost 3 T in three days is quite stressful, but it is much better than Redis.

  • Final plan: Self-developed Phantom
    • Table segment pre-allocation, intra-segment bloomfilter
    • Each kv: 1.2 bytes (1% misjudgment)
    • Daily new memory: 120G < 800G < 6T

Our final plan is to develop Phantom by ourselves. First, we allocate the shared memory in segments. The final memory used is only 120G. The algorithm is very simple. Each Key can be hashed N times. If a certain bit of the hash is It is 1. If you perform hashing three times, set it to 1 for three numbers, and hash X2 three times. When you later determine whether X1 exists, you will hash it three times. If they are all 1, consider it It exists. If a certain hash X3 has its bit calculated as 0, then it is 100% certain that it does not exist.

  • Phantom system architecture
    • Data is stored in shared memory, and data will not be lost after restarting.
    • Landing RDB+AOF
    • Compatible with Redis protocol

Its implementation architecture is relatively simple. The shared memory is pre-split into different tables, and the open calculation is performed in it, and then read and written. When landing, the AOF+RDB method is used for processing. Because the entire process is placed in shared memory, the data will not be lost if the process is upgraded and restarted. When accessing from the outside, build the Redis protocol, which can directly extend the new protocol to access our service.

6. Summary

  • focus point
    • High availability within the cluster
    • In-cluster scalability
    • High performance components
    • storage cost

To sum up, so far, we have paid attention to the high availability in the Cache cluster, its scalability, including its performance, and another particularly important thing is the storage cost. There are still some things that we have not paid attention to, such as how is the operation and maintenance of 21, Weibo Now there are thousands or tens of thousands of servers and so on.

7. Further optimization

Resource/component management oriented

  • How to simplify operation and maintenance?

local configuration mode

  • How to make changes quickly?

Regular peak and burst traffic

  • How to respond quickly and at low cost?

Many business data categories

  • How to independently control SLA?

Too many business-related resources

  • How to simplify development?

8. Servitization

  • Local Confs --> Configuration as a service
    • configServer manages configuration/services to avoid frequent restarts
    • Resource/service management API
    • Change method: script modification, smart client asynchronous update

The first solution adopted is to manage the entire Cache as a service and manage the configuration as a service to avoid frequent restarts. In addition, if the configuration changes, modify it directly with a script.

  • Cache access
    • Proxyization
  • IDC Data Consistency
    • Collecting/replication

  • ClusterManager
    • Scripting --> Web interface
    • Service verification business SLA
    • Service-oriented management and control of resources
  • Service governance
    • Expansion, shrinkage
    • SLA Guarantee
    • Monitoring alarm
    • Troubleshooting
  • Simplify development
    • Shield Cache resource details
    • Single line configuration access

Servitization also introduces Cluster Manager to realize external management, manage it through an interface, and perform service verification. In terms of service management, capacity expansion and reduction can be achieved, and SLA can also be well guaranteed. In addition, for development, Cache resources can now be blocked.

Summary and Outlook

Finally, to briefly summarize, the Weibo Cache architecture needs to be optimized and enhanced from different aspects of its data architecture, performance, storage cost, and service-oriented aspects.

references

https://blog.csdn.net/java_cpp_/article/details/130663371

https://blog.csdn.net/k6T9Q8XKs6iIkZPPIFq/article/details/108271182

recommended reading

" Message Push Architecture Design "

" Alibaba 2: How many nodes do you deploy?" How to deploy 1000W concurrency? "

" Meituan 2 Sides: Five Nines High Availability 99.999%. How to achieve it?" "

" NetEase side: Single node 2000Wtps, how does Kafka do it?" "

" Byte Side: What is the relationship between transaction compensation and transaction retry?" "

" NetEase side: 25Wqps high throughput writing Mysql, 100W data is written in 4 seconds, how to achieve it?" "

" How to structure billion-level short videos? " "

" Blow up, rely on "bragging" to get through JD.com, monthly salary 40K "

" It's so fierce, I rely on "bragging" to get through SF Express, and my monthly salary is 30K "

" It exploded...Jingdong asked for 40 questions on one side, and after passing it, it was 500,000+ "

" I'm so tired of asking questions... Ali asked 27 questions while asking for his life, and after passing it, it's 600,000+ "

" After 3 hours of crazy asking on Baidu, I got an offer from a big company. This guy is so cruel!" "

" Ele.me is too cruel: Face an advanced Java, how hard and cruel work it is "

" After an hour of crazy asking by Byte, the guy got the offer, it's so cruel!" "

" Accept Didi Offer: From three experiences as a young man, see what you need to learn?" "

"Nien Architecture Notes", "Nien High Concurrency Trilogy", "Nien Java Interview Guide" PDF, please go to the following official account [Technical Freedom Circle] to get ↓↓↓

Guess you like

Origin blog.csdn.net/crazymakercircle/article/details/132628780