Facebook technology exploration based on memcache extension

Insert picture description hereThis work is licensed under the Creative Commons Attribution-Non-Commercial Use-Share 4.0 International License Agreement in the same way .
Insert picture description hereThis work ( Lizhao Long Bowen by Li Zhaolong creation) by Li Zhaolong confirmation, please indicate the copyright.

introduction

It stands to reason that there is no time to read the paper in such a tight time recently, but by chance, I saw this article again, and after a brief scan of the content of the article, I decided to read it intensively. I started reading this article because I looked at the topic and thought it would have something to do with consistent hashing. Because in the first two days, I saw that Dynamo made some optimizations on consistent hashing in order to improve the efficiency of anti-entropy. If you can compare the number at this time This solution is of course excellent, but after seeing the content, I found that there is not much connection. The article describes the use of memcache by Facebook. Although it is contrary to the original purpose, the content is indeed very interesting, and because this article has been In my timeline, so I made up my mind to read this article.

Problem Description

The following is a summary picture drawn by a blogger after reading this paper. I think it is very representative. I basically explained the context of this article clearly, so I put it here.
Insert picture description here

The following issues will be discussed in this article:

  1. Preconditions and requirements of the design system
  2. Reduce cache delay and concise application layer, introduce mcrouter
  3. incast problem
  4. Reduce storage server load (leases, memcache pool)
  5. Fault handling
  6. Key failure in region and region pool
  7. Cold cluster warm up
  8. Consistency of cross-region replication

In fact, the above questions can also be regarded as a simplified version of the paper catalog, because this article is more like Facebook’s experience, so all the questions are very interesting, there is really no way to abandon any of them, come on! Let's talk one by one!

Finally, I want to talk about the last question before the main text. Why is the title of the paper called? The Scaling Memcache at Facebookreason is mentioned in the first and second sections of the article, but it is more obscure, that is, when using facebook, the open source is memcacheactually a standalone , So why is it called? ScalingThis article describes how to memcacheextend an open source stand-alone computer to a distributed one. This is a problem that many people did not notice when reading the article for the first time.

Preconditions and requirements of the design system

First, let's take a look at the background of the birth of this technology. The description in the second section is very clear. In the design of Facebook, the user's read operation is much larger than the write operation. This workload means that the read operation is using the cache. The biggest benefit can be obtained when the cache is used. Secondly, this kind of caching needs to synchronize data on multiple different systems, because what Facebook needs is a flexible caching strategy, which means that the cache can get data from different sources and synchronize.

Obviously there are two functions at this time. One is to be used as a query cache . First, it is requested in memcache through a string key. If it is not found, it will retrieve it from the database or from the background service, and then use the key to retrieve the result. Save it back to memcache. For write requests, the Web server sends SQL statements to the database, and then sends a delete request to memcache to invalidate the old cached data.
Insert picture description here
Of course, memcache can also be regarded as a more general undifferentiated cache, because memcache on multiple machines can be regarded as a whole, so we only need to do very few things to maintain consistency between multiple caches. For example, server A generates a huge amount of data, which is intermediate data required by some computing programs. Obviously, they can easily obtain the data they need from memcache at this time.

But here we have to be clear about a problem. Although it can be used as a general-purpose cache, the memcache version described in the paper is itself a stand-alone memory hash table, which does not have the capability of distributed storage, so it actually needs a function as a general-purpose cache. Powerful client (and easy to expand), this client must support data routing. Of course, the client storage cluster membership is also a method of event distribution in distributed storage, and an article may be written later to talk about this issue.

Reduce the load of getting the cache

Usually a large page needs to obtain a lot of resources, and these resources are basically not distributed in the same database or cache, so if all resources are quickly obtained, it becomes a huge problem.

According to unreliable investigations, it seems that there is an optimization that the front-end loads the prepared resources first, such as text, pictures, and resources such as videos can be filled with pictures first, and then displayed when the data reaches a part (seems like who has heard it before ), which can greatly improve the response speed of customer requests and reduce the response speed from the longest request delay to the shortest request delay.

nResources are distributed in different databases. In order to reduce the pressure of database requests, some resources are cached in multiple cache databases. The resources required by a page are hashed and stored in different memcache servers. Therefore, the web server must request multiple memcache servers to meet the user's request. At this time, the facebook team has made three optimization measures, namely:

  1. Maximize parallel requests
  2. Use UDP to reduce latency
  3. Dealing with incast issues

The first two points made me write that I can think of it, but the third point is difficult to consider without a large amount of quantitative data analysis. This is also a major feature of reading this kind of industrial papers.

Maximizing parallel requests actually solves the following problem. How to reduce the round-trip request delay of all resources on a page? This problem depends not only on the delay of a certain request, but also on the dependency relationship between resources on the page. In the latter case, we can build a directed acyclic graph (DAG) of all resources on the page. Obviously, we can conclude at this time. The point is the root, and the depth of this tree is the minimum number of RTTs.

Using UDP to reduce the request delay is actually well understood, but let’s talk about one of the problems mentioned in this section, that is, our web application does not directly memcachecommunicate with the server. The communication between them is through the client. This client The end is stateless, it encapsulates the complexity of the system, it provides serialization (building DAG), compression, request routing, error handling (request database) and request batch processing , leaving the simple interface to web applications. Of course, there are two ways to implement the logic of the client here, which can be a library embedded in the application, or as an mcrouterindependent agent called.

We know that TCP is a connection-oriented transport layer protocol, which means that the communication between two nodes not only requires three packets to establish a connection (defer_accept can remove the third handshake), but also requires heartbeat packets to maintain the connection during data transmission. Information, and graceful disconnection (three to four packets) is needed. The overhead here is very large. Of course, it can be slightly better in the long connection scenario, because the overhead of the above packets is divided into multiple operations on. However, the cost of maintaining the connection on a short connection will be very high, because there may only be one data packet.

Using UDP instead of TCP to optimize read operations can reduce operation latency. According to Facebook statistics, only 0.25% of packets are discarded, of which about 80% are due to delay or packet loss, and the rest are due to Out of order delivery, of course, we can choose to maintain the protocol stack in the user mode to avoid these problems, but it is not necessary, because we want to improve efficiency. The latter brings reliability while inevitably losing efficiency, and the cost of failure is caused by us. Receiving is nothing more than checking the database once. The use of UDP for read operations brings a 20% reduction in latency compared to TCP:
Insert picture description here

The description of the incast problem is as follows [6]:

A large number of requests arrive at the front-end server at the same time, or the data center server sends a large amount of data at the same time, causing the switch or router buffer to fill up, causing a large number of packet loss . The TCP retransmission mechanism also needs to wait for retransmission over time, which increases the delay of the entire system and greatly reduces throughput.

The facebook team cleverly used the sliding window mechanism to solve this problem. This window is not created based on a connection like TCP, but is applied to all memcache requests, and does not care about the destination address. The following is the window size and request delay Relationship:
Insert picture description here
When the window is relatively small, the application will have to distribute more sets of memcache requests serially, which will increase the duration of web requests . When the window is too large, the number of requests processed simultaneously memcache would lead to incast congestion . The result will be a memcache error, and the application will degenerate to fetch data from persistent storage , which will result in slower processing of web requests. Obviously, 300 is a very good window size choice when the load is small.

Reduce storage server load

The problem here is actually equivalent to how to reduce access to the database or increase the hit rate of the cache. The paper provides three ways to optimize this problem:

  1. leases mechanism
  2. memcache pool
  3. replication mechanism

Which leases the main mechanism to solve the outdated setting (stale sets) [9] and the shock group (thundering herds, this may be due to improved efficiency to fight the merger request to the database, in fact, describe the paper is not clear).

The memcache pool solves memcache as a general load. The top layer has different access modes, memory usage and quality of service requirements. This may lead to: those low-jitter primary keys that are still valuable are kicked out of those high-jitter primary keys that are no longer accessed Before being kicked out (after all, it is a memory-based database, which requires memory elimination), placing these data items with different jitter characteristics in different pools can improve the cache hit rate.

However, replicationit can provide excellent performance when the request volume exceeds the load of a single machine. Of course, there are two methods, one is based on the division of the primary key, and the other is full data replication. [7] 3.2.3 describes most situations. Reasons for choosing to copy.

The description in the paper is simple and easy to understand, so I won't explain it in detail here.

Fault handling

The Facebook team's handling of the fault is very peculiar. Generally speaking, in the field of distributed storage, the handling of faults is generally divided into the following points:

  1. Fault discovery (partition, node down or network fluctuation, general behavior is heartbeat packet recovery timeout)
  2. Failure recovery (there are many methods, mainly depending on how to do data redundancy, including but not limited to: slave node upper position, re-designate the master node (leases))
  3. Data synchronization (not detailed)

In the eyes of the facebook team, faults in this system are divided into two categories:

  1. Due to network or server failure, a small number of hosts cannot access
  2. Extensive downtime that affects a large percentage of servers in the cluster.

The latter can be determined that the cluster is offline, and we transfer the user's web request to another cluster. The latter can be considered a minor problem. At this time, there will be partially redundant machines, which are called Gutterthe function of replacing failed nodes. In a cluster, the number of Gutter is about 1% of memcached servers.

When the memcached client does not receive a response to its get request, the client assumes that the server has failed, and then sends the request again to the specific Gutter pool. If the second request is not hit, the client will insert the appropriate key-value pair into the Gutter machine after querying the database .

In this way, when some memcache cache nodes are down, the load of the database can be greatly reduced, because Gutterthere will be a large number of cache hits at this time. For some memcached servers that are unreachable due to failures or small-scale network accidents, Gutter will protect back-end storage from a surge in traffic .

Key failure in region and region pool

Here we must first outline the relationship between the various terms in this system architecture diagram is as follows:
Insert picture description here
Clearly WebServerand Memcachemake up Front-end Clusters, though not on the map, but I think it Mcrouterbelongs to the memcachepart, and Front-end Clusterswith Storage Clustermake up region.

Obviously, Storage Clusterauthoritative data is stored in the above cluster , and a write operation will memcacheinvalidate all the data in it. How to update the data in multiple caches at this time? First of all, the choice of consistency is the choice of strong consistency. Obviously, strong consistency will greatly This reduces throughput, so the update process must be asynchronous, so should we use WebServerbroadcast or find another way?

The facebook team chose to introduce mcsqueal:
Insert picture description here

mcsquealIt is essentially a daemon process. Each daemon process checks the SQL statements submitted by the database and extracts any delete commands. At this time, a small number of packets can be sent to each designated server running mcrouter in the previous cluster. Then mcrouter decomposes a separate delete operation from each batch of packets, and routes the invalid command to the correct memcached server in the front-end cluster. This can significantly reduce the load on the machine running the database process.

The WebServerbiggest reason for the failure of executing the key without passing it is because it is assumed that the delete command is incorrectly routed due to a configuration error, and web serverthe data for this operation cannot be landed (why not landed? Doubt), at this time the command is lost, and the log is Reliable, you can keep retrying. Of course, there are some problems between multiple clusters, we will talk about it later.

A collection of multiple Front-end Clustersshared memcacheservers is called region池.

Because memcachethe data is modified based on web requests, if user requests are randomly routed to all available front-end clusters, the probability of data in these clusters is about the same. For data sets with little modification and small traffic , multiple data sets can be Front-end Clustersshared memcache. This can significantly reduce the number of machines in some cases.

Cold cluster warm up

When a new cluster is added, the cache hit rate will be very low at this time, which will weaken the ability to isolate back-end services. So we can let the client of this new cluster retrieve data from the client of the cluster that has been running for a long time, so that the time for this cold cluster to rise to full load will be greatly shortened. Of course, this may cause consistency problems. Of course, this does not satisfy read and write consistency [8], because it is entirely possible that the cache has not been updated after the write, and the old data is read, but this is in the facebook application. The problem is not big, please refer to [7] 4.3 for details.

Consistency of cross-region replication

First point out the advantages of cross-regional data centers:

  1. Placing the web server close to the end user can greatly reduce the delay .
  2. Geographical diversification can alleviate the impact of natural disasters and large-scale power failures .
  3. The new location can provide cheaper electricity and other economic incentives .

It 3-2-1原则also clarified that it is best to have a remote backup.

The consistency here is mainly cache consistency, that is, the consistency problem between the active and standby clusters.

Writing from the main region may cause some update problems, but these problems are mcsquealprevented. Of course, this is also WebServerthe reason why the cache is invalid if the update is not used . Because the changes to the primary database may not be propagated to the replica database, the cache invalidation has already arrived. The next data query on the replica region will make memcachedirty data appear in it. Use mcsquealit to avoid this problem.

This section shows that the region is actually an unowned node architecture, because it can be no-masterwritten from , but the description is very brief and basically has no reference value. More interesting is the remote markermechanism to reduce the probability of reading old data. In fact, the additional delay (inter-region communication) when the cache misses is used in exchange for a decrease in the probability of reading outdated data. For details, please refer to [7]5.

to sum up

Many problems in this paper are impressive, incastproblems, cache coherency problems, 故障处理problems and so on. And it is indeed a good guide for us for the application of caching. It is difficult to find problems in many places without mining pits. Read more, type more code, so be it.

reference:

  1. " Scaling Memcache at Facebook Reading Notes "
  2. Scaling Memcache at Facebook
  3. " Facebook's Enhancement of Memcache Scalability "
  4. " MemCache basic introduction and working principle "
  5. " TCP INCAST Solutions "
  6. " PICC-Data Center Web Application Incast Problem "
  7. Scaling Memcache at Facebook
  8. " Talk about a little understanding of distributed consistency "
  9. " Analysis of Distributed Database and Cache Double Write Consistency Scheme "

Guess you like

Origin blog.csdn.net/weixin_43705457/article/details/113619918