Elaborate on the design of distributed Redis architecture and the pits that have been stepped on

Abstract: This article is mainly divided into five steps to explain
  Redis, RedisCluster and Codis;
  we prefer consistency;
  the experience and pits of using Codis in the production environment;
  some views on distributed databases and distributed architecture;
  Q & Section A.
  Codis is a distributed Redis solution. Different from the official pure P2P model, Codis adopts a Proxy-based solution. Today, we will introduce the design of Codis and the next large version of RebornDB, and also introduce some tips for Codis in practical application scenarios. Finally, I will introduce some of my views and opinions on distributed storage, and I hope all the chiefs will be elegant.

Details of the distributed Redis architecture design and the pits that have been stepped on_redis distributed_redis distributed lock_distributed cache redis
  1, Redis, RedisCluster and Codis
  Redis: Presumably in everyone's architecture, Redis is already an indispensable component. Rich data structures, ultra-high performance and simple protocols make Redis a good upstream cache layer for databases. However, we will be more worried about the single point problem of Redis. The capacity of single point Redis is always limited by memory. When the business has high performance requirements, ideally, we hope that all data can be stored in memory, and do not hit database, so it is natural to look for other solutions. For example, SSDs swap memory for disks in exchange for larger capacity. A more natural idea is to turn Redis into a distributed cache service that can be scaled horizontally. Before Codis, there was only Twemproxy in the industry, but Twemproxy itself is a static distributed Redis solution, which requires very high operation and maintenance requirements when expanding/shrinking capacity. High, and it is difficult to achieve smooth expansion and contraction. The goal of Codis is to be compatible with Twemproxy as much as possible, plus the function of data migration to achieve expansion and contraction, and finally replace Twemproxy. Judging from the results of Wandoujia's final launch, Twem was finally completely replaced, with a memory cluster of about 2T.
  Redis Cluster: The official cluster released at the same time as Codis. I think there are advantages and disadvantages. As an architect, I will not use it in the production environment. There are two reasons:
  the data storage module of the cluster and the distributed logic module. It is coupled together. The advantage of this is that the deployment is extremely simple, all-in-the-box, and there are not as many concepts, components and dependencies as Codis. But the downside is that it's hard for you to painlessly upgrade your business. For example, one day there is a serious bug in the distributed logic of Redis cluster, how do you upgrade? There is no good way to restart the entire cluster except rolling. This hurts operation and maintenance.
  The protocol has been greatly modified, which is not very friendly to clients. At present, many clients have become de facto standards, and many programs have been written. It is not realistic for business parties to replace Redisclient, and it is difficult to say at present. Which Rediscluster client has been verified in a large-scale production environment. It can be seen from the open source Rediscluster proxy of HunanTV that this impact is quite large, otherwise it will support the use of cluster clients.
  Codis: Unlike Redis cluster, Codis uses a stateless proxy layer to write distributed logic on the proxy. The underlying storage engine is Redis itself (although some small patches have been made based on Redis 2.8.13), data The distributed state is stored in zookeeper (etcd), and the underlying data storage becomes a pluggable component. The advantage of this is needless to say, that each component can be dynamically extended horizontally, especially the stateless proxy is of great significance for dynamic load balancing, and it can also do some interesting things, such as discovering some slots. The data is relatively cold, and a server group that supports persistent storage can be used to take charge of this part of the slot to save memory. When this part of the data becomes hot, it can be dynamically migrated to the server group in memory, and everything is transparent to the business. . Interestingly, after Twitter deprecated Twmeproxy, T family developed a new distributed Redis solution, still taking the proxy-based route. But not open source. The pluggable storage engine is also something RebornDB, the next-generation product of Codis, is doing. Btw, RebornDB and its persistence engine are completely open source, see https://github.com/reborndb/reborn and https://github.com/reborndb/qdb. Of course, the disadvantage of this design is that after proxy, there is one more network interaction, and it seems that the performance is degraded, but remember that our proxy can be dynamically expanded, and the QPS of the entire service is not determined by the performance of a single proxy ( So in the production environment I recommend using LVS/HA Proxy or Jodis), each proxy is actually the same.
  2. We prefer consistency
  Many friends asked me why they did not support read-write separation. In fact, the reason for this was very simple, because our business scenario at that time could not tolerate data inconsistency. Since the replication model of Redis itself is master-slave asynchronous replication, after successful writing on the master, There is no guarantee that this data can be read on the slave, and it is quite troublesome for the business side to deal with the problem of consistency. Moreover, the single-point performance of Redis is quite high. Unlike real databases such as mysql, there is no need to confuse the business side in order to improve a little bit of reading QPS. This is not the same as the role of the database. Therefore, you may see that Codis HA does not guarantee that data will not be lost at all, because it is asynchronous replication, so after the master hangs up, if there is no data synchronized to the slave, then upgrade the slave to the master at this time. , the data just written before synchronization will be lost. However, in RebornDB, we will try to support synchronous replication (syncreplication) for the persistent storage engine (qdb), so that some services that have stronger requirements on data consistency and security can be used.
  Speaking of consistency, this is why the MGET/MSET supported by Codis cannot guarantee the atomic semantics of the original single point. Because the keys involved in MSET may not be on different machines, if you need to ensure the original semantics, that is, either succeed together or fail together, this is a distributed transaction problem. For Redis, there is no WAL or return. Roll said that, so even the simplest two-phase commit strategy is difficult to implement, and even if it is implemented, there is no guarantee of performance. So using MSET/MGET in Codis is actually the same as opening a multi-threaded SET/GET locally, but it is packaged and returned by the server. We add the support of this command to better support the business that used Twemproxy before. .
  In actual scenarios, many friends use lua scripts to extend the functions of Redis. In fact, Codis supports it, but remember that when Codis is involved in this scenario, it is just forwarding, it does not guarantee your script. Whether the data being manipulated is on the correct node. For example, if your script involves manipulating multiple keys, all Codis can do is to assign the script to execute on the machine with the first key in the parameter list. So in this scenario, you need to ensure that the keys used by your script are distributed on the same machine. Here you can use the hashtag method.
  For example, if you have a script that manipulates multiple information of a user, such as uid1age, uid1sex, and uid1name, these keys may be scattered on different machines if you do not use hashtags. The curly braces enclose the area where the hash is calculated): {uid1}age, {uid1}sex, {uid1}name, so as to ensure that these keys are distributed on the same machine. This is a syntax introduced by twemproxy, and we also support it here.
  After open-sourcing Codis, we received a lot of feedback from the community. Most of the comments were focused on the dependency of Zookeeper, the modification of Redis, and the need for Proxy. We are also thinking about whether these things are necessary. Of course, there is no doubt about the benefits brought by these components, which have been explained above, but is there any way to make it more beautiful. Therefore, we will go one step further in the next stage and implement the following designs:
  use the built-in Raft of the proxy to replace the external Zookeeper, zk is actually just a strong consistency storage for us, we can actually use Raft to do the same. Embed raft into proxy to synchronize routing information. achieve the effect of reducing dependence.
  The abstract storage engine layer, the proxy or third-party agent is responsible for starting and managing the life cycle of the storage engine. Specifically, codis still needs to manually deploy the underlying Redis or qdb, and configure the master-slave relationship by itself, but in the future, we will hand over this to an automated agent or even integrate a storage engine inside the proxy. The advantage of this is that we can minimize the loss of proxy forwarding (for example, the proxy will start the Redis instance locally) and manual misoperation, which improves the automation of the entire system.
  There is also replication based migration. As we all know, the current data migration method of Codis is realized by modifying the underlying Redis and adding a single-key atomic migration command. The advantage of this is that the implementation is simple, and the migration process is business-agnostic. But the disadvantages are also obvious. First of all, the speed is relatively slow, and it is intrusive to Redis, and maintaining slot information brings additional memory overhead to Redis. Probably the ratio of small key-value-based business and native Redis is 1:1.5, so it still costs more memory.
  In RebornDB, we will try to provide a replication-based migration method, that is, when starting the migration, record the operation of a certain slot, and then start to synchronize to the slave in the background. When the slave is synchronized, start to playback the recorded operation. After the playback is almost, Stop the writing of the master, modify the routing table after leveling, switch the slot that needs to be migrated to the new master, master-slave (semi) synchronous replication, which was mentioned before.
  3. The experience and pitfalls of using Codis in the production environment Let
  me some tips. As a development engineer, the front-line operation experience is definitely not as much as that of the operation and maintenance classmates. You can discuss in depth together later.
  About multi-product line deployment: Many friends asked us how to deploy codis better when there are multiple projects. When we were in peapod, a product line would deploy a whole set of codis, but zk shared one, and different codis clusters had Different product names are used to distinguish them. The design of codis itself does not have a namespace. One codis can only correspond to one product name. Codis clusters with different product names will not interfere with each other on the same zk.
  About zk: Since Codis is a strongly dependent zk project, and when the connection between proxy and zk is jittered and the session expires, the proxy cannot provide services to the outside world, so try to ensure that the proxy and zk are deployed in the same computer room. In the production environment, zk must be an odd number of machines >= 3, and 5 physical machines are recommended.
  About HA: The HA here is divided into two parts, one is the HA of the proxy layer, and the other is the HA of the underlying Redis. Let's talk about the HA of the proxy layer first. As mentioned before, the proxy itself is stateless, so the HA of the proxy itself is better, because it is the same to connect to any living proxy. In the production environment, we use jodis, this is our A jedis connection pool developed is very simple. It is the list of surviving proxies on **zk, and returns jedis objects one by one to achieve the effect of load balancing and HA. Some friends also use LVS and HA Proxy for load balancing in the production environment, which is also possible. HA of Redis itself, here Redis refers to the master of each server group at the bottom of codis. At the beginning, codis did not design this part of HA, because after Redis hangs, if you directly upgrade the slave If so, it may cause data inconsistency, because new modifications may not be synchronized to the slave in the master. In this case, the administrator needs to manually repair the data. Later, we found that many friends reported this demand, so we developed a simple ha tool: codis-ha, which is used to monitor the survival of the masters of each server group. If a master dies, it will directly improve the A slave of the group becomes the new master. The address of the project is: https://github.com/ngaut/codis-ha.
  About dashboard: Dashboard plays a very important role in codis. All cluster information change operations are initiated through dashboard (this design is a bit like Docker). Dashboard exposes a series of RESTful API interfaces, whether it is a web management tool, or Command line tools operate by accessing these httpapis, so please ensure the network connectivity of the dashboard and other components. For example, it is often found that the ops of the cluster in the dashboard of a user is 0, because the dashboard cannot connect to the proxy machine.
  Regarding the Go environment: try to use the go1.3.x version in the production environment. The performance of Go 1.4 is very poor, and it is more like an intermediate version, which is released before it reaches the state of production ready. Many friends have a lot of criticisms about go's gc. We will not discuss philosophical issues here. The choice of go is the result of weighing many factors, and codis is a middleware type product, and there are not too many small objects resident in memory. Therefore, there is basically no pressure on gc, so there is no need to consider the problem of gc.
  Regarding the design of the queue: In fact, in short, it is the principle of "don't put eggs in one basket", try not to put all data in one key, because codis is a distributed cluster, if you always operate only one key , which is equivalent to degenerating into a single Redis instance. Many friends use Redis for queues, but Codis does not provide BLPOP/BLPUSH interfaces. This is no problem. You can logically split the list into multiple LIST keys, and implement regular polling on the business side (unless you The queue requires strict timing requirements), so that different Redis can share the access pressure of the same list. Moreover, if a single key is too large, it may cause blocking during migration. Since Redis is a single-threaded program, normal access will be blocked during migration.
  About master-slave and bgsave: codis itself is not responsible for maintaining the master-slave relationship of Redis. The master and slave in codis are only conceptual: the proxy will send the request to the "master", and if the master hangs up codis-ha, it will send a certain A "slave" is promoted to master. The real master-slave replication requires manual configuration when starting the underlying Redis. In the production environment, I suggest that the master machine should not open bgsave, and do not easily execute the save command. The data backup should be operated on the slave as much as possible.
  About cross-machine room/multiple activities: don't even think about it. . . Codis does not have the concept of multiple copies, and codis is mostly used in cache business scenarios. The pressure of business is directly on the cache. If a cross-machine room architecture is used at this layer, it is difficult to guarantee performance and consistency.
  Regarding the deployment of the proxy: In fact, the proxy can be deployed in a place close to the client, such as on the same physical machine, which is conducive to reducing delays, but it should be noted that currently jodis does not select the best location based on the location of the proxy. example, needs to be modified.
  Fourth, some views on distributed databases and distributed architecture (one more Thing)
  Codis-related content has come to an end. Next I want to talk about some of my views on distributed databases and distributed architectures. The architects are so greedy, they have to become distributed if they have a single point, and also want to be as transparent as possible :P. As far as MySQL is concerned, from the earliest single point to the separation of master-slave read-write, and then to Alibaba's similar Cobar and TDDL, the distribution and scalability are achieved, but at the expense of transaction support, so there is the later OceanBase. Redis went from single point to Twemproxy, to Codis, to Reborn. By the end of the day, the storage has been completely different from the original, but the protocols and interfaces are forever, such as SQL and Redis Protocol.
  NoSQL has come one after another, from Hbase to Cassandra to MongoDB, to solve the problem of data scalability, and to balance the CAP by tailoring the business storage and query models. But almost all cross-bank transactions are lost (insert, Xiaomi has added cross-bank transactions on Hbase, which is a good job).
  In my opinion, aside from the details of the underlying storage, for business, KV, SQL query (supported by relational database) and transaction can be said to be the storage primitives that constitute the business system. Why is the combination of memcached/Redis+MySQL so popular? It is precisely because of this combination that several primitives can be used. For business, it can easily realize the storage requirements of various businesses, and can easily write "correct" procedure. However, the current problem is that when the data is large to a certain extent, in the process of evolving from a single machine to a distributed one, the most difficult thing to do is the transaction. What SQL supports can also be done through various mysqlproxy, not to mention KV, it is born Distributed friendly.
  So in this way, we have entered a world without (cross-line) transaction support by default. In many business scenarios, we can only sacrifice the correctness of the business to balance the complexity of the implementation. For example, a very simple requirement: the most straightforward and normal way to write the change in the number of followers on Weibo is to put the modification of the number of followers of the follower and the number of followers of the follower into the same transaction, together Commits, either succeed together or fail together. But now in order to consider the performance, in order to consider the complexity of the implementation, the general approach may be queue-assisted asynchronous modification, or bypassing the transaction by cache first, etc.
  However, in some scenarios that require strong transaction support, it is not so easy to bypass (currently we only discuss open source architecture solutions), such as payment/point change business, the common way is to sharding the key path to a single point MySQL according to user characteristics, Or MySQLXA, but the performance drops too much.
  Later, Google encountered this problem in their advertising business, which required both high performance and distributed transactions, and also had to ensure consistency :), Google was struggling to support it through sharding through a large-scale MySQL cluster, The operability/scalability of this architecture is really poor. If it is in a normal company, it is estimated that it will be tolerated, but Google is not a normal company. It uses atomic clocks to get Spanner, and then builds the SQL query layer F1 on Spanner. When I saw this system for the first time, it was amazing. It should be the first publicly designed system that can truly be called NewSQL. Therefore, BigTable(KV)+F1(SQL)+Spanner (high-performance distributed transaction support), and another very important feature of Spanner is cross-data center replication and consistency guarantee (implemented by Paxos), multi-data center , which just completes the database stack of the entire Google infrastructure, making Google very convenient for almost any type of business system development. I think this is the future direction, a scalable KV database (as a cache and simple object storage), a high-performance distributed relational database that supports distributed transactions and SQL query interfaces, and provides table support.
  5. Q & A
  Q1: I haven't seen Codis. You said that Codis does not have the concept of multiple copies. What do you mean?
  A1: Codis is a distributed Redis solution, which conceptually divides data into 1024 slots through presharding, and then uses proxy to divide the data into 1024 slots. The key request is forwarded to different machines, and the copy of the data is still guaranteed by Redis itself.
  Q2: Codis information is stored in a zk. Does zk have other functions in Codis? Why not use sentinel
  A2 for master-slave switching: Codis is characterized by dynamic expansion and contraction, which is transparent to business; zk not only stores routing information, but also serves as a media service for event synchronization, such as changing the master or data migration, which requires all proxies to pass specific zk events. To achieve it can be said that zk is used by us as a reliable rpc channel. Because only the admin of the cluster change will send an event to zk, after the proxy arrives, the reply will be on zk, and the admin will continue after receiving the reply from each proxy. The cluster itself does not change frequently, so the amount of data is not large. The master-slave switch of Redis is to traverse the master of each server group on zk through codis-ha to judge the survival situation, and decide whether to initiate a command to upgrade the new master.
  Q3: Does data sharding use consistent hashing? Please give a detailed introduction, thank you.
  A3: No, it is through presharding, and the hash algorithm is crc32(key)%1024
  Q4: How to manage permissions?
  A4: There is no authentication-related command in Codis, and the auth command is added to reborndb.
  Q5: How to prevent ordinary users from linking Redis to destroy data?
  A5: Same as above, currently Codis does not have auth, and it will be added in the next version.
  Q6: What is the solution for Redis across computer rooms?
  A6: At present, there is no good solution. Our Codis positioning is a cache service within the same computer room. For services like Redis, the first is the large delay and the second is consistent The performance is difficult to guarantee. For cache services with high performance requirements, I don’t think cross-machine room is a good choice.
  Q7: How to do the master-slave of the cluster (for example, cluster S is the slave of cluster M, the number of nodes in S and M may be different, and S and M may not be in the same computer room)?
  A7: Codis is just a proxy-based middleware, and Not responsible for data copy related work. That is, there is only one copy of the data, inside Redis.
  Q8: Based on what you have introduced, I can draw a conclusion that you do not have the concept of multi-tenancy, nor do you achieve high availability. Can you say that? You are designing Redis more as a cache.
  A8: Yes, in fact, our internal multi-tenancy is solved through multiple Codis clusters. Codis is more of a project to replace twemproxy. High availability is achieved through third-party tools. Redis is a cache, and Codis mainly solves the problem of single point and horizontal expansion of Redis. Post the introduction of codis: Auto rebalance Extremely simple to use Support both Redis or rocksdb transparently. GUI dashboard & admin tools Supports most of Redis commands. Fully compatible with twemproxy(https://github.com/twitter/twemproxy). Native Redis clients are supported Safe and transparent data migration, Easily add or remove nodes on-demand. The problems solved are these. In the case of non-stop business, how to dynamically expand the cache layer is what Codis is concerned about.
  Q9: Do you have any experience with the migration of Redis cold standby databases? For Redis hot data, you can use the migrate command to transfer data between two Redis processes. Of course, if the peer has a password, migrate is over (this I A patch has been submitted to Redis official).
  A9: For cold data, we have implemented the complete Redissync protocol and a disk storage engine based on rocksdb. The cold data of the standby machine is all stored on the disk, directly attached to the master as a slave. In actual use, 3 groups have the same number of keys, but one of them has twice the ops of the other two. What may be the reason? The same number of keys does not mean that the actual requests are evenly distributed, as you may Some keys are particularly hot, and they must fall on the machine that actually stores the key. The storage engine of rocksdb just mentioned: https://github.com/reborndb/qdb, in fact, it is a Redis-server after startup, which supports the PSYNC protocol, so it can be used directly as Redis. Is a good way to save memory from the library.
  Q10: Redis instance memory accounts for more than 50%. At this time, when bgsave is executed, if virtual memory support is enabled, it will block, and if virtual memory support is not enabled, err will be returned directly, right?
  A10: Not necessarily, it depends on the write data ( The frequency of data modified after bgsave is turned on), the execution of bgsave inside Redis is actually realized by the COW mechanism of the operating system. If you modify almost all the data during this period, the operating system can only be completely complete Copy it out, and it will explode.
  Q11: I just finished reading it, I like it. Can you introduce the autorebalance implementation of codis.
  A11: The algorithm is relatively simple, https://github.com/wandoulabs/codis/blob/master/cmd/cconfig/rebalancer.go#L104. The code is relatively clear, code talks :). In fact, the slot is allocated according to the memory ratio of each instance.
  Q12: I mainly want to know how to reduce the impact of data migration on online services. Do you have any experience to introduce?
  A12: In fact, the way of codis data migration is very gentle now. It is an atomic migration of keys. If you are afraid of jitter, you can even add the delay time of each key. The advantage of this is that there is basically no perception of the business, but the disadvantage is that it is slow.
  The main point is these points. Do you understand? If you don’t understand, you can go here and do a simple database study. The most important thing is that everyone should communicate more and bring up what you don’t understand. Come out, learn from each other, improve each other.

Source code: minglisoft.cn/technology

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326503445&siteId=291194637