ChubbyGo's security demonstration and prospect

introduction

This article mainly states some of the problems encountered in ChubbyGo's current functions, as well as some areas that I think can continue to be optimized.

In fact, after reviewing the implementation of ChubbyGo again, I was surprised to find that, except for some bugs that I know that can be modified and the inconvenience caused by Go's RPC framework, all security problems originate from time. This is An unreasonable and unexpected thing. Although under the influence of Lamport's thoughts, I knew that time is an extremely difficult problem in a distributed environment, but after all, I haven't reached the code level before, and my understanding of this problem is still on the surface. After the ChubbyGo is initially completed, it can be considered a deeper level than before to see how much time affects a distributed service.

This article first demonstrates the security from two real examples encountered in ChubbyGo, and then gives several inspirations from ZooKeeper.

safety

Acquire timeout parameter

ChubbyGo implements a distributed lock service. Different clients come from different machines. Obviously, the behavior of the client is uncontrollable for the server. This means that in order to be more robust, the lock operation needs to have a timeout parameter to prevent the lock from being obtained by other clients after a client holding the lock goes down. This is easy to understand, and the key to the problem is time. why? Let's look at a picture:
Insert picture description here

  1. Suppose the reference time for the client to send a request is 0S.
  2. The client sends a lock request, and the requested lock timeout period is 2S.
  3. Assuming that the leader processes the request immediately after receiving the request, the lock is released after two seconds, that is, the unlocking time is 2+x.
  4. We add the time between when the leader receives the data and execute the lock to the delay of the data packet from the leader to the client, which is y. Obviously, y is greater than 0.
  5. When the client receives the packet and executes the lock, we then introduce a degree of unsynchronization between the client and the leader's clock, which we call degree, plus the time from when the client processes the packet to the lock, we call it z.
  6. That is, the time for the client to unlock and unlock is 2 + x + y + degree + z.

We can clearly see that the unlocking time of the client is lagging behind the unlocking time of the server. The difference between this time is (y + degree + z), but one of the most serious problems is that the degree is uncertain whether it is positive or negative. Because it may be that the clock of the leader is faster, or the clock of the client is faster, that is to say, (y + degree + z) we can’t judge the positive or negative, which means that two clients may be at the same time at the same time. Think of yourself as holding the lock. The current solution is that the leader's real sleep time is twice that of TimeOut, which avoids this problem to the greatest extent. Of course, it is still wrong in theory, but from the engineering point of view, it has actually avoided most of the problems.

We can see that this problem actually describes the problem that the client still holds the lock after the timeout period expires. There is another situation that may cause the above situation, that is, the client program may have passed a long GC or program priority The level is extremely low, and it is not scheduled in time by the OS scheduler, which is also a situation described in many blogs. I finally used the token mechanism to solve this problem instead of using Keepalive like Chubby. Interested friends can check the brief description of the token mechanism at the end of section 2.2 of " Distributed Locks: The Choice between Security and Performance ".

CheckSeq interface

CheckSeq is an interface introduced so that other servers can detect Token after the introduction of the Token mechanism. Next, it will be described that ChubbyGo is safe when other servers process requests correctly. Let's first look at a picture:
Insert picture description here

We assume that Leader is the server, the following OtherServer is a resource server that needs to detect Token, and the remaining two are different clients. Let's look at it step by step:

  1. ClientA requests a lock, and the value of Token is 2.
  2. ClientA uses this Token to request resources.
  3. After receiving this request, OtherServer sends a CheckToken request to ChubbyGo Leader. The parameter is the Token received from ClientA and the value is 2.
  4. The Leader sees that the Token is currently the latest when receiving the CheckSeq, and returns OK to the OtherServer. However, the lock held by ClientA times out when the data packet runs halfway. We assume that the lock held by ChubbyGo and the client expire at the same time (at least after the server expires), which is actually the problem we described in the previous section. That is to say, when OtherServer receives OK, maybe ClientB's request with Token3 has arrived, but we said earlier that ClientA already knows that it does not hold the lock, so although OtherServer receives OK, it allocates resources to ClientA. Will be rejected by ClientA.
  5. ClientB's request arrives. At this time, OtherServer continues the above process. Of course, if the above extreme conditions do not occur, the request for resources is successful.

Of course, if the resource exists in ChubbyGo, these things won't happen. It's ok to directly check the token of each request.

Outlook

FIFO client order与Sequential Consistency

ChubbyGo provides linear consistency like Chubby, which means that for all customers, they can see all operations that occurred before their Get operation, that is, global events can be understood as linear, and everything except Leader All of the services only play a role of data redundancy. Obviously, this is a relatively large waste of resources, because at least N/2 Follower servers and Leader have the same log (the underlying consistency protocol is implemented by Raft), but they It's inaction.

We actually have to consider a question at this time, that is, do customers need such a strong consistency? We take the virtual file tree provided by ChubbyGo's lock service as an example. Obviously, customers don't care about the data of files other than their own locks most of the time. In this way, FIFO is really a very suitable consistency, because the client only needs to care about the consistency of its own perspective. ZooKeeper uses a zxid to implement FIFO. Of course, this also has a problem. It seems that Zookeeper cannot achieve global consistency at this time. In fact, it is not difficult to achieve the Zookeeper architecture, which is the role of Sync. Of course, as far as I am concerned, I think ZooKeeper may Get is more elegant as a special write operation.

So ChubbyGo can try to introduce FIFO client order consistency in the later stage. This is not difficult to achieve. We only need to bring zxid to each operation and it is ok. At least it looks like this for now. Maybe we can also support reading data from Follower in the future. .

For more details about the improvement, I put it in Chubby's documentation , see section 3.3.

to sum up

It is actually frustrating to come to the conclusion after these thoughts, because ChubbyGo does not stand out in terms of architecture. On the contrary, the architecture of ZooKeeper is more concise and efficient, and it gives users more possibilities. But compared to Redis, ChubbyGo still has some advantages. For example, ChubbyGo will not have the data loss that Redis uses as a distributed lock to cause multiple clients to hold a lock, and it has more functions.

Guess you like

Origin blog.csdn.net/weixin_43705457/article/details/109458119