Why do I need to replace the RPC?

Many small partners have encountered the problem of needing to replace RPC for distributed system calls. Why does this happen? In fact, in the early stage of system construction, the requirements are simple, the architecture is simple, and the most important thing is that the number of requests is small, so many systems adopt the rapid prototyping development mode, and the requirements for rpc are not high. Just find a convenient or familiar rpc framework set into the system. However, as the complexity of the business increases and the amount of requests carried by the system increases, the RPC framework used at the beginning may show some fatal problems, such as the large fan-out problem. Let's take Thrift as an example. For example, with the growth of business complexity, we are faced with the following requirements.

As shown in the figure, for each request, the upstream service must obtain the results of a total of 26 services from A to Z downstream, and then assemble the results of these 26 services and return them to the front-end service. Some people say that 26 services is a bit exaggerated, I have never encountered this situation in my system. This is actually not an exaggeration. A system with complex business is split by services and finally split into some independent services with high cohesion and low coupling. It is very easy to achieve such a number of service types, and 26 is far from a lot. So when encountering such a problem, how does the traditional synchronous RPC solve this problem?

Taking Thrift as an example, if you need to access 26 services, in order to ensure the request processing speed, you must access each downstream service in parallel (requests cannot be serialized, because this will result in a response time of at least timeA + timeB + .... .. + timeZ), then we can only do concurrency through multithreading.

Through multi-threaded concurrent requests, we can basically achieve max(timeA, timeB, ..., timeZ) at most to process a request, but it is actually slightly more than this. It looks like we have to get a request thread pool, but how big should this pool be? If the current front-end request rate is P, then in order to ensure that each request processing time is as fast as possible, we need a thread pool of size 26 * P. Although at first glance, it may be able to cope. After all, after the request thread sends a network request, it will block in IO, and it will give up the CPU, so that the computing thread can obtain the CPU without wasting much CPU resources, but when P is too large, it will Oh no. For example, if P is 100 or 1000, too many threads at this time may increase the CPU scheduling overhead, because it will increase the thread switching burden of the CPU.

Therefore, we replace the RPC if and only if the current RPC has caused a system burden. For a system with a small business volume, it is not necessary to replace the RPC. However, you can also replace the RPC for technical improvement, but the benefits may not be. Big.

What kind of RPC is needed?

Considering that Thrift is not suitable for large fan-out, we may need RPC that works like the following.

This reactor model (just a simple example) can reduce the number of request threads. This RPC uses the system's Epoll to request back-end services and receive data, so that no matter how many requests are made, only one thread is used to complete it, and the user process is notified when the data arrives or can be sent through the Epoll mechanism, but the final need Return the received data to the computing thread for use. This model is actually better than the Thrift one. I also implemented a simple RPC framework in my spare time: http://www.cnblogs.com/haolujun/p/7527313.html It is rough but small enough.
There are also many open source RPC frameworks, such as fbthrift and gRPC, which can deal with large fan-out and find the one that suits you with the lowest amount of changes and post-maintenance costs.

How to migrate to the new RPC?

To migrate the system to the new RPC, in addition to changing the code, it is to be compatible. The system may need to run on two sets of RPC frameworks during the migration process, and the migration must be smooth. For example, a general distributed system might look like the following.

Services B1~B4 write their own addresses into etcD, but since we did not consider RPC migration at the beginning, the value corresponds to the address of the service, and there is no rpc type used by the service, etc.

Option 1 Add a new key

For A1~A2, B1~B4, you can first select a part for smooth transition, for example, we choose A1, B1~B2 for migration.

The online steps are as follows:

Offline A1, B1, B2.
Update the A1 configuration to read the list of backend services from the new key: service_new_rpc.
Update the B1,B2 configuration to register itself in the new key: service_new_rpc.
Start B1, B2.
Start A1.
Repeat the above steps for A2, B3, B4.

In this way, we can smoothly migrate services. But its shortcomings are obvious, a new key is required, and the service needs to be moved back to the old key in the later stage.

Scenario 2 Code Compatible

This solution must change some parsing code to make it compatible with the new ETCD value format, as shown below.

First, modify the A code to make it compatible with the new address resolution format. The new address format adds the RPC type identifier after each address: T (Thrift), G (GRPC), the compatibility between the new format and the old format is very easy, just find the delimiter during parsing, and judge the delimiter at the end Part of it is whether T is G or nothing, if not, it defaults to T.
Modify the A code so that it can use different RPC frameworks to call the backend according to the RPC type of the backend service in ETCD.
Modify the configuration of B1~B4, and add the RPC type by the way when you register yourself in etcD.
Modify B1~B2, use the new RPC as the server, and set the RPC type to G when registering.
Modify B3~B4, use the new RPC as the server, and set the RPC type to G when registering.

Through this step, we can achieve smooth migration of RPC. There are also disadvantages of this method: two sets of RPC frameworks need to be maintained at the same time until one of the RPCs is completely offline. But there are also advantages, no new keys are added.

Summarize

Replacing RPC is not as difficult as you think. As long as you sort out the logic before and after, and migrate a little bit, eventually your services will be completed. The most important question is has your system really reached the point where you have to switch to RPC?

How to elegantly replace RPCs for distributed systems