Chief Architect of JD.com: Architectural Practices Behind the 618 Promotion Gateway Carrying One Billion Calls

During the 618 promotion, our gateways carry billions of traffic and calls. In this case, the gateway system must ensure the stability and high availability of the entire system, as well as high performance and reliability to support the business. What we are facing is a very complex problem. Based on this complex problem, how to improve its performance and stability, and how to integrate complex technologies to ensure the high availability of the overall gateway is the focus of this article.

1. Gateway Covered Technologies

1.1 Gateway System

There are two main types of gateway systems:

  • The first one is called the client gateway, which is mainly used to receive some client requests, that is, the server side of the APP;

  • The second is called open gateway, mainly companies (such as JD.com) provide interfaces for third-party partners.

The techniques used by these two different gateways are very similar.

The difficulties faced by gateways with relatively large traffic include:

First, the gateway system needs to carry billions of traffic calls. The smooth operation of the interface and the performance loss of each interface after the backend service are very important. For example, we used a Redis cluster, and then built two computer rooms, each of which built a Redis cluster, so that high availability can be well guaranteed. When faced with a momentary traffic, we adopted some caching technology, or the more advanced Nginx+lua+Redis technology, so that this kind of high-traffic application can be separated from the dependency of JVM. In addition, we need to sort out each interface and downgrade some weakly dependent interfaces through the downgrade strategy to ensure the availability of core applications.

Second, the gateway system is actually a process of extending Http requests to back-end services. Our gateway has undertaken more than 1,000 back-end service interfaces. Faced with this situation, how can we ensure that services do not affect each other? How can the architecture level prevent the butterfly effect and prevent avalanches? That is to say, when a problem occurs on one interface, it will not affect the healthy operation of other interfaces. This sounds simple, but it is not.

There are more than a thousand interfaces, and the performance of each interface is inconsistent, and the external resources and database caches that each interface depends on are different. Various problems occur almost every day. How can we adopt some isolation technologies, governance Technology, etc., to ensure that when there is a problem with these interfaces, it will not affect the overall situation?

Third, we have exposed a thousand service interfaces to the outside world. The back of all interfaces means that dozens or even hundreds of teams are developing continuously every day, and new requirements may be launched every day. Faced with such a complex situation, it is impossible for us to modify or go online every time the back-end server is modified, so that the gateway will become very fragile and the stability will be extremely low.

We have adopted a dynamic access technology, so that the back-end gateway can be seamlessly connected through an access protocol, and then through some dynamic proxy methods, the back-end interface, no matter what modification or online is made, can be directly connected. Through the back-end management platform, it can be transparently published from the gateway, which solves the online problem of our gateway that depends on the back-end interface service.

1.2 Gateway Covered Technologies

Four technical directions of the gateway:

First, unified access. That is, the traffic of the front end (including APP or other sources) can be accessed at the unified network layer. The problems faced by this layer are: high-performance transparent transmission, high concurrent access, high availability, and how to forward a load to the back-end after the front-end traffic arrives.

Second, traffic control mainly refers to the traffic management part. In the face of massive traffic, how can we use some anti-brush technologies to ensure that the gateway is not overwhelmed by large traffic; and how can we use some technologies such as current limiting, downgrading, and fusing to protect the gateway in an all-round way.

Third, protocol adaptation. As mentioned above, the gateway will transparently transmit thousands of services on the backend, and not every one of these services needs to be developed and configured by the gateway. Through a protocol adaptation conversion, we allow various backend services to be opened from the gateway through the protocol we specify and through http. Of course, the gateway is not only the http protocol, but also some TCP. The internal protocols of JD.com are relatively uniform. There are Http restful protocols and JSF interfaces. JSF is a framework developed by JD.com, an RPC calling framework, similar to double, and then an RPC framework discovered based on registration.

Fourth, safety protection. This part is very important for the network, because the gateway is an export of the whole company. In this layer, we need to do some anti-brush, such as anti-cleaning some malicious traffic, do some blacklisting, and when there is some malicious traffic, pass Restricting means such as restricting IP rejects it from the entire gateway, preventing these malicious traffic from overwhelming the gateway.

2. Self-developed gateway architecture

2.1 Self-developed gateway architecture

Our self-developed gateway architecture is mainly divided into three layers.

The first layer: the access layer. Mainly responsible for the access of some long and short links, current limiting, black and white list, routing, load balancing, disaster recovery switching, etc. The technology used in this layer is the Nginx+lua method.

The second layer: the distribution layer (or: the business layer of the gateway). It is more of a NIO+Serviet3 asynchronous technology. This layer is further divided into several parts.

  • The top layer is data verification, in which some signature verification, time verification, and version, method, etc. will be done.

  • The next layer is called the generalization call layer, which mainly converts the restful request exposed by the gateway to the internal protocol of Jingdong, and performs a dynamic adaptation call process. In this piece, we use some caching technologies more, and technologies such as thread isolation and fusing are also implemented at this layer. Because there is a large amount of data and protocol conversion, this layer uses the technology of multi-use cache. All the data in our gateway layer will not directly penetrate to the DB, but use a method called heterogeneous data to directly use the cache. made.

There are two pieces in the middle of the generalization layer: one is called active notification, and the other is sandbox testing. Active notification is easy to understand, that is, we will notify the client in time through this TCP downlink channel, and send some coupons or reminders such as JD.com account; sandbox testing mainly means that we conduct an external 's test.

As shown in the figure, the rightmost part is service degradation, logging, and monitoring alarms. These three are the support systems of our entire gateway. Service downgrade means that when some services have problems, they are downgraded as soon as possible; logs are used for us to troubleshoot problems; monitoring alarms will be highlighted below, because the availability of a gateway is largely improved by monitoring systems. , there is no monitoring system, no alarm, just like no eyes, there is no way to know anything.

The third layer: back-end various business APIs (business interfaces). These interfaces are exposed externally through the gateway.

The entire gateway is roughly divided into the above three layers. The top access layer and the middle are the distribution layer of the gateway, as well as the business verification and business logic layers, and then transparently transmit requests to the back-end services through the gateway.

In addition to these three layers, we look at the systems on both sides, which are the core and important support of our entire gateway.

  • Gateway registry. Various interfaces on the backend can be published through the gateway registration center. This system has a similar management interface. As long as the backend API service is written according to the inherent protocol, if the format is OK, upload it to the management background, and it will be done with one click. Can be posted online. Of course there will be a test before the interface is released.

  • OA authentication center. This piece is mainly used for authentication. Security checks such as the verification of many signatures in the data verification layer are unified at this layer.

2.2 Technology stack

Some technology stacks involved in our gateway system: the first is Nginx+lua technology at the access layer; the second is NIO+Serviet3 asynchronous technology; the third is separation technology; the fourth is downgrade current limiting; the fifth is fuse technology; the sixth is caching, where should be cached, and where can the library be directly read; the seventh is heterogeneous data; the eighth is fast failure; the last is monitoring statistics, which is a very important part of the entire high-availability gateway system .

Below is an in-depth discussion and analysis of the scenarios in which these technologies are applicable, including what problems we use them to solve.

3. Basic ideas and process improvement points

Practice 1 Nginx layer unified access

First look at the entire online deployment architecture of the gateway, first enter the entire Jingdong gateway through a soft load LVS, the first layer is the core Nginx, after the core Nginx is the business Nginx, and then transparently transmit our requests through the business Nginx to the backend server.

The core Nginx is mainly the distribution of front-end traffic, such as current limiting and anti-brush are done at this layer. The lower layer is business Nginx, and the main logic of Nginx+lua is implemented in this layer. This layer can also reduce the core Nginx pressure and CPU pressure, and some lua application logic, such as current limiting, anti-brush, authentication, and downgrade, are all done at this layer.

Why add the layer of Nginx+lua? Compared with Tomcat, etc., Nginx is actually a server that can handle particularly large concurrent traffic. Based on this situation, we have had problems before. When this kind of concurrent traffic is particularly large, once there is a problem with a single machine later, even if you downgrade this interface, in fact, the real traffic still goes to the JVM of the Tomcat layer. When the traffic is heavy, it is difficult to digest this thing through the JVM. The result is: when your Tomcat has a problem, it is difficult for you to solve the problem by restarting, because the traffic will always exist, this Tomcat If something goes wrong, all actions are released after the restart, but they are like viruses, which will be transmitted back and forth. You restart a batch, and this batch will be infected again immediately. Nginx is naturally this NIO asynchronous method, which can very well support the business needs of large concurrency. So we put some core ones, such as downgrade, flow control, etc., on this layer, and let it prevent the traffic for us at the front end.

Practice 2: Introducing NIO and using Servlet3 for asynchrony

The second practice is to introduce NIO in the Tomcat layer, use a JDK7+TOMCAT7+Servlet3 configuration to make synchronous requests asynchronous, and then use NIO's multiplexing processing technology to allow us to simultaneously process higher number of concurrency.

Using Servlet3 asynchronous can improve the throughput, but the response time of a single request will be slightly longer, but this loss is tolerable, because it will bring about an increase in the throughput and flexibility of the entire application. Still very worth using.

The specific strategy is: the business method opens the asynchronous context AsynContext; releases the current processing thread of tomcat; the thread of tomcat is released, and then used for the processing of the next request to improve its throughput; to complete the processing of the business method in the AsynContext environment, call its complete method, which writes the response back to the response stream. This can improve the possibility of tomcat's business logic, allowing us to handle more requests with a very small number of threads at this layer, without being overwhelmed when the traffic is very heavy.

Practice 3 The Art of Separation

This section will pick two of the most important separation techniques to share.

Separation of request parsing and business processing

The first is to separate the request parsing thread from the business thread processed later by means of NIO.

Requests are handled by tomcat single thread, in NIO mode a very small number of threads can be used to handle a large number of link situations. Business logic processing and response generation are handled by another tomcat thread pool, which is isolated from the request thread. The business thread pools here can be further isolated, and different thread pools are set up for different businesses.

Business thread pool separation

The second is the separation of business thread pools, which is to isolate different interfaces or different types of interfaces through a thread isolation technology. For example, order-related interfaces are processed by 20 separate threads; commodity-related interfaces are processed by 10 separate threads, so that different interfaces can be independent of each other. The problem, consumes itself at most, and will not affect the invocation of threads of other interfaces.

The specific thread isolation can specify the number of a group of threads according to the business. These threads are prepared for a fixed interface. When there is a problem with this interface, it will use up its own number of threads and will not occupy other interfaces. Thread, which plays the role of thread isolation, so that when a single API fails, it will not affect others.

Practice 4 Downgrade

Downgrade mainly means that when there is a problem with an interface, we can directly downgrade the interface and let the call return directly without using other applications. In addition, if there is a problem with a weaker business logic, we directly downgrade this logic, so as not to affect other golden logics.

How to downgrade?

First of all, the downgrade switch should be managed centrally, such as pushing it to various application services through zookeeper. In this way, the corresponding switch can be found for downgrade processing as soon as a problem occurs.

A unified configuration based on development and downgrade itself, the system should be highly available and support multi-dimensional caching. For example, if we use zookeeper to implement, first zookeeper will have database storage, and then there will be a local cache. Then we will have a snapshot. If zookeeper cannot read the cache, it will load some underlying data through the snapshot to ensure that once the development is triggered, it can respond as soon as possible. And our switch isn't going to be a problem with other systems, it's a very weak, very thin layer.

I specially sorted out the above technologies. There are many technologies that can’t be explained clearly in a few words, so I just asked a friend to record some videos. The answers to many questions are actually very simple, but the thinking and logic behind them are not simple. If you know it, you also need to know why. If you want to learn Java engineering, high performance and distributed, explain the profound things in simple language. Friends of microservices, Spring, MyBatis, and Netty source code analysis can add my Java advanced group: 433540541. In the group, there are Ali Daniel live-broadcasting technology and Java large-scale Internet technology videos for free to share with you.

Fine flow control

After talking about switching, traffic control, and downgrade, let's look at the multi-dimensional traffic control and downgrade strategy, such as control according to a single API or API + region, operator and other dimensions. Once a problem occurs, we will downgrade a variety of combinations, and can also control traffic according to different dimensions such as seconds/minutes, so as to achieve refined traffic management.

graceful degradation

When it comes to downgrading, what I said above is more at the technical level. At the business level, we also need to pay attention to graceful downgrading. We can't say that once this logic is established, it directly returns to the front end 502, which is definitely unfriendly. We will definitely communicate with the front-end, for example, after downgrading, feedback a corresponding error code to the front-end, or give the user a prompt and other operation instructions, which can make the user experience better.

Practice 5 Current limiting

Malicious requests, malicious attacks, malicious request traffic can only access the cache, and malicious IPs can be blocked using deny at the nginx layer;

To prevent the process from exceeding the carrying capacity of the system, although it will be estimated, there are always surprises. If there is no current limit, when the peak load of the system is exceeded, the entire system will collapse.

Practice 6 Fusing

When there is a problem with our back-end mechanism and a certain threshold is reached, the system can automatically shut down and downgrade. This is the general idea of ​​fuse. We will have a more flexible configuration: for example, when an interface accesses three times in succession and times out or returns an error, it will automatically fuse; it can also be configured with some timeouts, for example, if the performance of three consecutive calls to this method exceeds 50 milliseconds, it will This method is automatically blown. After the fuse, it is equivalent to downgrading. If it is called again, it will return failure, that is, it will directly refuse to return.

There can also be a setting after the fuse: for example, it will come out in a half-open state after 5 seconds or a minute. After waking up again, it will test whether the service is OK that day. If there is no problem, it will go to you The previously blown API business has been opened again, and services can be provided normally to the outside world. Now there are some open source practices, through which you can do a good fuse. Of course, according to the ideas here, you can also implement it yourself, this is not a particularly complicated thing.

Practice 7 Fail Fast - Timeouts in the Link

Failing fast is a very important practice, not only for gateway systems, but also for other systems, especially those with a large number of calls, such as paying attention to the timeout settings in the entire chain. This is one of the things that we need to focus on reviewing when we prepare for Double 11 and 618 every year, including when we are usually developing and every time a new module goes online, we must focus on monitoring this one. We will sort out all the external dependencies of the system. For example, the gateway depends on some of our own business caches and databases, and more on thousands of different services on the backend.

When it comes to the network, we must set a timeout, because a system with a large amount of calls such as a gateway, if the timeout is not set, the default time may be a few minutes. The problem is that the entire gateway system may collapse in an instant, and any interface cannot be used externally. Because of the huge amount of data, you may be washed away before you can downgrade.

Practice 8 Monitoring Statistics - Application Layer

Monitoring statistics is a very core part of the gateway system. Only with monitoring and alarms can we know all operations and every API call in real time.

monitor target

First: Guarantee 7*24 hours guarding system;

Second: It can monitor the operation status of the system in real time, such as which API is calling for too long? Which API is broken? etc;

Third: statistical data, analysis indicators. For example, after a day has passed, is there a timeout for each API call? Is there any performance degradation of access, etc.;

Fourth: real-time alarm. Because monitoring is part of it, being able to notify us as soon as a problem is discovered so that we can deal with it immediately is also an aspect of making the system healthier.

Monitoring range

dimensions of monitoring

  • The first layer: hardware monitoring. Such as system CPU memory, network card and so on.

  • The second layer: custom monitoring. For example, call the police directly.

  • The third layer: performance monitoring. For example, the TP indicators of each interface, TP999, TP99, TP90, TP50, four performance indicators are used as reference standards for SLA, and availability, etc., which are very important for gateways.

  • The fourth layer: heartbeat monitoring. There are many machines on the gateway system line. What is the current situation of each machine? Is there any stock etc.

  • The fifth layer: business layer monitoring. For example, we will have some JVM monitoring, monitoring the number of Nginx connections, etc.

There is a very complete monitoring system in JD.com, called UMP system, which can help us monitor at all levels. It mainly provides us with some files similar to configuration. After we configure it, we can monitor the system. When we are doing it, we will monitor all methods through some AOP agents. Because we are a gateway, we need a lot of back-end transparent transmission. Because the gateway generates these interfaces dynamically, it doesn't know which interfaces there are, so when the interface is dynamically generated, AOP automatically injects it with monitoring one by one. An interface can have one monitor.

When it comes to monitoring, I have to mention that we do a gateway system for transparent transmission. There are various interfaces and business logic behind it. The performance of each business logic and interface needs to be monitored, and then inform the other party to let the other party. To rectify, so after we have added these monitoring, we have to be able to notify the corresponding person in charge, including ourselves, if there is a problem. Therefore, we will send out emails in the form of reports every day and every week, so that all system leaders can know the situation of the corresponding organization, such as whether there is a performance problem, whether it needs to be rectified, etc.

It is the so-called high-rise building that rises from the ground. In the process of architecture evolution, everything from the core module code to the core architecture will continue to evolve. This process is worth our in-depth study and thinking. If it helps you, please move your hands and pay attention!

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324933931&siteId=291194637