1. The cause of the problem

When I was working on a project recently, I encountered a problem that requires communication between multiple users, involving the WebSocket handshake request and the problem of WebSocket Session sharing in the cluster.

During this period, after a few days of research, I summed up several ways to implement distributed WebSocket clusters, from zuul to spring cloud gateway, and summed up this article, hoping to help some people, and to be able to work together Share your thoughts and research in this area.

Below is my scene description

Resources : 4 servers. Only one server has an ssl-authenticated domain name, one redis+mysql server, and two application servers (clusters)
Application publishing restrictions : Due to the needs of the scenario, the application site needs an SSL-certified domain name to publish. Therefore, the ssl-authenticated domain name server is used as the api gateway, responsible for the connection between https requests and wss (securely authenticated ws). Commonly known as https uninstall, the user requests the https domain name server (eg: https://oiscircle.com/xxx), but the real access is in the form of http+ip address. As long as the gateway configuration is high, it can handle multiple applications
Requirements : When users log in to the application, they need to establish a wss connection with the server. Different roles can send single messages or group messages.
Application service type in the cluster: each cluster instance is responsible for http stateless request service and ws long connection service

2. System architecture diagram

In my implementation, each application server is responsible for http and ws requests. In fact, the chat model established by the ws request can also be established as a module. From a distributed point of view, these two implementation types are similar, but in terms of implementation convenience, it is more convenient for an application to serve http+ws requests. will be explained below

The technology stack involved in this article

Eureka service discovery and registration
Redis Session sharing
Redis message subscription
Spring Boot
Zuul gateway
Spring Cloud Gateway gateway
Spring WebSocket handles long connections
Ribbon Load Balancing
Netty multi-protocol NIO network communication framework
Consistent Hash Consistent Hash Algorithm

I believe that everyone who can get to this point has understood the technology stack I listed above. If not, you can go to the Internet to find the introductory tutorial to learn about it. The following content is related to the above technologies, and the subject defaults to everyone's understanding...

3. Technical feasibility analysis

Below I will describe the session features and list n cluster solutions for processing ws requests in a distributed architecture based on these features

WebSocketSession与HttpSession

In the WebSocket integrated by Spring, each ws connection has a corresponding session: WebSocketSession. In Spring WebSocket, after we establish a ws connection, we can communicate with the client in a similar way:

protected void handleTextMessage(WebSocketSession session, TextMessage message) {
   System.out.println("服务器接收到的消息： "+ message );
   //send message to client
   session.sendMessage(new TextMessage("message"));
}

Then the problem comes: the session of ws cannot be serialized to redis, so in the cluster, we cannot cache all WebSocketSessions to redis for session sharing. Each server has its own session. The opposite is HttpSession, redis can support httpsession sharing, but there is no solution for websocket session sharing, so it is not feasible to take the path of redis websocket session sharing.

Some people may think: Can I cache the key information of sessin to redis, and the server in the cluster gets the key information of the session from redis and rebuilds the websocket session... I just want to say that if someone can try this method, please tell me...

The above is the difference between websocket session and http session sharing. In general, there is already a solution for http session sharing, and it is very simple. As long as the relevant dependencies are introduced: spring-session-data-redisand spring-boot-starter-redis, you can find a demo from the Internet to play and know how to do it. . In the websocket session sharing scheme, due to the way the bottom layer of websocket is implemented, we cannot achieve real websocket session sharing.

4. Evolution of the solution

4.1、Netty与Spring WebSocket

At the beginning, I tried to build a websocket server with netty. In netty, there is no concept of websocket session. Similar to channel, each client connection represents a channel. The front-end ws request passes through the port monitored by netty, and after the ws handshake connection is performed through the websocket protocol, message processing is performed through a series of handlers (chain of responsibility mode). Similar to the websocket session, the server has a channel after the connection is established, and we can communicate with the client through the channel

   /**
    * TODO 根据服务器传进来的id，分配到不同的group
    */
   private static final ChannelGroup GROUP = new DefaultChannelGroup(ImmediateEventExecutor.INSTANCE);
 
   @Override
   protected void channelRead0(ChannelHandlerContext ctx, TextWebSocketFrame msg) throws Exception {
       //retain增加引用计数，防止接下来的调用引用失效
       System.out.println("服务器接收到来自 " + ctx.channel().id() + " 的消息： " + msg.text());
       //将消息发送给group里面的所有channel，也就是发送消息给客户端
       GROUP.writeAndFlush(msg.retain());
   }

So, does the server use netty or spring websocket? Below I will list the advantages and disadvantages of these two implementations from several aspects

4.2. Implement websocket using netty

Anyone who has played netty knows that the thread model of netty is the nio model, and the amount of concurrency is very high. The network thread model before spring 5 was implemented by servlet, and servlet was not the nio model. Therefore, after spring 5, the underlying network implementation of spring adopts netty. If we use netty alone to develop the websocket server, the speed is absolute, but we may encounter the following problems:

It is inconvenient to integrate with other applications of the system. When rpc is called, it is impossible to enjoy the convenience of feign service invocation in springcloud
Business logic may have to be implemented repeatedly
Using netty may require reinventing the wheel
How to connect to the service registry is also a hassle
Restful service and ws service need to be implemented separately. If you implement restful service on netty, you can imagine how troublesome it is. I believe many people are used to using spring one-stop restful development.

4.3. Use spring websocket to implement ws service

Spring websocket has been well integrated by springboot, so it is very convenient to develop ws services on springboot, and the practice is very simple

4.3.1, the first step: add dependencies

<dependency>
   <groupId>org.springframework.boot</groupId>
   <artifactId>spring-boot-starter-websocket</artifactId>
</dependency>

4.3.2. Step 2: Add configuration class

@Configuration
public class WebSocketConfig implements WebSocketConfigurer {
@Override
public void registerWebSocketHandlers(WebSocketHandlerRegistry registry) {
    registry.addHandler(myHandler(), "/")
        .setAllowedOrigins("*");
}
 
@Bean
 public WebSocketHandler myHandler() {
     return new MessageHandler();
 }
}

4.3.3. Step 3: Implement the message listener class

@Component
@SuppressWarnings("unchecked")
public class MessageHandler extends TextWebSocketHandler {
   private List<WebSocketSession> clients = new ArrayList<>();
 
   @Override
   public void afterConnectionEstablished(WebSocketSession session) {
       clients.add(session);
       System.out.println("uri :" + session.getUri());
       System.out.println("连接建立: " + session.getId());
       System.out.println("current seesion: " + clients.size());
   }
 
   @Override
   public void afterConnectionClosed(WebSocketSession session, CloseStatus status) {
       clients.remove(session);
       System.out.println("断开连接: " + session.getId());
   }
 
   @Override
   protected void handleTextMessage(WebSocketSession session, TextMessage message) {
       String payload = message.getPayload();
       Map<String, String> map = JSONObject.parseObject(payload, HashMap.class);
       System.out.println("接受到的数据" + map);
       clients.forEach(s -> {
           try {
               System.out.println("发送消息给: " + session.getId());
               s.sendMessage(new TextMessage("服务器返回收到的信息," + payload));
           } catch (Exception e) {
               e.printStackTrace();
           }
       });
   }
}

From this demo, you can imagine the convenience of using spring websocket to implement ws service. In order to better align with the big family of spring cloud, I finally used spring websocket to implement ws service.

So my application service architecture is like this: an application is responsible for both restful services and ws services. The ws service module is not split because the split needs to use feign to make service calls. The first one is lazy, and the second split and no split are different from one more layer of io calls between services, so I didn't do it.

5. Transformation from zuul technology to spring cloud gateway

To implement websocket clustering, we inevitably have to transition from zuul to spring cloud gateway. The reasons are as follows:

Zuul1.0 version does not support websocket forwarding, zuul 2.0 began to support websocket, zuul2.0 was open source a few months ago, but version 2.0 is not integrated by spring boot, and the documentation is not complete. Therefore, transformation is necessary, and transformation is also easy to achieve.

In the gateway, in order to achieve ssl authentication and dynamic routing load balancing, some of the following configurations in the yml file are necessary. Here, you can avoid pitfalls in advance.

server:
  port: 443
  ssl:
    enabled: true
    key-store: classpath:xxx.jks
    key-store-password: xxxx
    key-store-type: JKS
    key-alias: alias
spring:
  application:
    name: api-gateway
  cloud:
    gateway:
      httpclient:
        ssl:
          handshake-timeout-millis: 10000
          close-notify-flush-timeout-millis: 3000
          close-notify-read-timeout-millis: 0
          useInsecureTrustManager: true
      discovery:
        locator:
          enabled: true
          lower-case-service-id: true
      routes:
      - id: dc
        uri: lb://dc
        predicates:
        - Path=/dc/**
      - id: wecheck
        uri: lb://wecheck
        predicates:
        - Path=/wecheck/**

If we want to play with https offloading happily, we also need to configure a filter, otherwise we will get the error not an SSL/TLS record when requesting the gateway

@Component
public class HttpsToHttpFilter implements GlobalFilter, Ordered {
  private static final int HTTPS_TO_HTTP_FILTER_ORDER = 10099;
  @Override
  public Mono<Void> filter(ServerWebExchange exchange, GatewayFilterChain chain) {
      URI originalUri = exchange.getRequest().getURI();
      ServerHttpRequest request = exchange.getRequest();
      ServerHttpRequest.Builder mutate = request.mutate();
      String forwardedUri = request.getURI().toString();
      if (forwardedUri != null && forwardedUri.startsWith("https")) {
          try {
              URI mutatedUri = new URI("http",
                      originalUri.getUserInfo(),
                      originalUri.getHost(),
                      originalUri.getPort(),
                      originalUri.getPath(),
                      originalUri.getQuery(),
                      originalUri.getFragment());
              mutate.uri(mutatedUri);
          } catch (Exception e) {
              throw new IllegalStateException(e.getMessage(), e);
          }
      }
      ServerHttpRequest build = mutate.build();
      ServerWebExchange webExchange = exchange.mutate().request(build).build();
      return chain.filter(webExchange);
  }
 
  @Override
  public int getOrder() {
      return HTTPS_TO_HTTP_FILTER_ORDER;
  }
}

In this way, we can use the gateway to offload https requests. So far, our basic framework has been built. The gateway can forward both https requests and wss requests. The next step is the communication solution for many-to-many session intercommunication between users. Next, I'll start with the least elegant solution based on the elegance of the solution.

6. Session broadcast

This is the simplest websocket cluster communication solution. The scenario is as follows:

Teacher A wants to send a group message to his students

The teacher's message request is sent to the gateway, the content contains {I am teacher A, I want to send xxx message to my students}
The gateway receives the message, obtains all IP addresses of the cluster, and calls the teacher's request one by one
Each server in the cluster obtains the request, finds out whether there is a session associated with the student locally according to the information of teacher A, calls the sendMessage method, and ignores the request if not.

The implementation of session broadcasting is very simple, but there is a fatal flaw: the phenomenon of wasting computing power. When the server does not have a message receiver session, it is equivalent to wasting the computing power of a loop traversal. This scheme can be prioritized when the concurrent demand is not high. Consider, the implementation is easy.

The method for obtaining the information of each server in the service cluster in spring cloud is as follows

@Resource
private EurekaClient eurekaClient;
 
Application app = eurekaClient.getApplication("service-name");
//instanceInfo包括了一台服务器ip，port等消息
InstanceInfo instanceInfo = app.getInstances().get(0);
System.out.println("ip address: " + instanceInfo.getIPAddr());

The server needs to maintain the relationship mapping table to map the user's id with the session. When the session is established, the mapping relationship is added to the mapping table. After the session is disconnected, the relationship in the mapping table needs to be deleted.

7. Implementation of Consistent Hash Algorithm (the main points of this article)

This method is the most elegant implementation in my opinion. It takes a certain amount of time to understand this solution. If you watch it patiently, I believe you will definitely gain something. Once again, students who do not understand the consistent hashing algorithm, please read here first, and now assume that the hash ring is searched clockwise.

First, to apply the idea of consistent hashing to our websocket cluster, we need to solve the following new problems:

When a cluster node is DOWN, it will affect the mapping of the hash ring to the node whose status is DOWN.
When the cluster node is UP, it will affect the old key that cannot be mapped to the corresponding node.
Hash ring read-write share.

In a cluster, there will always be problems with service UP/DOWN.

The problem analysis for node DOWN is as follows:

When a server is DOWN, the websocket session it owns will automatically close the connection, and the front end will receive a notification. This will affect the mapping error of the hash ring. We only need to delete the actual node and virtual node corresponding to the hash ring when listening to the server DOWN, so as to avoid forwarding the gateway to the server whose status is DOWN.

Implementation method: listen to the DOWN event of the cluster service in the eureka governance center, and update the hash ring in time.

The problem analysis for node UP is as follows:

Now suppose that there is a service CacheB online in the cluster, and the ip address of the server is just mapped between key1 and cacheA. Then the user corresponding to key1 goes to CacheB to send a message every time he wants to send a message. The result is obviously that the message cannot be sent, because CacheB does not have a session corresponding to key1.

At this point we have two solutions.

Scheme A is simple, but the action is big:

After eureka listens to the node UP event, it updates the hash ring according to the existing cluster information. And disconnect all session connections, let the client reconnect, at this time the client will connect to the updated hash ring node, so as to avoid the situation that the message cannot be delivered.

Option B is complicated and the action is small:

Let's first look at the situation without virtual nodes, assuming that the server CacheB is online between CacheC and CacheA. All users mapped from CacheC to CacheB will go to CacheB to find the session to send messages when they send messages. That is to say, once CacheB goes online, it will affect the messages sent by users between CacheC and CacheB. So we only need to disconnect CacheA from the session corresponding to the user from CacheC to CacheB, and let the client reconnect.

Next is the case where there are virtual nodes, assuming that the light-colored nodes are virtual nodes. We use long brackets to indicate that the result of a certain region mapping belongs to a certain Cache. The first is the case where the C node is not online. Everyone should understand the graph. All virtual nodes of B will point to the real B node, so the counterclockwise part of all B nodes will be mapped to B (because we stipulate that the hash ring searches clockwise).

Next is the situation of the C node going online. It can be seen that some areas are occupied by C.

From the above situation, we can know that when the node goes online, there will be many corresponding virtual nodes online at the same time, so we need to disconnect the session corresponding to the multi-segment range key (the red part in the figure above). The specific algorithm is a bit complicated, and the implementation method varies from person to person. You can try to implement the algorithm yourself.

Where should the hash ring be placed?

The gateway creates and maintains hash rings locally. When the ws request comes in, the hash ring is obtained locally and the mapping server information is obtained, and the ws request is forwarded. This method looks good, but it is actually not desirable. Recall that when the server is DOWN, it can only be monitored through eureka. After eureka listens to the DOWN event, does it need to notify the gateway to delete the corresponding node through io? Obviously too cumbersome, decentralizing eureka's responsibilities to the gateway is not recommended.
eureka is created and put into redis shared read and write. This solution is feasible. When eureka listens to the service DOWN, it modifies the hash ring and pushes it to redis. In order to keep the request response time as short as possible, we cannot let the gateway go to redis to get a hash ring every time it forwards a ws request. The probability of hash ring modification is indeed very low. The gateway only needs to apply the message subscription mode of redis and subscribe to the hash ring modification event to solve this problem.

So far, our spring websocket cluster has been built almost, and the most important thing is the consistent hash algorithm. Now there is the last technical bottleneck. How does the gateway forward to the specified cluster server according to the ws request?

The answer lies in load balancing. Both spring cloud gateway or zuul integrate ribbon as load balancing by default. We only need to rewrite the ribbon load balancing algorithm according to the user id sent by the client when the ws request is established, hash according to the user id, and search on the hash ring. ip, and forward the ws request to this ip and you're done. The process is shown in the figure below:

When the user communicates next, they only need to hash according to the id and obtain the corresponding ip on the hash ring, and then you can know which server the session exists when the ws connection is established with the user!

8. The imperfections of the ribbon in the spring cloud Finchley.RELEASE version

During the actual operation, the subject found two imperfections in the ribbon...

According to the method found on the Internet, after inheriting the AbstractLoadBalancerRule and rewriting the load balancing strategy, the requests of multiple different applications become chaotic. If there are two services A and B on eureka, after rewriting the load balancing strategy, requests for services of A or B will eventually only be mapped to one of the services. very strange! Maybe the official website of spring cloud gateway needs to give a demo that correctly rewrites the load balancing strategy.
The consistent hashing algorithm requires a key, similar to the user id. After hashing according to the key, it searches the hash ring and returns the ip. But the ribbon did not improve the key parameter of the choose function, and directly wrote the default!

Is there nothing we can do about it? In fact, there is a feasible and temporary alternative way!

As shown in the figure below, the client sends an ordinary http request (including the id parameter) to the gateway, the gateway hashes according to the id, finds the ip address in the hash ring, returns the ip address to the client, and the client then uses the ip address to make a ws request.

Due to the imperfect key processing in the ribbon, we cannot implement the consistent hashing algorithm on the ribbon for the time being. Consistent hashing can only be achieved indirectly by the client making two requests (one http, one ws). Hopefully ribbon will update this bug soon! Let's implement our websocket cluster a little more elegantly.

9. Postscript

These are the results of my research in the past few days. Many problems were encountered during the period, and the problems were solved one by one, and two websocket cluster solutions were listed. The first is session broadcasting and the second is consistent hashing.

These two schemes have their own advantages and disadvantages for different scenarios. This article does not use ActiveMQ, Karfa and other message queues to achieve message push, but just wants to use their own ideas and not rely on message queues to simply achieve long-term connection communication between multiple users . I hope to provide you with a different way of thinking.

WebSocket cluster solution