5. RedisResponseTimeoutException caused by Redisson exception caused by value being null

Project scenario:

Redis is used in the project to store some hot data


Problem Description

The morning after the upgrade, a large number of redis-related exceptions occurred in the production monitoring group. The exception stack information is as follows:

org.springframework.dao.QueryTimeoutException: Redis server response timeout (3000 ms) occured after 0 retry attempts. 
Increase nettyThreads and/or timeout settings. Try to define pingConnectionInterval setting. Command: (INCRBY), 
params: [[-84, -19, 0, 5, 116, 0, 18, 101, 110, 113, ...], 9], 
channel: [id: 0xd***e, L:/17*.*.*.*:3***6 - R:r-2zeijkfagqgv2g42ye.redis.rds.aliyuncs.com/17*.*.*.*:6379]; 
nested exception is org.redisson.client.RedisResponseTimeoutException: 
Redis server response timeout (3000 ms) occured after 0 retry attempts. 
Increase nettyThreads and/or timeout settings. Try to define pingConnectionInterval setting. Command: (INCRBY),
params: [[-84, -19, 0, 5, 116, 0, 18, 101, 110, 113, ...], 9], 
channel: [id: 0xdcb***ce, L:/17*.*.*.*:3***6 - R:r-2zeijkfagqgv2g42ye.redis.rds.aliyuncs.com/10.1*****:6379]
	at org.redisson.spring.data.connection.RedissonExceptionConverter.convert(RedissonExceptionConverter.java:48)
	at org.redisson.spring.data.connection.RedissonExceptionConverter.convert(RedissonExceptionConverter.java:35)
	at org.springframework.data.redis.PassThroughExceptionTranslationStrategy.translate(PassThroughExceptionTranslationStrategy.java:44)
	at org.redisson.spring.data.connection.RedissonConnection.transform(RedissonConnection.java:197)
	at org.redisson.spring.data.connection.RedissonConnection.syncFuture(RedissonConnection.java:192)
	at org.redisson.spring.data.connection.RedissonConnection.sync(RedissonConnection.java:360)
	at org.redisson.spring.data.connection.RedissonConnection.write(RedissonConnection.java:726)
	at org.redisson.spring.data.connection.RedissonConnection.incrBy(RedissonConnection.java:574)
	at org.springframework.data.redis.core.DefaultValueOperations.lambda$increment$1(DefaultValueOperations.java:167)
	at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:223)
	at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:190)
	at org.springframework.data.redis.core.AbstractOperations.execute(AbstractOperations.java:97)
	at org.springframework.data.redis.core.DefaultValueOperations.increment(DefaultValueOperations.java:167)

Cause Analysis:

First, I checked the stack and found that there was a bug in the code, which caused the data stored in redis to have a null value. At that time, since the value was null only in rare cases, I prepared to fix it at night and started to investigate why the timeout occurred.
1. Contact the operation and maintenance to check the running status of redis and check the Redis parameter configuration; (it has not been modified recently and no problems were found)
2. Network reasons; (the network is not Jitter)
3. Start checking the Redis source code. At this time, a colleague reported that when the value stored in redis is null, a RedisResponseTimeoutException problem will occur. The problem was discovered based on the source code and the phenomenon. Continue reading below for analysis.
4. Judging from the surface of the log, it is indeed a timeout. The first sentence "org.springframework.dao.QueryTimeoutException: Redis server response timeout (3000 ms) occurred after 0 retry attempts. Increase nettyThreads and/or timeout settings. Try to define pingConnectionInterval setting. " means there is no response after more than 3000ms. 3000ms is the response timeout of redisson, so where does this come from? In the source code, (Several important methods write, writeAsync, async, execute, checkWriteFuture, scheduleResponseTimeout)

	<T> T write(byte[] key, Codec codec, RedisCommand<?> command, Object... params) {
    
    
        RFuture<T> f = executorService.writeAsync(key, codec, command, params);
        indexCommand(command);
        return sync(f);
    }
	
	public <T, R> RFuture<R> writeAsync(byte[] key, Codec codec, RedisCommand<T> command, Object... params) {
    
    
        NodeSource source = getNodeSource(key);
        return async(false, source, codec, command, params, false, false);
    }

	public <V, R> RFuture<R> async(boolean readOnlyMode, NodeSource source, Codec codec,
            RedisCommand<V> command, Object[] params, boolean ignoreRedirect, boolean noRetry) {
    
    
        CompletableFuture<R> mainPromise = createPromise();
        RedisExecutor<V, R> executor = new RedisExecutor<>(readOnlyMode, source, codec, command, params, mainPromise,
                                                    ignoreRedirect, connectionManager, objectBuilder, referenceType, noRetry);
        executor.execute();
        return new CompletableFutureWrapper<>(mainPromise);
    }

	public void execute() {
    
    
        if (mainPromise.isCancelled()) {
    
    
            free();
            return;
        }

        if (!connectionManager.getShutdownLatch().acquire()) {
    
    
            free();
            mainPromise.completeExceptionally(new RedissonShutdownException("Redisson is shutdown"));
            return;
        }

        codec = getCodec(codec);

        CompletableFuture<RedisConnection> connectionFuture = getConnection().toCompletableFuture();

        CompletableFuture<R> attemptPromise = new CompletableFuture<>();
        mainPromiseListener = (r, e) -> {
    
    
            if (mainPromise.isCancelled() && connectionFuture.cancel(false)) {
    
    
                log.debug("Connection obtaining canceled for {}", command);
                timeout.cancel();
                if (attemptPromise.cancel(false)) {
    
    
                    free();
                }
            }
        };

        if (attempt == 0) {
    
    
            mainPromise.whenComplete((r, e) -> {
                if (this.mainPromiseListener != null) {
                    this.mainPromiseListener.accept(r, e);
                }
            });
        }

        scheduleRetryTimeout(connectionFuture, attemptPromise);

        connectionFuture.whenComplete((connection, e) -> {
            if (connectionFuture.isCancelled()) {
    
    
                connectionManager.getShutdownLatch().release();
                return;
            }

            if (connectionFuture.isDone() && connectionFuture.isCompletedExceptionally()) {
    
    
                connectionManager.getShutdownLatch().release();
                exception = convertException(connectionFuture);
                return;
            }

            sendCommand(attemptPromise, connection);

            writeFuture.addListener(new ChannelFutureListener() {
    
    
                @Override
                public void operationComplete(ChannelFuture future) throws Exception {
    
    
                    checkWriteFuture(writeFuture, attemptPromise, connection);
                }
            });
        });

        attemptPromise.whenComplete((r, e) -> {
            releaseConnection(attemptPromise, connectionFuture);

            checkAttemptPromise(attemptPromise, connectionFuture);
        });
    }

	private void checkWriteFuture(ChannelFuture future, CompletableFuture<R> attemptPromise, RedisConnection connection) {
        if (future.isCancelled() || attemptPromise.isDone()) {
    
    
            return;
        }

        if (!future.isSuccess()) {
    
    
            exception = new WriteRedisConnectionException(
                    "Unable to write command into connection! Increase connection pool size. Node source: " + source + ", connection: " + connection +
                    ", command: " + LogHelper.toString(command, params)
                    + " after " + attempt + " retry attempts", future.cause());
            if (attempt == attempts) {
    
    
                attemptPromise.completeExceptionally(exception);
            }
            return;
        }

        timeout.cancel();

        scheduleResponseTimeout(attemptPromise, connection);
    }

	

The writing operation of redisson is an asynchronous to synchronous process, using the sync method to block and wait for future results. Since it is an asynchronous process, how does it control the timeout? The answer is the org.redisson.command.RedisExecutor#scheduleResponseTimeout method. This method will start a scheduled task. The task will be executed after 3000ms (configurable). The content of the task is Set the operation result to timeout, that is, if the task is not canceled within 3000ms, a timeout exception will be thrown. The core logic is this piece of code:

private void scheduleResponseTimeout(CompletableFuture<R> attemptPromise, RedisConnection connection) {
    
    
        long timeoutTime = responseTimeout;
        if (command != null && command.isBlockingCommand()) {
    
    
            Long popTimeout = null;
            if (RedisCommands.BLOCKING_COMMANDS.contains(command)) {
    
    
                for (int i = 0; i < params.length-1; i++) {
    
    
                    if ("BLOCK".equals(params[i])) {
    
    
                        popTimeout = Long.valueOf(params[i+1].toString()) / 1000;
                        break;
                    }
                }
            } else {
    
    
                popTimeout = Long.valueOf(params[params.length - 1].toString());
            }

            handleBlockingOperations(attemptPromise, connection, popTimeout);
            if (popTimeout == 0) {
    
    
                return;
            }
            timeoutTime += popTimeout * 1000;
            // add 1 second due to issue https://github.com/antirez/redis/issues/874
            timeoutTime += 1000;
        }

        long timeoutAmount = timeoutTime;
        TimerTask timeoutResponseTask = timeout -> {
    
    
            if (isResendAllowed(attempt, attempts)) {
    
    
                if (!attemptPromise.cancel(false)) {
    
    
                    return;
                }

                connectionManager.newTimeout(t -> {
    
    
                    attempt++;
                    if (log.isDebugEnabled()) {
    
    
                        log.debug("attempt {} for command {} and params {}",
                                attempt, command, LogHelper.toString(params));
                    }

                    mainPromiseListener = null;
                    execute();
                }, retryInterval, TimeUnit.MILLISECONDS);
                return;
            }

            attemptPromise.completeExceptionally(
                    new RedisResponseTimeoutException("Redis server response timeout (" + timeoutAmount + " ms) occured"
                            + " after " + attempt + " retry attempts. Increase nettyThreads and/or timeout settings. Try to define pingConnectionInterval setting. Command: "
                            + LogHelper.toString(command, params) + ", channel: " + connection.getChannel()));
        };

        timeout = connectionManager.newTimeout(timeoutResponseTask, timeoutTime, TimeUnit.MILLISECONDS);
    }

The above code indicates that writeFuture has a Listener that checks checkWriteFuture when the write operation is completed. In other words, the actual request has been sent before scheduleResponseTimeout, so has the redis server received it? Use the monitor function of redis-cli to view the requests received by the server. You can indeed see the write requests. Simulate the server side of redis and see that the SET operation is processed very quickly. Does the problem occur on the response processing side? Next, let’s take a look at what redisson does when processing requests/responses. For details, please refer to the org.redisson.client.handler.RedisChannelInitializer#initChannel method. The code is as follows:

@Override
protected void initChannel(Channel ch) throws Exception {
    initSsl(config, ch);
    
    if (type == Type.PLAIN) {
        ch.pipeline().addLast(new RedisConnectionHandler(redisClient));
    } else {
        ch.pipeline().addLast(new RedisPubSubConnectionHandler(redisClient));
    }

    ch.pipeline().addLast(
        connectionWatchdog,
        CommandEncoder.INSTANCE,
        CommandBatchEncoder.INSTANCE);

    if (type == Type.PLAIN) {
        ch.pipeline().addLast(new CommandsQueue());
    } else {
        ch.pipeline().addLast(new CommandsQueuePubSub());
    }

    if (pingConnectionHandler != null) {
        ch.pipeline().addLast(pingConnectionHandler);
    }
    
    if (type == Type.PLAIN) {
        ch.pipeline().addLast(new CommandDecoder(config.getAddress().getScheme()));
    } else {
        ch.pipeline().addLast(new CommandPubSubDecoder(config));
    }

    ch.pipeline().addLast(new ErrorsLoggingHandler());

    config.getNettyHook().afterChannelInitialization(ch);
}

You can see many familiar figures in the code, such as pingConnectionHandler, CommandEncoder, and connectionWatchdog. From the above code, we can simply draw the input and output pipeline. The output (request redis) pipieline is as follows:

ErrorsLoggingHandler -> CommandsQueue -> CommandBatchEncoder -> CommandEncoder

The pipeline for input (responding to redis, including establishing connections, etc.) is as follows:

RedisConnectionHandler -> ConnectionWatchdog -> PingConnectionHandler -> CommandDecoder -> ErrorsLoggingHandler

On the output or request link, according to our analysis, an exception was thrown in the CommandEncoder when setting null, interrupting the request. When setting a normal value, the link was normal. On the input-response link, the most important one is the CommandDecoder, and the others are either processing logs or processing connection events. Here we focus on analyzing the CommandDecoder, that is, the decoder. The first step in decoding is very important, which is to retrieve the operation command corresponding to the request. This is how redisson does it:

protected QueueCommand getCommand(ChannelHandlerContext ctx) {
    Queue<QueueCommandHolder> queue = ctx.channel().attr(CommandsQueue.COMMANDS_QUEUE).get();
    QueueCommandHolder holder = queue.peek();
    if (holder != null) {
        return holder.getCommand();
    }
    return null;
}

It takes a queue from the channel, and then peeks out the QueueCommandHolder from the queue as the current command response to be processed. Since it can be taken from the queue, when is it inserted into the queue? We can see that in the requested pipiline, there is a CommandsQueue, where the command is inserted into the queue.

@Override
public void write(ChannelHandlerContext ctx, Object msg, ChannelPromise promise) throws Exception {
    if (msg instanceof QueueCommand) {
        QueueCommand data = (QueueCommand) msg;
        QueueCommandHolder holder = new QueueCommandHolder(data, promise);

        Queue<QueueCommandHolder> queue = ctx.channel().attr(COMMANDS_QUEUE).get();

        while (true) {
            if (lock.compareAndSet(false, true)) {
                try {
                    queue.add(holder);
                    ctx.writeAndFlush(data, holder.getChannelPromise());
                } finally {
                    lock.set(false);
                }
                break;
            }
        }
    } else {
        super.write(ctx, msg, promise);
    }
}

At this point, we already roughly know the corresponding relationship between queue writing and retrieval. Because peek will not remove the data in the queue, there must be a place to push the data in the queue. Redisson's approach is to push it out after decoding. After decoding, the following method will be called:

protected void sendNext(Channel channel) {
    Queue<QueueCommandHolder> queue = channel.attr(CommandsQueue.COMMANDS_QUEUE).get();
    queue.poll();
    state(null);
}

After analyzing this point, our problem has basically been solved. In the case, after stringRedisTemplate set null, a null pointer exception was thrown in the CommandEncode stage, resulting in the request not being sent, so there will definitely be no CommandDecoder stage. According to the pipeline In the order, CommandsQueue is executed before CommandEncoder, which also means that the set null (referring to instruction 1) request instruction is stuffed into the queue, without the decoding stage to push it out from the queue, which results in the next set normal value (instruction 2). ) command, the request was sent to redis, and redis also responded. However, during the decoding stage, the operation instruction taken out was from the previous request (instruction 1), so the decoding was abnormal. Moreover, because the operation instruction taken out at this time was the first time The operating instruction (Instruction 1) and the second instruction (Instruction 2) are still waiting for the next time it is taken out of the queue and processed (promise.tryFailure), which in turn causes the normal set (Instruction 2) to time out. From then on, errors will occur in all subsequent requests, because the QueueCommand taken out in the decoding phase is always the one requested until the PinConnectionHandler makes an error and disconnects, and the entire channel data is reset under the collaborative processing of ConnectionWatchDog. As a mature open source framework, redisson is extremely inappropriate to have this bug, and the solution is clear:
前置增加value判断


solution:

数据放入Redis前,增加key、value非空判断。

Insights:

In the process of analyzing why redisson spread abnormally, the actual process experienced is far more complicated than the article.
One is that the log when redisson errors are quite misleading. The first reaction when seeing the error log is that there is either a network error or an error in the redis server.
After analyzing for a long time, the reasons for the network and redis were ruled out,
and then we turned to the analysis of the implementation mechanism of redisson itself.
In addition, during the analysis of redisson, the original idea was to close the connection during the request phase, but not release the handle or reference, which also took a long time.
During the analysis process, a lot of TRACE logs were also printed, which also helped eliminate a lot of wrong directions.
All in all, analyzing problems is a time-consuming process, and it is also a process of continuous learning and progress. In this process, I became familiar with redisson, which also provided a good reference for my own application code in the future. .

Guess you like

Origin blog.csdn.net/Java__EE/article/details/131326432