ElasticSearch - Bulk Update Bulk Deadlock Troubleshooting | JD Cloud Technical Team

1. Problem system introduction

  1. Listen to the product change MQ message, query the latest product information, and call BulkProcessor to update the product field information in the ES cluster in batches;

  2. Since there are a lot of product data, the product data is stored on the ES cluster. The entire ES cluster is divided into 256 fragments, and the fragment routing is performed according to the three-level category ID of the product.

For example, if the product name of a SKU changes, we will receive the change MQ message of this SKU, and then go to the product interface to query the latest name of the product, and then route according to the three-level classification ID of the SKU to find the corresponding ES cluster fragmentation, and then update the product name field information.

Due to the huge amount of MQ messages for product changes, in order to improve the performance of updating ES and prevent the backlog of MQ messages, this system uses BulkProcessor for batch asynchronous updates.

The ES client version is as follows:

        <dependency>
            <artifactId>elasticsearch-rest-client</artifactId>
            <groupId>org.elasticsearch.client</groupId>
            <version>6.5.3</version>
        </dependency>

The pseudocode of BulkProcessor configuration is as follows:

        //在这里调用build()方法构造bulkProcessor,在底层实际上是用了bulk的异步操作
        this.fullDataBulkProcessor = BulkProcessor.builder((request, bulkListener) ->
                fullDataEsClient.getClient().bulkAsync(request, RequestOptions.DEFAULT, bulkListener), listener)
                // 1000条数据请求执行一次bulk
                .setBulkActions(1000)
                // 5mb的数据刷新一次bulk
                .setBulkSize(new ByteSizeValue(5L, ByteSizeUnit.MB))
                // 并发请求数量, 0不并发, 1并发允许执行
                .setConcurrentRequests(1)
                // 固定1s必须刷新一次
                .setFlushInterval(TimeValue.timeValueSeconds(1L))
                // 重试5次,间隔1s
                .setBackoffPolicy(BackoffPolicy.constantBackoff(TimeValue.timeValueSeconds(1L), 5))
                .build();

2. How to find the problem

  1. After the start of the 6.18 promotion, due to frequent product changes, the MQ message volume has reached several times the daily rate, and many products have also changed their third-level category IDs;

  2. When the system updated the SKU product information of these third-level category IDs, an error occurred when updating the product information according to the modified segment after the third-level category ID routed, and retried 5 times, but still failed;

  3. Because there is no index information of this product on the shard of the new route, these update requests will never be executed successfully, and a large number of abnormal retry logs are also recorded in the system log file.

  4. The backlog of MQ messages for product changes also began to appear, and the consumption speed of MQ messages obviously could not keep up with the production speed.

  5. Observing the UMP monitoring data of MQ message consumers, it is found that the consumption performance is very stable without obvious fluctuations, but the number of calls will drop off a cliff after the system consumes MQ for a period of time, from the original tens of thousands of calls per minute to single digits number.

  6. After restarting the application, the system starts to consume again, and the number of UMP monitoring calls returns to the normal level. However, after the system has been running for a period of time, the problem of consumption suspension still occurs, as if all consumption threads have been suspended.

Third, the detailed process of troubleshooting

First find a container that suspends consumption of MQ messages, check the application process ID, use the jstack command to dump the entire thread stack information of the application process, package and upload the exported thread stack information to https://fastthread.io/ for thread status analysis . The analysis report is as follows:

Through the analysis report, it is found that there are 124 threads in the BLOCKED state, and then you can click to view the detailed stack information of each thread. The stack information is as follows:

Continuously view the detailed stack information of multiple threads. MQ consumer threads are waiting to lock <0x00000005eb781b10> (a org.elasticsearch.action.bulk.BulkProcessor), and then search according to 0x00000005eb781b10 and found that this object lock is being locked by another thread Occupancy, the occupied thread stack information is as follows:

The thread state is in the WAITING state at this time, and it is found through the thread name that the thread should be an internal thread of the ES client. It is this thread that seizes the lock of the business thread, and then waits for other conditions to trigger the execution of the thread, so all the MQ consumption business threads have been unable to obtain the lock inside the BulkProcessor, resulting in a consumption suspension problem.

But why can't this thread elasticsearch[scheduler][T#1] execute? When did it start? What's the use?

We need to conduct an in-depth analysis of BulkProcessor. Since BulkProcessor is created through the builder module, we need to go deep into the source code of the builder to understand the creation process of BulkProcessor.

public static Builder builder(BiConsumer<BulkRequest, ActionListener<BulkResponse>> consumer, Listener listener) {
        Objects.requireNonNull(consumer, "consumer");
        Objects.requireNonNull(listener, "listener");
        final ScheduledThreadPoolExecutor scheduledThreadPoolExecutor = Scheduler.initScheduler(Settings.EMPTY);
        return new Builder(consumer, listener,
                (delay, executor, command) -> scheduledThreadPoolExecutor.schedule(command, delay.millis(), TimeUnit.MILLISECONDS),
                () -> Scheduler.terminate(scheduledThreadPoolExecutor, 10, TimeUnit.SECONDS));
    }

A time-scheduled execution thread pool is created internally. The thread naming rules are similar to the above-mentioned thread names holding locks. The specific code is as follows:

static ScheduledThreadPoolExecutor initScheduler(Settings settings) {
        ScheduledThreadPoolExecutor scheduler = new ScheduledThreadPoolExecutor(1,
                EsExecutors.daemonThreadFactory(settings, "scheduler"), new EsAbortPolicy());
        scheduler.setExecuteExistingDelayedTasksAfterShutdownPolicy(false);
        scheduler.setContinueExistingPeriodicTasksAfterShutdownPolicy(false);
        scheduler.setRemoveOnCancelPolicy(true);
        return scheduler;
    }

Finally, the internal construction method of BulkProcessor with parameters is executed inside the build method, and a periodic flushing task is started inside the construction method. The code is as follows

 BulkProcessor(BiConsumer<BulkRequest, ActionListener<BulkResponse>> consumer, BackoffPolicy backoffPolicy, Listener listener,
                  int concurrentRequests, int bulkActions, ByteSizeValue bulkSize, @Nullable TimeValue flushInterval,
                  Scheduler scheduler, Runnable onClose) {
        this.bulkActions = bulkActions;
        this.bulkSize = bulkSize.getBytes();
        this.bulkRequest = new BulkRequest();
        this.scheduler = scheduler;
        this.bulkRequestHandler = new BulkRequestHandler(consumer, backoffPolicy, listener, scheduler, concurrentRequests);
        // Start period flushing task after everything is setup
        this.cancellableFlushTask = startFlushTask(flushInterval, scheduler);
        this.onClose = onClose;
    }
private Scheduler.Cancellable startFlushTask(TimeValue flushInterval, Scheduler scheduler) {
        if (flushInterval == null) {
            return new Scheduler.Cancellable() {
                @Override
                public void cancel() {}

                @Override
                public boolean isCancelled() {
                    return true;
                }
            };
        }
        final Runnable flushRunnable = scheduler.preserveContext(new Flush());
        return scheduler.scheduleWithFixedDelay(flushRunnable, flushInterval, ThreadPool.Names.GENERIC);
    }
class Flush implements Runnable {

        @Override
        public void run() {
            synchronized (BulkProcessor.this) {
                if (closed) {
                    return;
                }
                if (bulkRequest.numberOfActions() == 0) {
                    return;
                }
                execute();
            }
        }
    }

Through the source code, it is found that the flush task is the fixed-time flush logic set when creating the BulkProcessor object. When the setFlushInterval method parameter takes effect, a background timing flush task will be started. The flush interval, defined by the setFlushInterval method parameter. During the execution of the flush task, it will also seize the BulkProcessor object lock, and the execute method will be executed only after the lock is seized. The specific method call relationship source code is as follows:

/**
     * Adds the data from the bytes to be processed by the bulk processor
     */
    public synchronized BulkProcessor add(BytesReference data, @Nullable String defaultIndex, @Nullable String defaultType,
                                          @Nullable String defaultPipeline, @Nullable Object payload, XContentType xContentType) throws Exception {
        bulkRequest.add(data, defaultIndex, defaultType, null, null, null, defaultPipeline, payload, true, xContentType);
        executeIfNeeded();
        return this;
    }

    private void executeIfNeeded() {
        ensureOpen();
        if (!isOverTheLimit()) {
            return;
        }
        execute();
    }

    // (currently) needs to be executed under a lock
    private void execute() {
        final BulkRequest bulkRequest = this.bulkRequest;
        final long executionId = executionIdGen.incrementAndGet();

        this.bulkRequest = new BulkRequest();
        this.bulkRequestHandler.execute(bulkRequest, executionId);
    }

The add method in the above code is called by the MQ consumer business thread, and there is also a synchronized keyword in this method, so the consumer MQ business thread will directly have a lock competition relationship with the flush task execution thread. The specific MQ consumption business thread call pseudocode is as follows:

 @Override
 public void upsertCommonSku(CommonSkuEntity commonSkuEntity) {
            String source = JsonUtil.toString(commonSkuEntity);
            UpdateRequest updateRequest = new UpdateRequest(Constants.INDEX_NAME_SPU, Constants.INDEX_TYPE, commonSkuEntity.getSkuId().toString());
            updateRequest.doc(source, XContentType.JSON);
            IndexRequest indexRequest = new IndexRequest(Constants.INDEX_NAME_SPU, Constants.INDEX_TYPE, commonSkuEntity.getSkuId().toString());
            indexRequest.source(source, XContentType.JSON);
            updateRequest.upsert(indexRequest);
            updateRequest.routing(commonSkuEntity.getCat3().toString());
            fullbulkProcessor.add(updateRequest);
}  

Through the above analysis of the thread stack, it is found that all business threads are waiting for the elasticsearch[scheduler][T#1] thread to release the BulkProcessor object lock, but the thread has not released the object lock, so the deadlock problem of the business thread appears .

Combined with the large number of exception retry logs that appear in the application log file, it may be related to BulkProcessor's exception retry strategy, and then further understand the exception retry code logic of BulkProcessor. Since the BulkRequest submitted in the business thread is uniformly submitted to the execute method in the BulkRequestHandler object for processing, the code is as follows:

public final class BulkRequestHandler {
    private final Logger logger;
    private final BiConsumer<BulkRequest, ActionListener<BulkResponse>> consumer;
    private final BulkProcessor.Listener listener;
    private final Semaphore semaphore;
    private final Retry retry;
    private final int concurrentRequests;

    BulkRequestHandler(BiConsumer<BulkRequest, ActionListener<BulkResponse>> consumer, BackoffPolicy backoffPolicy,
                       BulkProcessor.Listener listener, Scheduler scheduler, int concurrentRequests) {
        assert concurrentRequests >= 0;
        this.logger = Loggers.getLogger(getClass());
        this.consumer = consumer;
        this.listener = listener;
        this.concurrentRequests = concurrentRequests;
        this.retry = new Retry(backoffPolicy, scheduler);
        this.semaphore = new Semaphore(concurrentRequests > 0 ? concurrentRequests : 1);
    }

    public void execute(BulkRequest bulkRequest, long executionId) {
        Runnable toRelease = () -> {};
        boolean bulkRequestSetupSuccessful = false;
        try {
            listener.beforeBulk(executionId, bulkRequest);
            semaphore.acquire();
            toRelease = semaphore::release;
            CountDownLatch latch = new CountDownLatch(1);
            retry.withBackoff(consumer, bulkRequest, new ActionListener<BulkResponse>() {
                @Override
                public void onResponse(BulkResponse response) {
                    try {
                        listener.afterBulk(executionId, bulkRequest, response);
                    } finally {
                        semaphore.release();
                        latch.countDown();
                    }
                }

                @Override
                public void onFailure(Exception e) {
                    try {
                        listener.afterBulk(executionId, bulkRequest, e);
                    } finally {
                        semaphore.release();
                        latch.countDown();
                    }
                }
            });
            bulkRequestSetupSuccessful = true;
            if (concurrentRequests == 0) {
                latch.await();
            }
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
            logger.info(() -> new ParameterizedMessage("Bulk request {} has been cancelled.", executionId), e);
            listener.afterBulk(executionId, bulkRequest, e);
        } catch (Exception e) {
            logger.warn(() -> new ParameterizedMessage("Failed to execute bulk request {}.", executionId), e);
            listener.afterBulk(executionId, bulkRequest, e);
        } finally {
            if (bulkRequestSetupSuccessful == false) {  // if we fail on client.bulk() release the semaphore
                toRelease.run();
            }
        }
    }

    boolean awaitClose(long timeout, TimeUnit unit) throws InterruptedException {
        if (semaphore.tryAcquire(this.concurrentRequests, timeout, unit)) {
            semaphore.release(this.concurrentRequests);
            return true;
        }
        return false;
    }
}

BulkRequestHandler initializes a Retry task object through the construction method, and a Scheduler is also passed into the object, and the object and the flush task pass in the same thread pool, and only one fixed thread is maintained in the thread pool. The execute method first controls the number of concurrent executions based on the Semaphore. The number of concurrent executions is specified by parameters when building the BulkProcessor. Through the above configuration, it is found that the value is set to 1. So only one thread is allowed to execute the method at a time. That is, the MQ consumption business thread and the flush task thread can only execute one thread at a time. Then let's understand how the retry task is executed, see the following code for details:

 public void withBackoff(BiConsumer<BulkRequest, ActionListener<BulkResponse>> consumer, BulkRequest bulkRequest,
                            ActionListener<BulkResponse> listener) {
        RetryHandler r = new RetryHandler(backoffPolicy, consumer, listener, scheduler);
        r.execute(bulkRequest);
    }

RetryHandler will execute the bulkRequest request internally, and will also monitor the bulkRequest execution exception status, and then execute the task retry logic. The retry code is as follows:

private void retry(BulkRequest bulkRequestForRetry) {
            assert backoff.hasNext();
            TimeValue next = backoff.next();
            logger.trace("Retry of bulk request scheduled in {} ms.", next.millis());
            Runnable command = scheduler.preserveContext(() -> this.execute(bulkRequestForRetry));
            scheduledRequestFuture = scheduler.schedule(next, ThreadPool.Names.SAME, command);
        }

RetryHandler re-delivers the failed bulk request to the internal scheduler thread pool for execution. According to the above code, only one fixed thread is maintained in the thread pool, and the thread pool may be occupied by another flush task for execution. Therefore, if the retry logic is executing and the only thread in the thread pool is executing the flush task at this time, the execution of the retry logic will be blocked. If the retry logic cannot be executed, the Semaphore will not be released, but because the number of concurrency configured is 1, so the flush task thread needs to wait for other threads to release a Semaphore license before continuing to execute. Therefore, a circular waiting is formed here, causing the Semaphore and BulkProcessor object locks to be unable to be released, so that all MQ consumption business threads are blocked before acquiring the BulkProcessor lock.

At the same time, similar problems can also be searched on GitHub's ES client source code client, for example: https://github.com/elastic/elasticsearch/issues/47599 , so it further confirms the previous conjecture, because the continuous bulk Retrying caused a deadlock problem inside the BulkProcessor.

4. How to solve the problem

Now that the cause of the problem has been understood before, there are several solutions as follows:

1. Upgrade the ES client version to the official version 7.6. Subsequent versions physically isolate the exception retry task thread pool and the flush task thread pool to avoid thread pool competition, but version compatibility needs to be considered.

2. Since the deadlock problem is caused by a large number of abnormal retry logic, the retry logic can be canceled without affecting the business logic. This solution does not need to upgrade the client version, but needs to evaluate the business impact and execute failed requests Service retries can be performed in other ways.

If there is any omission or inappropriateness, welcome to correct me!

Author: Jingdong Retail Cao Zhifei

Source: JD Cloud Developer Community

Graduates of the National People’s University stole the information of all students in the school to build a beauty scoring website, and have been criminally detained. The new Windows version of QQ based on the NT architecture is officially released. The United States will restrict China’s use of Amazon, Microsoft and other cloud services that provide training AI models . Open source projects announced to stop function development LeaferJS , the highest-paid technical position in 2023, released: Visual Studio Code 1.80, an open source and powerful 2D graphics library , supports terminal image functions . The number of Threads registrations has exceeded 30 million. "Change" deepin adopts Asahi Linux to adapt to Apple M1 database ranking in July: Oracle surges, opening up the score again
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/10086464