foreword

One is not difficult to say, but it is simple to say that it is a bug that can't tell where the problem is. Yes, I may not be able to recognize it due to my lack of ability and experience. Next, can you show the true nature of the bug at a glance?

(This problem was exposed suddenly, without any symptoms, no one has changed it, and it has been running in production for a long time, so it is very strange, so this spy seems to be very good at hiding)

hidden "spy"

Let's look at the code first (pseudocode)

code


/**
 * 两个从数据库查询的耗时任务
 * @param countDownLatch
 * @param all
 */
public static void testCount(CountDownLatch countDownLatch, List<String> all) {
    for (int i = 0; i < 2; i++) {
        int finalI = i;
        ThreadPoolFactory.getGeneral().execute(() -> {
            try {
                List<String> countList = new ArrayList<>();
                //这里之所以用for循环，是因为查询业务需要0和1两个状态去查询
                if (finalI == 0) {
                //这里其实是查询数据库的mapper操作，为了方便演示
                    countList.add("1");
                    countList.add("2");
                    countList.add("3");
                } else {
                //这里其实是查询数据库的mapper操作，为了方便演示
                    countList.add("5");
                    countList.add("6");
                    countList.add("7");
                    countList.add("8");
                }
                if (countList != null) {
                    all.addAll(countList);
                }
            } catch (Exception ex) {
                ex.printStackTrace();
            } finally {
                countDownLatch.countDown();
            }
        });
    }
}



//线程池类
public class ThreadPoolFactory {

    private static final Logger logger = LoggerFactory.getLogger(ThreadPoolFactory.class);

    private static final ThreadFactory GENERAL_THREAD_FACTORY = new ThreadFactoryBuilder().setNameFormat("general-pool-%d").build();

    /**
     * corePoolSize：核心线程池大小
     * maximumPoolSize：最大线程池大小
     * keepAliveTime：线程最大空闲时间
     * unit：时间单位
     * workQueue：线程等待队列  四种队列 1.ArrayBlockingQueue：有界队列，2.SynchronousQueue：同步队列，3.LinkedBlockingQueue：无界队列，4.DelayQueue：延时阻塞队列
     * threadFactory：线程创建工厂
     * handler：拒绝策略 四种策略 1.ThreadPoolExecutor.AbortPolicy()：2.ThreadPoolExecutor.CallerRunsPolicy()：3.ThreadPoolExecutor.DiscardOldestPolicy()：4.ThreadPoolExecutor.DiscardPolicy()
     */
    private static final ExecutorService GENERAL = new ThreadPoolExecutor(5, 10,
            30L, TimeUnit.MILLISECONDS,
            new LinkedBlockingQueue<>(4096), GENERAL_THREAD_FACTORY, new ThreadPoolExecutor.AbortPolicy());

    public static ExecutorService getGeneral() {
        return GENERAL;
    }

}


//main方法测试
public static void main(String[] args) throws Exception {
    List<String> all = new ArrayList<>();
    CountDownLatch countDownLatch = new CountDownLatch(2);
    testCount(countDownLatch,all);
    countDownLatch.await(10, TimeUnit.SECONDS);
    System.out.println(all);
}

复制代码

For those who don't understand the above CountDownLatch, you can read my historical article: Dry goods! The usage scenario of CountDownLatch

Seeing this, I don’t know if you can see the clues. Let’s talk about the result of the problem first. The last all set is empty, and the interface on production is also the same problem. The code above is a pseudo copy of the 1:1 copy on production. code.

Let me first talk about my investigation ideas:

1. The thread pool problem. I think the thread is not recycled in time. The time is too long and the number of concurrency is too high, resulting in insufficient threads. The first thing that comes to mind is that the number of threads needs to be increased.

2. Too much data in the database causes the query to be an order of magnitude slower than before, and finally the queue is blocked, dragging down the thread (this probability is relatively low, because the database query returns quickly, and there is no slow SQL that needs to be optimized)

3. It is suspected that this loop is caused, such as a certain mechanism with less or no loop, and removing the for loop still does not solve the problem

Verify the first "spy"

First expand the number of core threads and the maximum number of threads, and expand these two parameters to 10 and 20

private static final ExecutorService GENERAL = new ThreadPoolExecutor(10, 20,
            30L, TimeUnit.MILLISECONDS,
            new LinkedBlockingQueue<>(4096), GENERAL_THREAD_FACTORY, new ThreadPoolExecutor.AbortPolicy());
复制代码

After the expansion, the data can be found after putting it on, and I feel that this big problem has been solved

How to say that sentence, the truth is often not so easy to find, the first ones caught are small fish and shrimps, as expected, after running for about a week, the same problem appeared again, it feels like this is a water tank, you Make the water tank bigger, and it will be full one day. We all know that the bigger the thread pool, the better.

So what is the truth? If you have already got the answer, you can go to the comment area to comment first, don't read the answer below.

With GPT "Detective Conan"

I won't say much about chatgpt here. If you don't understand this thing yet, then I will. . . I can only beg you to find out

I pasted the complete production code, and he replied like this

I have to say, in one sentence, it only takes 5 seconds to answer everything we can think of and what we can't think of

Obviously, we have basically passed the verification of the second point and the third point

那就是第一点了，其实我们早就应该想到这一点的，多线程环境下，线程安全问题是首位的！！！

找出"真凶"

使用synchronized关键字解决线程安全

使用synchronized关键字来同步访问all列表，即在多个线程访问all列表时，使用同一个锁来保证线程安全，避免出现数据不一致的问题。这样就解决了多个线程可能会同时访问并修改数据，导致数据丢失或损坏的问题。

聪明的你有没有找出“真凶”呢？？？

还记得我们加大线程数来解决问题吗，我又问了一个问题

扩大线程池的参数可能会提高程序的并发处理能力，但并不能从根本上解决问题。如果是由于数据同步问题导致的线程池查不到数据，那么扩大线程池只是把问题暂时推迟了而已。此外，扩大线程池的核心线程池数量也会占用更多的系统资源

AI已来，未来已来

再啰嗦一句，AI的强大这里就不再强调了 ,接下来我会持续利用GPT输出很多干货和其他AI生态的东西，都收在下方的AI专栏里，一起学习，一起成长，欢迎关注下方的AI专栏，点赞，谢谢各位看官

I spent two days not solving the problem, chatgpt took 5 seconds to get it done!