Newbie? Did not do too much threading practice? Come, it's actually very simple, after watching the second meeting

Preface

Some time ago, I saw that the number of words in my writing exceeded 10W on a third-party platform. It is hard to imagine that I had to use line breaks to complete the 800-word composition in high school (people who know it must have done it).

After doing this, I developed a habit: verify everything you can verify by yourself.

So I wrote a tool in the spare time of working overtime last Friday:

https://github.com/crossoverJie/NOWS

Use SpringBoot to count how many words you have written with just one line of commands.

java -jar nows-0.0.1-SNAPSHOT.jar /xx/Hexo/source/_posts

Input the article directory to be scanned to output the results (currently only supports Markdown files ending in .md)

Newbie? Did not do too much threading practice? Come, it's actually very simple, after watching the second meeting

Of course, the result is just a happy one (40 tens of thousands of words), because in the early blogs, I liked large post codes, and some English words were not filtered, so the results were quite different.

If only Chinese characters are counted, it must be accurate, and the tool has a built-in flexible extension method. Users can customize the statistics strategy. Please see the following text for details.

In fact, this tool is quite simple, the amount of code is small, and there is not much worth talking about. But after I recall, whether it is an interview or communicating with netizens, I found a common phenomenon:

Most novice developers will look at multithreading, but there is almost no relevant practice. Some even don't know what multithreading is useful in actual development.

For this reason, I want to bring a practical and easy-to-understand multi-threaded case for this kind of friends based on this simple tool.

At least let you know:

Why do you need multiple threads?
How to implement a multi-threaded program?
Problems and solutions caused by multithreading?

Single thread statistics

Before talking about multithreading, let's talk about how to implement single thread.

The requirements this time are also very simple, just scan a directory to read all the files below.

So our implementation has the following steps:

Read all files in a directory.
Keep all file paths to memory.
Traverse all the files one by one to read the word count of the text record.

Let's first look at how the first two are implemented, and when scanning to the directory, you need to continue to read the files in the current directory.

Such a scenario is very suitable for recursion:

    public List<String> getAllFile(String path){

        File f = new File(path) ;
        File[] files = f.listFiles();
        for (File file : files) {
            if (file.isDirectory()){
                String directoryPath = file.getPath();
                getAllFile(directoryPath);
            }else {
                String filePath = file.getPath();
                if (!filePath.endsWith(".md")){
                    continue;
                }
                allFile.add(filePath) ;
            }
        }

        return allFile ;
    }
}

Keep the path of the file in a collection after reading.

It should be noted that this number of recursion needs to be controlled to avoid stack overflow (StackOverflow).

The final reading of the file content is to use the stream in Java8 to read, so that the code can be more concise:

Stream<String> stringStream = Files.lines(Paths.get(path), StandardCharsets.UTF_8);
List<String> collect = stringStream.collect(Collectors.toList());

The next step is to read the word count and filter some special text (for example, I want to filter out all spaces, line breaks, hyperlinks, etc.).

Expandability

For simple processing, you can traverse collect in the above code and replace the content that needs to be filtered with empty.

But everyone's ideas may be different. For example, I only want to filter out spaces, line breaks, and hyperlinks, but some people need to remove all the English words in them, or even keep the line breaks (just like writing a text).

So this requires a more flexible approach.

Having read the above "Using the Chain of Responsibility Model to Design an Interceptor", it should be easy to think that such a scenario, the chain of responsibility model is more suitable.

The specific content of the chain of responsibility model will not be detailed, and those who are interested can check the above.

Look at the implementation directly here:

Define the abstract interface and processing method of the responsibility chain:

public interface FilterProcess {
    /**
     * 处理文本
     * @param msg
     * @return
     */
    String process(String msg) ;
}

Implementation of handling spaces and line breaks:

public class WrapFilterProcess implements FilterProcess{
    @Override
    public String process(String msg) {
        msg = msg.replaceAll("\\s*", "");
        return msg ;
    }
}

Implementation of handling hyperlinks:

public class HttpFilterProcess implements FilterProcess{
    @Override
    public String process(String msg) {
        msg = msg.replaceAll("^((https|http|ftp|rtsp|mms)?:\\/\\/)[^\\s]+","");
        return msg ;
    }
}

In this way, you need to add these processing handles to the responsibility chain during initialization, and provide an API for the client to execute.

Newbie? Did not do too much threading practice? Come, it's actually very simple, after watching the second meeting

Such a simple tool for counting words is complete.

Multithreaded mode

It is fast to execute once under the condition that there are dozens of blogs in my local area, but if our files are tens of thousands, hundreds of thousands or even millions.

Although the function can be realized, it is conceivable that such time-consuming is definitely increased exponentially.

At this time, multi-threading takes advantage, and multiple threads can read the file and summarize the results separately.

The realization process becomes:

Read all files in a directory.
The file path is handled by different threads.
Final summary results.

Problems caused by multithreading

It's not about using multithreading. Let's take a look at the first problem: sharing resources.

Simply put, it is how to ensure that the total word count of multi-threaded and single-threaded statistics is consistent.

Based on my local environment, let's take a look at the results of single-threaded operation:

Newbie? Did not do too much threading practice? Come, it's actually very simple, after watching the second meeting

The total is: 414,142 words.

Next, switch to a multi-threaded method:

List<String> allFile = scannerFile.getAllFile(strings[0]);
logger.info("allFile size=[{}]",allFile.size());
for (String msg : allFile) {
	executorService.execute(new ScanNumTask(msg,filterProcessManager));
}

public class ScanNumTask implements Runnable {

    private static Logger logger = LoggerFactory.getLogger(ScanNumTask.class);

    private String path;

    private FilterProcessManager filterProcessManager;

    public ScanNumTask(String path, FilterProcessManager filterProcessManager) {
        this.path = path;
        this.filterProcessManager = filterProcessManager;
    }

    @Override
    public void run() {
        Stream<String> stringStream = null;
        try {
            stringStream = Files.lines(Paths.get(path), StandardCharsets.UTF_8);
        } catch (Exception e) {
            logger.error("IOException", e);
        }

        List<String> collect = stringStream.collect(Collectors.toList());
        for (String msg : collect) {
            filterProcessManager.process(msg);
        }
    }
}

Use thread pool to manage threads. For more thread pool related content, please see here: "How to Use and Understand Thread Pool Elegantly"

Results of the:

Newbie? Did not do too much threading practice? Come, it's actually very simple, after watching the second meeting

We will find that this value will be less than our expected value no matter how many times it is executed.

Let’s see how statistics are implemented.

@Component
public class TotalWords {
    private long sum = 0 ;

    public void sum(int count){
        sum += count;
    }

    public long total(){
        return sum;
    }
}

It can be seen that it is just accumulating a basic type. What caused this value to be smaller than expected?

I think most people will say: multi-threaded operation will cause some threads to overwrite the values calculated by other threads.

But in fact, this is just the appearance of the problem, and the root cause is still not clear.

Memory visibility

The core reason is actually caused by the Java Memory Model (JMM) regulations.

Here is an explanation of the "volatile keyword you should know" written earlier:

Due to the Java Memory Model (JMM) regulations, all variables are stored in the main memory, and each thread has its own working memory (cache).

When a thread is working, it needs to copy the data in the main memory to the working memory. In this way, any operation on the data is based on the working memory (improved efficiency), and cannot directly manipulate the data in the main memory and the working memory of other threads, and then flush the updated data to the main memory.

The main memory mentioned here can be simply regarded as heap memory , and the working memory can be regarded as stack memory .

As shown below:

Newbie? Did not do too much threading practice? Come, it's actually very simple, after watching the second meeting

Therefore, during concurrent operation, it may appear that the data read by thread B is the data before thread A is updated.

More related content will not be expanded, and interested friends can look through previous blog posts.

Let's just talk about how to solve this problem. JDK has actually helped us think of these problems.

There are many concurrency tools you might use under the java.util.concurrent concurrency package.

This is very suitable for AtomicLong, it can modify the data atomically.

Let's take a look at the modified implementation:

@Component
public class TotalWords {
    private AtomicLong sum = new AtomicLong() ;
    
    public void sum(int count){
        sum.addAndGet(count) ;
    }

    public  long total(){
        return sum.get() ;
    }
}

It just uses its two APIs. Run the program again and you will find that the result is still wrong .

Newbie? Did not do too much threading practice? Come, it's actually very simple, after watching the second meeting

It's even 0.

Inter-thread communication

At this time, there is a new problem, let's take a look at how to obtain the total data is achieved.

List<String> allFile = scannerFile.getAllFile(strings[0]);
logger.info("allFile size=[{}]",allFile.size());
for (String msg : allFile) {
	executorService.execute(new ScanNumTask(msg,filterProcessManager));
}

executorService.shutdown();
long total = totalWords.total();
long end = System.currentTimeMillis();
logger.info("total sum=[{}],[{}] ms",total,end-start);

I don't know if you can see the problem. In fact, when you finally print the total number, you don't know whether other threads have completed execution.

Because executorService.execute() will return directly, no thread has finished executing when printing to get the data, which leads to this result.

About inter-thread communication I also wrote related content before: "In-depth understanding of thread communication"

There are roughly the following ways:

Newbie? Did not do too much threading practice? Come, it's actually very simple, after watching the second meeting

Here we use the thread pool method:

Add a judgment condition after deactivating the thread pool:

executorService.shutdown();
while (!executorService.awaitTermination(100, TimeUnit.MILLISECONDS)) {
	logger.info("worker running");
}
long total = totalWords.total();
long end = System.currentTimeMillis();
logger.info("total sum=[{}],[{}] ms",total,end-start);

So we tried again and found that no matter how many times the result is correct:

Newbie? Did not do too much threading practice? Come, it's actually very simple, after watching the second meeting

Efficiency improvement

Some friends may ask, this method has not improved much efficiency.

This is actually due to the fact that I have fewer local files and the time-consuming processing of one file is relatively short.

Even if the number of threads is opened enough to cause frequent context switching, the execution efficiency is reduced.

In order to simulate the improvement of efficiency, I let the current thread sleep for 100 milliseconds to simulate the execution time for each file processed.

Let's look at how long it takes to run in a single thread.

Newbie? Did not do too much threading practice? Come, it's actually very simple, after watching the second meeting

Total time: [8404] ms

Then it takes time when the thread pool size is 4:

Newbie? Did not do too much threading practice? Come, it's actually very simple, after watching the second meeting

Total time: [2350] ms

It can be seen that the efficiency improvement is still very obvious.

Think more

This is just one of the usages of multithreading. I believe that friends who see here should have a better understanding of it.

Let’s leave a post-reading exercise for everyone, the scene is similar:

There are tens of millions of mobile phone number data stored in Redis or other storage media. Each number is unique, and it is necessary to traverse all these numbers in the fastest time.

Friends who are interested in ideas are welcome to leave a message at the end of the article to participate in the discussion.

to sum up

I hope that the friends who have finished reading can have their own answers to the questions at the beginning of the article:

Why do you need multiple threads?
How to implement a multi-threaded program?
Problems and solutions caused by multithreading?

The code in the article is here.

https://github.com/crossoverJie/NOWS