spring boot API - document processing and executing python script on documents in parallel

Anurag :

Scenario:

  1. In my application, there are 3 processes which are copying documents on a shared drive in their respective folders.
  2. As soon as any document is copied on shared drive (by any process), directory watcher (Java) code picks up the document and call the Python script using "Process" and do some processing on the document. code snippet is as follows:

    Process pr = Runtime.getRuntime().exec(pythonCommand);
                // retrieve output from python script
                BufferedReader bfr = new BufferedReader(new InputStreamReader(pr.getInputStream()));
                String line = "";
                while ((line = bfr.readLine()) != null) {
                    // display each output line from python script
                    logger.info(line);
                }
                pr.waitFor();
    
  3. Currently my code waits till python code execution is completed on the document. Only after that it pick up the next document. Python code takes 30 secs to complete.

  4. After processing the document, document is moved from the current folder to archive OR error folder.
  5. Please find below screen shot of the scenario: enter image description here

What is the problem?

  1. My code is processing documents in sequential manner and I need to process the document in parallel.
  2. As Python code takes around 30 seconds, some of the events created by directory watcher are also getting lost.
  3. If around 400 documents are coming within a short span of time, document processing stops.

What I am looking for?

  1. Design solution for processing documents in parallel.
  2. In case of any failure scenario for document processing, pending documents must be processed automatically.
  3. I tried spring boot schedular as well but still documents are getting processed in sequential manner only.
  4. Is it possible to call the Python code in parallel as a background process.

Sorry for the long question but I am stuck at this from many days and already looked many similar questions. Thank you!

Anar Sultanov :

One option would be to use a ExecutorService provided by the JDK, which can execute Runnable and Callable tasks. You will need to create a class that implements Runnable, which will execute your Python script, and after receiving a new document, you need to create a new instance of this class and pass it to the ExecutorService.

To show how this works, we will use a simple Python script that takes a thread name as an argument, prints the start time of its execution, sleeps 10 seconds and prints the end time:

import time
import sys

print "%s start : %s" % (sys.argv[1], time.ctime())
time.sleep(10)
print "%s end : %s" % (sys.argv[1], time.ctime())

First, we implement the class that runs the script and passes it the name obtained in the constructor:

class ScriptRunner implements Runnable {

    private String thread;

    ScriptRunner(String thread) {
        this.thread = thread;
    }

    @Override
    public void run() {
        try {
            ProcessBuilder ps = new ProcessBuilder("py", "test.py", thread);
            ps.redirectErrorStream(true);
            Process pr = ps.start();
            try (BufferedReader in = new BufferedReader(new InputStreamReader(pr.getInputStream()))) {
                String line;
                while ((line = in.readLine()) != null) {
                    System.out.println(line);
                }
            }
            pr.waitFor();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Then we create main method that creates ExecutorService with a fixed number of parallel threads in the amount of 5 and pass 10 instances of ScriptRunner to it with interruptions of 1 second:

public static void main(String[] args) throws InterruptedException {
    ExecutorService executor = Executors.newFixedThreadPool(5);
    for (int i = 1; i <= 10; i++) {
        executor.submit(new ScriptRunner("Thread_" + i));
        Thread.sleep(1000);
    }
    executor.shutdown();
}

If we run this method, we will see that the service, due to the specified limit, has a maximum of 5 parallel-running tasks, and the rest fall into the queue and start in freed threads:

Thread_1 start : Sat Nov 23 11:40:14 2019
Thread_1 end : Sat Nov 23 11:40:24 2019    // the first task is completed..
Thread_2 start : Sat Nov 23 11:40:15 2019
...
Thread_5 end : Sat Nov 23 11:40:28 2019
Thread_6 start : Sat Nov 23 11:40:24 2019  // ..and the sixth is started
...
Thread_10 end : Sat Nov 23 11:40:38 2019

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=305270&siteId=1