Java theory and practice: The fork-join framework (from ibm)

Reprinted from: the Java theory and practice: The fork-join framework

Hardware trends drive programming language

Language, libraries, and frameworks form the way we write programs. Alonzo Church had showed early as 1934, all known for their computational framework can be represented assemblies are equivalent, the actual assembly programmers write a particular language is formed, and the programming model (manufactured by language, libraries, and frameworks drive) can simplify the expression of these languages.

On the other hand, an era of mainstream hardware platform to form the way we create languages, libraries, and frameworks. Java language from the outset to support threads and concurrency; this includes language like synchronizedand volatilesuch synchronization primitives, and the like library contains Threadclasses such. However, in 1995, the popular concurrency primitives reflects the status of the hardware at the time: Most commercial system did not provide parallelism, even the most expensive systems provide only limited parallelism. At that time, the main thread is used to represent asynchronous , rather than concurrent , and these mechanisms are sufficient to meet the needs of the time.

With the lower price multiprocessor systems, more applications require the use of hardware parallelism offered by these systems. And programmers found that writing concurrent programs is very difficult and error-prone to use the Java language provides low-level primitives and libraries. In Java 5, the java.util.concurrentpacket is added to the Java platform, which provides a set of components for building concurrent applications: concurrent collections, queues, semaphores, a latch (LATCH), and so the thread pool. These mechanisms are very suitable procedure for coarse task granularity; the number of applications for work only to be divided, the concurrent tasks not last less than the number of available processors. Web server will be used as processing by a single request, the mail server or database server unit of work, the application usually meet this demand, and thus take advantage of these mechanisms to ensure parallel hardware.

Technology continues to evolve, the hardware trend is very clear; Moore's Law states that higher clock frequency does not appear, but will integrate more cores per chip. It is easy to imagine let a dozen processors busy to handle a range of tasks coarse-grained, such as a user request, but the technology does not escalate to thousands of processors - in a very short period of time may flow exponentially growth, but eventually the hardware trend will prevail. When entering the multi-core era, we need to find finer-grained parallelism, or risk the processor is idle, even if there are a lot of work to do. If you want to keep up with the pace of technological development, the software platform must also meet the changing mainstream hardware platform. Finally, Java 7 will include a framework for representing a certain class of finer-grained parallel algorithms: the fork-frame the Join .

Achieve a more fine-grained parallelism

Today, the majority of the server application user request - response processing as a unit of work. Server applications will usually run a lot more concurrent threads, or requests than the number of processors available. This is because in most server applications, processing the request contains a large number of I / O, these I / O will not take up too much processor (all web server application will handle many of socket I / O, because the request is received by the socket; will handle a large number of disk (or database) I / O). If 90% of the time for each task to wait for I / O completion, you will need 10 times the number of processors concurrent tasks, in order to take advantage of all the processors. As the number of processors, the request may not be enough to keep all concurrent processors busy. However, it is still possible to use another parallelism to improve performance metrics: the user waits for a response time acquisition.

A typical network server application example, consider a database server. When a request arrives at the server database, need to go through a series of processing steps. First, the SQL statement parsing and validation. You must then select a query plan; For complex queries, the database server will evaluate many different candidate plans to minimize expectations of I / O operations. Search query plan is a CPU-intensive tasks; in some cases, be considered excessive candidate plan will have a negative impact, but if too few candidates plan required I / O operations will certainly be more than the actual number. After retrieving the data from the disk, you may be required to perform more processing result data set; query may contain polymerization operation, such as SUM, AVERAGE, or the need to sort the data set. Then the results must be encoded and returned to the requestor.

Like most server requests, processing an SQL query involves computation and I / O. Although the CUP does not reduce the additional I / O completion time (it may be used additional memory to reduce I / O by the number of I / O operations before the results of the cache), but may be shortened by the CPU request processing parallelization intensive parts (such as program evaluation and sorting) processing time. When evaluating the candidate query plans may be evaluated in parallel different plans; while sorting the data set, a large data set can be broken into smaller data sets, respectively, and then merge sort. Doing so will make the user feel that performance has improved, as it will receive faster results (even if more work may be needed to service the request as a whole).

Divide and conquer

Merge sort is a divide-and-conquer one example algorithm, in which a problem in the algorithm recursively decomposed into sub-problems, then the sub-problem solution composition of the final result. divide-and-conquer algorithm may also be used in order environments, but more effective in a parallel environment, as may be processed in parallel sub-problems.

Typical parallel divide-and-conquer algorithm of the form shown in Listing 1:

Listing 1. Common parallel divide-and-conquer algorithm pseudocode.
// PSEUDOCODE
Result solve(Problem problem) { 
    if (problem.size < SEQUENTIAL_THRESHOLD)
        return solveSequentially(problem);
    else {
        Result left, right;
        INVOKE-IN-PARALLEL { 
            left = solve(extractLeftHalf(problem));
            right = solve(extractRightHalf(problem));
        }
        return combine(left, right);
    }
}

Parallel divide-and-conquer algorithm first assess the problem, it is determined whether the order of magnitude more suitable solution; typically, by comparing the size of the problem to a certain threshold value is completed. If the problem is large enough to require parallel decomposition, the algorithm will be broken down into two or more sub-problems, and in parallel sub-problems recursively algorithm itself, and then wait for the results of sub-problems, and finally merge these results. Over the threshold for the selection order and parallel execution method is the cost of coordination of parallel tasks. If the coordinator is 0 cost, more and more fine-grained tasks will provide better parallelism; until desired steering sequential method, the lower the cost of coordination, it can be divided into more fine-grained tasks.

Fork-join

Listing 1 example of the use of the INVOKE-IN-PARALLEL operation does not actually exist; it's behavior for the current task is suspended, two sub-parallel execution of tasks, while waiting for the current task to complete two sub-tasks. It can then result of merging two sub-tasks. This parallel decomposition, often referred to the fork-the Join , because first perform a task decomposition (the fork) to a plurality of sub-tasks, and then consolidated (the Join) (after completion).

Listing 2 shows an example of a problem for the use of fork-join solution: searching the maximum element which in a large array. Although this example is very simple, but the fork-join technology can be used for a wide variety of searching, sorting, and data analysis problems.

Listing 2. Select the largest element from a large array
public class SelectMaxProblem {
    private final int[] numbers;
    private final int start;
    private final int end;
    public final int size;

    // constructors elided 

    public int solveSequentially() {
        int max = Integer.MIN_VALUE;
        for (int i=start; i<end; i++) {
            int n = numbers[i];
            if (n > max)
                max = n;
        }
        return max;
    }

    public SelectMaxProblem subproblem(int subStart, int subEnd) {
        return new SelectMaxProblem(numbers, start + subStart, 
                                    start + subEnd);
    }
}

Note that subproblem()method does not copy the element; it is simply copied into the array reference and offset an existing data structure. This is very common in the fork-join problem implementations because the process of recursive decomposition problem will create a lot of new Problemobjects. In this example, the search task does not modify the data structure being searched, there is no need to maintain private copy of each task, the underlying data sets, it will not replicate and add additional overhead.

Listing 3 demonstrates the use of fork-join package SelectMaxProblemsolution, the planned Java 7 that contains the package. JSR 166 Expert Group is open to develop this package, use the code name for jsr166y, you can download and use it alone (it will eventually be included in the package in the Java version 6 or later java.util.concurrent.forkjoinin). invoke-in-parallelOperation is coInvoke()a method to achieve, while the operation invocation plurality of actions and wait for completion of all actions. ForkJoinExecutorWith Executorsimilar, because it is also used to run the task, but it is designed for compute-intensive tasks and design. This task will not be blocked, unless it is waiting for the same ForkJoinExecutorto another task processing.

fork-join framework supports several styles ForkJoinTasks, including those requiring explicit complete, and the need for execution of the loop. Use of this RecursiveActionclass directly supports parallel recursive non-result-bearing tasks decomposition style; RecursiveTaskclass to solve the same problems result-bearing tasks (other fork-join task classes include CyclicAction, AsyncActionand LinkedAsyncAction; For more details on how to use them, please Check Javadoc).

Listing 3. Using fork-join framework to solve the problem of selecting a maximum value
public class MaxWithFJ extends RecursiveAction {
    private final int threshold;
    private final SelectMaxProblem problem;
    public int result;

    public MaxWithFJ(SelectMaxProblem problem, int threshold) {
        this.problem = problem;
        this.threshold = threshold;
    }

    protected void compute() {
        if (problem.size < threshold)
            result = problem.solveSequentially();
        else {
            int midpoint = problem.size / 2;
            MaxWithFJ left = new MaxWithFJ(problem.subproblem(0, midpoint), threshold);
            MaxWithFJ right = new MaxWithFJ(problem.subproblem(midpoint + 
              1, problem.size), threshold);
            coInvoke(left, right);
            result = Math.max(left.result, right.result);
        }
    }

    public static void main(String[] args) {
        SelectMaxProblem problem = ...
        int threshold = ...
        int nThreads = ...
        MaxWithFJ mfj = new MaxWithFJ(problem, threshold);
        ForkJoinExecutor fjPool = new ForkJoinPool(nThreads);

        fjPool.invoke(mfj);
        int result = mfj.result;
    }
}

Table 1 shows the maximum element selected from the elements of the array 500,000 on different systems results, and changing the threshold so that the method is superior to sequential parallel method. For most runs, the number of threads fork-join the pool of available hardware threads (cores multiplied by the number of threads per kernel) are equal. Compared with the sequential method, the number of threads present a speedup (speedup).

Table 1. Results from run select-max 500k array of elements on different systems
  Threshold = 500k Threshold = 50k Threshold = 5k Threshold = 500 Threshold = -50
Pentium-4 HT (2 threads) 1.0 1.07 1.02 .82 .2
Dual-Xeon HT (4 threads) .88 3.02 3.2 2.22 .43
8-way Opteron (8 threads) 1.0 5.29 5.73 4.53 2.03
8-core Niagara (32 threads) .98 10.46 17.21 15.34 6.49

The results are exciting because they show good speedups in the choice of a variety of parameters. Therefore, as long as avoid choosing completely unreasonable parameters for the problem and the underlying hardware, you'll get good results. Using chip-multithreading technology, significantly less than optimal acceleration; CMT method such as Hyperthreading provided equivalent performance to be less than the number of actual performance provided by the kernel, but the performance loss depends on many factors, including executing cache miss rate code (miss rate).

Order threshold range selected here from 500K (the size of the array, that there is no parallelism) to 50. In this example, the que value of 50 is too small, impractical, and the results show, when the order of threshold is too low, fork-join task management overhead to play a decisive role. However, Table 1 also shows that as long as avoid unrealistic high and low parameters, you will get good results. Select Runtime.availableProcessors()as the number of worker threads usually get results similar to a desired result, because the tasks performed in fork-join and pool are CPU-bound, however, as long as avoid setting too large or too small pool, this parameter The results will not have much impact.

MaxWithFJClass does not require explicit synchronization. Its operational data for the life cycle of the problem is the same, and ForkJoinExecutorthere is sufficient internal synchronization ensures that the problem of data visibility into sub-tasks, sub-tasks can also ensure that the results for visualization tasks with their combined sex.

Analysis fork-join framework

There are many ways to achieve Listing 3 fork-frame in the Join presentation. You can use the original thread; Thread.start()and Thread.join()provides all the necessary functions. However, this method requires the number of threads may be more than the number of VM can support. For size N (assumed to be a small threshold sequence) problems would require O (N) threads to solve the problem (problem of tree depth is log 2 N , a depth of k binary tree has 2 k nodes) . In these threads, the thread will use nearly half of the entire life cycle to wait for the completion of subtasks. Creating a thread takes up a lot of memory, which makes this approach is limited (although this approach can work, but the code is very complex and requires careful for the problem size and hardware parameters tuning).

Using traditional thread pool to achieve the fork-join is also challenging because fork-join task thread life cycle most of the time is spent waiting on other tasks. This behavior can cause thread starvation deadlock (the Thread starvation deadlock) , the number of tasks unless carefully selected parameters to limit the creation of, or the pool itself is very large. The traditional thread pool is designed for the independent tasks, but the design also takes into account the potential of blocking, coarse-grained task - fork-join solution does not produce these two cases. For fine-grained tasks in conventional thread pools, there are cases where all the threads share the work of the task queue happen contention.

Work stealing

by means of a fork-join framework called work stealing (work stealing) technology reduces contention work queue . Each worker thread has its own job queue, which is a double-ended queue (or called deque ) to achieve (Java 6 adds several deque implemented in the library, including ArrayDequeand LinkedBlockingDeque). When a task into a new thread, it will push themselves to the head of the deque. When a combined operation with another task execution uncompleted tasks, it will push the task to another head of the queue and executed without sleep to await the completion of another task (as Thread.join()is the same operation). When the thread task queue is empty, it will try the deque from another thread tail steal another task.

You can use standard queue to achieve work-stealing, but compared to the standard queue, deque has two advantages: reducing contention and steal . Because only the worker thread will visit the head of its own deque, deque head never happen contention; because only when a thread is idle will visit deque tail, so there is rarely deque thread tail of contention ( binding deque implemented make these access patterns to further reduce the cost of coordination) in the fork-join framework. Compared with the traditional method based on the thread pool, reducing contention will greatly reduce the cost of synchronization. In addition, this approach implies the LIFO (last-in-first-out , LIFO) task queuing mechanism means that the biggest task row at the end of the queue when another thread needs to steal task, it will be able to get a break down multiple tasks into small tasks, thus avoiding theft task in the future. Therefore, to achieve a reasonable work-stealing load balancing, without the need for coordination and synchronization costs to a minimum.

Conclusion

fork-join method provides a simple way of parallel algorithms, it said without advance understanding of the target system will provide the extent of parallelism. All sorting, searching, and numerical algorithms can be carried out in parallel decomposition (later, as Arrays.sort()the standard library will use mechanisms such as fork-join framework, allowing applications to have free access to the benefits of parallel decomposition). As the number of processors grows, we will need to use more internal program parallelism, in order to effectively utilize these processors; for computationally intensive operations (such as sorting) parallel decomposition, so that the program can be more easily take advantage of future hardware .


Published 218 original articles · won praise 165 · Views 1.03 million +

Guess you like

Origin blog.csdn.net/x_i_y_u_e/article/details/52487100