Implementation of parallel merge sorting algorithm based on C++/SYCL in oneAPI

1 title description
- 1.1 Description
- 1.2 Analysis & Examples
2 code implementation
3 Running results
4 Summary and thoughts

Note: This task is a cooperation task between the C language course and Intel. By writing your own parallel merge sort algorithm, you can understand the role of data segmentation and merging as well as the cooperation between threads on operating efficiency.
Please indicate the source when reprinting.

1 title description

1.1 Description

Use C++/SYCL based on oneAPI to implement an efficient parallel merge sort. The splitting and merging of data and the cooperation between threads need to be considered.

1.2 Analysis & Examples

Merge sort is a divide-and-conquer algorithm. Its basic principle is to divide the array to be sorted into two parts, sort the two parts separately, and then merge the sorted sub-arrays into an ordered array. Consider taking advantage of the characteristics of heterogeneous parallel computing and assign sorting and merging operations to multiple threads for simultaneous execution to improve sorting efficiency. The specific implementation process is as follows:

Split the array to be sorted into multiple smaller sub-arrays and assign these sub-arrays to different thread blocks for processing.
Threads within each thread block cooperate to complete the local sorting of the subarray.
Through multiple iterations, adjacent ordered subarrays are continuously merged until the entire array is ordered.

In actual implementation, merge sort can use shared memory to speed up the sorting process. Specifically, shared memory can be used to store temporary data, reducing the number of accesses to global memory, thereby improving sorting efficiency. In addition, during the merge operation, a synchronization mechanism needs to be considered to ensure data consistency between multiple threads.

It should be noted that in actual applications, factors such as array size, thread block size, and data access mode must be taken into consideration to design appropriate algorithms and parameter settings to fully utilize the parallel computing capabilities of the target computing hardware GPU and improve the efficiency of sorting. efficiency and performance.

2 code implementation

2.1 Equipment selection

Here, we can choose different devices to calculate the subsequent sorting algorithm.

import ipywidgets as widgets
device = widgets.RadioButtons(
    options=['GPU Gen9', 'GPU Iris XE Max', 'CPU Xeon 6128', 'CPU Xeon 8153'],
    value='CPU Xeon 6128',    
    description='Device:',
    disabled=False
)
display(device)

2.2 Code implementation

2.2.1 Description of merging algorithm

Merge sort is a sorting method implemented using the idea of merging. This algorithm uses the classic divide-and-conquer strategy. The divide-and-conquer method decomposes the problem into several smaller sub-problems, then solves the sub-problems recursively, and finally combines the solutions to the sub-problems to obtain the solution to the original problem.

The basic idea of merge sort is to continuously split the sequence to be sorted into subsequences until each subsequence has only one element. Then, these ordered subsequences are combined two by two to obtain a larger ordered subsequence. Repeat this process until you have the entire sequence.

The specific implementation steps of merge sort are as follows:

Split the sequence to be sorted into two subsequences until each subsequence has only one element.
Merge two ordered subsequences into one ordered subsequence.
Repeat step 2 until you get the entire sequence.

2.2.2.1 Implementation of basic merging algorithm

Merge sort principle:
- Merge sort is a divide-and-conquer algorithm that splits an array into two halves, recursively applies merge sort to each half, and then merges the two sorted halves into a complete sorted array.
Conjunction function (merge):
- Input parameters: sorted arrayarr, left borderleft, midpointmiddle, right border right.
- Function: Merge two sorted subarrays (arr[left..middle] and arr[middle+1..right]) into an ordered array.
- Implementation: Create two temporary arraysL and R, copy the data of the left half and the right half respectively, and then merge them in order Return to the original arrayarr.
Exclusion order function (mergeSort):
- Input parameters: SYCL queueq, sorted arrayarr, left borderleft, right border right.
- Function: Recursively split an array and sort and merge each sub-array.
- Implementation: Calls itself recursively to sort the left and right halves of the array, then calls the merge function to merge them into an ordered array.
Main function (main):
- Function: Read data, initialize SYCL device and queue, perform sorting, and print the sorted array.
- Implementation: Read floating point data from fileproblem-2.txt to arr, select SYCL device (GPU or CPU), create queue, callmergeSort Sort, and finally output the sorted results.

Without considering the multi-thread synchronization mechanism, the code for writing a basic merge sort algorithm is as follows:

%%writefile lab/my_sort.cpp

#include <CL/sycl.hpp>
#include <iostream>
#include <vector>
#include <fstream>
#include <string>
#include <sstream>

using namespace sycl;

// Merge function to merge two sorted arrays
void merge(std::vector<float>& arr, size_t left, size_t middle, size_t right) {
    
    
    size_t i, j, k;
    size_t n1 = middle - left + 1;
    size_t n2 = right - middle;

    // Create temporary arrays
    std::vector<float> L(n1), R(n2);

    // Copy data to temporary arrays L[] and R[]
    for (i = 0; i < n1; i++)
        L[i] = arr[left + i];
    for (j = 0; j < n2; j++)
        R[j] = arr[middle + 1 + j];

    // Merge the temporary arrays back into arr[left..right]
    i = 0; // Initial index of first subarray
    j = 0; // Initial index of second subarray
    k = left; // Initial index of merged subarray
    while (i < n1 && j < n2) {
    
    
        if (L[i] <= R[j]) {
    
    
            arr[k] = L[i];
            i++;
        } else {
    
    
            arr[k] = R[j];
            j++;
        }
        k++;
    }

    // Copy the remaining elements of L[], if there are any
    while (i < n1) {
    
    
        arr[k] = L[i];
        i++;
        k++;
    }

    // Copy the remaining elements of R[], if there are any
    while (j < n2) {
    
    
        arr[k] = R[j];
        j++;
        k++;
    }
}

// Recursive function for parallel merge sort
void mergeSort(queue& q, std::vector<float>& arr, size_t left, size_t right) {
    
    
    if (left < right) {
    
    
        size_t middle = left + (right - left) / 2;

        // Recursively sort first and second halves
        mergeSort(q, arr, left, middle);
        mergeSort(q, arr, middle + 1, right);

        // Merge the sorted halves
        merge(arr, left, middle, right);
    }
}

int main() {
    
    
    std::vector<float> arr;
    std::ifstream file("problem-2.txt");
    std::string line;

    if (file.is_open()) {
    
    
        getline(file, line);
        file.close();
    } else {
    
    
        std::cerr << "Unable to open file";
        return 1;
    }

    std::istringstream iss(line);
    float number;
    while (iss >> number) {
    
    
        arr.push_back(number);
    }

    // Choose the device
    device selected_device;

    try {
    
    
        // Try to select GPU device
        selected_device = gpu_selector{
    
    }.select_device();
        std::cout << "Using GPU." << std::endl;
    } catch (const sycl::exception& e) {
    
    
        // If GPU selection fails, fall back to CPU
        std::cerr << "GPU not available. Using CPU instead." << std::endl;
        selected_device = cpu_selector{
    
    }.select_device();
        std::cout << "Using CPU." << std::endl;
    }

    // Create a SYCL queue
    queue q(selected_device);

    // Print the unsorted array
    std::cout << "Unsorted array:" << std::endl;
    for (size_t i = 0; i < arr.size(); ++i) {
    
    
        std::cout << arr[i] << " ";
    }
    std::cout << std::endl;

    // Call parallel merge sort
    mergeSort(q, arr, 0, arr.size() - 1);

    // Print the sorted array
    std::cout << "Sorted array:" << std::endl;
    for (size_t i = 0; i < arr.size(); ++i) {
    
    
        std::cout << arr[i] << " ";
    }
    std::cout << std::endl;

    return 0;
}

2.2.2.2 Implementation of parallel merging algorithm

Although the above code implements the basic merge sort algorithm, it cannot effectively utilize resources through parallelism. Next, I optimize the code. The following code implements a parallel merge sort algorithm, using C++/SYCL to take advantage of heterogeneous parallelism. Characteristics of computing, especially parallel processing on GPUs.

Split array and assign to thread block

In this code, it is not directly reflected that the array to be sorted is divided into multiple small sub-arrays and allocated to different thread blocks. In fact, this is because parallelism is implemented at a finer granularity. In the mergeSort_parallel function, by calling itself recursively, you are actually splitting the array. Each recursive call processes half of the array until each subarray is small enough to be processed by a separate thread or block of threads. This recursive splitting is typical of merge sort.

Cooperation of threads within a thread block

The key parallel part in this code is the merge_parallel function. In this function, SYCL's parallel_for is used to create a parallel execution unit. Each execution unit is responsible for merging together two small sorted arrays (from the left and right halves).

L_buf and R_buf represent the left and right subarrays respectively, which are loaded into the SYCL buffer, and these buffers are available on the device.
parallel_for creates enough execution units viarange<1>(right - left + 1), each unit is responsible for a part of the merge operation.
The parallelism here lies in merging multiple elements at the same time, rather than traditional element-by-element processing.

Merge ordered subarrays

The key step in merge sort is to merge ordered subarrays. In the merge_parallel function, this step is parallelized by parallel_for. Each execution unit independently takes elements from the two subarrays, compares them, and places them sequentially into the final array. During this process, different execution units may process adjacent data segments, but due to the characteristics of merge sort, they will not interfere with each other.

Shared memory usage and inter-thread synchronization

In this code, shared memory similar to CUDA is not directly used. SYCL's buffer and accessor abstractions may use similar mechanisms under the hood to optimize memory access, but this is transparent. In this case, the main performance consideration is to ensure that memory access is as efficient as possible, and this is usually handled automatically by the SYCL runtime and hardware drivers.

In terms of synchronization, since each merge operation is independent, the need for direct synchronization between threads is minimized. At the end of the merge_parallelfunction, q.wait() ensures that the program will not continue until all parallel operations are completed. This is the key to ensuring data consistency.

Next, the final code is shown in sections:

Import library

%%writefile lab/my_sort.cpp

#include <CL/sycl.hpp>
#include <iostream>
#include <vector>
#include <fstream>
#include <string>
#include <sstream>
#include <algorithm>

using namespace sycl;

Parallel merge function

void merge_parallel(queue& q, std::vector<float>& arr, size_t left, size_t middle, size_t right) {
    
    
    size_t n1 = middle - left + 1;
    size_t n2 = right - middle;

    // Allocate memory for temporary arrays on the device
    buffer<float, 1> L_buf(arr.data() + left, range<1>(n1));
    buffer<float, 1> R_buf(arr.data() + middle + 1, range<1>(n2));
    buffer<float, 1> arr_buf(arr.data(), range<1>(arr.size()));

    // Perform parallel merge
    q.submit([&](handler& h) {
    
    
        auto L = L_buf.get_access<access::mode::read>(h);
        auto R = R_buf.get_access<access::mode::read>(h);
        auto A = arr_buf.get_access<access::mode::write>(h);

        h.parallel_for(range<1>(right - left + 1), [=](id<1> idx) {
    
    
            size_t index = left + idx[0];
            size_t i = idx[0] < n1 ? idx[0] : n1;
            size_t j = idx[0] < n1 ? 0 : idx[0] - n1;

            while (i < n1 && j < n2) {
    
    
                if (L[i] <= R[j]) {
    
    
                    A[index++] = L[i++];
                } else {
    
    
                    A[index++] = R[j++];
                }
            }

            while (i < n1) {
    
    
                A[index++] = L[i++];
            }

            while (j < n2) {
    
    
                A[index++] = R[j++];
            }
        });
    });
    q.wait();
}

Parallel merge sort function

void mergeSort_parallel(queue& q, std::vector<float>& arr, size_t left, size_t right) {
    
    
    if (left < right) {
    
    
        size_t middle = left + (right - left) / 2;

        // Recursively sort halves in parallel
        mergeSort_parallel(q, arr, left, middle);
        mergeSort_parallel(q, arr, middle + 1, right);

        // Merge the sorted halves
        merge_parallel(q, arr, left, middle, right);
    }
}

main function

int main() {
    
    
    std::vector<float> arr;
    std::ifstream file("problem-2.txt");
    std::string line;

    if (file.is_open()) {
    
    
        getline(file, line);
        file.close();
    } else {
    
    
        std::cerr << "Unable to open file" << std::endl;
        return 1;
    }

    std::istringstream iss(line);
    float number;
    while (iss >> number) {
    
    
        arr.push_back(number);
    }

    // Choose the device
    device selected_device;

    try {
    
    
        selected_device = gpu_selector{
    
    }.select_device();
        std::cout << "Using GPU." << std::endl;
    } catch (const sycl::exception& e) {
    
    
        std::cerr << "GPU not available. Using CPU instead." << std::endl;
        selected_device = cpu_selector{
    
    }.select_device();
        std::cout << "Using CPU." << std::endl;
    }

    // Create a SYCL queue
    queue q(selected_device);

    // Print the unsorted array
    std::cout << "Unsorted array:" << std::endl;
    for (const auto& e : arr) {
    
    
        std::cout << e << " ";
    }
    std::cout << std::endl;

    // Call parallel merge sort
    mergeSort_parallel(q, arr, 0, arr.size() - 1);

    // Print the sorted array
    std::cout << "Sorted array:" << std::endl;
    for (const auto& e : arr) {
    
    
        std::cout << e << " ";
    }
    std::cout << std::endl;

    return 0;
}

3 Running results

Write a script to run the above code:

#!/bin/bash
source /opt/intel/inteloneapi/setvars.sh > /dev/null 2>&1

# Command Line Arguments
src="lab/"
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/intel/oneapi/compiler/latest/linux/lib
echo ====================
echo my_merge_sort
dpcpp ${src}my_sort.cpp -o ${src}my_sort -w -O3 -lsycl
./${src}my_sort

The result is as follows (including unsorted matrix and sorted matrix):
Insert image description here

Insert image description here
The complete running time of the entire task:

It is observed that this parallel merge sort code can effectively sort the array in terms of results, and in terms of running efficiency, because of the use of The characteristics of heterogeneous parallel computing allocate sorting and merging operations to multiple threads for simultaneous execution to improve sorting efficiency.

4 Summary and thoughts

This task is a cooperation task between the C language course and Intel. In the previous parallel matrix calculation task, I referred to Intel's example for learning, and for the parallel merge sort task, I wrote my own code to implement it. At first, I could only implement the basic merge sort algorithm based on the architecture. Later, I learned some code writing for parallel code calculations through Intel's course sharing, and began to modify the basic code. Finally, a more effective parallel merge sorting was achieved through data segmentation and merging and collaboration between threads. Different from previous code writing, I tried heterogeneous parallel computing for the first time in this cooperative course. I felt the charm of this in speeding up calculations and gained a lot.