Pthread parallel programming summary

Pthread parallel programming summary

1. pthread_create

int pthread_create(pthread_t *, 
                const pthread_attr_t *,
                void * (*)(void *),   
                void *)

Calling example:

errcode = pthread_create(&thread_id, &thread_attribute,&thread_fun, &fun_arg);

thread_idThread ID or handle (for stopping the thread, etc.)
thread_attributeVarious properties, null pointers represent standard default property values
thread_funThe function to run (both parameter and return type are void*)
fun_argParameters passed to thread_fun
errorcodeIf the creation fails, return a non-zero value

Effects of pthread_create

The main thread creates a new thread with the help of the operating system
The thread executes a specific function thread_fun
All created threads execute the same function, representing the decomposition of the thread's computing tasks
In the case where different threads in the program perform different tasks , the parameters passed when creating the thread can be used to distinguish the thread's "id" and the unique characteristics of other threads

A simple thread example

int main()
{
    pthread_t threads[16];
    int tn;
    for(tn=0;tn<16;tn++)
    {
        pthread_create(&threads[tn],NULL,ParFun,NULL);
    }
    for(tn=0;tn<16;tn++)
    {
        pthread_join(threads[tn],NULL);
    }
    return 0;
}

This code creates 16 threads to execute the function "ParFun".

NOTE: Thread creation is expensive, so ParFun should do a lot of work to be worth the price

2. Thread data sharing

global variables are shared
Objects allocated in the heap may be shared (pointer sharing)
Variables on the stack are private: passing their pointers to other threads can cause problems
Common sharing method: Create a "thread data" structure to pass to all threads, for example:

char *message = "Hello World!\n";     
    pthread_create( &thread1,
                NULL,
                  (void*)&print_fun,
                  (void*) message);

3. Pthread “Hello world”

3.1 Some preparations

The threadcount runtime setting, read from the command line
print per thread“Hello from thread <X> of <threadcount>”

3.2 `pthread_join`Functions

int pthread_join(pthread_t *, void **value_ptr)；

illustrate:

Effect: "Suspend the calling thread until the target thread ends, unless the target thread has already ended."
The second parameter allows the target thread to return information (usually NULL) to the calling thread when the target thread exits
Returns a non-zero value if an error occurs

3.3 “Hello World”

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

/* Global variable: accessible to all threads */
int thread_count;

// 线程执行函数
void* Hello(void* rank); /* Thread function */

int main(int argc, char* argv[]) {
  long thread; /* Use long in case of a 64-bit system */
  // 线程句柄
  pthread_t* thread_handles;

  /* Get number of threads from command line */
  thread_count = strtol(argv[1], NULL, 10);

  thread_handles = malloc(thread_count*sizeof(pthread_t));

  // 创建线程
  for (thread = 0; thread < thread_count; thread++)
   pthread_create(&thread_handles[thread], NULL, Hello, (void*) thread);

  printf("Hello from the main thread\n");

  // 等待线程结束
  for (thread = 0; thread < thread_count; thread++)
    pthread_join(thread_handles[thread], NULL);

  free(thread_handles);
  return 0;
} /* main */

void* Hello(void* rank) {
  long my_rank = (long) rank;  /* Use long in case of 64-bit system */

  printf("Hello from thread %ld of %d\n", my_rank, thread_count);

  return NULL;
} /* Hello */

Possible outputs:

Hello from thread 1 of 4
Hello from thread 3 of 4
Hello from thread 0 of 4
Hello from the main thread
Hello from thread 2 of 4

4. Other basic APIs of Pthread

4.1`pthread_exit( )`

void pthread_exit(void *value_ptr);

by value_ptrreturning the result to the caller

4.2`pthread_cancal()`

int pthread_cancel(pthread_t thread);

cancel thread threadexecution

An example of canceling thread execution

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>

void *threadFunc(void *parm)
{
    while(1)
    {
        fprintf(stdout, "I am the child thread.\n");
         // 检测线程是否处于取消状态，若是，在此处退出线程
        pthread_testcancel();
        sleep(1);
    }
}

int main(int argc, char *argv[])
{
    void *status;
    pthread_t   thread;
    pthread_create(&thread, NULL, threadFunc, NULL);
    sleep(3);
    // 向线程发出取消信号
    pthread_cancel(thread);
    // 等待线程真的退出
    pthread_join(thread, &status);
    if (status == PTHREAD_CANCELED)
        fprintf(stdout, "The child thread has been canceled.\n");
    else
        fprintf(stderr, "Unexpected thread status!\n");
    return 0;
}

operation result:

I am the child thread.
I am the child thread.
I am the child thread.
I am the child thread.
The child thread has been canceled.

5. Comprehensive example: sorting multiple arrays

Multiple one-dimensional arrays can also be viewed as a matrix, sorting each row (one-dimensional array)
What's the difference between multiplying a matrix and a vector?

Multiplication takes data division in parallel and assigns data to different processes/threads,

#include <iostream>
#include <algorithm>
#include <vector>
#include <time.h>
#include <immintrin.h>
#include <windows.h>
#include <pthread.h>

using namespace std;

typedef struct{
    int threadId;
} threadParm_t;

const int ARR_NUM = 10000;
const int ARR_LEN = 10000;
const int THREAD_NUM = 4;
const int seg = ARR_NUM / THREAD_NUM;//数组数/线程数=每个线程的任务量

vector<int> arr[ARR_NUM];
pthread_mutex_t mutex;
long long head, freq;        // timers

void init(void)
{
  srand(unsigned(time(nullptr)));
  for (int i = 0; i < ARR_NUM; i++) {
    arr[i].resize(ARR_LEN);
    for (int j = 0; j < ARR_LEN; j++)
      arr[i][j] = rand();
  }
}

void *arr_sort(void *parm)
{
  threadParm_t *p = (threadParm_t *) parm;
  int r = p->threadId;
  long long tail;
  // 每个线程的计算量
  // 每个线程负责连续n/4个数组的排序
  for (int i = r * seg; i < (r + 1) * seg; i++)
    sort(arr[i].begin(), arr[i].end());

  pthread_mutex_lock(&mutex);
  QueryPerformanceCounter((LARGE_INTEGER *)&tail);
  printf(“Thread %d: %lfms.\n", r, (tail - head) * 1000.0 / freq);
  pthread_mutex_unlock(&mutex);

  pthread_exit(nullptr);
}

int main(int argc, char *argv[])
{
  QueryPerformanceFrequency((LARGE_INTEGER *)&freq);

  init();
  mutex = PTHREAD_MUTEX_INITIALIZER;
  pthread_t thread[THREAD_NUM];
  threadParm_t threadParm[THREAD_NUM];

  QueryPerformanceCounter((LARGE_INTEGER *)&head);

  for (int i = 0; i < THREAD_NUM; i++)
  {
    threadParm[i].threadId = i;
    pthread_create(&thread[i], nullptr, arr_sort, (void *)&threadParm[i]);
  }

   for (int i = 0; i < THREAD_NUM; i++)
  {
    pthread_join(thread[i], nullptr);
  }

  pthread_mutex_destroy(&mutex);
}

result:

//单线程
Thread 0: 7581.931894ms.
//4线程
Thread 3: 1942.302817ms.
Thread 2: 1948.374916ms.
Thread 0: 1955.479851ms.
Thread 1: 1969.761978ms.

Although the data is completely random, the data distribution of each thread is consistent, thus achieving load balancing.

If the random numbers generated are not from the same distribution, the results are not so good. as follows:

void init_2(void)
{
  int ratio;
  srand(unsigned(time(nullptr)));
  for (int i = 0; i < ARR_NUM; i++) {
    arr[i].resize(ARR_LEN);
    if (i < seg) ratio = 0;
    else if (i < seg * 2) ratio = 32;
    else if (i < seg * 3) ratio = 64;
    else ratio = 128;
    if ((rand() & 127) < ratio)
      for (int j = 0; j < ARR_LEN; j++)
        arr[i][j] = ARR_LEN - j;
    else
      for (int j = 0; j < ARR_LEN; j++)
        arr[i][j] = j;
  }
}

Top 1/4: Fully Ascending

Second paragraph: 1/4 reverse order, 3/4 ascending order

The third paragraph: 1/2 reverse order, 1/2 ascending order

Fourth paragraph: complete reverse order

Block partitioning load is uneven!

operation hours:

//单线程
Thread 0: 1643.106837ms.
// 4线程
Thread 0: 428.869616ms.
Thread 1: 486.402280ms.
Thread 2: 530.073299ms.
Thread 3: 643.510582ms

The parallel cost is 643.5*4!

Dynamic task assignment

int next_arr = 0;
pthread_mutex_t  mutex_task;
void *arr_sort_fine(void *parm)
{
  threadParm_t *p = (threadParm_t *) parm;
  int r = p->threadId;
  int task = 0;
  long long tail;
  while (1) {
    // 获取任务（串行）
    pthread_mutex_lock(&mutex_task);
    task = next_arr++;
    // 动态任务划分
    pthread_mutex_unlock(&mutex_task);
    // 如果任务池为空，停止
    if (task >= ARR_NUM) break;
    stable_sort(arr[task].begin(), arr[task].end());
  }
  pthread_mutex_lock(&mutex);
  QueryPerformanceCounter((LARGE_INTEGER *)&tail);
  printf("Thread %d: %lfms.\n", r, (tail - head) * 1000.0 / freq);
  pthread_mutex_unlock(&mutex);
  pthread_exit(nullptr);
}

result:

Thread 0: 549.246907ms.
Thread 3: 552.934092ms.
Thread 2: 556.541263ms.
Thread 1: 559.427082ms

Coarse-grained dynamic partitioning - allocating 50 rows at a time:

Thread 0: 520.849620ms.
Thread 1: 524.470671ms.
Thread 3: 527.458957ms.
Thread 2: 530.890995ms.

Fine-grained task division will balance the load, but the synchronization overhead is also very large. As for how to divide the granularity appropriately, experiments are needed.