Pthread并行编程总结

Pthread并行编程总结

1. pthread_create

int pthread_create(pthread_t *, 
                const pthread_attr_t *,
                void * (*)(void *),   
                void *)

调用例:

errcode = pthread_create(&thread_id, &thread_attribute,&thread_fun, &fun_arg);
  • thread_id 线程ID或句柄(用于停止线程等)
  • thread_attribute 各种属性,空指针表示标准默认属性值
  • thread_fun 要运行的函数(参数和返回值类型都是void*)
  • fun_arg 传递给thread_fun的参数
  • errorcode 若创建失败,返回非零值

pthread_create的效果

  • 主线程借助操作系统创建一个新线程
  • 线程执行一个特定函数thread_fun
  • 所有创建的线程执行相同的函数,表示线程的计算任务分解
  • 对于程序中不同线程执行不同任务的情况,可用创建线程时传递的参数区分线程的“id”以及其他线程的独特特性

一个简单的线程例子

int main()
{
    pthread_t threads[16];
    int tn;
    for(tn=0;tn<16;tn++)
    {
        pthread_create(&threads[tn],NULL,ParFun,NULL);
    }
    for(tn=0;tn<16;tn++)
    {
        pthread_join(threads[tn],NULL);
    }
    return 0;
}

这段代码创建了16个线程执行函数“ParFun”.

注意:创建线程的代价很高,因此ParFun应完成很多工作才值得付出这种代价

2. 线程数据共享

  • 全局变量都是共享的
  • 在堆中分配的对象可能是共享的(指针共享)
  • 栈中的变量是私有的:将其指针传递给其他线程可能导致问题
  • 常用共享方式:创建一个“线程数据”结构传递给所有线程,例如:
char *message = "Hello World!\n";     
    pthread_create( &thread1,
                NULL,
                  (void*)&print_fun,
                  (void*) message);

3. Pthread “Hello world”

3.1 一些准备

  • 线程数(threadcount)运行时设置,从命令行读取
  • 每个线程打印“Hello from thread <X> of <threadcount>”

3.2 pthread_join函数

int pthread_join(pthread_t *, void **value_ptr);

说明:

  • 作用:“挂起调用线程,直至目标线程结束,除非目标线程已结束。”
  • 第二个参数允许目标线程退出时返回信息给调用线程(通常是NULL)
  • 如发生错误返回非零值

3.3 “Hello World”

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

/* Global variable: accessible to all threads */
int thread_count;

// 线程执行函数
void* Hello(void* rank); /* Thread function */

int main(int argc, char* argv[]) {
  long thread; /* Use long in case of a 64-bit system */
  // 线程句柄
  pthread_t* thread_handles;

  /* Get number of threads from command line */
  thread_count = strtol(argv[1], NULL, 10);

  thread_handles = malloc(thread_count*sizeof(pthread_t));

  // 创建线程
  for (thread = 0; thread < thread_count; thread++)
   pthread_create(&thread_handles[thread], NULL, Hello, (void*) thread);

  printf("Hello from the main thread\n");

  // 等待线程结束
  for (thread = 0; thread < thread_count; thread++)
    pthread_join(thread_handles[thread], NULL);

  free(thread_handles);
  return 0;
} /* main */

void* Hello(void* rank) {
  long my_rank = (long) rank;  /* Use long in case of 64-bit system */

  printf("Hello from thread %ld of %d\n", my_rank, thread_count);

  return NULL;
} /* Hello */

可能的输出结果:

Hello from thread 1 of 4
Hello from thread 3 of 4
Hello from thread 0 of 4
Hello from the main thread
Hello from thread 2 of 4

4. Pthread 其他基础 API

4.1 pthread_exit( )

void pthread_exit(void *value_ptr);

通过value_ptr返回结果给调用者

4.2 pthread_cancal()

int pthread_cancel(pthread_t thread);

取消线程thread执行

一个取消线程执行的例子
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>

void *threadFunc(void *parm)
{
    while(1)
    {
        fprintf(stdout, "I am the child thread.\n");
         // 检测线程是否处于取消状态,若是,在此处退出线程
        pthread_testcancel();
        sleep(1);
    }
}

int main(int argc, char *argv[])
{
    void *status;
    pthread_t   thread;
    pthread_create(&thread, NULL, threadFunc, NULL);
    sleep(3);
    // 向线程发出取消信号
    pthread_cancel(thread);
    // 等待线程真的退出
    pthread_join(thread, &status);
    if (status == PTHREAD_CANCELED)
        fprintf(stdout, "The child thread has been canceled.\n");
    else
        fprintf(stderr, "Unexpected thread status!\n");
    return 0;
}

运行结果:

I am the child thread.
I am the child thread.
I am the child thread.
I am the child thread.
The child thread has been canceled.

5.综合例:多个数组排序

  • 多个一维数组也可看作一个矩阵,对每行(一维数组)进行排序
  • 与矩阵与向量相乘有何差别?

​ 乘法并行采取数据划分,把数据分配给不同的进程/线程,

#include <iostream>
#include <algorithm>
#include <vector>
#include <time.h>
#include <immintrin.h>
#include <windows.h>
#include <pthread.h>

using namespace std;

typedef struct{
    int threadId;
} threadParm_t;

const int ARR_NUM = 10000;
const int ARR_LEN = 10000;
const int THREAD_NUM = 4;
const int seg = ARR_NUM / THREAD_NUM;//数组数/线程数=每个线程的任务量

vector<int> arr[ARR_NUM];
pthread_mutex_t mutex;
long long head, freq;        // timers

void init(void)
{
  srand(unsigned(time(nullptr)));
  for (int i = 0; i < ARR_NUM; i++) {
    arr[i].resize(ARR_LEN);
    for (int j = 0; j < ARR_LEN; j++)
      arr[i][j] = rand();
  }
}

void *arr_sort(void *parm)
{
  threadParm_t *p = (threadParm_t *) parm;
  int r = p->threadId;
  long long tail;
  // 每个线程的计算量
  // 每个线程负责连续n/4个数组的排序
  for (int i = r * seg; i < (r + 1) * seg; i++)
    sort(arr[i].begin(), arr[i].end());

  pthread_mutex_lock(&mutex);
  QueryPerformanceCounter((LARGE_INTEGER *)&tail);
  printf(“Thread %d: %lfms.\n", r, (tail - head) * 1000.0 / freq);
  pthread_mutex_unlock(&mutex);

  pthread_exit(nullptr);
}

int main(int argc, char *argv[])
{
  QueryPerformanceFrequency((LARGE_INTEGER *)&freq);

  init();
  mutex = PTHREAD_MUTEX_INITIALIZER;
  pthread_t thread[THREAD_NUM];
  threadParm_t threadParm[THREAD_NUM];

  QueryPerformanceCounter((LARGE_INTEGER *)&head);

  for (int i = 0; i < THREAD_NUM; i++)
  {
    threadParm[i].threadId = i;
    pthread_create(&thread[i], nullptr, arr_sort, (void *)&threadParm[i]);
  }

   for (int i = 0; i < THREAD_NUM; i++)
  {
    pthread_join(thread[i], nullptr);
  }

  pthread_mutex_destroy(&mutex);
}

结果:

//单线程
Thread 0: 7581.931894ms.
//4线程
Thread 3: 1942.302817ms.
Thread 2: 1948.374916ms.
Thread 0: 1955.479851ms.
Thread 1: 1969.761978ms.

虽然数据完全随机,但每个线程数据分布是一致的,因此达到了负载均衡。

如果生成的是不是同一分布的随机数,结果就没有这么好。如下:

void init_2(void)
{
  int ratio;
  srand(unsigned(time(nullptr)));
  for (int i = 0; i < ARR_NUM; i++) {
    arr[i].resize(ARR_LEN);
    if (i < seg) ratio = 0;
    else if (i < seg * 2) ratio = 32;
    else if (i < seg * 3) ratio = 64;
    else ratio = 128;
    if ((rand() & 127) < ratio)
      for (int j = 0; j < ARR_LEN; j++)
        arr[i][j] = ARR_LEN - j;
    else
      for (int j = 0; j < ARR_LEN; j++)
        arr[i][j] = j;
  }
}

前1/4:完全升序

第二段:1/4逆序,3/4升序

第三段:1/2逆序,1/2升序

第四段:完全逆序

块划分负载不均!

运行时间:

//单线程
Thread 0: 1643.106837ms.
// 4线程
Thread 0: 428.869616ms.
Thread 1: 486.402280ms.
Thread 2: 530.073299ms.
Thread 3: 643.510582ms

并行代价是643.5*4!

动态任务分配

int next_arr = 0;
pthread_mutex_t  mutex_task;
void *arr_sort_fine(void *parm)
{
  threadParm_t *p = (threadParm_t *) parm;
  int r = p->threadId;
  int task = 0;
  long long tail;
  while (1) {
    // 获取任务(串行)
    pthread_mutex_lock(&mutex_task);
    task = next_arr++;
    // 动态任务划分
    pthread_mutex_unlock(&mutex_task);
    // 如果任务池为空,停止
    if (task >= ARR_NUM) break;
    stable_sort(arr[task].begin(), arr[task].end());
  }
  pthread_mutex_lock(&mutex);
  QueryPerformanceCounter((LARGE_INTEGER *)&tail);
  printf("Thread %d: %lfms.\n", r, (tail - head) * 1000.0 / freq);
  pthread_mutex_unlock(&mutex);
  pthread_exit(nullptr);
}

结果:

Thread 0: 549.246907ms.
Thread 3: 552.934092ms.
Thread 2: 556.541263ms.
Thread 1: 559.427082ms

粗粒度动态划分——每次分配50行 :

Thread 0: 520.849620ms.
Thread 1: 524.470671ms.
Thread 3: 527.458957ms.
Thread 2: 530.890995ms.

细粒度任务划分会负载均衡,但是同步开销也很大,至于怎样划分粒度合适,还需实验。

猜你喜欢

转载自blog.csdn.net/turing365/article/details/80216553