Pthread并行编程总结
1. pthread_create
int pthread_create(pthread_t *,
const pthread_attr_t *,
void * (*)(void *),
void *)
调用例:
errcode = pthread_create(&thread_id, &thread_attribute,&thread_fun, &fun_arg);
thread_id
线程ID或句柄(用于停止线程等)thread_attribute
各种属性,空指针表示标准默认属性值thread_fun
要运行的函数(参数和返回值类型都是void*)fun_arg
传递给thread_fun的参数errorcode
若创建失败,返回非零值
pthread_create的效果
- 主线程借助操作系统创建一个新线程
- 线程执行一个特定函数thread_fun
- 所有创建的线程执行相同的函数,表示线程的计算任务分解
- 对于程序中不同线程执行不同任务的情况,可用创建线程时传递的参数区分线程的“id”以及其他线程的独特特性
一个简单的线程例子
int main()
{
pthread_t threads[16];
int tn;
for(tn=0;tn<16;tn++)
{
pthread_create(&threads[tn],NULL,ParFun,NULL);
}
for(tn=0;tn<16;tn++)
{
pthread_join(threads[tn],NULL);
}
return 0;
}
这段代码创建了16个线程执行函数“ParFun”.
注意:创建线程的代价很高,因此ParFun应完成很多工作才值得付出这种代价
2. 线程数据共享
- 全局变量都是共享的
- 在堆中分配的对象可能是共享的(指针共享)
- 栈中的变量是私有的:将其指针传递给其他线程可能导致问题
- 常用共享方式:创建一个“线程数据”结构传递给所有线程,例如:
char *message = "Hello World!\n";
pthread_create( &thread1,
NULL,
(void*)&print_fun,
(void*) message);
3. Pthread “Hello world”
3.1 一些准备
- 线程数(threadcount)运行时设置,从命令行读取
- 每个线程打印
“Hello from thread <X> of <threadcount>”
3.2 pthread_join
函数
int pthread_join(pthread_t *, void **value_ptr);
说明:
- 作用:“挂起调用线程,直至目标线程结束,除非目标线程已结束。”
- 第二个参数允许目标线程退出时返回信息给调用线程(通常是NULL)
- 如发生错误返回非零值
3.3 “Hello World”
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
/* Global variable: accessible to all threads */
int thread_count;
// 线程执行函数
void* Hello(void* rank); /* Thread function */
int main(int argc, char* argv[]) {
long thread; /* Use long in case of a 64-bit system */
// 线程句柄
pthread_t* thread_handles;
/* Get number of threads from command line */
thread_count = strtol(argv[1], NULL, 10);
thread_handles = malloc(thread_count*sizeof(pthread_t));
// 创建线程
for (thread = 0; thread < thread_count; thread++)
pthread_create(&thread_handles[thread], NULL, Hello, (void*) thread);
printf("Hello from the main thread\n");
// 等待线程结束
for (thread = 0; thread < thread_count; thread++)
pthread_join(thread_handles[thread], NULL);
free(thread_handles);
return 0;
} /* main */
void* Hello(void* rank) {
long my_rank = (long) rank; /* Use long in case of 64-bit system */
printf("Hello from thread %ld of %d\n", my_rank, thread_count);
return NULL;
} /* Hello */
可能的输出结果:
Hello from thread 1 of 4
Hello from thread 3 of 4
Hello from thread 0 of 4
Hello from the main thread
Hello from thread 2 of 4
4. Pthread 其他基础 API
4.1 pthread_exit( )
void pthread_exit(void *value_ptr);
通过value_ptr
返回结果给调用者
4.2 pthread_cancal()
int pthread_cancel(pthread_t thread);
取消线程thread
执行
一个取消线程执行的例子
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
void *threadFunc(void *parm)
{
while(1)
{
fprintf(stdout, "I am the child thread.\n");
// 检测线程是否处于取消状态,若是,在此处退出线程
pthread_testcancel();
sleep(1);
}
}
int main(int argc, char *argv[])
{
void *status;
pthread_t thread;
pthread_create(&thread, NULL, threadFunc, NULL);
sleep(3);
// 向线程发出取消信号
pthread_cancel(thread);
// 等待线程真的退出
pthread_join(thread, &status);
if (status == PTHREAD_CANCELED)
fprintf(stdout, "The child thread has been canceled.\n");
else
fprintf(stderr, "Unexpected thread status!\n");
return 0;
}
运行结果:
I am the child thread.
I am the child thread.
I am the child thread.
I am the child thread.
The child thread has been canceled.
5.综合例:多个数组排序
- 多个一维数组也可看作一个矩阵,对每行(一维数组)进行排序
- 与矩阵与向量相乘有何差别?
乘法并行采取数据划分,把数据分配给不同的进程/线程,
#include <iostream>
#include <algorithm>
#include <vector>
#include <time.h>
#include <immintrin.h>
#include <windows.h>
#include <pthread.h>
using namespace std;
typedef struct{
int threadId;
} threadParm_t;
const int ARR_NUM = 10000;
const int ARR_LEN = 10000;
const int THREAD_NUM = 4;
const int seg = ARR_NUM / THREAD_NUM;//数组数/线程数=每个线程的任务量
vector<int> arr[ARR_NUM];
pthread_mutex_t mutex;
long long head, freq; // timers
void init(void)
{
srand(unsigned(time(nullptr)));
for (int i = 0; i < ARR_NUM; i++) {
arr[i].resize(ARR_LEN);
for (int j = 0; j < ARR_LEN; j++)
arr[i][j] = rand();
}
}
void *arr_sort(void *parm)
{
threadParm_t *p = (threadParm_t *) parm;
int r = p->threadId;
long long tail;
// 每个线程的计算量
// 每个线程负责连续n/4个数组的排序
for (int i = r * seg; i < (r + 1) * seg; i++)
sort(arr[i].begin(), arr[i].end());
pthread_mutex_lock(&mutex);
QueryPerformanceCounter((LARGE_INTEGER *)&tail);
printf(“Thread %d: %lfms.\n", r, (tail - head) * 1000.0 / freq);
pthread_mutex_unlock(&mutex);
pthread_exit(nullptr);
}
int main(int argc, char *argv[])
{
QueryPerformanceFrequency((LARGE_INTEGER *)&freq);
init();
mutex = PTHREAD_MUTEX_INITIALIZER;
pthread_t thread[THREAD_NUM];
threadParm_t threadParm[THREAD_NUM];
QueryPerformanceCounter((LARGE_INTEGER *)&head);
for (int i = 0; i < THREAD_NUM; i++)
{
threadParm[i].threadId = i;
pthread_create(&thread[i], nullptr, arr_sort, (void *)&threadParm[i]);
}
for (int i = 0; i < THREAD_NUM; i++)
{
pthread_join(thread[i], nullptr);
}
pthread_mutex_destroy(&mutex);
}
结果:
//单线程
Thread 0: 7581.931894ms.
//4线程
Thread 3: 1942.302817ms.
Thread 2: 1948.374916ms.
Thread 0: 1955.479851ms.
Thread 1: 1969.761978ms.
虽然数据完全随机,但每个线程数据分布是一致的,因此达到了负载均衡。
如果生成的是不是同一分布的随机数,结果就没有这么好。如下:
void init_2(void)
{
int ratio;
srand(unsigned(time(nullptr)));
for (int i = 0; i < ARR_NUM; i++) {
arr[i].resize(ARR_LEN);
if (i < seg) ratio = 0;
else if (i < seg * 2) ratio = 32;
else if (i < seg * 3) ratio = 64;
else ratio = 128;
if ((rand() & 127) < ratio)
for (int j = 0; j < ARR_LEN; j++)
arr[i][j] = ARR_LEN - j;
else
for (int j = 0; j < ARR_LEN; j++)
arr[i][j] = j;
}
}
前1/4:完全升序
第二段:1/4逆序,3/4升序
第三段:1/2逆序,1/2升序
第四段:完全逆序
块划分负载不均!
运行时间:
//单线程
Thread 0: 1643.106837ms.
// 4线程
Thread 0: 428.869616ms.
Thread 1: 486.402280ms.
Thread 2: 530.073299ms.
Thread 3: 643.510582ms
并行代价是643.5*4!
动态任务分配
int next_arr = 0;
pthread_mutex_t mutex_task;
void *arr_sort_fine(void *parm)
{
threadParm_t *p = (threadParm_t *) parm;
int r = p->threadId;
int task = 0;
long long tail;
while (1) {
// 获取任务(串行)
pthread_mutex_lock(&mutex_task);
task = next_arr++;
// 动态任务划分
pthread_mutex_unlock(&mutex_task);
// 如果任务池为空,停止
if (task >= ARR_NUM) break;
stable_sort(arr[task].begin(), arr[task].end());
}
pthread_mutex_lock(&mutex);
QueryPerformanceCounter((LARGE_INTEGER *)&tail);
printf("Thread %d: %lfms.\n", r, (tail - head) * 1000.0 / freq);
pthread_mutex_unlock(&mutex);
pthread_exit(nullptr);
}
结果:
Thread 0: 549.246907ms.
Thread 3: 552.934092ms.
Thread 2: 556.541263ms.
Thread 1: 559.427082ms
粗粒度动态划分——每次分配50行 :
Thread 0: 520.849620ms.
Thread 1: 524.470671ms.
Thread 3: 527.458957ms.
Thread 2: 530.890995ms.
细粒度任务划分会负载均衡,但是同步开销也很大,至于怎样划分粒度合适,还需实验。