Measuring multi-threaded benchmark performance in .NET

Multi-threaded benchmark performance is a metric used to measure the execution capabilities and performance of a computer system or application in a multi-threaded environment. It is often used to evaluate the efficiency and performance of a system when processing tasks in parallel. Measurements typically create multiple threads and perform concurrent tasks on these threads to simulate the parallel processing needs of real applications.

Here, we use multiple threads to complete a counting task and simply measure the multi-threaded benchmark performance of the system. In the following five measurement codes (Code 1, Code 4, Code 5, Code 6, Code 7), all are set There is a counter, and the number of counter counts per second reflects the performance of the system. By comparing these measurements, you can intuitively understand multithreading, how to fully utilize system performance through multithreading, and possible bottlenecks in running multithreading.

Measurement methods

First, use a multi-threaded shared variable auto-increment example to do multi-thread benchmark performance measurement:

//代码1：简单的多线程测量多线程基准性能
long totalCount = 0;
int threadCount = Environment.ProcessorCount;

Task[] tasks = new Task[threadCount];
for (int i = 0; i < threadCount; ++i)
{
    tasks[i] = Task.Run(DoWork);
}
while (true)
{
    long t = totalCount;
    Thread.Sleep(1000);
    Console.WriteLine($"{totalCount - t:N0}");
}
void DoWork()
{
    while (true)
    {
        totalCount++;
    }
}

//结果
48,493,031
48,572,321
47,788,843
48,128,734
50,461,679
……

Because in a multi-threaded environment, switching between threads will cause some overhead, such as the time to save and restore thread context. If context switching occurs frequently, it may have an impact on the performance test results. Therefore, the above code sets the number of threads to start the test thread according to the number of CPU cores in the system. These threads perform auto-increment operations on a shared variable.

People with multi-threaded programming experience can easily see that the above code does not properly protect shared resources and a race condition will occur. This can lead to inconsistent data, uncertain order of operations, or inability to reproduce consistent performance results. We will demonstrate this situation with code.

//代码2：展示出竞态条件的代码
long totalCount = 0;
int threadCount = Environment.ProcessorCount;

Task[] tasks = new Task[threadCount];
for (int i = 0; i < threadCount; ++i)
{
    tasks[i] = Task.Run(DoWork);
}
void DoWork()
{
    while (true)
    {
        totalCount++;
        Console.Write($"{totalCount}"+",");
    }
}
//结果
1,9,10,3,12,13,4,14,15,16……270035,269913,270037,270038,270036,270040,269987,270042,270043……

From 代码2the running results, we can see that because it is operated by different threads, these threads access and modify the value of totalCount at the same time, and the printed totalCount does not increase sequentially.

It can be seen that 代码1without a thread synchronization mechanism, we cannot accurately measure multi-threaded benchmark performance.
The synchronization method of threads in C#, such as traditional lock mechanisms (such as lock statements, Monitor classes, Mutex classes, Semaphore classes, etc.) usually use mutual exclusion mechanisms to protect shared resources to ensure that only one thread can access resources at the same time to avoid competition. condition. These lock mechanisms block access by other threads while the code block is locked, so that only one thread can execute the locked code at a time.
Here, lock is used as a thread synchronization mechanism to modify the above code to protect shared variables and prevent shared variables from being modified by multiple threads at the same time.

//代码3：使用lock锁
long totalCount = 0;
int threadCount = Environment.ProcessorCount;
object totalCountLock = new object();

Task[] tasks = new Task[threadCount];
for (int i = 0; i < threadCount; ++i)
{
    tasks[i] = Task.Run(DoWork);
}

void DoWork()
{
    while (true)
    {
        lock (totalCountLock)
        {
            totalCount++;
            Console.Write($"{totalCount}"+",");
        }
    }
}

//结果
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30……

The result at this time is sequential output.

We use lock-containing code to measure multi-threaded benchmark performance:

//代码4：运用含lock锁的代码测量多线程基准性能
long totalCount = 0;
int threadCount = Environment.ProcessorCount;
object totalCountLock = new object();

Task[] tasks = new Task[threadCount];
for (int i = 0; i < threadCount; ++i)
{
    tasks[i] = Task.Run(DoWork);
}
while (true)
{
    long t = totalCount;
    Thread.Sleep(1000);
    Console.WriteLine($"{totalCount - t:N0}");
}
void DoWork()
{
    while (true)
    {
        lock (totalCountLock)
        {
            totalCount++;
        }
    }
}

//结果
16,593,517
16,694,824
16,514,421
16,517,431
16,652,867
……

Another way to ensure thread safety in a multi-threaded environment is to use the atomic operation Interlocked. Different from traditional locking mechanisms (such as lock statements, etc.), the Interlocked class provides some special atomic operations, such as Increment, Decrement, Exchange, CompareExchange, etc., for atomic operations on shared variables. These atomic operations are performed directly at the CPU instruction level without using traditional blocking and mutual exclusion mechanisms. It ensures that operations on shared variables are atomic through hardware-level operations, avoiding race conditions and data inconsistency issues.
It is better suited for use with specific atomic operations rather than as a general thread synchronization mechanism.

//代码5：运用原子操作的代码测量多线程基准性能
long totalCount = 0;
int threadCount = Environment.ProcessorCount;

Task[] tasks = new Task[threadCount];
for (int i = 0; i < threadCount; ++i)
{
    tasks[i] = Task.Run(DoWork);
}

while (true)
{
    long t = totalCount;
    Thread.Sleep(1000);
    Console.WriteLine($"{totalCount - t:N0}");
}

void DoWork()
{
    while (true)
    {
        Interlocked.Increment(ref totalCount);
    }
}
//结果
37,230,208
43,163,444
43,147,585
43,051,419
42,532,695
……

In addition to using mutex locks and atomic operations, we can also try to isolate data from multiple threads. The ThreadLocal class provides thread local storage functionality for data isolation in a multi-threaded environment. Each thread will have its own independent copy of the data, which is stored in a ThreadLocal instance. Each ThreadLocal can be accessed by the corresponding thread.

//代码6：运用含ThreadLocal的代码测量多线程基准性能
int threadCount = Environment.ProcessorCount;

Task[] tasks = new Task[threadCount];
ThreadLocal<long> count = new ThreadLocal<long>(trackAllValues: true);

for (int i = 0; i < threadCount; ++i)
{
    int threadId = i;
    tasks[i] = Task.Run(() => DoWork(threadId));
}

while (true)
{
    long old = count.Values.Sum();
    Thread.Sleep(1000);
    Console.WriteLine($"{count.Values.Sum() - old:N0}");
}

void DoWork(int threadId)
{
    while (true)
    {
        count.Value++;
    }
}

//结果
177,851,600
280,076,173
296,359,986
296,140,821
295,956,535
……

The above code uses the ThreadLocal class. We can also customize a class to create an object as a context for each thread. The code is as follows:

//代码7：运用含自定义上下文的代码测量多线程基准性能
int threadCount = Environment.ProcessorCount;

Task[] tasks = new Task[threadCount];
Context[] ctxs = new Context[threadCount];

for (int i = 0; i < threadCount; ++i)
{
    int threadId = i;
    ctxs[i] = new Context();
    tasks[i] = Task.Run(() => DoWork(threadId));
}

while (true)
{
    long old = ctxs.Sum(v => v.TotalCount);
    Thread.Sleep(1000);
    Console.WriteLine($"{ctxs.Sum(v => v.TotalCount) - old:N0}");
}

void DoWork(int threadId)
{
    while (true)
    {
        ctxs[threadId].TotalCount++;
    }
}

class Context
{
    public long TotalCount = 0;
}

//结果：
1,067,502,570
1,100,966,648
1,145,726,019
1,110,163,963
1,069,322,606
……

System Configuration

components	Specification
CPU	11th Gen Intel(R) Core(TM) i5-11300H
Memory	16 GB DDR4
operating system	Microsoft Windows 10 Home Chinese Edition
Power options	Set to high performance
software	LINQPad 7.8.5 Beta
Runtime	.NET 7.0.10

Measurement results

Measurement methods	1 second count	Performance percentage
No thread synchronization	50,461,679	118.6%
lock lock	16,652,867	39.2%
Atomic operations (Interlocked)	42,532,695	100%
ThreadLocal	295,956,535	695.8%
Custom context (Context)	1,069,322,606	2514.1%

Result analysis

The results measured without thread synchronization are inaccurate and cannot be used as a basis.

According to the results of running the program, we can see that using the traditional lock mechanism is not efficient. Using atomic operations Interlocked, the efficiency is nearly 2 times higher than traditional locks.
The two methods to achieve isolation between threads are both more efficient than the previous methods. Programs that use custom contexts are most efficient.

The main difference between the two codes for thread isolation lies in the way thread safety is implemented. Code 6 uses ThreadLocal a class to implement it, while Code 7 uses a custom context, using an array to provide a unique context for each thread. 代码6Thread Local Storage (TLS) is used to implement its functionality. It is a global variable that can be accessed by all running threads, but the value seen by each thread is private. Although this feature makes ThreadLocalit very useful in multi-threaded programming, in order to achieve this feature, it implements a complex mechanism internally, such as it creates a weak reference hash table to store the data of each thread. This internal implementation detail adds corresponding computational and access overhead.

For code 7, it creates a Contextclass array called, each thread has its own Contextobject, and modifies this object during execution. Since each thread manages its own Contextobjects, there are no inter-thread conflicts, which reduces a lot of additional overhead.

Therefore, although 代码6both 代码7and implement thread data isolation, 代码7they avoid ThreadLocalthe additional overhead and therefore perform better in terms of performance.

in conclusion

If isolation between threads can be achieved, the efficiency of multi-threaded code can be greatly improved and the maximum performance value of the system can be measured.