High Throughput Stream Processing in Rust

This article mainly introduces the concept, method and optimization of stream processing in Rust. The author not only introduces the basic concepts of stream processing and commonly used stream processing libraries in Rust, but also implements a stream processing program using these libraries.

Finally, the authors show how to optimize the performance of stream processing programs by measuring idle and blocking times, and syncing these to Twitter and blogs.

picture

In addition, the author also provides some other optimization suggestions, such as:

  • In a real system, consideration should be given to pinning threads to CPU cores or using a version of green threads to reduce context switching.
  • When processing a stream, it is often necessary to allocate memory for the result. Memory allocation is expensive, so, in future articles, the author will introduce some good ways to optimize memory allocation.

First, the stream traits in synchronous and asynchronous Rust are introduced separately.

1. Stream traits in synchronous and asynchronous Rust

In synchronous Rust, the stream core abstraction is Iterator. It provides methods for generating items in a sequence and blocking between them, then combining them by passing iterators to other iterators' constructors. This allows us to connect things together effortlessly.

In asynchronous Rust, the core stream abstraction is Stream. It behaves very similarly to an Iterator; however, instead of blocking between each item, it allows other tasks to run while it blocks waiting.

In asynchronous Rust and synchronous Rust, Read and Write correspond to AsyncRead and AsyncWrite, respectively. These idiosyncrasies suggest that unparsed bytes often come directly from layer 10 (e.g. from a socket or a file).

picture

Rust streams incorporate the best features that other languages ​​have; for example, they are able to sidestep legacy issues found in Node.js' Duplex streams by leveraging the Rust trait system, and also implement backpressure and lazy iteration at the same time, greatly improving efficiency. Most importantly, Rust streams allow for the same type of asynchronous iteration.

Going forward, there's still a lot to love about Rust streams, although there are still some kinks to be ironed out.

2. General overview: What is stream processing?

Now, maybe you have learned about the stream characteristics in synchronous and asynchronous Rust, let's introduce what "stream processing" is.

"Stream processing" is an important big data processing method, and its main feature is that the processed data is continuous and arrives in real time.

In technology companies of all sizes, stream processing is often used to analyze and process specific events, and is often applied to distributed systems.

Some areas do use "stream processing" heavily, including: video processing and high-frequency trading. We can also use this to find architectural inspiration in new blockchains. Because, the blockchain needs to handle transactions and metadata flows, etc.

Today, you can rent an AWS instance with 100+ CPU cores, 100GB of memory, multiple GPUs, and 100Gbps of bandwidth without owning a distributed system of nodes.

Now, let's see how downstream processing works in Rust programming:

3. For example: a hash program that calculates 1 billion numbers

Now, let's write a SHA512 and BLAKE3 hash program that computes 1 billion numbers! You can imagine: numbers represent trades, analytical events or price signals. Hashing can be used to represent arbitrary transformations on these inputs.

The following is a single-threaded solution program:

picture

When I run this in release mode on a Digital Ocean with a dedicated CPU and 16 cores, it takes just over 6 minutes.

picture

1. Channel

Now, let's rewrite this program using "stream processing". Instead of doing the hashing in a single loop, we'll set up a pipeline of threads to do the hashing in parallel and then collect the results.

A local stream that sends data between two threads is called a channel. Our new program will spawn four threads. A generator thread will generate numbers and send them to two different hashing threads simultaneously. The hashing thread will read these numbers, hash them individually, and send their output to the result thread, the following diagram is its architecture:

picture

We will also send and receive data using the mpsc channel from the standard library. mpsc can be used to represent "multiple producer-single consumer", which means that you can send data to the channel from multiple threads, but only one channel can output data. While we won't be using this multi-producer feature, it's important to know.

It's still a fairly simple program:

picture

The output is as follows:

picture

oh! The new version with channels is taking twice as long, what's wrong?

2. Ring buffer

You can test with flame graphs, but save time!

No matter how small, the construction of all channel libraries will incur additional costs, and the benefits of parallelization must outweigh this cost to ensure the normal operation of the system. The bottleneck in this case is channel send() and recv(). Since the standard library mpsc channel in Rust is relatively slow, there are alternatives such as crossbeam-channel.

For this, we analyzed 4 different channel libraries and the results are as follows:

picture

Obviously, ringbuf and rtrb are the fastest. Because their ring buffers are lock-free, they act as a "single producer-single consumer". A single producer means that only one pipeline will put data into the queue, and another pipeline will be responsible for data output, which is less overhead than a "multi-producer queue".

In addition, these libraries are also non-blocking. When the queue is full, if you try to push, it will prompt "error" instead of "block", the same is true for "empty queue".

To use these ring buffer libraries, I added spinlocks to keep retrying if the channel is blocked. It turns out that this is also the approach used in high frequency trading architectures.

I've also found that adding very short "sleep" times while waiting improves overall performance. This may be due to throttling of the boot CPU when core usage reaches 100% or above certain temperatures.

Here are the new pop() and push(value) helpers:

picture

We will show with new methods:

picture

It's faster than before, but not by much, so let's take parallelism to another level.

3. More parallelization

Currently, we create two threads for hashing, one for SHA512 and one for BLAKE3. The slower of the two will be the bottleneck of our technological development. To demonstrate this, I re-ran the original single-threaded example, using only SHA512 hashes, and here are the results:

picture

This is very close to the performance in the parallel hashing example, meaning that overall most of the time spent hashing is due to SHA512.

So, what if we create more threads at the same time and hash multiple numbers? Let's try. We will create 2 SHA512 hashing threads and 2 BLAKE3 hashing threads to start.

4. Visualization

Each thread has its own input and output queues. We'll send the generated numbers to each thread in a loop order and read the results in the same order.

picture

This ensures that the order of the stream is maintained across the resulting thread; if ordering is not important or message processing times vary, then other scheduling mechanisms may be better.

The following is the cycle scheduling code:

picture

The new code is more complex, partly as follows:

picture

Let's take a look, how is it doing now? The output is as follows:

picture

Much better indeed!

5. Measure "idle" and "blocked" times

How many threads should there be per hash function? In more complex systems this is difficult to determine and may even be dynamic.

In fact, there is a technique that is helpful for "stream processing", namely, measuring idle and blocking times in some time window.

  • free time

Time spent waiting for an empty queue to receive a message

  • full time

Time spent waiting for a full queue to send output

Idle time is the time to spin during pop(), and blocking time is the time to spin during push(). I modified these two functions to track elapsed time. This code uses units with little overhead:

picture

I also created a new thread to count these times, and the output is as follows:

picture

We can see that the sha512 thread is neither "idle" nor "blocked", but is 100% active; in addition, we can increase the speed of the system by increasing the number of sha512 threads.

NOTE: Problems like the "Heisenberg Uncertainty Principle" may arise when measuring the behavior of a system to change its performance. If this is the case, look at the "Coarse Timing Library"; usually an approximation of the timing measurement is sufficient.

We concluded from trial and error data in the Digital Ocean instance that the optimal number is 8 SHA512 threads and 4 BLAKE3 threads.

picture

Result: less than 1/6 of the initial time.

Fourth, the next step: allocate memory for different stream processing results

In this article, we introduced the concepts, methods, and optimizations of stream processing in Rust with concrete examples, but there are still many details that have not been discussed. In actual systems, we should consider pinning "threads" to CPU cores to reduce context switching.

Also, when streaming, you often need to allocate memory for different results. This is expensive, so, in future articles, we will also discuss some strategies for this.


See more great cutting edge tools

Space elevators, MOSS, ChatGPT, etc. all indicate that 2023 is not destined to be an ordinary year. Any new technology is worthy of scrutiny, and we should have this sensitivity.

In the past few years, I have vaguely encountered low-code, and it is relatively popular at present, and many major manufacturers have joined in one after another.

Low-code platform concept: Through automatic code generation and visual programming, only a small amount of code is needed to quickly build various applications.

What is low-code, in my opinion, is dragging, whirring, and one-pass operation to create a system that can run, front-end, back-end, and database, all in one go. Of course this may be the end goal.

Link: www.jnpfsoft.com/?csdn , if you are interested, also experience it.

The advantage of JNPF is that it can generate front-end and back-end codes, which provides great flexibility and can create more complex and customized applications. Its architectural design also allows developers to focus on the development of application logic and user experience without worrying about the underlying technical details.

Guess you like

Origin blog.csdn.net/Z__7Gk/article/details/132109153