The new titanium cloud service has shared 746 technical dry goods for you
background
The Ceph community recently froze the upcoming release of Ceph Reef, and today we look at the RGW performance of Reef on a 10 node, 60 NVMe drive cluster.
We deployed 20 RGW instances and 200 hsbench S3 clients to execute highly parallel workloads in 512 buckets.
In many tests, we found that Reef is usually 1-5% faster than Quincy.
At 3X copy, the speed of 4MB GET is about 53GB/s, and the speed of 4MB PUT is 18.6GB/s.
In both cases we hit the bottleneck of the network. Reef also achieved about 312K 4KB GET/s and 178K 4KB PUT/s at 3X replication.
In the large object test, the CPU usage was equal between the different versions. However, in the 4KB PUT and GET tests, Reef used 23% more CPU resources than Quincy. We'll keep an eye on this as the Reef release approaches.
Finally, we noticed a significant deviation in total CPU usage between OSD and RGW, depending on the type of workload being run. At the same time, this also has a great impact on our hardware selection.
introduce
In Part 1 of this series (https://ceph.io/en/news/blog/2023/reef-freeze-rbd-performance/), we looked at how to use CBT (https://github.com/ ceph/cbt) tools to see how RBD performance varies across multiple Ceph versions.
In Part 1, CBT uses Jens Axboe's fio (https://github.com/axboe/fio) tool for performance testing against RBD images. CBT can also utilize the S3 benchmark tool to test RGW.
In previous Ceph releases, we have used this feature for performance testing. Combining CBT, hsbench (https://github.com/markhpc/s3bench.git) S3 benchmarks, and git bisect, we found two code commit updates that hurt RGW performance and efficiency in Ceph Pacific releases.
Once these code issues were identified, Ceph developers were able to quickly implement fixes, bringing performance and CPU usage back to the levels we were seeing in Nautilus.
In this article, we will use the same tool to test the performance of Reef after freezing .
cluster settings
Nodes |
10 x Dell PowerEdge R6515 |
CPU |
1 x AMD EPYC 7742 64C/128T |
Memory |
128GiB DDR4 |
Network |
1 x 100GbE Mellanox ConnectX-6 |
NVMe |
6 x 3.84TB Samsung PM983 |
OS Version |
CentOS Stream release 8 |
Ceph Version 1 |
Quincy v17.2.5 (built from source) |
Ceph Version 2 |
Reef 9d5a260e (built from source) |
All nodes are connected to the same Juniper QFX5200 switch with a single 100GbE QSFP28 connection. At the same time we deployed Ceph and started fio tests using CBT (https://github.com/ceph/cbt/).
Unless otherwise specified, each node has 6 OSDs installed, and fio runs 6 processes of type librbd.
Intel based systems can configure "latency-performance" and "network-latency" to tune the system. This avoids delays associated with CPU C/P state transitions.
Tuning for AMD Rome based systems hasn't changed much in this regard, and we haven't confirmed that tuned actually limits C/P state transitions, but for these tests the tuned profile is still set to "network-latency" .
test setup
CBT requires tuning several parameters for the deployed Ceph cluster. Each OSD is configured with 8GB of memory and uses msgr V1 with cephx disabled. No special adjustments are made for RGW. hsbench (https://github.com/markhpc/hsbench.git) is used for S3 testing.
Each node starts 20 hsbench processes, each process is connected to a different RGW instance.
Therefore, each RGW process has an associated hsbench process connected to it from a different node (200 hsbench processes in total). Each hsbench process is configured to use 64 threads and use 512 common buckets.
At the same time, some Ceph background service processes, such as scrub, deep scrub, pg autoscaling and pg balancing, are also disabled. 3X replication was used in all RGW pools.
The data and index pools each have 8192 PGs, while the root, control, meta, and log pools each have 64 PGs. Two separate test cases were run for quincy and reef using 2 million 4MiB objects and 200 million 4KiB objects respectively.
Each test case runs in the following order:
Test |
Description |
this |
Clear all existing objects from buckets, delete buckets, initialize buckets |
p |
Put objects into buckets |
l |
List objects in buckets |
g |
Get objects from buckets |
d |
Delete objects from buckets |
4MB PUT
Reef was slightly faster than Quincy, maintaining a roughly 2% lead in all three tests. The CPU usage of the RGW and OSD processes was similar in both tests.
Surprisingly, both Quincy and Reef showed RGW CPU usage at around 5 cores, but doubled to 10 cores when tested for about 2-3 minutes.
In Part 1 of this series (https://ceph.io/en/news/blog/2023/reef-freeze-rbd-performance/), we investigated sequential write throughput of 4MB and found that using librbd, We can keep around 25GB/s (75GB/s when copying).
Why are we seeing only about 18.5GB/s for a 4MB PUT here? A large part of this is because we transfer data twice. Once from client to RGW, then again from RGW to OSD. Some transfers will occur on RGW instances or OSDs located on the same node.
At the same time, we cannot ignore replication because there are at least 2 additional transfers from the primary OSD to the secondary OSD, which must fall on other nodes.
Probability of 2 transfers = 1/10 * 1/10 = 1/100
Probability of 4 transfers = 9/10 * 9/10 = 81/100
Probability of 3 transfer = 1 - 1/100 - 81/100 = 18/100
The general best-case performance we should expect, relative to maximum network throughput, is:
100 / (2 * (1/100) + 4 * (81/100) + 3 * (18/100)) = 26.31%
In the case of librbd, we always do 3 transfers with 3X replication (1 to the primary OSD, then 1 to the other two secondary OSDs):
100 / 3 = 33.3%
Multiply the above result by our expected performance:
25GB/s * 26.31/33.3 = 19.75GB/s
It looks like we're doing a little worse than RBD, but once you factor in the extra network overhead, it's not too far off.
4MB LIST
The completion time of the LIST test is completely determined by the number of objects. In this case, since we only have a total of 2M objects distributed in 512 buckets, the LIST time for a single bucket column is very short. Reef is again as fast or slightly faster than Quincy.
4MB GET
In this test, the CPU usage of the RGW and OSD processes was very consistent. As with the PUT test, we must consider the number of times the data must be transferred.
Because we are using replication, we always only transfer data once from OSD to RGW and then to the client.
Probability of 0 transfers = 1/10 * 1/10 = 1/100
Probability of 2 transfers = 9/10 * 9/10 = 81/100
Probability of 1 transfer = 1 - 1/100 - 81/100 = 18/100
100 / (0 * (1/100) + 2 * (81/100) + 1 * (18/100)) = 55.6%
Here we maintain ~51GB/s for Quincy and ~53GB/s for Reef. That's pretty close to the maximum these 100GbE NICs should be able to deliver.
4MB DELETE
The DELETE test "throughput" numbers here seem surprisingly high, but DELETE performance depends primarily on the number of objects being deleted rather than object size, and the actual deletion process is asynchronous. Having said that, Reef again seems to be slightly faster than Quincy on average.
In the 3rd test iteration, both appeared to have faster deletes. The reason for this is unclear.
4KB PUT
As in the previous test, Reef was slightly faster than Quincy. The OSD CPU usage is about the same for both, but the RGW CPU usage seems to be about 15% higher.
On the bright side, in Quincy's RGW, CPU consumption spiked at the end of each test iteration, which did not happen in Reef.
An important point here is that both OSD and RGW try to maintain 180K PUT/s as much as possible. RGW requires 3 roundtrips (2 synchronous, 1 asynchronous) per PUT operation to the primary OSD to properly write the data and keep the bucket index properly in sync.
4KB LIST
Compared with the previous test, the more objects, the longer the bucket LIST time. Reef is only slightly faster on the first iteration, but Quincy and Reef are otherwise on par.
4KB GET
First , GET performance in Reef is faster and more stable than in Quincy. Having said that, there is some very interesting behavior here.
In Quincy, the first 2 minutes of the test were dominated by extremely high CPU usage. Once complete, steady state CPU usage is much lower than Reef. OSD CPU usage is also low.
In this test, Reef used about 10-20% of the CPU on the RGW side and about 17-23% on the OSD side. Reef's behavior seems to be more consistent.
While the RGW CPU usage was high at the beginning of the test, it was nowhere near the change that occurred in the Quincy test.
4KB DELETE
Finally, Reef was on average 5% faster at deleting 4K objects. Again, this is an asynchronous process, so the results may not match the speed at which RGW removes content in the background.
in conclusion
These tests demonstrate only a small subset of RGW's performance characteristics. We didn't look at single-operation latency in this article, or try different clients to test it. Nonetheless, in our testing, we saw two trends:
1. Reef outperforms Quincy by about 1-5% and is fairly consistent across all tests and occurs in most test iterations.
2. In some tests, especially the 4K GET test, Reef's CPU usage is higher than Quincy's. This is evident in RGW, and we also found this problem in OSD. We hope to follow up on this issue before releasing Reef.
There are a few additional things to note about these results.
In previous tests, we used 10 RGW instances for a 60 NVMe cluster.
In these tests, we used 20 RGW instances and found that the performance of the small object PUT and GET tests was significantly higher than what we had seen before. It is quite possible to increase the number of RGWs to a higher ratio, maybe 1 RGW instance per 2 or even 1 OSD might yield better small object performance.
The second interesting thing is that the CPU consumption varies a lot across these tests, and the ratio of CPU consumption between the RGW and OSD processes also changes.
If we look at the Reef results and calculate the approximate number of cores used by RGW and OSD, we get:
4MB
Test |
Result |
Total RGW Cores |
Total OSD Cores |
Total Cores |
RGW/OSD Core Ratio |
4MB PUT |
18.6GB/s |
184 Cores |
152 Cores |
336 Cores |
6/5 |
4MB GET |
53GB/s |
88 Cores |
55 Cores |
143 Cores |
8/5 |
4KB
Test |
Result |
Total RGW Cores |
Total OSD Cores |
Total Cores |
RGW/OSD Core Ratio |
4KB PUT |
178K IOPS |
269 Cores |
475 Cores |
744 Cores |
11/20 |
4KB GET |
312K IOPS |
302 Cores |
102 Cores |
404 Cores |
3/1 |
In addition to knowing the number of cores used in these tests, there is another conclusion that can be drawn from these results: running Ceph in containers is very good.
Because the latest version of Ceph has a large number of external dependencies, this makes package management very troublesome. Beyond that, many users have adopted containerized management applications and want to deploy ceph in the same way.
One requirement for this is that you want to allocate Ceph daemon static resource allocation at the container level: a certain amount of memory and a certain number of cores to run. This is where it can be more troublesome.
We'll discuss memory quotas later, but in short: Guaranteeing memory usage in non-specifically designed applications is extremely difficult.
By default, what we can do is let Ceph processes monitor their mapped memory and automatically adjust non-essential memory usage. Because Ceph includes low-level memory auto-tuning and cache management code.
Limiting the number of CPU cores usually does not cause the same problems as limiting memory. In these tests, the difference between PUT and GET was nearly 6 times.
If we were optimizing for only one scenario, we would either impact the other scenario, or have a lot of redundant CPU cores sitting idle at different points in time.
If we need more 4K GET throughput, then maybe in some future version of Ceph we can start and stop RGW containers on demand.
But for now, we need flexibility to achieve high performance and high efficiency of Ceph RGW storage.