Under the new version of Ceph Reef: RGW performance

5e129e99ff85c3830282122feee01fa2.gif

The new titanium cloud service has shared 746 technical dry goods for you

7621c31aebb488eefc08f2f4aa2cf235.gif

c5e819ff0bd766c8f1184a4b9739922b.png

background

152cd51a6193c28609cff15bff5612f8.gif

The Ceph community recently froze the upcoming release of Ceph Reef, and today we look at the RGW performance of Reef on a 10 node, 60 NVMe drive cluster.

We deployed 20 RGW instances and 200 hsbench S3 clients to execute highly parallel workloads in 512 buckets.

In many tests, we found that Reef is usually 1-5% faster than Quincy.

At 3X copy, the speed of 4MB GET is about 53GB/s, and the speed of 4MB PUT is 18.6GB/s.

In both cases we hit the bottleneck of the network. Reef also achieved about 312K 4KB GET/s and 178K 4KB PUT/s at 3X replication.

In the large object test, the CPU usage was equal between the different versions. However, in the 4KB PUT and GET tests, Reef used 23% more CPU resources than Quincy. We'll keep an eye on this as the Reef release approaches.

Finally, we noticed a significant deviation in total CPU usage between OSD and RGW, depending on the type of workload being run. At the same time, this also has a great impact on our hardware selection.

introduce

27bfe03c1cc1ad48a2082c3110cf3679.gif

In Part 1 of this series (https://ceph.io/en/news/blog/2023/reef-freeze-rbd-performance/), we looked at how to use CBT (https://github.com/ ceph/cbt) tools to see how RBD performance varies across multiple Ceph versions.

In Part 1, CBT uses Jens Axboe's fio (https://github.com/axboe/fio) tool for performance testing against RBD images. CBT can also utilize the S3 benchmark tool to test RGW.

In previous Ceph releases, we have used this feature for performance testing. Combining CBT, hsbench (https://github.com/markhpc/s3bench.git) S3 benchmarks, and git bisect, we found two code commit updates that hurt RGW performance and efficiency in Ceph Pacific releases.

4541267321238721bf6737a891fc2c5b.png

Once these code issues were identified, Ceph developers were able to quickly implement fixes, bringing performance and CPU usage back to the levels we were seeing in Nautilus.

In this article, we will use the same tool to test the performance of Reef after freezing .

cluster settings

c5628a86db14f475a87e58be45d9e97a.gif

Nodes

10 x Dell PowerEdge R6515

CPU

1 x AMD EPYC 7742 64C/128T

Memory

128GiB DDR4

Network

1 x 100GbE Mellanox ConnectX-6

NVMe

6 x 3.84TB Samsung PM983

OS Version

CentOS Stream release 8

Ceph Version 1

Quincy v17.2.5 (built from source)

Ceph Version 2

Reef 9d5a260e (built from source)

All nodes are connected to the same Juniper QFX5200 switch with a single 100GbE QSFP28 connection. At the same time we deployed Ceph and started fio tests using CBT (https://github.com/ceph/cbt/).

Unless otherwise specified, each node has 6 OSDs installed, and fio runs 6 processes of type librbd.

Intel based systems can configure "latency-performance" and "network-latency" to tune the system. This avoids delays associated with CPU C/P state transitions.

Tuning for AMD Rome based systems hasn't changed much in this regard, and we haven't confirmed that tuned actually limits C/P state transitions, but for these tests the tuned profile is still set to "network-latency" .

test setup

2b3d0729b34155d839ebe67fb76da81c.gif

CBT requires tuning several parameters for the deployed Ceph cluster. Each OSD is configured with 8GB of memory and uses msgr V1 with cephx disabled. No special adjustments are made for RGW. hsbench (https://github.com/markhpc/hsbench.git) is used for S3 testing.

Each node starts 20 hsbench processes, each process is connected to a different RGW instance.

Therefore, each RGW process has an associated hsbench process connected to it from a different node (200 hsbench processes in total). Each hsbench process is configured to use 64 threads and use 512 common buckets.

At the same time, some Ceph background service processes, such as scrub, deep scrub, pg autoscaling and pg balancing, are also disabled. 3X replication was used in all RGW pools.

The data and index pools each have 8192 PGs, while the root, control, meta, and log pools each have 64 PGs. Two separate test cases were run for quincy and reef using 2 million 4MiB objects and 200 million 4KiB objects respectively.

Each test case runs in the following order:

Test

Description

this

Clear all existing objects from buckets, delete buckets, initialize buckets

p

Put objects into buckets

l

List objects in buckets

g

Get objects from buckets

d

Delete objects from buckets

4MB PUT

5f141dc05f3b73a3ed548590a9bf954f.gif

829ab9d027a7ea3b7a27809d1f78c95a.png

Reef was slightly faster than Quincy, maintaining a roughly 2% lead in all three tests. The CPU usage of the RGW and OSD processes was similar in both tests.

Surprisingly, both Quincy and Reef showed RGW CPU usage at around 5 cores, but doubled to 10 cores when tested for about 2-3 minutes.

In Part 1 of this series (https://ceph.io/en/news/blog/2023/reef-freeze-rbd-performance/), we investigated sequential write throughput of 4MB and found that using librbd, We can keep around 25GB/s (75GB/s when copying).

Why are we seeing only about 18.5GB/s for a 4MB PUT here? A large part of this is because we transfer data twice. Once from client to RGW, then again from RGW to OSD. Some transfers will occur on RGW instances or OSDs located on the same node.

At the same time, we cannot ignore replication because there are at least 2 additional transfers from the primary OSD to the secondary OSD, which must fall on other nodes.

Probability of 2 transfers = 1/10 * 1/10 = 1/100
Probability of 4 transfers = 9/10 * 9/10 = 81/100
Probability of 3 transfer = 1 - 1/100 - 81/100 = 18/100

‍‍‍‍

The general best-case performance we should expect, relative to maximum network throughput, is:

100 / (2 * (1/100) + 4 * (81/100) + 3 * (18/100)) = 26.31%

‍‍‍‍

In the case of librbd, we always do 3 transfers with 3X replication (1 to the primary OSD, then 1 to the other two secondary OSDs):

100 / 3 = 33.3%

‍‍‍‍

Multiply the above result by our expected performance:

25GB/s * 26.31/33.3 = 19.75GB/s

‍‍‍‍

It looks like we're doing a little worse than RBD, but once you factor in the extra network overhead, it's not too far off.

4MB LIST

f617edde052ecf5364879378c12a53fc.gif

70c5ed189c279142c4e30a148cf8ce19.png

The completion time of the LIST test is completely determined by the number of objects. In this case, since we only have a total of 2M objects distributed in 512 buckets, the  LIST  time for a single bucket column is very short. Reef is again as fast or slightly faster than Quincy.

4MB GET

896adfadee4b6971a664287bd3559225.gif

ed1f79460d63a2904d3709f2906ed570.png

In this test, the CPU usage of the RGW and OSD processes was very consistent. As with the PUT test, we must consider the number of times the data must be transferred.

Because we are using replication, we always only transfer data once from OSD to RGW and then to the client.

Probability of 0 transfers = 1/10 * 1/10 = 1/100
Probability of 2 transfers = 9/10 * 9/10 = 81/100
Probability of 1 transfer = 1 - 1/100 - 81/100 = 18/100
100 / (0 * (1/100) + 2 * (81/100) + 1 * (18/100)) = 55.6%

Here we maintain ~51GB/s for Quincy and ~53GB/s for Reef. That's pretty close to the maximum these 100GbE NICs should be able to deliver.

4MB DELETE

41338c394ff43f3974ebaf2f4938cef6.gif

5bd5c0a21dc6f3c25ccfe465a6ff6c48.png

The DELETE test "throughput" numbers here   seem surprisingly high, but  DELETE  performance depends primarily on the number of objects being deleted rather than object size, and the actual deletion process is asynchronous. Having said that, Reef again seems to be slightly faster than Quincy on average.

In the 3rd test iteration, both appeared to have faster deletes. The reason for this is unclear.

4KB PUT

f5362edca61c6e6fcfebee6bb2e19a14.gif

070485a3b062c23a78bf8b7649e70331.png

As in the previous test, Reef was slightly faster than Quincy. The OSD CPU usage is about the same for both, but the RGW CPU usage seems to be about 15% higher.

On the bright side, in Quincy's RGW, CPU consumption spiked at the end of each test iteration, which did not happen in Reef.

An important point here is that both OSD and RGW try to maintain 180K PUT/s as much as possible. RGW requires 3 roundtrips (2 synchronous, 1 asynchronous) per PUT operation to the primary OSD to properly write the data and keep the bucket index properly in sync.

4KB LIST

529ef32fe812bdbce51aa09802e99334.gif

e09668d41495ee656e672166b4ac5854.png

Compared with the previous test, the more objects,   the longer the bucket LIST time. Reef is only slightly faster on the first iteration, but Quincy and Reef are otherwise on par.

4KB GET

cba57153a3c7c048536f92247a289e36.gif

77c9a36d3d44e2e8d6aec0ad3ed6226d.png

First , GET performance in Reef   is faster and more stable than in Quincy. Having said that, there is some very interesting behavior here.

In Quincy, the first 2 minutes of the test were dominated by extremely high CPU usage. Once complete, steady state CPU usage is much lower than Reef. OSD CPU usage is also low.

In this test, Reef used about 10-20% of the CPU on the RGW side and about 17-23% on the OSD side. Reef's behavior seems to be more consistent.

While the RGW CPU usage was high at the beginning of the test, it was nowhere near the change that occurred in the Quincy test.

4KB DELETE

3f2a9c6adfa4c0936fa8680f99c57906.gif

a9d5834b8db34397108a6dcd5d483df4.png

Finally, Reef was on average 5% faster at deleting 4K objects. Again, this is an asynchronous process, so the results may not match the speed at which RGW removes content in the background.

in conclusion

d791ffefbdf167757acbb1b4830b9131.png

These tests demonstrate only a small subset of RGW's performance characteristics. We didn't look at single-operation latency in this article, or try different clients to test it. Nonetheless, in our testing, we saw two trends:

1. Reef outperforms Quincy by about 1-5% and is fairly consistent across all tests and occurs in most test iterations.

2. In some tests, especially the 4K GET test, Reef's CPU usage is higher than Quincy's. This is evident in RGW, and we also found this problem in OSD. We hope to follow up on this issue before releasing Reef.

There are a few additional things to note about these results.

In previous tests, we used 10 RGW instances for a 60 NVMe cluster.

In these tests, we used 20 RGW instances and found that the performance of the small object PUT and GET tests was significantly higher than what we had seen before. It is quite possible to increase the number of RGWs to a higher ratio, maybe 1 RGW instance per 2 or even 1 OSD might yield better small object performance.

The second interesting thing is that the CPU consumption varies a lot across these tests, and the ratio of CPU consumption between the RGW and OSD processes also changes.

If we look at the Reef results and calculate the approximate number of cores used by RGW and OSD, we get:

4MB

Test

Result

Total RGW Cores

Total OSD Cores

Total Cores

RGW/OSD Core Ratio

4MB PUT

18.6GB/s

184 Cores

152 Cores

336 Cores

6/5

4MB GET

53GB/s

88 Cores

55 Cores

143 Cores

8/5

4KB

Test

Result

Total RGW Cores

Total OSD Cores

Total Cores

RGW/OSD Core Ratio

4KB PUT

178K IOPS

269 Cores

475 Cores

744 Cores

11/20

4KB GET

312K IOPS

302 Cores

102 Cores

404 Cores

3/1

In addition to knowing the number of cores used in these tests, there is another conclusion that can be drawn from these results: running Ceph in containers is very good.

Because the latest version of Ceph has a large number of external dependencies, this makes package management very troublesome. Beyond that, many users have adopted containerized management applications and want to deploy ceph in the same way.

One requirement for this is that you want to allocate Ceph daemon static resource allocation at the container level: a certain amount of memory and a certain number of cores to run. This is where it can be more troublesome.

We'll discuss memory quotas later, but in short: Guaranteeing memory usage in non-specifically designed applications is extremely difficult.

By default, what we can do is let Ceph processes monitor their mapped memory and automatically adjust non-essential memory usage. Because Ceph includes low-level memory auto-tuning and cache management code.

Limiting the number of CPU cores usually does not cause the same problems as limiting memory. In these tests, the difference between PUT and GET was nearly 6 times.

If we were optimizing for only one scenario, we would either impact the other scenario, or have a lot of redundant CPU cores sitting idle at different points in time.

If we need more 4K GET throughput, then maybe in some future version of Ceph we can start and stop RGW containers on demand.

But for now, we need flexibility to achieve high performance and high efficiency of Ceph RGW storage.

Guess you like

Origin blog.csdn.net/NewTyun/article/details/130757998