Several misunderstandings of hard disk performance - from the consensus algorithm

Three weeks ago, I open-sourced the consensus library Dragonboat I wrote myself  . In the feedback, I found that some users had a lot of basic misunderstandings about the performance of hard disks, but I thought carefully about these pits and walked through them myself. From the perspective of a software engineer, this article shares a few misunderstandings of hard disk performance along the way, so that everyone can go around the pit.

SATA to NVME

The story starts with using a local NVME disk provided by Google Cloud. "Local NVME disk", as the name suggests, should be high-performance, right? Its IOPS data is beautiful, with the halo of Google's signature, it must not be water. I ran the Dragonboat 's benchmarking mode, and the score was unbearable. The performance of the NVME disk was worse than that of the SATA SSD 7 years ago.

Such as consensus algorithms , various databases and various software that require WAL need to ensure that the data is indeed saved to the hard disk, to ensure that, for example, after a power failure and restart, the data is still available in good condition. fsync() plays this role, it ensures that the write data in the operating system cache and the write data cached on the disk are indeed saved and can survive power failure and restart. A transaction that writes data in the database and a Proposal in the consensus algorithm need to ensure that the data has been placed on the disk, and the consensus algorithm needs the data to be placed on most machines. The latency performance of fsync() has the most direct impact on the throughput of the above systems. Is the snail speed of the Google cloud local NVME disk caused by the slow fsync()?

It's time for the heirloom tool pg_test_fsync .

Correctly test the various performances related to fsync(), use a large circle of tools and add your own, and find that the pg_test_fsync tool that comes with the PostgreSQL database is the most intuitive and easy to use. The figure below shows the running score of pg_test_fsync on the local NVME disk provided by Google Cloud. Under the beautiful IOPS data of the Google Cloud local NVME disk, fsync() takes nearly 4.4 milliseconds each time, which is on the order of a high-speed mechanical disk. Other users have also discovered this bizarre problem .

As a comparison, the test results of common SATA solid-state drives such as Intel S3700/S3710, Intel 320 and Micron 500DC show that their fsync() delay is about 0.15-0.2 milliseconds, which is dozens of times lower than that of Goole cloud local NVME drives. The pg_test_fsync result of Intel S3700 is this:

The results of NVME's Intel P3700 are as follows. The difference is objective, but it is not the above-mentioned gap of dozens of times:

In terms of consensus algorithm , its theoretical delay limit is one fsync() delay plus one network RTT delay. A simple calculation shows that the fsync() delay of the above-mentioned Google Cloud's NVME determines that the consensus throughput of a single client cannot exceed 230 times per second. The theoretical upper limit of the client consensus throughput is immediately increased to 5000 times. SATA's S3700 kills the wonderful NVME disk on Google Cloud in seconds.

Capacity, throughput, number of IOPS and even lifespan can be stacked by multiple disks, and this fsync() delay, there is no shortcut . The above comparison of NVME and SATA shows that NVME or not is not the core key factor. The difference between SATA and NVME is on the order of tens of microseconds, and the reasons for the specific differences are overwhelming, and the introduction articles on the Internet are overwhelming, so I will not repeat them here. The above examples where NVME is dozens of times slower than SATA show objectively that the real performance difference is not in SATA/NVME.

Consumer-grade to enterprise-grade SSDs

Another common pitfall is the use of consumer-grade SSDs in development and test environments. For example, Samsung's NVME M.2 solid-state drives are cheap and abundant, and the IOPS data is comparable to enterprise-grade products. It is used in non-production environments. reason. At the beginning of the development of Dragonboat , I foolishly took such a home NVME disk to run the test, and the result was all kinds of tragedies. In fact, this misunderstanding is most directly explained by comparing the data posted by FreeBSD developers . The same is to fsync() the disk after writing the disk. The comparison is between the antique Intel 710 enterprise-class SATA hard disk and the high-end household-level Samsung 950 PRO NVME disk, which should never be used at the household level, even in the development and testing environment. :

The above-mentioned third-party data also proves that the difference between SATA/NVME is not the core key. The write delay of NVME home disk is 11 times that of the antique Intel 710 SATA disk, which is absolutely not suitable for consensus algorithms , databases and other fields. If the single-machine throughput of the development and test environment is 1/10 of that of the production environment, and such a difference is only for the price difference of a few hundred RMB of solid-state disks, it is obviously not worth the gain.

Cache with Power Loss Protection

Traditional enterprise-class hard disks have power-down protection functions. At first, it sounds like something designed for data integrity. The purpose is to prevent the hard disk from losing data in its cache that has not been written to the disk when the power is lost. In fact , the cache with or without power failure protection is precisely the reason for the huge difference in the performance of fsync().

After the Intel P3700 is disassembled, the two protruding capacitors on the upper left corner of the front of the card for power-down protection are clearly visible

In the enterprise disk with power failure protection, when fsync() is performed, as long as the data is successfully written into the memory cache on the SSD card, the host can report back to the host to report that the disk is completed, because even if the system suddenly loses power, the power in the capacitor is sufficient to ensure Power is maintained until the data in the cache is safely written to the NAND. For the wonderful enterprise disks without power failure protection, such as the above-mentioned local NVME disk of Google Cloud, and the Samsung 950 PRO home disk of NVME, the data must be written to the NAND memory chip every time. The physical latency of writing to NAND is on the average millisecond level, which has nothing to do with SATA or NVME.

The figure below is AnandTech 's comparison of the performance of several common NAND chips. Taking Intel P3700 as an example, it is the most typical MLC NAND solid state disk. The write delay of the NAND used is 1ms. The reason why the disk can be completed within 100 microseconds is because the data is reliable under the cooperation of the power-down protection mechanism. Write to cache, not to MLC NAND.

The big pit here is the excessive one-sided pursuit of performance differences brought by NAND types such as SLC/MLC/TLC. It is best to use SLC/MLC particles for servers. First of all, this is not a product trend. Second, the above analysis has clearly shown that the most direct factor related to throughput is the power failure protection system. It is precisely through it that the NAND write delay is completely avoided, so that the disk write performance can be good. The type of NAND really does not need to be exacting. It is the key to choose enterprise disks from big manufacturers such as Intel, to ensure that there is no problem in the integrity self-check of power failure protection, and to choose the ones that can withstand the write life.

Intel Optane

Optane avoids the need for memory-based caches in principle. Without this memory cache, power-off protection is naturally unnecessary. It has lower read and write latency, no cache and no power-off protection, and write to disk in 20-30 microseconds. In addition to its expensive price, there are no outstanding indicators including life expectancy. This latest development is specifically noted, but not expanded.

The consensus algorithm does not require a large amount of storage space with high speed and low fsync() latency

Mature consensus algorithm libraries and database systems generally support specifying a WAL storage location and pointing it to Optane or a solid state disk with low fsync() latency with power failure protection, which greatly helps system performance. This type of WAL data is generally not large, and in many tested scenarios, about 100G is generally enough, which is why the solid state disk such as the Intel P4801X is only 100G. Don't mistakenly understand that with consensus algorithm , all data must be placed on the SSD with low disk write latency.

in conclusion

  • Disk write latency is the core hard disk indicator for applications such as consensus algorithms and databases.
  • The difference between SATA and NVME disk write delay is much smaller than the delay difference caused by the presence or absence of power-down protection
  • The most fundamental difference between home-level and enterprise-level is whether it has power-down protection, and the difference in disk write delay caused by power-down protection
  • The _pg_test_fsync_ tool that comes with PostgreSQL can easily detect the write performance of the disk drop. It is recommended to go directly to the scrapping process or replace the unrelated fields of the consensus algorithm and database for SSDs with a duration of more than 200 microseconds.

Finally, have you tried Dragonboat , an open source consensus library? Welcome to try it out, and click Star to support!

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324190304&siteId=291194637