Compatibility competition of shared file systems on the cloud

"Everything is a file" is the basic design philosophy of UNIX . The files are organized into tree-shaped directories according to the hierarchical relationship, which constitute the basic form of the file system. When users use the file system to save data, they can access it according to the agreed interface specification without caring about the underlying storage method of the data.

Concepts

Regarding the interface specification of the file system, the most widely used one is POSIX, which is derived from the relevant standards written by the IEEE committee, some of which are about file and directory operations. The standard itself is rather long and obscure, and will not be discussed in depth here. We can refer to a question and answer on Quora " What does POSIX conformance/compliance mean in the distributed systems world? " for a more comprehensive overview.

POSIX compliance requires a file system to have the following characteristics:

  • Hierarchical directory structure, supports arbitrary depth
  • files pass open(O_CREAT), directories pass mkdir create , etc.
  • The directory can be opendir/readdirtraversed
  • Paths/namespaces can be modified via rename, link / unlink, , symlink / readlinketc.
  • writeData is required to be persisted when passed or writevwritten , passed or readfsyncreadreadv
  • Some other interfaces such as stat, chmod / chownetc.
  • Contrary to some popular myths, extended attributes do not appear to be part of POSIX, see the list of functions in The Open Group Base Specifications Issue 7, 2018 edition

Test article

Whether a file system really meets POSIX compatibility, we can test it through testing tools. A more popular test case set is pjdfstest, which is derived from FreeBSD and is also suitable for systems such as Linux. The test cases of pjdfstest need to be run as root and require Perl and TAP::Harness (Perl package) to be installed in the system. The test process is as follows:

cd /path/to/filesystem/under/test
sudo prove --recurse --verbose /path/to/pjdfstest/tests

We selected several shared file systems in cloud environments for testing, and the failure cases in the statistical test results are as follows:

Because the failed test cases of Amazon EFS are orders of magnitude larger than other products, the abscissa in the above figure uses a logarithmic coordinate for the convenience of comparison.

We also tested S3FS and Goofys at the same time, and the number of use cases that failed was hundreds or even thousands. The root cause is that these two projects were not designed strictly according to the file system:

  • Goofys can mount S3 as a file system, but it's only a "Filey" system with a "POSIX-ish" interface (these two descriptions are from the official project introduction, translated into Chinese as "paradoxical" or "similar to the gods"). Goofys is designed to sacrifice POSIX compatibility for performance, and the supported file operations are greatly limited by the object storage itself such as S3. The test results also verified this. It is advisable to thoroughly review your application's data access methods prior to production use to avoid falling into the trap.

  • S3FS , despite its name as a file system, is actually closer to a way of managing objects in an S3 bucket with a view of the file system . Although S3FS supports a larger subset of POSIX, it only maps system calls to object storage requests one by one, and does not support the semantics and consistency of conventional file systems (such as atomic renaming of directories, mutual exchange when exclusive mode is opened). exclusion, appending the file contents will cause the entire file to be rewritten, hard links are not supported, etc.). These flaws make S3FS unsuitable as a replacement for regular filesystems (even regardless of performance issues), because when applications access filesystems, the expected behavior should be POSIX-compliant, and S3FS falls far short of this.

Analysis

Next, we will classify and count the failed use cases of the test, and select several representative categories to analyze what limitations will be caused to the application.

Overall, JuiceFS has fewer failed use cases and better compatibility, both in terms of volume and category. Amazon EFS's failed use cases far exceed those of other file systems in terms of total number and category, and cannot be put into the same chart for comparison, and will be analyzed separately later.

JuiceFS

JuiceFS passed the vast majority of the 8811 use cases in this test, failing only 3 on the utimensat test set. The corresponding log is as follows:

…
/root/pjdfstest/tests/utimensat/08.t ........
not ok 5 - tried 'lstat pjdfstest_bfaee1fc7f2c1f80768e30f203f41627 atime_ns', expected 100000000, got 0
not ok 6 - tried 'lstat pjdfstest_bfaee1fc7f2c1f80768e30f203f41627 mtime_ns', expected 200000000, got 0
Failed 2/9 subtests
/root/pjdfstest/tests/utimensat/09.t ........
not ok 5 - tried 'lstat pjdfstest_7911595d91adcf915009f551ac48e1f2 mtime', expected 4294967296, got 0

These test cases are from utimensat/08.t and utimensat/09.t . Among them, 08.t is to test the sub-second file access time and modification time accuracy, and 09.t is required to support 64-bit timestamps.

JuiceFS currently only supports seconds, and timestamps are stored as 32-bit integers, so these three tests cannot be passed (in fact, all file systems involved in this test cannot pass this test set 100%). If your application scenario requires time accuracy below seconds or a larger range, please contact us to discuss solutions.

GCP Filestore

In addition to several failures on the utimesat test set like JuiceFS, GCP Filestore also failed 1 item in the unlink test set. This entry also fails on all other filesystems.

/root/pjdfstest/tests/unlink/14.t ...........
not ok 4 - tried 'open pjdfstest_b03f52249a0c653a3f382dfe1237caa1 O_RDONLY : unlink pjdfstest_b03f52249a0c653a3f382dfe1237caa1 : fstat 0 nlink', expected 0, got 1

This test set ( unlink/14.t ) is used to verify the behavior when a file is deleted in the open state:

desc="An open file will not be immediately freed by unlink"

The operation of deleting a file actually corresponds to unlink at the system level, that is, removing the link from the file name to the corresponding inode, and reducing the value of the corresponding nlink by 1. This test case is to verify this.

# A deleted file's link count should be 0
expect 0 open ${n0} O_RDONLY : unlink ${n0} : fstat 0 nlink

The file contents are only actually deleted when the number of links (nlink) is reduced to 0 and there are no open file descriptors (fd) pointing to the file. If nlinks are not updated properly, files that should have been deleted may remain on the system.

CFS

Compared with Google Filestore, CFS has not passed several tests of open and symlink.

open failure case

Select some of the failure logs as follows:

/root/pjdfstest/tests/open/07.t .............
not ok 5 - tried '-u 65534 -g 65534 open pjdfstest_f24a42815d59c16a4bde54e6559d0390 O_RDONLY,O_TRUNC', expected EACCES, got 0
not ok 7 - tried '-u 65533 -g 65534 open pjdfstest_f24a42815d59c16a4bde54e6559d0390 O_RDONLY,O_TRUNC', expected EACCES, got 0
not ok 9 - tried '-u 65533 -g 65533 open pjdfstest_f24a42815d59c16a4bde54e6559d0390 O_RDONLY,O_TRUNC', expected EACCES, got 0
Failed 3/23 subtests

This test set, open/07.t, is used to verify the behavior that an EACCES error should be returned for O_TRUNC mode when write permission is not available.

desc="open returns EACCES when O_TRUNC is specified and write permission is denied"

The above three failure logs need to be analyzed in combination with the test code, corresponding to the three cases of owner, group and other respectively. Without loss of generality, we only analyze the owner situation:

expect 0 -u 65534 -g 65534 chmod ${n1} 0477
expect EACCES -u 65534 -g 65534 open ${n1} O_RDONLY,O_TRUNC

First, set the file owner permission to 4, that is, r--read-only , and then try to open the file in O_RDONLY, O_TRUNC mode. It is expected to return EACCES, but 0 is actually returned.

According to the description of O_TRUNC in The Single UNIX ® Specification, Version 2

O_TRUNC If the file exists and is a regular file, and the file is successfully opened O_RDWR or O_WRONLY, its length is truncated to 0 and the mode and owner are unchanged. It will have no effect on FIFO special files or terminal device files. Its effect on other file types is implementation-dependent. The result of using O_TRUNC with O_RDONLY is undefined.

The result of using O_TRUNC in combination with O_RDONLY is unknown, and since the file under test for this use case is itself an empty file, O_TRUNC has no effect.

symlink failure case

The corresponding test log is as follows:

/root/pjdfstest/tests/symlink/03.t ..........
not ok 1 - tried 'symlink 7ea12171c487d234bef89d9d77ac8dc2929ea8ce264150140f02a77fc6dcad7c3b2b36b5ed19666f8b57ad861861c69cb63a7b23bcc58ad68e132a94c0939d5/.../... pjdfstest_57517a47d0388e0c84fa1915bf11fe4a', expected 0, got EINVAL
not ok 2 - tried 'unlink pjdfstest_57517a47d0388e0c84fa1915bf11fe4a', expected 0, got ENOENT
Failed 2/6 subtests

This test set ( symlink/03.t ) is used to test the behavior of symblink when the path exceeds the length of PATH_MAX

desc="symlink returns ENAMETOOLONG if an entire length of either path name exceeded {PATH_MAX} characters"

The corresponding code for the failed use case is as follows:

n0=`namegen`
nx=`dirgen_max`
nxx="${nx}x"

mkdir -p "${nx%/*}"
expect 0 symlink ${nx} ${n0}
expect 0 unlink ${n0}

The test case is to create a symbolic link with a length of PATH_MAX (including the trailing 0), but it shows that a symbolic link with a length of PATH_MAX cannot be created on the Tencent Cloud NAS.

Alibaba Cloud NAS

Compared with Tencent Cloud NAS, Alibaba Cloud NAS performs normally on symlink, but fails several test cases on chmod and rename.

chmod failure case

In this test set, Alibaba Cloud NAS failed the following items

/root/pjdfstest/tests/chmod/12.t ............
not ok 3 - tried '-u 65534 -g 65534 open pjdfstest_db85e6a66130518db172a8b6ce6d53da O_WRONLY : write 0 x : fstat 0 mode', expected 0777, got 04777
not ok 4 - tried 'stat pjdfstest_db85e6a66130518db172a8b6ce6d53da mode', expected 0777, got 04777
not ok 7 - tried '-u 65534 -g 65534 open pjdfstest_db85e6a66130518db172a8b6ce6d53da O_RDWR : write 0 x : fstat 0 mode', expected 0777, got 02777
not ok 8 - tried 'stat pjdfstest_db85e6a66130518db172a8b6ce6d53da mode', expected 0777, got 02777
not ok 11 - tried '-u 65534 -g 65534 open pjdfstest_db85e6a66130518db172a8b6ce6d53da O_RDWR : write 0 x : fstat 0 mode', expected 0777, got 06777
not ok 12 - tried 'stat pjdfstest_db85e6a66130518db172a8b6ce6d53da mode', expected 0777, got 06777
Failed 6/14 subtests

This test set ( chmod/12.t ) is used to test the behavior of the SUID/SGID bits

desc="verify SUID/SGID bit behaviour"

We select the 11th and 12th test cases to explain in detail, and cover these two permission bits at the same time

# Check whether writing to the file by non-owner clears the SUID+SGID.
expect 0 create ${n0} 06777
expect 0777 -u 65534 -g 65534 open ${n0} O_RDWR : write 0 x : fstat 0 mode
expect 0777 stat ${n0} mode
expect 0 unlink ${n0}

Here, we first create the target file with the permissions of 06777, and then modify the file content to check whether the SUID and SGID are properly cleared. 777 in the file permissions will be familiar to everyone, which corresponds to the rwx of owner, group and other, which can be read, written, and executable. The leading 0 represents an octal number.

The second bit 6 needs to be explained, this octet (octet) represents special permission bits, the first two of which correspond to setuid/setgid (or SUID/SGID), which can be applied to executable files and public directories. When this permission bit is set, any user will run the file as owner (or group). This special attribute allows the user to gain access to files and directories that would normally only be open to the owner. For example, the passwd command sets the setuid permission, which allows ordinary users to modify the password, because the file that saves the password is only allowed to be accessed by root, and the user cannot directly modify it.

The starting point of the design of setuid/setgid is to provide a way for users to access restricted files (not owned by the current user) in a restricted way (specified executable). Therefore, when the file is modified by a non-owner, this permission bit should be automatically cleared to prevent users from obtaining other permissions through this way.

From the test results, we can see that in Alibaba Cloud NAS, when a file is modified by a non-owner, the setuid/setgid is not cleared, so in fact, the user can perform any operation as the owner by modifying the content of the file. security hazard .

参考阅读: Special File Permissions (setuid, setgid and Sticky Bit) (System Administration Guide: Security Services)

rename failure case

Alibaba Cloud NAS has a large number of failures in this test set, reaching 24 items, all of which appear in rename/09.t :

desc="rename returns EACCES or EPERM if the directory containing 'from' is marked sticky, and neither the containing directory nor 'from' are owned by the effective user ID"

This test set is used to examine the behavior of rename when the sticky bit is set: when the directory containing the source object has the sticky permission bit set, and the owner of both the source object and the containing directory is different from the effective user ID, rename should return EACCES or EPERM. (Such complex logic is reminiscent of the skill setting of the generals of the Three Kingdoms Killing...).

A typical application of the sticky bit is the /tmp directory, which allows everyone to create content, but only the owner can delete files. This is usually the case for the public upload directory in FTP.

Several failed test cases show that Alibaba Cloud NAS's support for sticky bit is not perfect, the rename operation by non-owner is not rejected, and the actual effect is produced - the source file is renamed. This behavior bypasses the access control of the file system and poses a threat to the security of user files.

Failed Use Cases in Amazon EFS

Not only did Amazon Elastic File System (EFS) have a very high failure rate in the pjdfstest test (1533 of 8811 test cases failed), but it covered almost all categories, which is surprising.

EFS supports NFS mounts, but support for NFS features is incomplete. For example, EFS does not support block devices and character devices, which directly leads to the failure of a large number of test cases in pjdfstest. After excluding these two types of files, there are still hundreds of failures of different categories, so the application of EFS in complex scenarios must be cautious.

Summary

From the comparative analysis above, JuiceFS performs best in terms of compatibility, sacrificing sub-second time precision and range (1970-2106) for performance like most network filesystems. Google Filestore and Tencent Cloud CFS were next, and failed in several categories. The compatibility of Alibaba Cloud NAS and Amazon EFS is the worst, and there are a lot of compatibility tests that fail, including several test cases with serious security risks. It is recommended to do a security assessment before use.

JuiceFS has always attached great importance to the high compatibility with POSIX standards. We use compatibility testing tools such as pjdfstest and other random and concurrent testing tools (such as fsracer, fstool, etc.) as integration testing tools. While continuously improving functions and performance, we try our best to Maintain maximum POSIX compatibility, avoid users from falling into various traps during use, and focus more on their own business development.

If it is helpful, please follow our project Juicedata/JuiceFS ! (0ᴗ0✿)

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5389802/blog/5473039