[Master Q&A Summary] Talk about distributed file system and JuiceFS

 The file system is a very important component in a computer, providing a consistent way to access and manage storage devices. In different operating systems, the file system will have some differences, but there are some common features that have not changed for decades:

  1. The files are organized in a tree-like directory;
  2. Data exists in the form of files, and APIs such as Open, Read, Write, Seek, Close are provided for access;
  3. Provides atomic rename (Rename) operations to change the location of a file or directory.

The access and management methods provided by the file system support most computer applications, and the concept of "everything is a file" of Unix highlights its important position. JuiceFS is an open source distributed file system that innovatively uses object storage as the underlying storage medium to achieve unlimited expansion of storage space. Any file stored in JuiceFS will be divided into fixed-size data blocks according to specific rules and stored in the object storage, and the metadata of the data blocks will be stored in Redis, MySQL and other databases.

OSCHINA invited Juicedata partner Su Rui @Su Rui to discuss issues related to distributed file systems with everyone.

Guest profile

Su Rui, partner of Juicedata, participated in the creation of the distributed file system JuiceFS as the No. 1 member. He first obtained dozens of commercial customers at home and abroad through SaaS products on the global public cloud, and then opened JuiceFS in January 2021.


Q: What version of redis and mysql does JuiceFS rely on? How to consider system security issues?

A: Redis can refer to best practices , and various engines can be supported by popular versions. JuiceFS will try to be compatible with more engines and versions.

Q: What are the advantages of JuiceFS object storage over traditional block storage distributed file systems? What business scenarios is JuiceFS storage currently used for?

A: JuiceFS is an object-based distributed file system, which has better elasticity than block-based. Object Storage can be considered fully elastic, but Block Storage mostly cannot. At present, JuiceFS is mainly used in big data, AI, DevOps and various application scenarios that require NAS. You can see many cases and customer practices on juicefs.com.

Q: ​​If the file is split into fixed-size data blocks, how are the data blocks guaranteed to be in order, and are the database blocks fixed in size? Will there be a lot of memory fragmentation? Does it take up a lot of memory to merge databases when reading?

A: For example, if you want to read a 1GB file, it is sequential read in the storage. If all are read out and returned to the requester, it will occupy more memory and the waiting time of the caller will be longer. All are 4K/64K/256K/1M and other capacities to be read sequentially, and are continuously returned to the caller in a streaming way. So JuiceFS divides files into block storage without affecting performance. On the contrary, it also improves performance. Because it is easier to read a lot of blocks concurrently and return them to the caller, of course, while obtaining high throughput, it needs to be replaced with a little memory resources. The file is divided into many data blocks and stored in the object storage, and each block key will be stored in the meta engine (direct storage takes up too much space, 16 blocks are designed as a chunk, and the chunkid + offset is stored). The block size is set to 4MiB by default, which is the maximum value, and there will actually be blocks smaller than this value. Fragments are merged. Don't worry when reading, various storage systems also read pages/blocks of a certain size when reading.

Q: Considering the speed of access, difficulty of repair and migration, any recommendations for the choice of the underlying file system?

A: The access speed is easy to judge, just test and evaluate it in combination with your business scenario. What I want to say is that in distributed storage, only using some Benchmark Tools to see the performance of running points and actual business scenarios is different, and must be tested with business scenarios. In terms of migration, common and popular access protocols are not difficult. For example, JuiceFS is fully compatible with POSIX, HDFS, and S3. It also provides juicesync, a data migration tool, which is compatible with dozens of object storage APIs. What should be considered is the impact of the data migration on the business.

Q: JuiceFS's replica placement strategy across nodes, racks, computer rooms, and regions, and the degree of availability, can you briefly talk about: the implementation of the consistency protocol and the metadata management involved, the timeliness and correctness of primary cluster switching, and finally It is the strategy of master-slave synchronization in various emergencies, and comprehensive and efficient load balancing (space/throughput/copy).

A: JuiceFS Community Edition uses popular database engines, including Redis, MySQL, PostgreSQL, and TiKV, each of which has a lot of practical experience in operation and maintenance. The deployment and operation solutions are recommended by the database engine community, and JuiceFS will provide some best practices where necessary, such as using Redis as the metadata engine .

Q: Does JuiceFS provide an API interface to open Key-Value Storage, and what programming languages ​​are supported? Also, how efficient is it for lots of small files to write?

A: You can use the S3 API to read and write JuiceFS data. Refer to the documentation . If you access it in POSIX mode, various programming languages ​​are natively supported. SDKs for programming languages ​​currently include the Java SDK (which is compatible with the HDFS API).

Q: What is the split performance for large files?

Answer: The advantages of JuiceFS, refer to the documentation .

Q: 1. What is the difference between JuiceFS and NAS NFS storage? What are the advantages of JuiceFS? 2. What design pattern does JuiceFS use?

Answer: 1. Compared with NAS, JuiceFS is designed for cloud environment; elastic capacity; richer access methods, including POSIX, HDFS, S3, K8s CSI; applicable to big data, AI, DevOps and many scenarios that require NAS, More cost-effective 2. The reference architecture design adopts the design scheme of separation of metadata and data.

Q: Is the same product as Ceph? If so, where is the advantage?

Answer: and CephFS are a class of products, please refer to the documentation for details .

Q: I read an article on using JuiceFS to improve the performance of mysql data backup by 10 times. The performance analysis tool that comes with JuiceFS used in the article is really good. I hope the teacher can share its design and architecture.

A: You can refer to the documentation , and then look at the corresponding code implementation.

Q: What are the main application scenarios of JuiceFS?

A: JuiceFS can be applied to scenarios such as big data, AI, container platform, DevOps, etc. For details, please refer to the documentation or follow the official account: Juicedata.

Q: I saw a comparison between JuiceFS and Ceph, Alluxio and other products in the official document, but I feel that the actual competitor may be MinIO or OZone. Have you ever done a comparison with these two products? Recently, I am doing technical research on object storage, mainly storing photos, audio and video, and text files. Please give me some opinions, thank you!

Answer: Both MinIO and OZone are object storage. JuiceFS is a file system, you can use them as a backend persistence layer service. If your requirements are storage and access, and there are no scenarios such as computing, analysis, and training, it is more suitable to use object storage directly.

Q: Is JuiceFS suitable for massive small file storage? How is the performance? Clients have historical files, including hundreds of millions of folders and files.

A: It is recommended to use Redis as the metadata engine to store less than 200 million files, and TiKV is suitable for larger scale. The performance depends on the scene. For read-based scenarios such as training, the cache mechanism is accelerated, and the performance is good.

If 100 million files Redis, MySQL, TiKV are no problem. It depends on which system you are more familiar with and have more maintenance experience with. In terms of performance, it depends on the business scenario and access method to make judgments.

Q: How does the underlying implementation ensure the file integrity of the distributed file system?

Answer: A few simple points: 1. Write the data first and then write the metadata to ensure that the data is complete when the metadata is written successfully; 2. The metadata is guaranteed by the transactionality of the engine; 3. After writing, the data and metadata The reliability of itself is guaranteed by the object storage and the metadata engine respectively; 4. The checksum capability provided by the object storage will be used.

Q: Is this similar to Fastdfs for file storage? What is the difference? Which one is easier to use in the enterprise now?

A: JuiceFS is designed for the cloud environment, it can be easily used in the cloud environment, and it also naturally achieves elastic scaling. It supports POSIX, HDFS, and S3 when accessing, which makes it more convenient to develop applications on it, especially to organize various Pipelines.

Q: If the database is broken and the backup made a day ago is restored, can the file data be restored to the state of the day before and can continue to work?

A: Partial recovery is possible. Various databases also have more fine-grained disaster recovery solutions, such as Redis AOF and MySQL Binlog, which can minimize the risk of data loss. On the other hand, the active-standby strategy is also used in the production environment. When data is deleted, both the metadata and the data in the object store are deleted. If the metadata is restored to the state of the previous day, the data cannot be restored and cannot be read.

Question: Is it possible to store a log with a concept similar to binlog in the object storage. When the database data is lost and not backed up, the entire database can be restored through this structure similar to binlog?

A: The new version adds the function of automatically backing up metadata to object storage, which is one more backup.

Q: I think the distributed file system, which relies on relational databases, is a design mistake. The correct one should be the other way around, a relational database, which relies on a distributed file system. In this way, in a single database, the corresponding background hard disk space is unlimited. No matter how much data the application system has, the relational database can solve the problem by adding new disks and adjusting the configuration parameters of the distributed file system without dividing the database or the table. I don't know what Teacher Su thinks?

A: I understand your statement, and I also partially agree. The architecture design of JuiceFS is consistent with the paper on Google File System at the High level, using the separate management method of metadata and data, and there are many other distributed file systems that use this idea. In different places, in the Low level design, most of the DFS currently seen are designed for raw disks.

But in the cloud era, we re-abstract to look at this problem, combined with our own more than ten years of business experience using DFS. DFS carries a very rich workload, and has different requirements for different dimensions such as scale, performance, durability, availability, and cost. We try to answer a question: how to use a product to better and more flexibly meet more and richer business scenarios?

Then, in the architectural design of JuiceFS, object storage is chosen as the data persistence layer (similar to chunk server, data node), because object storage provides mature advantages of scalebility, availability, high throughput, and low cost, which is exactly what The core capabilities required by the data persistence layer.

The metadata engine is designed as a plug-in. Redis, MySQL, TiKV, and FoundationDB have their own advantages and can meet the needs of different business scenarios. Another advantage of choosing these mature and popular systems is that developers have accumulated a lot of experience in use and operation and maintenance, so there is no need to increase the learning burden. This is the concept of JuiceFS design community products, friendly user experience, low learning threshold, suitable but not extreme (we have another extreme version of the design).


View the expert Q&A column: Topics expert Q&A - OSCHINA - Chinese open source technology exchange community

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4855753/blog/5403929
Recommended