GFS (GlusterFS) distributed file system

1. Overview of GlusterFS

GFS, Big Table, Map Reduce are called Google’s troika, which is the cornerstone of many basic services

GFS was proposed in 2003. It is a distributed file system. It is very different from the premises of many distributed systems before. It is suitable for the following scenarios

  • 1 Thinks that component failure is a normal state, provides a fault-tolerant mechanism, automatic load balancing, so that the distributed file system can run on cheap machines
  • 2For large file storage, the main workload of the system is large-scale streaming read, and the write operation is mainly written in append mode, rarely random write
  • 3 Write once, read many times, such as web page storage on the Internet

1. GliusterFS features

  • Scalability and high performance
  • High availability
  • Global unified namespace (centralized management, analogy API nature/concept)
  • Flexible volume management
  • Based on standard protocol

2. GFS composition

  • 1. Storage server
  • 2. Client (not local)
  • 3. NFS/Samba storage gateway composition
  • Open source distributed file system
  • Composed of storage server, client and NFS/Samba storage gateway
  • No metadata server

3. GFS file system composition

  • 1. File system interface (API: Application Program Interface)
  • 2. A collection of software for object management
  • Objects and attributes

File system function:

  • From a system perspective, the file system is a system that organizes and allocates the space of file storage devices, is responsible for file storage, and protects and retrieves stored files.
  • Specifically, it is responsible for creating files for users, storing, reading, modifying, and storing files, and controlling file access.

Mounting of the file system is used:

  • File systems other than the root file system need to be mounted to the mount point before they can be accessed after they are created. The mount point is a certain directory file associated with the partition device file.
  • Analogy: NFS
  • Distributed file system

4. GFS terminology

File system Explanation
Brick The server that actually stores user data
Volume "Partitions" of the local file system
FUSE The user space file system (analogous to EXT4) file system, and then write the data on the disk, and if it is a remote GFS, the client's request should be handed over to FUSE (file system), and the cross-border point can be stored in GFS
VFS (Virtual Port) In the kernel virtual file system, the user first submits the request to the VFS, then the VFS to the FUSE, then to the GFS client, and finally the client to the remote storage
Glusterd (service) It is the process running on the storage node (the client runs the gluster client) and the interaction between the entire GFS is completed by the Gluster client and glusterd (GFS usage process)

Summary: Using GFS will use the above virtual file system

5.GFS architecture

A GFS cluster consists of a master server (with backup), multiple chunk servers, and clients, and the architecture is relatively simple.

  • Chunkserver: Data storage node, the file is divided into fixed-size Chunk, each Chunk is uniquely identified, by default, Chunk is stored as 3 copies
  • Chunk: Each file is stored as at least one chunk, and each chunk is stored as an ordinary linux file. The size of the chuck is a key design parameter. The default is 64MB, and each chunk has a global 64-bit unique identifier.

Advantages and disadvantages of larger Chunk size?

  • Larger: Reduce the size of metadata on the master and put it in memory to reduce the interaction between the client and the master

  • Disadvantages: Small files will be stored as a Chunk. When multiple Clients operate on a single small file multiple times at the same time, the ChunkServer that stores this Chunk will become a hot spot

  • GFS master: manage all metadata information; lease management; cluster Chunk migration; useless Chunk recovery;

  • Metadata (metadata) mainly has three categories: namespace, file and Chunk mapping (which Chunk a file contains), and the location information of each Chunk copy. Meta-information is stored in memory to speed up processing. The first two types of meta-information will be stored on the local disk as an operation log file and copied to the remote master backup machine, which is needed for disaster recovery. The location information of the Chunk copy is obtained by the Master polling each ChunkServer through the heartbeat, without persistence, which avoids the problem of data synchronization between the Master and the ChunkServer when the ChunkServer is changed.

  • GFS client: The communication with the Master node only obtains metadata, and all data operations directly interact with the Chunk server. GFS provides a set of API interfaces similar to traditional file systems

6. Working principle of GlusterFS

  • This is a typical C/S (Client/Server) architecture
    Insert picture description here

  • 1. Clients or applications access data through the mount point of GlusterFS.

  • 2. The linux system kernel receives the request and processes it through the VFS API.

  • 3. The VFS submits the data to the FUSE kernel file system, and the fuse file system submits the data to the GlusterFS client through the /dev/fuse device file.

  • 4. After the GlusterFS client receives the data, the client processes the data according to the configuration file configuration.

  • 5. Pass the data to the remote GlusterFS Server through the network, and write the data to the server storage device.

2. Stacked architecture

Insert picture description here

  • Realize complex functions by combining different functions of the modules.
  • By loading the first three modules below, and then combining multiple GFS-client ends, the required distributed volume, strip volume, etc. are formed.
Module Explanation
VFS The virtual kernel file system receives and processes requests through the VFS API (scene: according to the request, load the following modules)
I / O cache I/O cache
read ahead Kernel file pre-reading
distribute/stripe Distributed, striped volume
gige Gigabit Ethernet/Gigabit Interface
TCP/IP Network protocol
InfiniBand Compared with TCP/IP, TCP/IP has the characteristics of forwarding lost data packets. Based on this communication protocol, communication may slow down. IB uses a trust-based and flow control mechanism to ensure connection integrity. The chance of loss is small.
RDMA Responsible for data transmission, there is a data transmission protocol, function: in order to solve the delay of client and server data processing during transmission
POSIX Portable operating system interface, mainly to solve the portability between different operating systems
read ahead Kernel file pre-reading
  • Then convert to logical storage (EXT4 + BRICK)
  • The above architecture model can improve the work efficiency of GFS

1. Benefits of equal distribution

  • When the amount of data is getting larger and larger, the amount of data (probability) relative to each storage node is equal, and if single-point failure is considered, when the data is stored in the storage node, CFS will have a backup mechanism for this , The default is 3 backups, so GFS's own mechanism will generate redundancy for data to solve a single point of failure.

2. Elastic HASH algorithm

  • Get a fixed-length data through the HASH algorithm (here is a 32-bit integer)
  • Under normal circumstances, the results obtained with different data are different.
  • In order to solve the complexity of indexing and positioning distributed file data, the HASH algorithm is used to assist.

advantage:

  • Ensure that the data is evenly distributed in each Brick;
  • Solve the dependence on the metadata server, and then solve the single point of failure and access bottleneck;

Three. GlusterFS seven types of volumes

  • GlusterFS supports seven types of volumes, namely distributed volumes, striped volumes, replicated volumes, distributed striped volumes, distributed replicated volumes, striped replicated volumes and distributed striped replicated volumes.

1. Distribute volume

  • Also known as a hash volume, it is similar to RAID0. The file is not fragmented. The file is written to the hard disk of each node according to the hash algorithm. The advantage is large capacity, but the disadvantage is no redundancy.
  • Files are distributed to all Brick Servers through the HASH algorithm. This volume is the basis of Glusterf; hashing to different bricks according to the HASH algorithm in units of files, in fact, only expands the disk space. If one of the disks is damaged, the data will also be lost. , Belongs to the file-level RAID0, and does not have fault tolerance;
  • Distributed volume is the default volume of GlusterFS. When creating a volume, the default option is to create a distributed volume. In this mode, the file is not divided into blocks, and the file is directly stored on a Server node. Directly use the local file system for file storage, and most of the Linux commands and tools can continue to be used normally. The HASH value needs to be saved through extended file attributes. The underlying file systems currently supported are ext3, ext4, ZFS, XFS, etc.
  • Due to the use of the local file system, the access efficiency has not been improved, but will be reduced due to network communication; in addition, it will be difficult to support very large files, because distributed volumes will not block files. . Although ext4 can already support a single file up to 16TB, the capacity of local storage devices is really limited.

Distributed volumes have the following characteristics:

  • The files are distributed on different servers, and the layout is redundant;
  • Expand the size of the volume more easily and cheaply;
  • A single point of failure will cause data loss;
  • Rely on the underlying data protection;

2. Stripe volume

  • Similar to RAID0, files are divided into data blocks and distributed to multiple Brick Servers in a polling manner. File storage is based on data blocks and supports large file storage. The larger the file, the higher the reading efficiency;
  • The Stripe mode is equivalent to RAID0. In this mode, the file is divided into N blocks (N strip nodes) according to the offset, and stored in each Brick Server node in a poll. The node stores each data block as an ordinary file in the local file system, and records the total number of blocks and the serial number of each block through the extended attribute. The number of stripes specified during configuration must be equal to the number of storage servers contained in the brick in the volume. The performance is particularly outstanding when storing large files, but it does not have redundancy.

Strip rolls have the following characteristics:

  • The data is divided into smaller pieces and distributed to different strips in the block server group;
  • Distribution reduces the load and smaller files accelerate the speed of access;
  • No data redundancy;

3. Replica volume

  • Synchronize files to multiple bricks to have multiple file copies, which belong to file-level RAID1 and have fault tolerance. Because the data is scattered among multiple bricks, the read performance has been greatly improved, but the write performance is reduced;
  • The replication mode, also known as AFR, is equivalent to RAID1. That is, one or more copies of the same file are saved, and each node saves the same content and directory structure. In the copy mode, the disk utilization is low because the copy is to be saved. If the storage space on multiple nodes is inconsistent, the capacity of the lowest node will be taken as the total capacity of the volume according to the barrel effect. When configuring a replicated volume, the number of replicates must be equal to the number of storage servers contained in the brick in the volume. The replicated volume has redundancy. Even if one node is damaged, it will not affect the normal use of data.

Copy volume has the following characteristics:

  • All servers in the volume keep a complete copy;
  • The number of copies of the volume can be determined by the customer when it is created;
  • At least two block servers or more servers;
  • Have redundancy;

4. Distribute Stripe volume

  • The number of Brick Servers is a multiple of the number of stripes (the number of bricks distributed in data blocks), which has the characteristics of both distributed volumes and striped volumes;
  • Distributed striped volumes take into account the functions of both distributed and striped volumes, and are mainly used for large file access processing. At least 4 servers are required to create a distributed striped volume.
  • When creating a volume, if the number of storage servers is equal to the number of strips or replicates, then a striped volume or replicated volume is created; if the number of storage servers is twice or more than a striped volume or replicated volume, then a distribution will be created Striped volume or distributed replicated volume.

5. Distribute Replica volume

  • The number of Brick Servers is a multiple of the number of mirrors (the number of data copies), and has the characteristics of distributed volumes and replicated volumes;
  • Distributed replicated volumes take into account the functions of distributed volumes and replicated volumes, and are mainly used when redundancy is required.
  • If there are 8 servers, when the replica is 2, in the order of the server list, servers 1 and 2 are used as a copy, servers 3 and 4 are used as a copy, servers 5 and 6 are used as a copy, and servers 7 and 8 are used as a copy. ; When the replica is 4, in the order of the server list, the server 1/2/3./4 is used as a copy, and the server 5/6/7/8 is used as a copy.

6. Stripe Replica volume

  • Similar to RAID10, it also has the characteristics of striped volumes and replicated volumes;

7. Distribute Stripe Replica volume (Distribute Stripe Replica volume)

  • A composite volume of the three basic volumes, usually used in Map Reduce-like applications;

Guess you like

Origin blog.csdn.net/LI_MINGXUAN/article/details/114276854