Gluserfs architecture [Translation] Detailed official website

Detailed Gluserfs

Typesetting looked uncomfortable, you can see [my book Jane] (https://www.jianshu.com/p/0340e429431b)
DOC Home: HTTPS: //docs.gluster.org/en/latest/Quick-Start -guide / architecture /
⚠️ this paper, the official website for the translation, recording more convenient to view, interpret wrong place please point out, finishing architecture and source code have released article will explain later.

FUSE

GlusterFS is a userspace filesystem. This was a decision made by the GlusterFS developers initially as getting the modules into linux kernel is a very long and difficult process.

GlusterFS is a userspace filesystem, it was decided to make the initial GlusterFS developers, because the introduction of the linux kernel module is a very long and difficult process.

Being a userspace filesystem, to interact with kernel VFS, GlusterFS makes use of FUSE (File System in Userspace). For a long time, implementation of a userspace filesystem was considered impossible. FUSE was developed as a solution for this. FUSE is a kernel module that support interaction between kernel VFS and non-privileged user applications and it has an API that can be accessed from userspace. Using this API, any type of filesystem can be written using almost any language you prefer as there are many bindings between FUSE and other languages.

As a userspace filesystem, in order to interact with the kernel vfs, Glusterfs use FUSE. FUSE is a kernel module, support the interaction between the kernel VFS and non-privileged user applications, and there is an API that can be accessed from user space, using this API, you can use almost any language you like to write any type of file system because in between there are many other languages ​​and FUSE bindings

image

This shows a filesystem "hello world" that is compiled to create a binary "hello". It is executed with a filesystem mount point /tmp/fuse. Then the user issues a command ls -l on the mount point /tmp/fuse. This command reaches VFS via glibc and since the mount /tmp/fuse corresponds to a FUSE based filesystem, VFS passes it over to FUSE module. The FUSE kernel module contacts the actual filesystem binary "hello" after passing through glibc and FUSE library in userspace(libfuse). The result is returned by the "hello" through the same path and reaches the ls -l command.

This shows a file system "hello world", it is compiled to create a binary "hello". It is the point in the file system on / tmp / fuse mounted execution. The user at the mount point / ls -l command issuing tmp / fuse. This command arrives via the glibc VFS, since the mount / tmp / fuse corresponding to the fuse-based file system, the VFS module pass it to fuse. After FUSE FUSE kernel module and by glibc library in user space (libfuse) in contact with the actual file system binaries "hello". Results returned by the "hello" through the same path, and reaches ls -l command.

The communication between FUSE kernel module and the FUSE library(libfuse) is via a special file descriptor which is obtained by opening /dev/fuse. This file can be opened multiple times, and the obtained file descriptor is passed to the mount syscall, to match up the descriptor with the mounted filesystem.

Communication between the FUSE FUSE kernel module library (libfuse) is performed by a special file descriptor, the file descriptor is obtained by opening / dev / fuse.. May be acquired a plurality of times to open the file, and the file is obtained descriptor passed to mount the syscall, so as to break with a mounted file system to match the description.

More about userspace filesystems

FUSE reference

Translators

Translating “translators”:

A translator converts requests from users into requests for storage. (User request into a store request)

*One to one, one to many, one to zero (e.g. caching)
image

  • Translator CAN Modify Requests ON A The Way through:

    == == Translator can modify the request via any of

    Convert to One Another type Request (Request Transfer During The Amongst The Translators) Modify Paths, the flags, the even Data (EG Encryption)

    == The one kind into another type of request type of request (request during transmission between converter), modified paths, flags and even data (e.g., encryption) ==
  • The Intercept Block or CAN Translators Requests. (EG Access Control)

    == Translators may prevent or intercept requests (e.g., access control) ==
  • Requests the spawn new new Or (EG pre-FETCH)

    == or generate a new request (pre-fetch) ==

How Do Translators Work?

  • Shared Objects Shared Objects == ==
  • ACCORDING loaded to Dynamically 'volfile'
    == The "volfile" dynamic loading ==

    dlopen / dlsync Setup Pointers to Parents / Children Call init (constructor) Call FOPS through the IO Functions.

    == dlopen / dlsync pointer provided parent / child calls init (constructor) function calls the IO == by fops
  • For the Validating Conventions / passing Options, etc.
    == agreed verification / delivery options, etc. ==
  • The Configuration of Translators at The (Operating since GlusterFS 3.1) at The gluster IS Managed through the Command Line interface (cli), SO you do not need to know in the What the Order to Graph at The Translators Together.
    == Translators configuration (since GlusterFS 3.1) by gluster command line interface (cli) managed, so we do not know in what order to combine these translators. ==

    Types of Translators

    List of known translators with their current status.

Translator Type Functional Purpose note
Storage Lowest level translator, stores and accesses data from local file system. Lowest level data converters, storage, and access to the local file system
Debug Provide interface and statistics for errors and debugging It provides error and debug interfaces and statistical information.
Cluster Handle distribution and replication of data as it relates to writing to and reading from bricks & nodes. Processing and distribution of replicated data, as it relates to the writing and reading of blocks and nodes
Encryption Extension translators for on-the-fly encryption/decryption of stored data. Encryption expanded dynamic encryption / decryption of data stored
Protocol Extension translators for client/server communication protocols. Extension translators client / server communication protocol
Performance Tuning translators to adjust for workload and I/O profiles. Adjust translator to accommodate the workload and I / O configuration file
Bindings Add extensibility, e.g. The Python interface written by Jeff Darcy to extend API interaction with GlusterFS. Add scalability, such as the preparation of Jeff Darcy Python interface for expanding interaction with the API GlusterFS
System System access translators, e.g. Interfacing with file system access control. File System Access Interface
Scheduler I/O schedulers that determine how to distribute new write operations across clustered systems. I / O scheduler, to determine how to distribute the new system across clusters of write operations
Features Add additional features such as Quotas, Filters, Locks, etc. Adding additional features, such as quotas, filters, lock, etc.

The default / general hierarchy of translators in vol files :
image

All the translators hooked together to perform a function is called a graph. The left-set of translators comprises of Client-stack.The right-set of translators comprises of Server-stack.

==所有translator hook在一起执行一个function称作一个graph==

Sub CAN BE GlusterFS Translators at The divided INTO MANY-the Categories, the Categories are Important But TWO - Cluster and Performance Translators:

== Gluster Translator can be divided into many categories, two important categories are Cluster and Performance (performance) translators ==

The MOST Important and of One The Translator The First Data / IS Request has to Go through FUSE Translator Which falls The category of an under Mount Translators .

A translator == data / request must pass through the fuse translator, which belongs to the category of Mount translator. ==

  1. Cluster Translators:
  • DHT(Distributed Hash Table)==(分布式Hash Table)==
  • AFR (Automatic File Replication) == (text automatically copied) ==
  1. Performance Translators:
  • I-cache
  • io-threads
  • md-seek
  • O-B (open behind)
  • QR (quick read)
  • r-a (read-ahead)
  • w-b (write-behind)

For example: gluster volume infoview

[root@node4 /]# gluster volume info
 
Volume Name: heketidbstorage
Type: Replicate
Volume ID: d141e423-cc06-4fa5-a7ef-5edbc1b405ce
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.8.4.184:/var/lib/heketi/mounts/vg_cbdb3ac545853d57be6c39db60b7f647/brick_661b5cdd36ee50aaeeeb3490737f2851/brick
Brick2: 10.8.4.182:/var/lib/heketi/mounts/vg_ddfdf2348bf510c356d5234e0ed0a0ec/brick_5ca55df9bbb54e531d6fc4205b297503/brick
Brick3: 10.8.4.183:/var/lib/heketi/mounts/vg_2ceb6870ad884b6507a767308980cf8a/brick_eaa40efdc196a5a18158c79e0b1b0459/brick
Options Reconfigured:
user.heketi.id: 4550c84383c151de59bd6679e73b9117
user.heketi.dbstoragelevel: 1
performance.readdir-ahead: off
performance.io-cache: off
performance.read-ahead: off
performance.strict-o-direct: on
performance.quick-read: off
performance.open-behind: off
performance.write-behind: off
performance.stat-prefetch: off
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off

Other Feature Translators include:

  • changelog == (log update) ==
  • Locks - Locks GlusterFS has Translator at The following Which the Provides Internal locking the Operations Called inodelk, entrylk, Which are Used by AFR to the Achieve Synchronization of the Operations ON Files or Directories that Conflict with the each OTHER.
    == Locks - GlusterFS, provided hereinafter referred inodelk, entrylk internal lock operation, afr use these operations to achieve with the file or directory conflicting's synchronization ==
  • == marker (tag) ==
  • quota == (quota) ==

    Debug Translators
  • the trace - the To The error logs the trace Generated During The Communication Amongst The Translators.
    == communication process between the tracking error logs generated translator program ==
  • io-stats == (io state) ==

    DHT(Distributed Hash Table) Translator
    What is DHT?

DHT is the real core of how GlusterFS aggregates capacity and performance across multiple servers. Its responsibility is to place each file on exactly one of its subvolumes – unlike either replication (which places copies on all of its subvolumes) or striping (which places pieces onto all of its subvolumes). It’s a routing function, not splitting or copying.

DHT is GlusterFS across multiple servers polymerization real core capacity and performance. It is the responsibility of each file in which a precisely subvolumes (sub-volumes) - This Replication (copy placed on all sub-volumes) and Striping (the slice is placed on all sub-volumes) different . It is a routing function, rather than split or copied.

How DHT works?

The basic method used in DHT is consistent hashing. Each subvolume (brick) is assigned a range within a 32-bit hash space, covering the entire range with no holes or overlaps. Then each file is also assigned a value in that same space, by hashing its name. Exactly one brick will have an assigned range including the file’s hash value, and so the file “should” be on that brick. However, there are many cases where that won’t be the case, such as when the set of bricks (and therefore the range assignment of ranges) has changed since the file was created, or when a brick is nearly full. Much of the complexity in DHT involves these special cases, which we’ll discuss in a moment.

The basic method used in DHT is consistent hashing. Each subvolume (brick) in a 32-bit range assigned hash space, covering the entire range (entire range), no holes or overlaps (overlap or vulnerability). Then, the hash of each file name, assign a value for each file in the same space. Only a brick having a specified range, comprising a hash value of the file, the file "should be" located on the brick. However, in many cases, is not the case, for example, brick set (Accordingly, the scope of the range of distribution) change, brick or almost full. Many complexities are related to DHT in these exceptional circumstances, we will discuss later

When you open() a file, the distribute translator is giving one piece of information to find your file, the file-name. To determine where that file is, the translator runs the file-name through a hashing algorithm in order to turn that file-name into a number.

When you open a file, distribute translator will provide a piece of information to find your file, that file-name file name. To determine the location of the file, translator hash algorithm run by file name, file name to convert to digital.

A few Observations of DHT hash-values assignment:

The assignment of hash ranges to bricks is determined by extended attributes stored on directories, hence distribution is directory-specific.
Consistent hashing is usually thought of as hashing around a circle, but in GlusterFS it’s more linear. There’s no need to “wrap around” at zero, because there’s always a break (between one brick’s range and another’s) at zero.
If a brick is missing, there will be a hole in the hash space. Even worse, if hash ranges are reassigned while a brick is offline, some of the new ranges might overlap with the (now out of date) range stored on that brick, creating a bit of confusion about where files should be.

range assigned to hash brick is determined by the extended attribute is stored in the directory, and thus is specific to the distributed directory. Consistent hashing is usually thought of as hashing around a circle, but in GlusterFS it's more linear GlusterFS but it is more linear, it is not necessary at 0 "wound", because there is always 0 at a breakpoint (in a brick of range and range between another brick). If you throw a brick, hash space there will be a hole Worse, if the scope of redistribution of hash offline in brick, then the new range may be stored on the brick and the (now obsolete) ranges overlap, thereby resulting files should be placed where confusion.

AFR(Automatic File Replication) Translator

The Automatic File Replication (AFR) translator in GlusterFS makes use of the extended attributes to keep track of the file operations.It is responsible for replicating the data across the bricks.

GlusterFS in AFR use extended attributes to keep track of file operations. It is responsible for replicating data across the block

RESPONSIBILITIES OF AFR

Its responsibilities include the following:

  1. Maintain replication consistency (i.e. Data on both the bricks should be same, even in the cases where there are operations happening on same file/directory in parallel from multiple applications/mount points as long as all the bricks in replica set are up).

Holding replication consistency (i.e., data on two brick should be the same, even when a plurality of applications in parallel / mount point to a case where the same operation occurs on the file / directory, copying set as long as the brick are all set up).

  1. Provide a way of recovering data in case of failures as long as there is at least one brick which has the correct data.

Provided a method of recovering data when a fault occurs, as long as there is at least one brick has the correct data.

  1. Serve fresh data for read/stat/readdir etc.

Provide new data read / stat / readdir, etc.

Geo-Replication

Geo-replication provides asynchronous replication of data across geographically distinct locations and was introduced in Glusterfs 3.2. It mainly works across WAN and is used to replicate the entire volume unlike AFR which is intra-cluster replication. This is mainly useful for backup of entire data for disaster recovery.

Geo-replication provides asynchronous data replication across geographic locations, which introduced in Glusterfs 3.2. It is mainly working across the WAN, for copying the entire volume, rather than the AFR, it is within the cluster replication. This is primarily used to back up the entire data for disaster recovery.

Geo-replication uses a master-slave model, whereby replication occurs between Master - a GlusterFS volume and Slave - which can be a local directory or a GlusterFS volume. The slave (local directory or volume is accessed using SSH tunnel).

Geo-replication master from the model used in the master copy (GlusterFS volume) and the (local directory or may be GlusterFS volume) between. From (local directory or volume using SSH tunnel access).

Geo-replication provides an incremental replication service over Local Area Networks (LANs), Wide Area Network (WANs), and across the Internet.

Geo-replication offers incremental replication service through a local area network (LANs), wide area networks (WANs) and the Internet.

Geo-replication over LAN

You can configure Geo-replication to mirror data over a Local Area Network.

image

Geo-replication over WAN

You can configure Geo-replication to replicate data over a Wide Area Network.

image

Geo-replication over Internet

You can configure Geo-replication to mirror data over the Internet.

image

Multi-site cascading Geo-replication

You can configure Geo-replication to mirror data in a cascading fashion across multiple sites.

image

There are mainly two aspects while asynchronously replicating data:

. 1.Change detection - These include file-operation necessary details There are two methods to sync the detected changes: == (two ways to synchronize the detected change:) ==

i. Changelogs - Changelog is a translator which records necessary details for the fops that occur. The changes can be written in binary format or ASCII. There are three category with each category represented by a specific changelog format. All three types of categories are recorded in a single changelog file.

Change Log - change log is a necessary details of the occurrence of fops translation program record. Change can be written in binary format or ASCII. There are three categories, each represented by a specific format changelog. All three types of categories are recorded in a separate change log file.

Entry - create(), mkdir(), mknod(), symlink(), link(), rename(), unlink(), rmdir()

Data - write(), writev(), truncate(), ftruncate()

Meta - setattr(), fsetattr(), setxattr(), fsetxattr(), removexattr(), fremovexattr()

In order to record the type of operation and entity underwent, a type identifier is used. Normally, the entity on which the operation is performed would be identified by the pathname, but we choose to use GlusterFS internal file identifier (GFID) instead (as GlusterFS supports GFID based backend and the pathname field may not always be valid and other reasons which are out of scope of this document). Therefore, the format of the record for the three types of operation can be summarized as follows:

For the type of operation performed and the entity records, using the type identifier. Under normal circumstances, the operating entity will be the execution path name, but we chose to use so that the internal file identifiers (GFID) instead of (and thus supports GFID end path name field may not always be effective). Thus, the recording format of three operations can be summarized as:

Entry - GFID + FOP + MODE + UID + GID + PARGFID/BNAME [PARGFID/BNAME]

Meta - GFID of the file

Data - GFID of the file

GFID's are analogous to inodes. Data and Meta fops record the GFID of the entity on which the operation was performed, thereby recording that there was a data/metadata change on the inode. Entry fops record at the minimum a set of six or seven records (depending on the type of operation), that is sufficient to identify what type of operation the entity underwent. Normally this record includes the GFID of the entity, the type of file operation (which is an integer [an enumerated value which is used in Glusterfs]) and the parent GFID and the basename (analogous to parent inode and basename).

GFID similar to the inode. And recording data element fops GFID entity performing an operation to record data on the inode / metadata changes. Fops input is recorded records 6 or 7 (depending on the type of operation) at least, it is sufficient to determine the entity type of operation experienced. Typically, this record includes GFID entity, type of file operation (it is an integer [enumeration value used Glusterfs]), and the parent GFID the basename (similar to the parent inode and basename).

Changelog file is rolled over after a specific time interval. We then perform processing operations on the file like converting it to understandable/human readable format, keeping private copy of the changelog etc. The library then consumes these logs and serves application requests.

Scroll Changelog file after a certain time interval. We then perform processing operations on the file, such as converting it into understandable / human readable format, subject to change logs and other private copy. Then, the library and use these logs to provide an application service requests.

. ii Xsync - Marker translator maintains an extended attribute "xtime" for each file and directory Whenever any update happens it would update the xtime attribute of that file and all its ancestors So the change is propagated from the node (where the change has.. occurred) at The Way to All at The root.
== Xsync- mark converter for each file and directory maintenance of an extended attribute "xtime". Whenever there is any update it will update xtime properties of the file and all of its ancestors. Thus, changes have been propagated from a node (where the change occurs) to the root ==
image
the Consider The above Directory Tree Structure. At Time Tl The Sync Master and Slave were in each OTHER.
image
At Time T2 WAS A new new File Created the File2. This Will Trigger the xtime marking (where xtime is the current timestamp) from File2 upto to the root, ie, the xtime of File2, Dir3, Dir1 and finally dir0 all will be updated.

In T2, create a new file File2. This will trigger xtime tag from File2 i to the root (where the current timestamp is xtime). e, xtime File2 is, Dir3, Dir1, finally Dir0 will be updated.

Geo-replication daemon crawls the file system based on the condition that xtime(master) > xtime(slave). Hence in our example it would crawl only the left part of the directory structure since the right part of the directory structure still has equal timestamp. Although the crawling algorithm is fast we still need to crawl a good part of the directory structure.

Geo-replication according to the xtime daemon (master)> xtime (from) gripping condition file system. Thus, in our example, it will only crawl on the left part of the directory structure, because the right part of the directory structure still has the same timestamp. Although the fetching algorithm is faster, but still need to grab part of the directory structure.

2. Replication -. We use rsync Rsync for the Data Replication IS AN External Utility by Will Which of the Calculate at The diff at The SENDS TWO Files and the this -difference from Source to Sync.
== Replication - we use rsync for data replication. Rsync is an external utility program, it calculates the differences between the two files, and this difference is sent from the source file to the sync. ==

Overall working of GlusterFS

As soon as GlusterFS is installed in a server node, a gluster management daemon(glusterd) binary will be created. This daemon should be running in all participating nodes in the cluster. After starting glusterd, a trusted server pool(TSP) can be created consisting of all storage server nodes (TSP can contain even a single node). Now bricks which are the basic units of storage can be created as export directories in these servers. Any number of bricks from this TSP can be clubbed together to form a volume.

Once installed on GlusterFS to the server node, it will create a binary file gluster management daemon (glusterd). This daemon should be running all participating nodes in the cluster. After starting glusterd, you can create a reliable server pooling all storage server nodes (TSP may even contain a single node) consisting of (TSP). The TSP in any number of brick may be connected together to form a volume.

Once a volume is created, a glusterfsd process starts running in each of the participating brick. Along with this, configuration files known as vol files will be generated inside /var/lib/glusterd/vols/. There will be configuration files corresponding to each brick in the volume. This will contain all the details about that particular brick. Configuration file required by a client process will also be created. Now our filesystem is ready to use. We can mount this volume on a client machine very easily as follows and use it like we use a local storage:

After you create a volume, the process will run in glusterfsd in each participating brick (participation brick). At the same time, will /var/lib/glusterd/vol/generate a configuration file called vol, the volume will have a profile of each of the corresponding brick. This will include all the details about a specific brick. It will also create a client configuration files required for the cleaning process. Now we can use the file system. We can easily be mounted in this volume on the client machine, as shown, and as local storage as it is used:

mount.glusterfs <IP or hostname>:<volume_name> <mount_point>
IP or hostname can be that of any node in the trusted server pool in which the required volume is created.

When we mount the volume in the client, the client glusterfs process communicates with the servers’ glusterd process. Server glusterd process sends a configuration file (vol file) containing the list of client translators and another containing the information of each brick in the volume with the help of which the client glusterfs process can now directly communicate with each brick’s glusterfsd process. The setup is now complete and the volume is now ready for client's service.

When the client mount volumes, glusterd client glusterfs process communication process with the server. Glusterd server process sends a configuration file (vol file), which contains the list of client translators, another configuration file contains information about the volume of each brick, the help in this, the client can now directly glusterfs process of each brick glusterfsd process communication. Installation is now complete, client service volume has been prepared.

image

When a client issues a system call (or Fop to file operation) in the mounted file system, VFS (determine the type of the file system is GlusterFS) sends the request to the FUSE kernel module. FUSE kernel module sequentially through the / dev / fuse transmits it to the client node in the GlusterFS userspace (which has already been described in part FUSE). GlusterFS process on the client by a set of client translators translator program composed of these translatot storage server configuration file is sent glusterd process (vol file) are defined. The first of these converters is FUSE Translator, which consists FUSE library (libfuse) composition. Each translator has fop each file operation or function corresponding to the glusterfs support. The request to execute the respective functions of each translator. The main client translators include:

  • FUSE translator
  • DHT translator- DHT translator maps the request to the correct brick that contains the file or directory required.
  • AFR translator- It receives the request from the previous translator and if the volume type is replicate, it duplicates the request and passes it on to the Protocol client translators of the replicas.
  • Protocol Client translator- Protocol Client translator is the last in the client translator stack. This translator is divided into multiple threads, one for each brick in the volume. This will directly communicate with the glusterfsd of each brick.
    In the storage server node that contains the brick in need, the request again goes through a series of translators known as server translators, main ones being:

In the storage server node contains brick required, request through a series of translator called server translators once again, are:

  • Protocol server translator
  • POSIX translator
    The request will finally reach VFS and then will communicate with the underlying native filesystem. The response will retrace the same path.

The final request VFS, and then communicates with the underlying native file system. The response will follow the same path back.

Guess you like

Origin www.cnblogs.com/keep-live/p/11951670.html