Distributed Storage Comparison

Distributed Storage Comparison

Common open-source distributed storage system

Overall comparison system

Here Insert Picture Description

Open source protocol specification

GPL: can not be modified after the code derived as a commercial closed source software distribution and sales after the product must also modify the software under a GPL license;
GPLv2: Overall modify the text of the GPL must be in accordance with the flow, not only the text of the source code modification must be open to the public, but also for circulating the text of this modification does not permit themselves to limit additional modifications made;
GPLV3: requires the user to modify the source code released, also require the publication of related hardware; LGPL: a more relaxed GPL

TFS

TFS (Taobao File System) is a Taobao developed a distributed file system, its interior through a special optimized for mass storage of small files, now outside the open-source;
TFS uses its own file system storage format, requiring dedicated API interface to access the current version of the official client provided are: C ++ / JAVA / PHP.

characteristic

1) In the TFS file system, the NameServer manage file metadata, hot standby HA achieved by switching mechanism, since all the metadata are in memory, the treatment efficiency is very efficient, very simple system architecture, the management is also very convenient ;
2) the TFS DataServer as a sub-node deployment of data storage, but also have the function of load balancing and redundancy backup, thanks to its own file system, the file will take a small consolidation strategy to reduce the fragmentation of data, so as to enhance IO performance;
3) TFS metadata information (BlockID, FileID) mapped directly to the file name, this design greatly reduces the memory space to store metadata;

advantage

1) file tailored for small, random IO performance is relatively high;
2) support online expansion mechanism, enhance the scalability of the system;
3) to achieve a soft RAID, enhanced concurrent processing capability of the system fault tolerance and data recovery capabilities;
4) support availability hot standby switchover, the lifting system;
5) support the master from the cluster deployment, wherein the primary cluster providing the read / standby function;

Shortcoming

1) TFS only do optimization for small files, not suitable for storing large files;
2) does not support the POSIX interface to access common, universal low;
3) does not support custom directory structure and file access control;
4) through the API download , there is a single point of performance bottleneck;
5) official documentation is very small, high learning costs;

Scenarios

Application 1) multi-cluster deployment of
2) not change the basic storage
3) massive small files
according to the current official materials provided on a single cluster node, storage node can work well in less than 1,000 units, such as the expansion of the storage node may occur NameServer performance bottlenecks, Taobao online deployment capacity has reached 1800TB scale (2009 data)

FastDFS

FastDFS people is the development of a distributed file system, currently more active community. As shown above there are three nodes in the system: Client, Tracker, Storage, on a packet by the underlying storage concepts of logic, such that by arranging a plurality of Storage in the same group, in order to achieve soft RAID10, to enhance the performance of concurrent IO simple load balancing and redundancy data; logical storage while adding a new group of linear, calm achieve a linear expansion of storage capacity.
Download the file, in addition to support through API way, there is provided apache and nginx plug-in support, but also can not use the corresponding plug-in, static Web resources directly available for download way outside.
Currently FastDFS (V4.x) the amount of code probably 6w multi-line, using the internal network model more mature libevent-party libraries, have high concurrency processing power.

characteristic

1) In the introduction Tracker Server is the core hub of the entire system, which completes the access scheduling (load balancing), Storage monitoring and management server, we can see crucial role Tracker, also increased the single point of failure of the system, FastDFS support multiple alternate Tracker for this purpose, although the actual test found alternate Tracker is not a perfect run, but still able to ensure that the system is available.
2) In file synchronization, only with the group before doing synchronous Storage, Storage server push by the source file is located to the other Storage Server is currently using Binlog synchronous manner, due to the current bottom of the file is not synchronized correctness school experience, so this is only a partial synchronization mode internal network point of a single cluster, if you use the public Internet, certainly there will be situations damaged file, you need to add their own file verification mechanisms.
3) Support from the master file, the picture is very suitable for the existence of the association, on storage, FastDFS doing tricks from the file ID in the main, completed the storage association relations.

advantage

1) system does not support POSIX (Portable Operating System), reducing the complexity of the system, processing more efficient
2) supports online capacity expansion mechanism, enhance system scalability
3) to achieve a soft RAID, enhanced concurrent processing capability of the system fault tolerance and data recovery capabilities
4) support master and slave files, support for custom extensions
5) standby Tracker service, enhance system availability

Shortcoming

1) does not support HTTP, large file would be a nightmare (FastDFS not suitable for large file storage)
2) does not support the generic interface POSIX access, low versatility
3) Synchronization of files across the public network, there is a large delay You need to do the appropriate application of fault-tolerant strategy
4) does not support file synchronization mechanism correctness check and reduce system availability
5) through the API download, there is a single point of performance bottleneck

Scenarios

1) Application of a single cluster deployment
2) is not substantially altered after storage
3) The small and medium-sized document
materials are currently provided by the official, a conventional system using FastDFS 900T storage capacity has been reached, the physical machine has reached 100 (50 groups)
installation instructions _FastDFS
source path: https://github.com/happyfish100/fastdfs

MooseFS

MooseFS is a high-availability fault-tolerant distributed file system that supports files by way FUSE mount operation, while providing the web management interface is very easy to see the current state of the file is stored.

characteristic

1) From the figure we can see MooseFS file system consists of four parts: Managing Server, Data Server, Metadata Backup Server and Client
2) in which all metadata is managed by the Managing Server, in order to improve overall system availability, MetadataBackup Server file metadata recorded operation log data for timely recovery
. 3) data Server can be distributed deployment, data storage is distributed to each block storage node, thus improving the overall performance of the system, while the data Server redundancy provides the ability to enhance system reliability
. 4) mounts the Client by FUSE, provides access to the POSIX similar manner, thereby reducing the difficulty of developing Client terminal, to enhance the system's versatility
metadata server (master): responsible for managing the individual data storage server, file read and write scheduling, file space reclamation and restoration
metadata journaling server (metalogger): responsible for the change log file backup master server in order to take over their working time in the master server to the problem
of data storage server (chunkserver ): Where the data is actually stored in a plurality of physical servers, is responsible for connection management server, the management server obey scheduling, provide storage space, and to provide data transmission; multiple copies node; in the data storage directory, the actual data can not see

advantage

1) deployment is simple to install, easy management
2) supports online capacity expansion mechanism, enhance system scalability
3) to achieve a soft RAID, enhanced system of concurrent processing capability and fault tolerant data recovery capabilities
4) data recovery easier, enhancing system availability 5) have the Recycle Bin function to facilitate business custom

Shortcoming

1) there is a single point of performance bottleneck and single point of failure
2) MFS Master node memory-consuming
3) for the file is less than 64KB, lower storage utilization

Scenarios

1) to deploy a single application cluster
2), the large files

GlusterFS

GlusterFS is Red Hat's open-source distributed file system, it has high scalability, high availability and high performance characteristics, due to its non-metadata server design, making it truly linear scalability, the total storage capacity PB can easily reach the level of client support thousands of concurrent access; for cross-cluster, its powerful Geo-replication can achieve inter-cluster data mirroring, replication and support chain, which is very suitable for the collapse of the cluster scenarios

characteristic

1) currently support GlusterFS FUSE to mount mode, via a standard NFS / SMB / CIFS file protocols such as access to the same body access to the file system, but it also supports HTTP / FTP / GlusterFS access, and access the latest version of Amazon's AWS support system
2) GlusterFS system via SSH command-line-based management interface, you can remotely add, delete storage node, you can also monitor the use of the current state of the storage node
3) GlusterFS cluster nodes to support expansion in the volume of dynamic virtual storage expansion; while in a distributed redundant lower than mode, with self-healing management functions in Geo redundant mode, file support for HTTP, asynchronous transfer and incremental transfer characteristics

advantage

1) The system supports POSIX (Portable Operating System) to support the FUSE mount access through multiple protocols, high versatility
2) supports online capacity expansion mechanism, enhance system scalability
3) to achieve a soft RAID, enhanced concurrent systems fault tolerant data processing capacity and resilience
4) powerful command-line management, reduce learning, deployment costs
5) to support the entire cluster mirror copy, according to facilitate business pressures, increased cluster node
6) specialization official information document, the file system by Red Hat enterprise to do maintenance, quality guaranteed version

Shortcoming

1) the stronger versatility, the more it spans level, affecting the IO processing efficiency
2) under frequently read and write, will produce junk files take up disk space

Scenarios

1) application of multi-cluster deployment of
2) large files in accordance with the material currently provided by the official, the use of existing storage capacity GlusterFS system can easily reach PB

the term:

brick: block assigned to the file system volume;
Client: mounting the volume, and to provide services;
Server: where the actual file stored;
subvolume: the converted file system blocks;
Volume: file system volumes after the final conversion.

Ceph

Ceph is open distributed file system can be stored by a target / block / papers, designed beginning, it will as a single point of failure problem to be solved first, and therefore the system has high availability, performance and scalability features . The file system supports a high-performance file system is still in BTRFS (B-Tree file system) of the pilot phase, while supporting the storage by OSD manner, its performance is very excellent, since the system is again commercial stage, into caution Production Environment

characteristic

1) Ceph RADOS underlying storage is based on (a reliable, automatic distributed object storage), which provides storage system LIBRADOS / RADOSGW / RBD / CEPHFS access to the underlying embodiment, as shown in FIG.
2) by FUSE, Ceph support similar POSIX access method; Ceph distributed system is the most critical MDS node can deploy multiple problems with no single point of failure, and greatly enhanced processing performance
3) Ceph CRUSH algorithm by using dynamic complete file inode number to object number conversion, in order to avoid further storage file metadata information, and enhance the flexibility of the system

advantage

1) Support object store (OSD) cluster, by CRUSH algorithm, complete file dynamically positioned higher processing efficiency
2) mounts the support via FUSE, reduce development cost client, highly versatile
3) support distributed MDS / MON with no single point of failure
4) robust fault tolerance and self-healing capabilities 5) supports online capacity expansion and redundancy and enhance system reliability

Shortcoming

1) is currently in the testing stage, the system stability be fine

Scenarios

1) the whole network distributed deployment of applications
2) real-time, high reliability requirements of official propaganda, the storage capacity can be easily reached PB level
source path: https://github.com/ceph/ceph

MogileFS

Development Language: perl
open source licenses: GPL
relies database
Trackers (Control Center): Responsible for database read and write, as an inter-agency copy storage data synchronization
Database: storing source data (default MySQL)
Storage: File Storage
In addition to API, you can integrate with nginx Foreign provide download service
Source path: https://github.com/mogilefs

Guess you like

Origin blog.csdn.net/weixin_44691065/article/details/91893281