Storage--Pangu_Alibaba Cloud Feitian Distributed Storage System Design In-depth Analysis

Abstract: This article is compiled based on the video "Pangu: Feitian Distributed Storage System Practice" shared by Wu Yang of the Pangu team. He mainly shared from the following three aspects: What is Pangu? What problem is Pangu used to solve? How did Pangu solve the problem? He mainly introduced the distributed system architecture and design concept of Pangu.

This article is based on the video "Pangu: Feitian Distributed Storage System Practice" shared by Wu Yang of the Pangu team.

He mainly shared from the following three aspects: What is Pangu? What problem is Pangu used to solve? How did Pangu solve the problem? He mainly introduced the distributed system architecture and design concept of Pangu.



c4787c3d6e561c5a63d30b8d5d85dd87c9139c08



The above picture lists the current mainstream cloud computing vendors, and we found a very interesting thing: all cloud computing vendors are "rich second generation", and their distributed storage technologies all use self-developed technologies, instead of using familiar ones. Open source distributed systems.

Feitianmeng The dream of
the first generation of Feitianmen is to provide various computing and storage services to the outside world on a large number of cheap PC servers. Specific to the following components: Kuafu, mainly responsible for the network; Nuwa, mainly responsible for coordination; Fuxi, mainly responsible for scheduling; Pangu, mainly responsible for storage; Shennong, mainly responsible for monitoring.

e61591399abfca1b5c0e6ec8d772a705787b0ca1

The figure above introduces Pangu's underlying storage platform, which plays a role in connecting the previous and the next. As a distributed storage system, Pangu mainly provides two types of interfaces: Append Only interface and Random Access interface.

What problem is Pangu used to solve?
The hardware or system of a single machine is always imperfect, and there is always a small probability of error, but it needs to have the ability to scale horizontally on a large scale, because it has to manage a large number of machines. Putting these two dimensions together means that mistakes are the norm.

On a large scale, small probability events are the norm
4% annual disk damage rate, 1%% machine daily downtime rate
Raid card crashes, capacitor charging and discharging cause write back mode to change to write through
network segmentation, switch packet loss, upgrade restart, fiber damage and bandwidth reduction by 90%, two-site computer room routing Error The
rack is powered off, the entire equipment room is powered off, the
network card TCP check is wrong, the disk access data check is wrong, the
NTP time drift, the D state of the kernel IO thread, and the dirty page cache cannot be written back
System hotspots are always present, instantaneous transfer
Program defects cause resources Leakage, creating a large number of files, accessing dirty data
Misoperation : deleting data by mistake, pulling out the wrong disk, failing to clean up the test machine environment and going online...
Problems and challenges faced by
Pangu

The block storage, object storage, table storage, file storage, offline big data processing, big data analysis and many other businesses in the Internet of Things face enormous challenges, and even some of them are self-contradictory.
How did Pangu solve the problem?
4f4e17143c62417efa290d78d5fb03b61439862a

Pangu made some trade-offs in the system design. First of all, Pangu has enabled more cloud products, allowing cloud products to connect with users, so that we can concentrate on building a stable and reliable distributed storage platform. High reliability and high availability are parts that cannot be compromised. In any case, strong consistency, correctness, reliability, and availability of data must be guaranteed. Sometimes the pursuit of low cost will threaten high availability, so we must achieve high performance, reasonable cost, and provide cost-effective online storage. Easy-to-use, service-oriented, convenient for users to access light-weight, non-aware operation and maintenance, complete and easy-to-use monitoring, tools, and documentation.

Pangu overall structure

55b35df3b7df7c274e4ae987dad72ff03511c1ab
is divided into three parts: Client, Master, ChunkServer. When a write needs to be initiated, the client creates a file to the master and opens the file. At this time, the master will select the positions of the three copies and feed it back to the client. The client finds the ChunkServer according to the location of the three copies and writes the data into it. That is to say, the Client does the overall control, the Master provides the storage of the source data, and the ChunkServer provides the storage of the data. The single point in the system is very fragile, how to ensure its high availability? The first step of Pangu is to add a Paxos, which means that many Masters are used to form a group to achieve high availability. Even if many servers are used to achieve high availability, only one server can ultimately serve the outside world. When there is enough memory data, horizontal expansion is required. MountTable can divide the directory tree into volumes, and the horizontal expansion of the Master can be achieved through different volumes.
High data reliability
8efd0ef43e35b403519d2bdda7ea602b181a5ca9
The three copies of Pangu are strongly consistent. The three copies are located in different fault domains, and data is automatically replicated in the event of a failure. As shown in the figure above, a data center has 3 copies of data stored in 4 RACKs, if RACK-1 is suddenly powered off or there is a problem with the network. At this time, for example, the diamond-shaped data is originally on RACK-3 and RACK-4. When the diamond-shaped data of RACK-1 is lost, Pangu will copy a copy from RACK-3 through an efficient algorithm and put it into RACK-2 to ensure that data security and reliability.

Data Assurance Integrity
91203f90ba093ae889a92bea82bb0ceb18403320
Pangu mainly does two things: end-to-end data verification and silent error checking. In a small probability, the data stored in the memory may change, and the data stored on the disk will also change. Each piece of data is followed by a CRC, so that once written to the disk, the data and CRC can be matched. Periodically scan in the background, and when it is found that the data does not match the CRC, it is determined that the data has a bit inversion, then use other A good copy overwrites it.

Reasonable cost
Pangu has optimized the reasonable cost. For example, a single cluster running offline has tens of thousands of units and hundreds of petabytes of data. The single-group Master has also been optimized, and the read can reach 15W QPS, and the write can reach 5W QPS. The single data node is optimized for the limit of the software stack, so that the consumption of the software is very low, and the storage is tiered. Finally, in order to achieve low cost, an ordinary PC server, Erasure Code is used.

The operation and maintenance of self-service
7a87f9f0d028d0251b651ad304fd1eefd6989227
is very important. Pangu realizes that the hot upgrade application is not aware, and the operation and maintenance operations are performed automatically according to the configuration without manual intervention. It is corrected in time through environmental standardization, and self-solving problems through problem diagnosis. The structure is shown in the figure above. There is a centrally managed configuration management library. The Pangu Control Center will push the configuration management library to each component of Pangu, automatically perform configuration changes, and realize automatic alignment when the configuration is found to be incorrect. Distributed systems at scale are very important.
Fault-tolerant design The core of
distributed systems is fault-tolerant design:

data security is a belief: E2E Checksum; background silent scanning; system bugs, hardware failures, and fault tolerance of operation and maintenance operations. In large-scale systems, there are always a variety of problems, and when these problems are mixed together, it becomes very difficult.
Environmental inspection eliminates hidden dangers: disk partition; rack distribution; configuration errors; software errors; hardware errors.
Single-machine failure without perception: data replication ensures safety; retrying by changing machines ensures successful reading and writing; memory and avoidance of faulty machines.
Monitoring + self-healing: Master self-health checks for switching; Chunkserver finds faulty disks or machines for isolation; Client detects service status and performs Master switching; Client self-health checks and reports status.
The above design greatly reduces the pressure of operation and maintenance.

Master
44dbe3050bdf20fc1c1da230d81a7622e02c801f
Master needs to solve mainly three types of problems: large capacity, high efficiency and stability. Large capacity means: Federation horizontal expansion, memory compact arrangement supports 800 million files in a single group, read and write OPS 100K/s. Efficiency means the optimal algorithm, fast replication triggered by hardware errors ensures data security, dynamic planning of data traffic achieves maximum throughput, and dynamic security domain adjustment ensures high data availability. Stability means that Paxos data is consistent, preventing single point, multi-angle monitoring automatically triggers switching, and multi-user isolation prevents killing. Since Pangu is a multi-tenant system, for example, a 10,000-unit cluster will run various applications that do not know each other, but they share a Master machine. If a user accesses the Master in large numbers, the entire cluster cannot provide external services. How can this situation be prevented? Pangu has done multiple isolations to solve the above problems.
Chunkserver
b5959aaaa0801ddefafb329ba1650783ac39eb9c
The problems faced by Chunkserver are: the price of flash memory is high and IOPS is high; the price of mechanical hard disk is low and IOPS is low; the solution that only writes to memory will lose data if power is lost. If the entire cluster is powered off, the data that has not been written in the memory will be lost. If all three backup data are lost, this is unacceptable to cloud computing. How to combine flash memory and mechanical hard disk to solve the above problems at the lowest cost? Some solutions use UPS, but UPS also has unreliability issues and data can still be lost. Therefore, the final solution is to use a small amount of cache with a large number of mechanical hard disks. The data is written to the cache in the foreground and dumped to the mechanical hard disk in the background.
Client
a98a1800899311df924be592dda82ac90bc75c2f
Client faces many problems. In many current programming languages, coroutines are very popular. In traditional multi-threaded programming, when there are many threads on a multi-core system, the switching cost is very high, and high-performance programs cannot tolerate this. Some solutions are asynchronous programming, thus using a small number of threads, not thread cutting. How to have both the convenience of synchronous programming and the performance of asynchronous programming? Coroutines are the solution. Many current programming languages ​​already provide coroutines, but C++ does not provide coroutines, so Pangu achieved high performance by implementing coroutines. The problem faced by the client is: some users need extreme performance, some users need easy programming, and the existing massive programs need to be seamlessly supported. The solution to the above problem is to use thread synchronization primitives to support both coroutine and non-coroutine users. In the coroutine, it is not thread-cut, so it means that all tasks are executed in one thread. If any task has a blocking operation, it will reduce the throughput of the entire thread.
If you find any content suspected of plagiarism in this community, please send an email to: [email protected] to report and provide relevant evidence. Once verified, this community will immediately delete the allegedly infringing content.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326172418&siteId=291194637