Avalanche effect [technology thinking] distributed storage system

mythmgn (author) is the author of this post Old Man, was released on other platforms, began serialization own thinking and technical articles from this month in the blog Park (feel supported markdown easy to use!). please indicate the source, thank you.

A distributed storage system background

Copy is a common concept of a distributed storage system: a certain size of the data according to certain redundant storage strategy to ensure system availability in case of partial failure.

There are several ways between redundant copy of a copy, more commonly there are two types:

  • Pipeline: like pipes, a-> b-> c, replication of data by way of the conduit. The way higher throughput, but there are problems slow node, a node congestion, the whole process will be affected
  • Distribution: client -> a client -> b client -> c. Lower overall system throughput, but no problems slow node

For the number of redundant copies, three copies of this article to select common solutions.

Distributed storage systems generally have automatic recovery copy function, an error in the local storage node, other nodes (master copy of the data node or client nodes, according to a copy of the agreement may be copied) automatically initiates a copy of the repair, the storage node down copies of data on other health restored to the node. In the case of a small amount of downtime, a copy of the cluster of auto-repair strategy will work properly. However, in accordance with the operation and maintenance of large-scale storage service experience, monthly disk failure rate of X percent and monthly failure rate per thousand X switches have a high probability of resulting in a larger number of machine downtime occurs several times of the year. In addition, the bulk upgrade. If there is an upgrade bug, according to the cluster down process needs to be a copy of the repair, resulting in the original normal time to complete the upgrade prolonged, a greater number of events are also prone to downtime.

Produce two avalanche effect

Over a period of time greater number of downtime events likely to induce large-scale copy of the system's completion strategy. Two characteristics of a distributed storage system currently leads to the large-scale copy of the completion strategy is easy for the system to produce an avalanche effect:

. A free space of the entire cluster is small: usually a whole <= 30%, less than the local machine <= 20% or even 10%

b Application mixed fabrics: different applications to maximize the use of hardware resources deployed on the same physical / virtual machines

This year, fire up the various network disk, typically disk class service is a cloud of. Major companies in the personal storage capacity to fight back 1T, in fact, also in the fight operating costs, operation and maintenance costs. Most existing cloud storage only to rise, or do data classification (similar to Facebook's data classification project), according to the degree of cold. A large amount of cloud storage, but relatively small increments, in order to reduce storage and bandwidth resources, waste of resources, a newly created file if the original data already stored the same md5 or sha1 signature as the existing file for internal links, no longer created a new file. But even so, the amount of data as a whole is still very large.

Currently cloud storage-related businesses no significant source of income, there are server costs per year for each of the tens of thousands for operating cost considerations, low idle rate back-end distributed storage system. The moment of downtime will bring a lot of bulk copy repair, a large number of copies of the repair is likely to survive and then played in other machines already approaching its storage quota, and then let the machine is down, or read-only. This continues, the entire cluster may be an avalanche, the system crippled.

Three avalanche prevention

This section discusses how to prevent the entire system in an avalanche within the system logic. The importance of prevention is greater than the process to predict the state of the cluster after the accident, to optimize the advance has become a direction of avalanche prevention.

Select the following have occurred several practical scenarios to share with you.

1.  Cross-frame copy selection algorithm and machine resources, user logical isolation

Restore site:

Day operation and maintenance of the students found a cluster dozens of machines instantly lost contact, responsible for triggering the repair copy of the master node starts a copy of the repair crazy. A large number of user feedback cluster slow start, read and write rammed live.

On-site to deal with:

Priorities - a copy of the recovery of the cluster caused by excessive overall affected.

a. Engineers processing act decisively, gdb to the process conditions change fixes a copy of a copy of <2, instead of the original 3 (replicas_num), so that the master node at this time only to repair the file copy number is less than 2, the guarantee is not lost At least one redundant copy of the file, only one copy of the data to prevent the hook due to possible recurrence of files might get lost.

b. Emergency solve the problem with these machines lost and found is a switch problem, c segment abcd ip network of machines batch failures. Network Group urged the repair as soon as possible.

C. After a copy of the repair to> = 2, Gdb change detection inadequate copy cycle, tens of seconds to detect delayed one day. Wait for the network group switches to resolve the problem.

d. network restoration, the original machine rejoins the cluster. 2 copies of a large number of copies of a file is changed back to 3, section 3 copies of full recover lost files.

E. restored to a normal master node state parameters, the system begins normal repair.

improvement measures:

Before improvements, the first analysis of this incident exposed the lack of Under the system:

1) Master parameter is not supported by thermal modification, Gdb online process too risky.

2) However, a number of local mechanical failures affecting the whole cluster (dozens of large clusters is still a relatively localized failure). As mentioned above, the failure rate of a few thousandths of total organic parts per month will affect your cluster storage system undergoing a switch failure brings.

Improvements released after the case study:

1) Master supports hot correction schedule in advance, as soon as possible to support the core parameters of thermal modification.

After the heat effect of modifying the substantial line, avoid subsequent problems online several times.

2) adding a copy of data stored in the algorithm selection pickup host machine's inter-switch (rack position) policy, mandatory - or try to ensure that - when a copy of the selected position across racks. Copies under this algorithm, at least one copy with other copies in two different switches (IP abcd in paragraph c). The measure applies to both new copies of the stored data selection and a copy of a copy of the missing completion strategy, to ensure that a copy of the system on the host select switch will not downtime and data loss, thus avoiding a copy of completion has been in queue / loss of a large number of copies of the master node node list of aggravating load.

3) Machine according to the isolation region is divided the agenda; user storage location logically divided region according to the agenda function; Pickup join algorithm on the agenda across the region.

a) the physical location of the machine in accordance with the divided region, the user storage location logically divided according to region, the user is let into the cluster only affect this part of the machine using the logical division in case of partial failure.

As a result, the worst case is simply not available in this region, this region has led to read and write permissions of users affected. Pickup algorithm design across the region to further ensure that the user will not be divided region of a region unavailable data loss, because other copies exist on the other region. Thus, the core switch failure resulting in downtime a region of hundreds of machines nor will it cause the cluster range is too big influence.

b) increasing the confidence region concept, the stability factor is added to the copy machine redundancy algorithm.

When the cluster size reaches a certain amount, there will be a different machine stability problems (in general, consistent with the stability of the machine a number of on-line). The stability of the region by marking, can force a copy of the selected data when at least one copy of the copy as for stable, reducing the probability of loss of all copies.

c) Region divided user operation needs to consider the SLA response time, the physical stability of the machine, the location and other information.

A reasonable division of the region to enhance the system stability, enhance the operation of the corresponding time, prevent system crashes have benefits. Delicate division rule will bring to enhance the overall stability, but also increases the complexity of the system. How this choice, leaving the readers to think deeply.

2.  Let it Flow Control Cluster

There is a general flow control aspects and in accordance with the principle of the distributed storage system features: not any action should take up too much processing time. Here the "no action" includes operations performed when a surge in traffic, partial to a certain number of machine downtime appear in the system. Only a smooth and successful process of these operations, to ensure the system is not an exception because of the emergence of a whole affected, even avalanche.

Restore site:

1) Scenario 1 day operation and maintenance of the students found that clusters increase the write operation at a certain time. By observing a storage node, we found not only to write, and random write! The overall throughput of certain product lines declined.

2) Scenario 2 requires a large clustered storage business restructuring, the existing data do change large amounts of data need to be removed.

Operation and maintenance of the students found, a. In the entire cluster on the whole crazy gc garbage collection phase b. A cluster response speed was slow, particularly in relation to the operation meta meta information updated.

3) Scenario 3 day operation and maintenance of the students suddenly found a cluster of concurrent surge in single user xyz a lot of concurrent operations, in accordance with the original user research, the user should not have such a large scale usage scenarios.

The expected surge in certain operations outside the cluster, there are many such, no longer tired.

On-site to deal with:

1) Union immediately relevant users to understand the reasons for the surge operations, irrational surge need to be addressed immediately.

We discovered irrational surge follows:

Scene a Class 1: Review by codes found, a large number of read and write operations random changes. Recommended to convert random read and write after reading the change + + to write the new file to delete old files, convert random read and write to sequential read and write.

. Class B Scene 3: a performance test on-line product line. Operation and maintenance of the students immediately inform the relevant product line ceased operations. All clusters made public by e-mail re-emphasized, not for performance testing. If necessary, contact the relevant personnel in the performance test scenarios exclusive cluster.

2) to promote the design and implementation of flow control mechanisms function in all aspects of the cluster and on-line.

improvement measures:

1) the user operates the flow control

a. control the flow of user operation limitation

Can be achieved by designing the internal system, can be achieved by limiting the external network, etc., do some restrictions on a single user flow control to prevent a single user of the resource consumption of the entire cluster.

b. Operation storage node flow control

In accordance with the level of resource consumption of the cluster is divided into High - Medium - Low three layers to achieve similar design to grab token, each token number is adjusted to more appropriate values ​​after a cluster practices. This prevents excessive consumption of certain types of cluster load operation. If certain types of excessive consumption of the load operation, the operation requests from other classes may have a greater delay, after the sparking retry timeout, the collapse of small-scale, have a chance to spread throughout the whole cluster and generates a crash.

C. gc do garbage collection process flow control alone. Deletion in a distributed storage system which is common design: upon receiving a user delete operation, meta tag information to remove content directly returns, subsequent policy control, delete the current limiting operation gc prevent significant consumption of excessive single storage node the disk processing capacity. Specific limiting policies and practices need to be token value is set according to the characteristics of the cluster and draw Jiaoyou settings.

2) flow control blacklist

Because of online users do scenario testing classes can be constrained by artificial systems, but can not prevent users online bug cause the same effect as line scale test scenarios. Such scenes are generally short operating in several serious exceeds the current limit.

Such scenarios may be to control the flow blacklisting, when a user within a short time (eg 1 hour) exceeds the upper limit set serious, blacklist the user, temporarily blocking operation. Perimeter surveillance operation and maintenance group of students will be notified emergency treatment.

3) concurrent repair storage node, create a copy of flow control

Large amounts of data to create a copy of a copy of the repair operation or operations if not the speed limit, bandwidth and CPU, memory and other resources of the storage node, affecting normal reading and writing services, a large number of delays occur. The delay may lead to a large number of retries, aggravated busy cluster.

With a host process data necessary to limit the concurrent copy of the repair, the number of copies created, so that the occupation of the ingress bandwidth is not too large, the process would not carry out such operations because of excessive delay and increase the number of other operations. This is especially important for systems using distributed copies of the replication agreement. Distribution agreements are generally slow checking mechanism node, a copy of the flow control system will not aggravate further delay increases node may become slow. If the increased possibility of slower node, the newly created file may be slow because of the lack of mechanisms node checks when creating a copy, this will make the situation worse cluster.

3.  predicted in advance, early action

1) predict disk failure, fault-tolerant single disk errors.

Scene reproduction:

There is a vendor of a batch of SSD disk problem, running for some time after the line, local concentration appear larger number of bad clusters on the disk, but not all disks are damaged. At that time did not have a single disk fault-tolerance mechanism, a disk goes bad, the whole machine is placed into an unusable state, which has led to these bad disc machines are not available, the cluster over a period of time in a copy of the repair status, throughput by greater impact.

improvement measures:

a)  hard disk health prediction, automatic data migration is about to become a high probability of bad disk copy

In recent years, disk health status predicted in advance of the technology becomes more sophisticated, technically can predict the health of the disk and the disk has a high probability broken before automatically migrate data to another disk, reducing disk goes bad for system stability effects.

b)  single fault-tolerant hard disk error

Storage node support for exception handling bad disk. When a single disk hang, automatic migration / repair of the original single-disk data to another disk, rather than the process as a whole dang off, because once the whole shoot down other disk data will be treated as missing copies, distributed storage system storage resources tight cluster downtime experienced one such incident will result in a copy of the lengthy restoration process. In the existing distributed storage system, it has a similar Taobao TFS as each disk to start a process to manage the whole number of the number of disk loading process will start.

2) According to the prior distributions stored in the prediction balanced development, perform load balancing operations in advance.

Such strategies are becoming more common design. Since the repair strategy after making cluster distributed memory clusters hang some machines always have a chance to become a hot machine, we can predict the hot spots for such machines, data migration section advance to the relatively low load machines.

Load balancing strategy and a copy of the selection policy, the need for trade-offs complexity and degree of optimization problems. Complex cluster load balancing strategy to bring good, but because of the introduction of high complexity, high rate bug problem. How to choose, it is still troubled by the problem of distributed storage system designers.

Four safe mode

Safe Mode is a project produced during practice anti distributed storage system avalanche big kill, so I am particularly separately as an introduction . The basic idea is that within a certain time limit the number is down more than expected to make the cluster into safe mode, in accordance with policy configuration, the severity of the situation, stop to repair a copy, stop reading and writing until it stops all operations (general strategy).

In a system without the concept of the machine region, the security mode can play a very good protection. I used to participate in a project of a large-scale experiences downtime, because there is no safe mode, normal processing system copy repair, life and health of the original storage node also hit the disabled, and then an avalanche, a copy of the entire cluster into madness repair status. Change the cluster after the repair process in this state because the copy has occurred Fix the meta-information / actual data and become difficult. The event is the final outcome data recovered from a cold standby data, missing data to the cold standby failure time.

Of course, the security model is not perfect. "For some time", "upper limit" how to set up, when to stop copies of repair, when to stop reading, when to stop writing their own resume or human intervention returned to normal, safe mode intensity whether to go to the region level, these issues We need to consider the security mode, and such designs are generally designed to target users and clusters are closely related. For example, if the business is a low-latency sensitive users may choose small-scale failure does not affect reading and writing, and high-latency, high-throughput clusters can accept stop to read and write.

Five Thoughts

Due to the complexity and length of the distributed storage system limitations, this article only select a limited number of typical scenarios are analyzed and discussed, true distributed storage system much more complex than this number of cases, and more details. How to balance a cluster of abnormal automate complex process and introduced how to achieve a better flow control and avoid low-latency response time on the user, how to guide cluster load balancing and load balancing to avoid bringing excess cluster resource overhead, such in the real issues are emerging distributed storage system design. If you are a designer, how would you choose?

Guess you like

Origin www.cnblogs.com/mythmgn/p/10948310.html