Today, Li Yunhua teacher read the book, "Ali game high availability architecture design practice ', would like to share some feelings.
The sentence was impressed that he initially said, "the flavor of the pot to make R & D back!" In other words, highly available system is designed, not by the operation and maintenance guarantee out! He mentioned problems with the way people think in order: first thought is not too LOW operation and maintenance, and hardware such as poor quality, why this month the cabinet also bad, too bad switch, is not to buy a second-hand computer city put inside? The second thought is not bad luck, a month ago, two months before he had once met four times this month, is not you do not burn incense in the room? The third test is not a lack why these Bug testing phase can not be found, only to find online? There are inadequate operation and maintenance experience, such as a switch failure, some people think that is very simple, switching on the line. Some students even mentioned process is imperfect, to say the whole process, there are many areas for improvement. For example, after the failure, the response mechanism is not smooth enough, after the failure bunch of people, including R & D, testing, operation and maintenance rush, is not to be given a treatment program the whole process, designated responsible person pile? However, the main problem is the problem of the system design. There are several solutions on the following ways: high availability goals - the traditional method: After determining the direction we need to set a goal, first determine a target. High Availability are actually a few words refer to a 9,5 9 is probably the carrier-class or financial level, most of the Internet is 3 9-4 9. But there is a drawback, in addition to technical staff, other students are not well understood, they can not be converted into an intuitive understanding of the four or five 9 9. So, when we were given no such project objectives to be. Availability target - business-oriented: our ultimate goal of determining the target with a few 9 has a relatively large difference, the goal of some 9 mainly from the perspective of the system to consider, that the reliability of this system is that several 9. The advantage of this objective: 1, focused business. 2, readily decomposed. Target itself is our work direction, we must first locate the problem, how to locate the problem? We can think of a way, followed by recovery business, and the third is the frequency of failures can not be too high; 3, easy to measure. Later, when we plan to do, a lot of programs just take this set of standards, basically will be able to judge the proposal is feasible.Finally the whole target down conversion, corresponding to about almost 9 4, 9 4 higher than a little. High Availability overall architecture of the overall architecture of a total of four layers: user layer, network layer, service layer and layer operation and maintenance. In fact, the whole structure with the goal is the same, we are for the entire business, did not say which system should have a few high availability 9, but from the business point of view throughout the whole process Suppose you want to achieve goals, how each should go do. Each layer needs to be done to deal with some of the programs in order to achieve our goals. Then I'll tell you about in detail, the basic ideas and practices of each program.
The next step is decoupled architecture: business separation Below is the original architecture of this system all features are included, such as login, registration, issued parameter, messages, log, update. In fact, for the players play the game, the real strong correlation only issued registration and login parameters, and log messages, update is not really a player to play the game or you must have strong correlation. So, business practice is to separate the core business and spin off non-core business to a different system, the call through the interface between the two systems, visit each other. The benefit of this, assuming that non-core business system failure, it does not affect the core business system, because through the interface between them is called, does not share the same resources.
Service Center service center similar to the DNS, is achieved between service calls when the entire scheduling functions within the system, the service center is the name of a system similar services. Business downgrade to split the system into core and non-core business systems business systems, in some emergency situations, such as non-core business system reboot there is no way, even hung out to say a database, which in turn affect the core business system. This time, the interface is accessible, but the response time is particularly slow, the core of the system is a bit slow. So, in this extreme case, we can send via artificial way downgrade instructions, the function of this non-core business systems to be stopped, this is not stopped the program stopped, but said to them an interface or url stopped, when the core of the system to get access to a 500 or 503 error.
总结:研发、测试、运维,大家一起来设计高可用性。