[Spike system] - key architecture and design principles

         Recently observe a spike system design Digest, we found that many design and marketing related systems before did have a lot of agreement, also apply to other system architecture design; so do simple records;

        From the "What systems are designed to spike key point" began, in June and November of each year marketing are two hot season, when the Internet is also a major business platform commodity-related traffic system users the most active period; and traffic spike is peak representative, so many companies will split a single system to undertake this part of the business;

        So how do we better understand the spike system? I think as a programmer, you first need to start from the high dimension, to think as a whole. In my opinion, a spike in fact, mainly to solve two problems, one is the concurrent read, concurrent write one . The core concept of concurrent read optimization is to minimize the user to the server to "read" data, or let them read less data; concurrent write the same principles, it requires us to separate out a library at the database level, to do special deal with. In addition, we need to do some protection against spike system, designed fallback plan for the unexpected, to prevent the worst from happening.

        From an architect's point of view, in order to build and maintain a large flow concurrent read and write, high-performance, highly available system, the entire route user requests from the browser to the server we have to follow several principles, that is, to ensure that data requested by a user as little as possible, to minimize the number of requests, a path as short as possible, as little as possible dependent on, and do not have a single point. These will focus on explaining the key points in a later article.

In fact, the whole spike architecture can be summarized as "steady, accurate, fast," a few keywords.

The so-called "stability" is really a system architecture to meet the high availability, when traffic is expected to be stable, traffic is more than expected also can not ball dropped, you have to ensure the smooth completion of the activity, which is essential;

Then is the "quasi" spike 10 iphone, it will only deal 10, more than a little one will not work, once the inventory does not, that platform will bear the loss, so "accurate" is to ensure data consistency .

He concluded that "fast" performance is high enough, or how to support such a large flow of it? Not only do the ultimate server performance optimization, each point of the entire call links do collaborative optimization of the entire system can be perfect;

So from a technical point of view, "steady, accurate, fast", it corresponds to the high availability of our infrastructure, consistency and performance requirements, our column will mainly revolve around these areas, as follows.

  • high performance.  Spike involving a large number of concurrent concurrent read and write, and therefore support high concurrent access to this critical point. Movement from the design data of the present embodiment separation column, and found hot isolated clipping hierarchical filtering request, the ultimate end of this optimization services focus on four aspects.
  • consistency.  Implementation spike in commodity inventory reduction is also key. One can imagine that a limited number of commodities at the same time by many times the request at the same time to reduce inventory, reduce inventory is divided into "take inventory reduction," "payment minus inventory" and withholding several, in large concurrent updates process should ensure the accuracy of the data, its difficult to imagine. Therefore, I will use an article devoted to how to design spike inventory reduction program.
  • High availability.  Although I introduced a lot of extreme optimization ideas, but reality always bound to be some that we can not consider the situation, so to ensure the availability and accuracy of the system, we have designed a PlanB to reveal all the details, so that in the worst case occurs still be able to calmly deal with the time. The last column, I will take you from thinking which links to programs designed to reveal all the details.

In summary, the spike system is essentially a big meet concurrency, high performance and highly available distributed systems . Then we have to talk, how to meet a well-distributed system architecture, based on the spike for this business to achieve the ultimate performance improvements.

Architect perspective is: first to outline a profile, think about how to build a large flow concurrent read and write, high performance, and highly available systems, of which there are elements which need to be considered. I put these elements summed up as " 4 to 1 not to "

1. Data as little as possible (try not mean that absolutely need to have a balance and overall)

        The so-called "data to as little as possible", the first refers to the data requested by the user can be little less. Data upload request comprises the system data and system data is returned to the user (usually the website).

       Why the "data to as little as possible" mean? Because first of these will take time data transmission over a network, followed by either the requested data or return data needs to be processed server, and the server usually have to do compression and character encoding when writing network, these are very consuming CPU, thus reducing transmission can significantly reduce the amount of data used in the CPU. For example, we can simplify the page size of the spike, remove unnecessary page decoration effect, and so on.

       Secondly, the "data as little as possible" also requires the system can be less dependent on the data less, including some business logic to complete the system needs to read and save data, which is usually database and back-end services and dealing. Call other services would involve serialization and de-serialization of data, and this is a major killer of CPU, it will also increase the delay. Moreover, the database itself can easily become a bottleneck, so as little as possible and interact with the database, the data is simple, the smaller the better.

2. To minimize the number of requests

        When the user returns the requested page, the browser rendering the page also contains other additional requests, for example, depend on this page of CSS / JavaScript, images, and Ajax requests, etc. are defined as "additional request", these additional requests should as little as possible. Because the browser sends a request to have every number there will be some consumption, such as three-way handshake to establish a connection to be done, sometimes there are page-dependent or limit the number of connections, some requests (such as JavaScript) need serial loading and so on. DNS In addition, if different from the requested domain name is not the same thing, and also to resolve these domain names may be taking a little longer. So you have to remember is that the number of requests to reduce resource consumption can significantly reduce these factors.

        For example, a request to reduce a number of the most common practice is to merge CSS and JavaScript files, JavaScript files to merge into a plurality of file, URL spaced with commas ( https://g.xxx.com/tm/xx-b /4.0.94/mods/ ?? Module1-preview / index.xtpl.js, JHS-Module1 / index.xtpl.js, Module1-Focus / index.xtpl.js). In this way the server is still a single file each store, but there will be a server-side component parses the URL, and then moving to merge the files together with the return.

3. path as short as possible

        The "Path" is a user request to return data in this process, after the demand for intermediate nodes. Generally, these nodes can be represented as a system or a new Socket connection (such as a proxy server is to create a new Socket connection request forwarded). After each node, usually produce a new Socket connection. However, each additional connection will increase the new uncertainties. From the probability and statistics, joining a request after five nodes, each node availability is 99.9%, then the availability of the entire request is: 99.9% of the 5 th power, equal to about 99.5%. Path is shortened can not only increase the availability request, can also effectively enhance the performance (to reduce the intermediate node can be reduced and deserialize serialized data), and to reduce delay (Processed reduce network traffic).

        To shorten the access path there is a way to merge multiple applications is a strong mutual dependence deployed together, the remote procedure call (RPC) becomes a method call between the internal JVM. In the "large sites Architecture Evolution technology and performance optimization," a book I have a chapter on the detailed implementation of this technology.

4. To minimize reliance

        The so-called dependency, referring to complete a user request must rely on a system or service, depend here refers to the strong dependence. For example, say you want to show spike page, and this page have to rely on a strong product information, user information, as well as others such as coupons, transaction lists, these information can not not have to spike the (weak dependence), these weak dependency can be removed in an emergency.

        To reduce dependence, we can classify to the system, such as 0 system, Level 1 system, 2 systems Level 3 systems, 0 system if it is the most important system, then the 0 system strongly dependent systems are also the most critical systems, and so on.

       Note that the 0 system to minimize the strong dependence on level 1 system to prevent important system is unimportant system collapse. Such as payment systems is 0 system, and coupons are Class 1 system, you can downgrade to the coupons, in extreme cases, to prevent the payment of the coupon system is the level 1 system to collapse.

5. Do not have a single point

        The single-point system can be said to be a taboo on the system architecture, because a single point means there is no backup, the risk is not controllable, we have designed a distributed system is the most important principle is "no single point."

Then how to avoid a single point of it? I think the key point is to avoid the state machine and the service binding, that is service-free status of such services can move freely in the machine.

And how the state machine services decoupling it? There are also a lot of implementation. For example, the machine-dependent and dynamic configuration, these parameters can be dynamically configured to push through the center of the dynamic pull down when the service starts, we set some rules in the configuration center to easily change these mappings.

Application of stateless is a way to avoid a single point, but the service itself is difficult as the storage of non-state, because the data to be stored on disk, and the machine itself must bind, then this scene generally through redundancy backup multiple ways to solve a single point problem.

 

Note: Introduced in front of some of these design principles, but you have not found, I always say "try" rather than "absolute"?

       I think you must ask is not necessarily the best requests least, my answer is "not necessarily." We have put United within some CSS into the page, doing so can reduce the dependence of the request to speed up a CSS home page rendering, but also increases the size of the page, not in compliance with the "data to as little as possible" principle, this in the case where we have to improve the rendering speed the fold, only the first screen of the CSS with HTML depend came into still other CSS files as dependent on load, speed and try to achieve a balance to open the entire page load performance of the fold.

So, architecture is an art of balance, and the best architecture to adapt to it once out of the scene, everything will be empty talk.

Different architectures cases under different scenarios

        Architectural principles spike in front of that system, combined with Taobao spike early system architecture evolution, combing the amount of body under different requests, optimal system architecture spike is what.
        If you want to quickly build a simple spike system, you just need to add a product purchase page "timed shelves" function, allowing users to see only the buy button when the spike began, when the stock sold out will be over . This is the first time a version of the spike system implementation.

        But with the increasing amount of requests (such as from 1w / s to 10w / s the order), this simple framework soon encountered a bottleneck, and therefore needs to be done to transform architecture to improve system performance. These architectures include the transformation:

  1. The spike independent system to create a single system, so we can do a targeted optimization, such as the independent system reduces shop decoration functions, reducing the complexity of the page;
  2. In the independent system deployment also do a cluster of machines, such a big spike will not affect the normal flow of goods purchased machine cluster load;
  3. The hotspot data (such as inventory data) into a single buffer system to improve the "read performance";
  4. Increase in answer spike, spike prevent grab a single device.

At this point the system architecture becomes the next chart like this. The most important thing is, the spike became an independent details of the new system, in addition to some of the core data into the cache (Cache), the other systems are also associated with a separate cluster approach to deployment.

However, this architecture still support the requested amount can not exceed 100w / s, so the spike in order to further improve the performance of the system, we further upgrade of infrastructure, such as:

  1. The page is completely static and dynamic separation, so that the user does not need to refresh the entire page when the spike, but only need to click a button to grab treasure, whereby a page refresh data down to a minimum;
  2. The service side of the commodity spike cached locally, do not need to rely on the system's back-office services to call to get data, do not even go to public inquiry cluster cache data, so not only can reduce system calls, but also to avoid the crush of public cache cluster.
  3. Increasing the current limit protection system, prevent the worst from happening.

After these optimizations, the system architecture becomes like the following figure. Here, we further pages of static, spike process does not require refresh the entire page, while the server only needs to request a few dynamic data. Moreover, the most critical information and trading systems have increased the local cache to cache spike advance information of goods, hotspot database has done a stand-alone deployment, and so on.

       From the previous several upgrades, in fact, the more you need to customize the back of the place, that is, the more "not common" . For example, the spike in commodity cache memory of each machine, this approach is obviously not suitable for too many goods at the same time the situation spike, as stand-alone memory is always limited. So to get the ultimate performance, we must make sacrifices in (terms such as, versatility, ease of use, cost, etc.) elsewhere.

 

to sum up

        Let us look back to under the foregoing, I first introduced the three key points to consider design spike system "steady, accurate, fast," is the corresponding large concurrency, high performance, high availability of systems in general optimizer ideas, and abstract summed up as "4 to 1 not to" principle, that is: the data as little as possible, to minimize the number of requests, the path to be as short as possible, rely as little as possible, and do not have a single point. Of course, these points is the direction you have to work, or to close the actual scene and the specific conditions of the specific operation is performed. In short, just a main line of thinking, specific implementation needs to be combined with the specific scene.

Published 12 original articles · won praise 1 · views 779

Guess you like

Origin blog.csdn.net/a1290123825/article/details/86689351