Reconstruction and evolution of nice server architecture

Reconstruction and evolution of nice server architecture

Editor's note: High-availability architecture sharing and dissemination of articles with typical significance in the field of architecture. This article is shared by Lei Guoguo in the high-availability architecture group. Please indicate that it is from the high-availability framework public account "ArchNotes".

Reconstruction and evolution of nice server architectureLei Guoguo, joined nice in November 2014 and is responsible for server-side online business. He is good at PHP. He has spontaneously translated the book "Extending and Embedding PHP" and some modules of the official PHP manual. I like to use what I have learned to build my own tool chain and think about system and architecture design issues.

nice is a picture social app whose goal is to let people discover the beauty of life. The core experience of the product is social based on lifestyle.

Reconstruction and evolution of nice server architecture

We hope to allow users to express their lifestyles through pictures, live broadcasts, tags, and new fashion brands, and use these content as a basis to provide users with social scenes. In terms of products, we are still actively exploring how to better provide users with this value.

At this stage, the nice server mainly faces the following challenges:

  • System design, facing change, must be able to well support the diversity of requirements in the "product exploration stage".
  • Stability, to prevent stability problems from causing harm to existing users, while also dealing with sudden business growth.
  • Collaboration, although small and complete, the server serves as a bridge between the client, strategy recommendation, big data, QA, operation, product and other teams. How to solve the bridge between all parties through technical or non-technical means.

"Turn it down and start over": Nice's road to reconstruction

As soon as I joined nice, I received a very challenging task to reconstruct the overall business and framework of the server. Like many entrepreneurial teams, we have accumulated a series of technical debts as we grow up.

  • The old system is written using the CI framework, there is no division of modules, no clear layering, various business processing directly at the entrance, almost no reuse.
  • API version management is a direct directory copy. With the development of business, it is already necessary to maintain more than ten versions of interface code at the same time, which is painful.
  • The code is full of compatibility logic like if ($isAndroid && $appVersion >= 3).

So that the client/server joint debugging basically depends on shouting. I believe that many friends in startups have experienced these problems.

Old system architecture

The old system architecture is the most typical integrated application architecture: the background, HTML5, and interfaces are all kneaded together.
Reconstruction and evolution of nice server architecture

Facing the situation at that time, first analyze the problems to be solved, and consider the following three problems to be more critical.

  1. Structural issues. The code structure is messy and cannot be reused.
  2. Client-side difference management. The copy of the interface version leads to a huge amount of repeated code, which is mainly for the special needs of specific clients/grayscale/small traffic and other issues.
  3. Client/server RD collaboration issues.

Layering and modularity

First of all, from a big perspective, a simple two-tier architecture can be used to solve the first problem.
Reconstruction and evolution of nice server architecture

The application layer and the service layer are divided at the coding level:

  • The application layer solves the problems of entry, such as the access of common services such as interactive protocols, authentication, and Antispam, as well as the personalized requirements of each end.
  • The service layer solves the problem of business logic and divides the service layer into vertical modules according to the business.

Through the division of levels and modules, code management becomes clear and logic reusability is greatly improved. At the same time, business division also provides basic support for subsequent business isolation and hierarchical management.

Two basic components

The above-mentioned client-side difference management and client-server collaboration are solved through two basic components of the framework:
Reconstruction and evolution of nice server architecture

  • ClientAdapter: The client adapter is used to handle all logic problems caused by client differences.
  • CKCR: CheckAndCorrect, data checking and correction, used to control the input and output protocol, and solve the technical problems of the client/server RD collaboration.

ClientAdapter component

Let's first look at an application scenario of ClientAdapter.
Reconstruction and evolution of nice server architecture

Taking the above configuration as an example, the following commonly used similar rules can be implemented.

  • All clients are version 3.1.0 and above, support the "hello" function.
  • All clients are version 3.1.0 or higher, and the gray channel abc36032 is version 3.1.0, which supports the "emoji" function.
    Reconstruction and evolution of nice server architecture

(Click on the picture to zoom in full screen)

Everyone look at the nice code above. In this way, logically, nice deals with various "Features" rather than specific client environments. The above example is an application scenario of ClientAdapter.

The overall structure of the ClientAdapter
Reconstruction and evolution of nice server architecture

The most basic part of ClientAdapter is to abstract the concept of a "client runtime environment", which is used to describe various information of the client that initiates each request, such as system, App version, IP, network standard, network operator, Geographical location, etc. In addition, it provides a simple description rule to describe a restricted client environment.

At the application level, it only exposes the checkEnv() interface to check whether the current client meets the given description rules. On top of this infrastructure, the upper layer of nice has many applications.

For example, the differences in the original client will cause nice to face complex client adaptation. Through NiceFeature, under the Feature mechanism, RD is actually facing the iterative features of products. For example, in NiceUrl, the unified scheduling of CDN is implemented in nice. Through ClientAdapter, we can flexibly control various regions to adopt different scheduling strategies.

In addition, in terms of handling user diversion, the flexibility provided by this mechanism can also meet Nice's needs for experimental user extraction in multiple dimensions.

ClientAdapter open source address

The implementation of the ClientAdapter component is very simple, only 200 lines. The code has been extracted and put on github for your reference. The website is: http://t.cn/RGnqnpj .

To add, the design of ClientAdapter refers to the common methods in C language. In the autoconf stage, various environmental information of the system is defined as various HAVE_XXX. This achieves the goal of decoupling the complexity of the environment and the actual business code.

CKCR introduction

The above question 3, the client/server RD collaboration problem, this problem is divided into two parts.

  • Technical level: the constraints of the protocol layer.
  • Process level: how to work together.

At the technical level, the problem to be solved is how to ensure the implementation of the interface protocol. From the input point of view, we must not trust the client, as long as the agreement is established, we must follow the rules to prevent the client from being controlled by "bad guys" and causing damage to the service. From the output point of view, the data returned by the business layer must be transmitted to the client in accordance with the agreement to avoid undesirable results from the client, such as common types that do not match and cause Crash.

In a word, it is to correct the checksum of the input data and the output data. Therefore, nice has introduced a layer of components called CKCR here, the full name is ChecK && CorRect.

CKCR implementation

CKCR implements a small set of descriptive grammar rules. This grammar is used to describe the check and correction of the data. The checksum correction behavior can be freely extended, the following is an example of its application.
Reconstruction and evolution of nice server architecture

(Click on the picture to zoom in full screen)

In this example, $data represents the data to be processed, and $ckcrDesc is the description string of this grammar rule. The rules it describes are as follows.

  • The overall data is a KV array (Mapping), only the data of the two subkeys user and shows are retained.
  • user is also a KV array, its id is of type int, and name is of type str.
  • shows is an array, each element in the array is a KV array, filter out its id subkey and its url subkey, and apply the custom processing of imgCdnUrl (execute cdn scheduling).

It has similarities and differences with the protobuf and Thrift schemes. The main difference lies in the fact that CKCR provides a general approach for scalable data verification/correction. Due to this scalability, it can do more articles on the common data in the general level. For example, the imgCdnUrl in the above example is the unified hook point for nice CDN scheduling.

CKCR internal structure and grammar rules
Reconstruction and evolution of nice server architecture
Reconstruction and evolution of nice server architecture

(Click on the picture to zoom in full screen)

The above is the basic function of CKCR. Later, it was found in the application that most of the data structure of the core data output of the system is the same in each scenario of the system. Therefore, the reusability of the CKCR description string becomes a problem.

In order to solve this problem, nice introduced a preprocessing mechanism before CKCR compilation. A fixed data structure description can be quoted through a special syntax. After the introduction of this mechanism, it has brought an additional benefit, that is, the core data structure of the precipitation system.
Reconstruction and evolution of nice server architecture

(Click on the picture to zoom in full screen)

The problem of client/server collaboration is nicely solved on the technical level through this component. So, how to solve the problem of human collaboration?

First of all, the description rules of CKCR are concise. Therefore, it can be directly output as an interface document.

Secondly, on the basis of the interface document, there is a clear plan for the collaboration between the nice server and the client RD.

  • Agreement: The two parties RD communicate, agree on the interface agreement and provide documents, and the two parties enter the design stage.
  • Fake data: The server RD quickly provides a pseudo interface of mock data for the self-test of the basic functions of the client RD.
  • Real interface: The server RD provides a real interface, and the two parties are jointly debugged.

Such a step-by-step development method basically solved the problem of "joint debugging by shouting". The work of the two parties is basically decoupled, and the development progress of both parties is basically not affected.

CKCR open source address

The source code of this component of CKCR has been extracted and put on github at http://t.cn/RGn57Xk .

Summary of the first stage

These are nice solutions to the three problems in the first stage:

  • Layering and modularization: Through a two-layer structure, structural problems are solved.
  • Client adapter: Solve the problem of client-side differences.
  • CKCR: Through CKCR and collaboration process, solve the problem of client/server RD collaboration.

At this stage, the overall reconstruction is mainly used to solve the development problems and pave the way for future structural adjustments.

At that time, the approach was to "retrove and start over." Looking back now, this choice was correct based on the circumstances at the time. Around that time, we did not refactor the remaining problems of the system. After the system became larger, it became more difficult.

However, having said that, the road to "reconstruct and start over" is still full of risks after all. Before making such a decision, we must make adequate resources and risk assessments.

Those pits filled for stability

After completing the overall refactoring, nice entered the rapid development of the business. In March 2015, we also engaged in SpeciaForce for a month. Almost everyone in the R&D team lives near the company and has more than 7*14 hours of work. The sprint during that period has brought about the improvement of key data such as daily activity for our products, and the interface PV has reached the highest peak of 500 million per day.

Until August 2015, the stability of the service has withstood a great test. To be honest, I was about to collapse at that time. The most important types of problems are as follows.

1. MySQL can't hold it

Nice's MySQL cluster was originally a single-instance single-database, one master and four slaves, and a mechanical hard disk. In April 2015, like many teams with fast business growth, in order to quickly solve the problem, OP replaced all DBs with SSDs, which was very profitable.

In addition, the service cannot be isolated due to the consideration of a single library, and the writing of the main library has also become a bottleneck. Around March 2015, nice began to consider sub-database/sub-table. The scheme of sub-library is mainly divided according to vertical business.

Technical debt is really hard to owe! ! !

The sub-library took almost half a year for the two students. Then it took about a quarter to split the tables for the core large tables of the system.

In MySQL, our lessons are as follows.

  • Using hardware to solve problems is very cost-effective.
  • The business line division of the library, the scale evaluation of the table, such things, must not be careless. Doing it in advance may take a few more person days; doing it later may be like we need or even more than 1 person-year to wipe our ass.

On the other hand, nice relies heavily on Redis. Part of the data is a typical Cache usage. When the online business accesses the Cache, it will automatically fallback to the DB to flush the Cache. Another part of the data is quasi-persistent data. This part of our online business will not fallback to DB.

2. Redis can't hold it

Nice's Redis is also a single cluster. Around April and May 2015, with the rapid business iteration and many new functions coming online, Redis pressure increased rapidly, and Redis failures began to occur occasionally. At that time, nice started the service split of Redis. Because the business model is relatively simple, the splitting speed of Redis is relatively fast.

But at the same time, due to data loss during the failure, we decided to do some independent development in Redis high availability. Mainly for smooth expansion, automatic failover, etc.

Because of the lack of experience in this area, a lot of problems have occurred. In the most serious one, no problems were found during the online trial operation, but after the full amount was cut, multiple cluster nodes encountered problems one after another. After two days or so, I couldn't hold it, so I had to switch back. However, at that time, I was faced with the problem of machine resources. I couldn't switch back all at once. I had to switch from cluster to cluster. At the same time, several RDs were used to write almost all quasi-persistent data recovery scripts.

The lesson of Redis is very painful. Judging from the experience of stepping on the pit in 2015, our gains are as follows.

Experience in stress and volume assessment

  • Stress-related problems are still solved by the idea of ​​isolation/split.
  • Basic service monitoring is necessary. The implementation cost of basic resource monitoring such as CPU, memory, disk, and bandwidth is not high, but it can often help us find problems in advance.
  • Service capacity evaluation requires careful thinking. For online business, you will fallback to DB's Cache business, and you must be careful about the risk of penetration after failure.

Data-related experience

  • For quasi-persistent data, if Redis does not have a disaster recovery plan, it is best to prepare for the record of full recovery data. Otherwise, if something goes wrong, write the code now and kneel down.
  • If you want to move online data, you must prepare a rollback plan, otherwise it may be a disaster.

Finally, let’s talk about the opinion of independent research and development. Under the premise that the technology scale cannot be reached, it is recommended to use the existing mature solutions. Even if the hard conditions meet the standards, to do this, you must be very careful at every step.

3. The front-end machine cannot be carried

The last question is the pressure of the front end machine.

First of all, when start-up companies are relatively small, there are generally fewer problems with the access layer, but it is still recommended to take basic precautions or plans for "bad guys".

Looking at the front-end machine cluster again, the problems we have experienced can basically be classified into two categories.

  • There was a problem with the polling of the client, and the user caused a large number of requests to the server.
  • The back-end service fails, causing the front-end machine to process slowly or simply fail to connect, so that the front-end machine's process pool is easily full.

Because we have already divided the service resources and the business has also been divided into modules, the handling of this matter is relatively easy. The general idea is as follows.

The front-end machine is classified according to business. The core business usually changes less, and the failures caused by other business changes can avoid affecting the core business.

Downgrading: We provide a set of downgrading plans. If problems occur, they can be downgraded immediately after going online. There are two main types of strategies.
a. Request end: Deal with a certain user group or a certain business problem;
b. Back-end service: A dependent back-end service fails.

Back-end service disaster recovery: At present, nice has added a layer of LVS to do the failover of back-end services (for MySQL/Redis internal services); another point, on the front-end machine application side, nice has also made a layer of back-end services The scoring mechanism prevents the problems that can only be detected on the application side from being ignored.

Finish the above two stages. The current status of our server architecture is as follows.
Reconstruction and evolution of nice server architecture

(Click on the picture to zoom in full screen)

nice current pain points and next steps

After August 2015, through the joint efforts of RD and OP, the issue of service stability has been guaranteed. At the same time, the technical team developed rapidly, and there were also students with complete data and strategy teams. In the process of multi-party docking, servicing gradually becomes more and more important.

In addition, the size of its own team is also expanding. All online business codes are together, and sometimes more than 10 people's changes are required to go online. The coupling of code bases has also become an obvious problem.

The most important problem facing Nice is servitization and code splitting.

In response to this problem, our current basic ideas are as follows.

  • Servicing, not one size fits all. While supporting remote invocation services, it can also support simultaneous deployment at this stage. Avoid too many services and introduce service management issues too early.
  • Code split, application development framework upgrade. Separate the library/framework/business, maintain it independently, and introduce library dependency management tools.

During the more than one year involved in entrepreneurship. I have a lot of insights.

Entrepreneurship is a way of life. You will face all kinds of problems at any time, some of which you are good at, some of which you are not good at. But in any case, you have to come forward. Because you choose, you have to bear it.

This sharing is mainly based on practical problems. I hope that friends who are also involved in entrepreneurship can avoid detours after seeing this sharing.

Q & A

Question 1: Excuse me, what kind of disaster recovery solution does Nice adopt for the high reliability of Redis, and how does Redis's multi-machine homeowner-slave synchronization achieve?
Lei Guoguo: Nice's current application is done through proxy, and it is currently deployed in a single computer room.

Question 2: Does the access layer (network and seven-layer protection) in the architecture diagram use nginx, keepalived, LVS, haproxy and other components?
Lei Guoguo: Yes, nice uses haproxy and nginx. Use waf on nginx to do some basic rule protection.

Question 3: Even if the client adaptation layer is built in the front, how does the back service provide services according to different adaptation conditions? Does it still require multi-version maintenance?
Lei Guoguo: The key to the adaptation layer is to transform the complex diversity of environmental information into Feature. The business is just facing Feature for development.

Question 4: How to solve the connection pool problem when using PHP as the service layer or later as a service layer (short connection has many disadvantages when there are many machines)?
Lei Guoguo: Choosing swoole or doing some connection pool expansion development yourself can solve this problem. But I think in the long run, PHP is not designed for this scenario, so it can be used as a transitional solution.

Question 5: Why did you not consider using cloud services to build services in the early stage?
Lei Guoguo: Nice also uses some third-party services. We all cooperate with manufacturers such as Qiniu/Wangsu.

Question 6: How big is each piece of nice Redis? Will full synchronization occur when the network is jittery?
Lei Guoguo: Each fragment will try to be controlled within 10G. When there were more failures in 2015, we also had bgsave causing problems.

Question 7: How to implement API access? What scheme is used with the LVS layer in the figure?
Lei Guoguo: For
API access, nice initially used DNS directly. Because of the operator's problem, it later chose HTTPDNS, and now it is embedded with its own IP. IP-based access;
LVS is mainly used for automatic failover of services such as DB / Redis.

Question 8: How many people did you use for the adapter and ckcr, and how long did it take to complete the reconstruction?
Lei Guoguo: It took two weeks to reconstruct the overall framework and I wrote it alone. Business restructuring 4 RD + 1 OP friendship support, it took a month.

Question 9: Seeing that the backend server is mainly based on PHP, did you consider using nginx_lua or node.js and other asynchronous non-blocking technologies to improve concurrency? Is there room for optimization to simply use PHP to handle 200 million requests?
Lei Guoguo:
I think optimization depends on whether there is a high-concurrency demand scenario first. For example, a second-hand transaction where nice was ready some time ago, it is expected that there may be a scene similar to a spike. I have also considered nginx_lua and other solutions, of course, the final choice What plan depends on the needs of the actual scene.
How many requests PHP can resist depends on the complexity of the business. For example, in the application scenario of my last company, the business request volume is very large. I completely simplified the online business to a Redis get, and there is no web server. I directly use PHP to match it, and a single machine can reach 5000+ QPS.

Question 10: Why do you choose to use LVS instead of haproxy for failover of back-end services?
Lei Guoguo: Here we use LVS for request forwarding only, not traffic.

This article planned Tim Yang and Liu Yun, edited Liu Yun and Hao Yaqi, and broadcasted Liu Shijie in the group. If you want to discuss the evolution of the Internet architecture, please follow the official account for opportunities to join the group. Please indicate that it is from the high-availability framework "ArchNotes" WeChat official account and include the following QR code.

Highly available architecture

Changing the way the internet is built

Reconstruction and evolution of nice server architecture
Long press the QR code to subscribe to the ``High Availability Architecture'' official account
Reconstruction and evolution of nice server architecture

Guess you like

Origin blog.51cto.com/14977574/2547910