From php5.6 to golang1.19 - the road to the performance transition of Library App

Author | Baidu Library App

guide 

This article shares the practical experience of Baidu Wenku App server technology stack migration from PHP to Go in a simple way, including specific solutions for technology selection, infrastructure construction, traffic migration, and the practice of refactoring core project cases.

The full text is 6209 words, and the expected reading time is 16 minutes.

01 motivation

For a long time, the Baidu Wenku App server has adopted PHP as the main development language, which has efficiently supported the iterative development of the business. With the continuous growth of platform traffic, the load on the server side is getting larger and closer to the system bottleneck. In order to increase the load capacity of the system, we have adopted some optimization methods, the fastest and most effective method is to increase the number of instances in the online cluster. In addition, Lua development projects have also been used to undertake some interfaces with simple logic and large access volume to share the load. Due to some limitations of lua itself, it is not suitable for complex business logic.

Along with the development trend of IT technology, we actively responded to the company's call to reduce costs and increase efficiency, and decided to migrate and reconstruct the server-side technology stack in mid-2022. It aims to upgrade the technical architecture and increase the system load capacity.

It is difficult to grasp the timing of technology stack migration and project refactoring, especially when a mature team needs to make major system changes. If there is no real pain point, even if the research and development students think that there are many unreasonable and risky designs in the technical implementation, they are often not allowed to spend a lot of time on technical projects. But once even business personnel (product managers, sales, operations) feel that the system functions need to be upgraded, for example, in terms of user experience, the App document search interface has a relatively long delay, and product students think that if the first screen rendering can be significantly accelerated, click The rate and payment rate will be greatly improved, but it is difficult to optimize the R&D side based on the old technology stack, so it is very suitable for migration and reconstruction at this moment.

At the right time, the product students proposed some ideas to improve the user experience and meet the needs of project reconstruction, such as: accelerating the first screen rendering of the search results page, the new App home page, AIGC intelligent creation. These large requirements are very conducive to the start of the migration and reconstruction work, and they minimize the additional manpower required for migration and reconstruction. Migration and refactoring of existing functions, faster and more stable interface response brings smooth user experience, which is conducive to promoting the achievement of the overall team's okr goal.

Looking back, sort out the technical debt of the server based on php5.6 at that time:

1. The underlying technology : the language version is old, the features are outdated, and there are defects such as low execution efficiency, security risks, and waste of resources;

2. Development quality and efficiency : business logic cross-coupling, a large number of discarded interfaces and offline business logic reduce code readability and maintainability, and continuously increase the difficulty of project iteration.

picture

△Technical debt

02 State before startup

In terms of service deployment, the server-side deployment method of Library App is nginx+hhvm (HipHop VM 3.0.1; baidu version:1.1.6 (rel)), HHVM is a high-performance PHP virtual machine developed by Facebook, which is a traditional nginx+php - A performance optimized version of fpm. In recent years, the continuous maintenance and iteration of the original hhvm team has been lost. The grammatical features and execution efficiency it supports are relatively backward, and there are certain security risks.

Check the server instance usage at the beginning of the migration, which depends on daily operation and maintenance. First, confirm that the cpu, memory, and disk usage of the online service instance are within reasonable thresholds, excluding resource waste and high utilization caused by low utilization. There will be disaster recovery risks. At the application layer, we use thousands of php5 instances.

03 Vision

The investment and return of refactoring are not linear.

—— "Domain-Driven Design: A Solution to the Complexity of Software Core"

Intuitively speaking, we hope that the server-side upgrade can bring less code, a more stable system, higher quality efficiency, and a better user experience. This is reflected in the following points:

1. Technology upgrade : Adopt advanced language framework to support efficient iteration of the project to provide a strong underlying engine, security and mature application ecology;

2. Improve design : sort out code logic, manage redundancy, solve bad smells in code, and build a high-reusability, low-coupling, and scalable business architecture;

3. Cost reduction and efficiency increase : On the one hand, base upgrades improve code execution efficiency, reduce flat noise, and improve service availability and observability; on the other hand, in operation and maintenance practices, reasonable allocation of CPU, memory, and disk quotas for container instances Optimize resource performance.

The success of the server-side upgrade can be achieved from two aspects, namely, the migration of the technology stack and the reconstruction of the existing code design .

04 Do technology selection

We do not intend to use relatively niche and ecologically isolated languages ​​as the technology stack of the library App server. At the same time, referring to the technology stack upgrade direction of the brother team, two in-plant frameworks finally entered the technology selection finals, the php7-based odp3 framework (Online Develop Platform) and the go-based gdp2 framework (Go Develop Platform).

picture

Option One: PHP7 Framework and Phaster

The PHP7 framework is an online business development platform released by the company, which provides standard webserver environment, standard php environment, AP framework, basic library, resource access layer, general services and other components, and unifies business logic and deployment structure. The highlight of the framework is Phaster. Phaster allows you to use PHP language to develop high-performance Http, Fastcgi, and Nshead services, perform high-performance RPC calls, and realize business code parallelization at extremely low cost.

The comparison between Phaster and other industry frameworks is as follows.

picture

Phaster can be used as http server or fastcgi server. Compared with the traditional nginx+cgi method, Phaster achieves several times performance improvement based on the above capabilities. With the following highlights:

1. The logic of traditional hhvm or php-fpm processing requests is that when each request is processed, the php context must be initialized first, and the context is cleaned up at the end of the request to recycle various resources. When phaster enables context reuse, it can save time spent on class loading, file loading, initialization and other processes. For example, if your interface needs to read a large file configuration each time, you can put the read operation in the initialization file. Within 100 requests, it is enough to execute this read operation only once;

2. hhvm or php-fpm does not directly support the http protocol. Often, nginx is added as the http server, and the two communicate through fastcgi. And Phaster can be started directly as an http server, reducing the processing and forwarding of nginx by one layer;

3. The support of coroutines provides a high concurrency basis for IO-intensive business scenarios. For blocking IO, it can be put into a coroutine to change blocking into non-blocking, and enjoy the IO performance improvement brought by the asynchronous effect while using the synchronous programming scheme.

It is worth mentioning that Go supports these capabilities .

Option Two: Go Framework

The Go language was released by Google in 2009. In recent years, it has risen rapidly with the development of cloud computing, microservices, and distribution, and has become one of the mainstream programming languages. Similar to Java, it is a static, strongly typed , Compiled programming language, born for concurrency, so it is naturally suitable for concurrent programming (network programming).

The GDP2 (Go Develop Platform) framework is a Go development framework with good support for in-plant infrastructure, good scalability, easy configuration, easy assembly, and easy testing. With complete RPC Client and RPC Server capabilities, as well as a supporting common basic library, it can be used to develop various applications such as API, Web and back-end services. With the following highlights:

1. Good support for the infrastructure in the factory;

2. Good scalability, easy configuration and assembly;

3. Good usability and friendly to testing (easy to mock, multiple testServer, testClient);

4. The internal state of the component is easy to observe;

5. Full link timeout & process control mechanism, good stability;

6. Large-scale application in the factory, stable and reliable (basically all Go projects are used, and there are thousands of projects used).

picture

Comparing the characteristics of the above two frameworks and the feasibility of implementation and other factors, we are more inclined to migrate to the GDP framework .

05 The critical path for refactoring

After making the technical selection, we will start the next step. Like common web projects, the Wenku App business has a fast iterative speed and heavy tasks, and it is difficult to ensure that there is sufficient manpower to invest in technical projects for a long time. Therefore, the premise of upgrading and refactoring the technology stack is to ensure that the business needs continue to be carried out, and the awareness of continuous refactoring is required, and "agile iteration" is often adopted.

5.1 Agile iteration

The first step is workload estimation. Through log aggregation analysis, it can be concluded that the current App interface route with traffic (many interfaces of the old project have no traffic, and the associated requirements have been offline). In actual operation, it is found that the traffic of the top 50 interfaces sorted according to the qps value from large to small accounts for 99%+ of the total traffic, which also determines the priority of the interface migration sequence.

The second step is to formulate a strategy. The landing of the server-side technology stack go migration is essentially reflected in the transfer of traffic undertaken by php to go. When all traffic is running on the go instance and there is no dependency on the underlying call of the php project, the upgrade can be considered complete. Therefore, continuously expanding the proportion of the total traffic of the go instance cluster on the App server side is the goal of our migration. This can be roughly summarized in two ways:

1. According to business needs, combine interface importance and traffic proportion to determine priority and perform migration;

2. There is no strong dependency between the code implementation of new requirements and the php project, and it is directly developed in the go project.

For quite a long time, it was in the coexistence state of php+go for mixed programming. Due to the B/S architecture characteristics of the app, the refactored interface needs to be forwarded through the access layer gateway, and the specific application layer cluster (php->go) that the server accepts traffic should be switched, so that the client can keep the path unchanged, so that Realize the high availability of the old version of App. In the early stage of refactoring with the idea of ​​hybrid programming, there may be some special requirements, such as: the same piece of business logic needs to be written in go and written in php, which will undoubtedly increase a certain amount of workload. Of course, this is unavoidable. of.

Avoid a misunderstanding when refactoring: the waterfall model, which refactors the entire project in one go. In terms of time, labor costs and stability, this method is relatively risky and is not recommended. On the whole, the interface granularity is refactored in batches, so that both the internalized technology iteration and the externalized business impact can be clearly perceived. It is more suitable as a way to realize traffic migration.

5.2 Infrastructure construction supporting golang

It is different from the work flow of the PHP project in the following points.

1. Scaffolding : In addition to defining framework properties such as routing, logic layering, and generating configuration, go needs to additionally encapsulate coroutines and provide R&D students with an out-of-the-box scaffolding.

2. Publish : Encapsulate build logic to realize package compilation and environment variable management.

3. Deployment : The entire binary file is covered, and the service needs to be restarted. Using the hot restart module can achieve lossless online and faster online speed.

4. Traffic : The interface migration of CS-based apps requires the access layer to rewrite routes and coordinate gateway changes.

5. Monitoring : Logs are graded, traces are transparently transmitted between microservices, and all kinds of logs are placed on the disk in accordance with the format specifications collected by the agent.

5.3 Write the first interface

On the one hand, when starting the technology stack migration, you need to understand that the go language supports concurrency, and you can easily develop asynchronous programs with a strong type language. Go is a strongly typed static language. The type is determined at compile time. It is not as flexible as PHP, but it is more rigorous and safer. It can check out most of the hidden problems during the compilation phase.

picture

△Type conversion

On the other hand, how does a refactoring project manage stale code? In a nutshell, you can refer to the 23 bad code smells proposed in the book "Refactoring - Improving the Design of Existing Code", refactor the code in a targeted manner, and tame it into a clean and easy-to-read code.

picture

After the preliminary research and migration strategy are determined, the actual code development becomes handy. When migrating traffic from the old interface, we need to re-implement it with go on the new interface. The calling method is exactly the same as the old interface, including path, method, signature verification, header rules, parameter structure, response structure, and error code. Only the virtual domain name on the application layer is different.

5.4 Quality Assurance

The code is ready. Different from the regular testing process of PHP projects, Go cannot bypass performance testing. Because we hardly need to pay attention to GC and memory leaks when writing PHP, but Go needs it. Sometimes manual testing and black-box testing are OK, but when encountering certain concurrent business scenarios online, problems will be exposed, often manifested as The cpu or memory utilization of the instance continues to rise until it crashes.

Respond to the memory leak problem of go. On the one hand, it is necessary to add stress testing to the testing process; on the other hand, it is necessary to pay more attention to whether the instance resource utilization rate, interface sound level, and stability indicators of the monitoring dashboard meet expectations, because some hidden bugs cannot be covered even by stress testing arrive. At this time, it is necessary to improve the observability of go services to detect risks in a timely manner.

The panoramic matrix of Go quality assurance capabilities is as follows:

picture

Build offline quality assurance capabilities:

picture

Build online quality assurance capabilities:

picture

5.5 Traffic Migration

picture

As shown in the figure above, after the go project is launched, the actual traffic is still being undertaken by the old project. Start traffic migration. User traffic first reaches the access layer. At this layer, we divert it to different application layer load balancers according to different access domain names and routes. In order to be compatible with the old version of the app, it is necessary to keep the domain name route unchanged Next, complete traffic migration. The access layer gateway is used to distribute traffic, and the rules for diverting to PHP are applied to the load balancer of the go application layer, and the traffic migration is completed. Note that if it is a very core interface, we need grayscale release, which can be realized by nginx+lua, or the famous open source gateway ApiSix and BFE projects, which all support grayscale release.

5.6 Core Function Refactoring Practice

The more prominent highlights of this reconstruction are reflected in the optimization of the new homepage and search results page of Baidu Library App.

(1) The personalized needs of users of the customized new homepage
library are relatively scattered. It is hoped that by abstracting the commonness of vertical user content needs, extracting continuous features and high-use content, and adopting a centralized recommendation method, the user's vertical content can be improved. The structural satisfaction of similar content can improve user retention rate and renewal willingness. Refactored the layout and content display of the App homepage. Added personalized "My Database", "Teaching Progress", "Recommended Channels", and customized display of document list and folder list.

The technical solution of the App's new home page is brand new, and the motivation for refactoring comes from "business-driven" rather than "quality-driven". In terms of requirement realization, the bottom layer does not depend on old PHP projects. So directly develop and launch the go project to provide interface services. After going online in this way, go naturally replaced the homepage traffic originally undertaken by php.

picture

△The new homepage of the library

(2) Search result page optimization

The main refactoring object on the server side is a search interface. In actual development, communicate with the product whether it is possible to offline the unnecessary tab list and internalized recommendation logic; clean up the business logic of many AB experiments that have been offline, and remove the existing Push the code judgment of the whole AB experiment; optimize the document sorting algorithm, align the current necessary fields with products and front-end classmates, and remove redundancy; make good use of coroutines to optimize serial logic.

Combining the front-end to remove lazy loading code, image localization, user-side capability to cache interface data, and offline package service to build technical means, search result page optimization has achieved good results. Significantly reduced the loading speed of the search results page, the average delay of Android was reduced by 41%; the average delay of IOS was reduced by 43%, and the click-through rate of search results and the number of orders traded also increased to a certain extent.

picture

△【New and old search results page】Statistics of white screen duration during AB experiment

06 Goal Achieved

Since the start of the go migration in August 2022, it has nearly completed all the traffic migration work on the App server.

1. Technical iteration : Thanks to the advanced features of go language, memory management and rich ecology, code execution efficiency, security and observability have been improved; by sorting out business logic, governance redundancy, cleaning up bad smells in code, and encapsulating Public classes and other methods to improve quality efficiency, code readability and maintainability;

2. Improve performance : On the one hand, the synchronous blocking code execution method is changed through coroutine and channel technology; on the other hand, the execution efficiency of the compiled binary file is much higher than the network model of nginx+php-cgi. On average, the interface time consumption is reduced by about 30%, and TP90 reduces the time consumption by 35%;

3. Cost reduction and efficiency increase : Thanks to the high-performance features of the go language, the load capacity of the application layer instance is improved. After the traffic migration, the online cluster of the library App server reduced the number of instances by about 50%.

07 Thinking and Summary

1. The mobile app belongs to the application of the CS architecture. During the migration process, it is necessary to ensure that the old version Client can use the service;

2. When facing a long-term project, it is very important to disassemble the goal. Quick trial and error and instant feedback are also part of Internet thinking;

3. The migration theory is harmless, but it is necessary to synchronize the risk with PM students, pay attention to various business indicators in a timely manner, and formulate a plan at the same time to ensure the flexibility of rollback;

4. After the interface has just been launched or the AB experiment is completed, the traffic of the migrated interface will increase. It is necessary to develop the habit of frequently observing the availability dashboard and deal with abnormal http status in time to avoid risks from expanding into failures.

08 Epilogue

Knowing but not doing is ignorance; doing but not knowing can lead to knowledge.

Recalling the whole process of project migration and reconstruction, the most interesting thing is the initial stage of technology selection and discussion of the specific implementation plan of traffic migration. At that time, it was a bit confused how to migrate the bloated and huge PHP single project. Gradually deepen the understanding of the project in the process of practical exploration, and derive and formulate the next action through the inspiration obtained, forming a positive cycle. I hope that the content of this article will be helpful to everyone's work practice.

——END——

Recommended reading:

Application practice of light sweeping motion effect on mobile terminal

Android SDK security hardening issues and analysis

Large-scale quantitative practice of search semantic model

How to design an efficient distributed log service platform

Multimodal Semantic Matching Model in Video and Image Retrieval: Principles, Implications, Applications and Prospects

Baidu offline resource management

Graduates of the National People’s University stole the information of all students in the school to build a beauty scoring website, and have been criminally detained. The new Windows version of QQ based on the NT architecture is officially released. The United States will restrict China’s use of Amazon, Microsoft and other cloud services that provide training AI models . Open source projects announced to stop function development LeaferJS , the highest-paid technical position in 2023, released: Visual Studio Code 1.80, an open source and powerful 2D graphics library , supports terminal image functions . The number of Threads registrations has exceeded 30 million. "Change" deepin adopts Asahi Linux to adapt to Apple M1 database ranking in July: Oracle surges, opening up the score again
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4939618/blog/10086661