Best technical practice of end-to-end full-link governance of Autohome APP

Title end-to-end quality improvement background

In the era of mobile Internet and intelligence, mobile App software has become an important tool to facilitate work and life. In order to improve the user experience and create a premium car consumer position, Autohome has carried out a comprehensive special treatment of the quality of App service.

End-to-end full-link quality of service issues

In actual operation, the length and complexity of the link from App users to the server is relatively high, and there are many network links and nodes including mobile devices, wireless networks, core networks, and service networks. In such a large-scale distributed environment, system quality and performance are critical to user experience and service stability. In these links, once low-quality requests and exceptions occur, the performance and stability of the entire system will be affected, leading to access problems on the user's App side, such as slow response, response freeze, errors, etc., especially during peak traffic periods, which may easily cause Problems such as system crashes and service unavailability seriously affect the user experience. Therefore, how to manage end-to-end full-link low-quality requests and improve system quality and performance has become an urgent need.

Systematic Solution for End-to-End Link Low-Quality Request Governance

In the process of implementing the end-to-end full-link low-quality request governance solution, comprehensive analysis of the end-to-end full-link low-quality request governance is a global and complex issue. It is necessary to comprehensively use a variety of technologies and management methods, fully consider different business scenarios and user needs, explore scientific governance methods, effective governance methods, tool-based construction, and systematic architecture upgrades. Secondly, through strengthening teamwork, real-time monitoring, and timely early warning and other management methods to ensure the safe and stable operation of the system while improving the quality and performance of the entire link system.

Establish an end-to-end full-link low-quality request standard

Low-quality request governance first needs to determine the effective identification method and logic of "user access from App client to server" low-quality requests. Consensus indicators, quantification of specific link node service issues, help various departments and teams reach an agreement on the concept of low-quality requests, provide a basis and common language for governance, and clarify the responsibilities and authorities of relevant departments and personnel in the governance of low-quality requests. This ensures the efficient execution of governance work, and facilitates monitoring and evaluation to ensure that governance work is carried out in accordance with established standards.
The specific "low-quality request" LQR is used to judge the back-end service quality standard: the overall second opening rate (1000ms) on the main App side is used as the benchmark, and the client request time-consuming (rendering <150ms) + back-end request time-consuming ( Network response time <600ms, service response rate <250ms), that is, the back-end governance direction is judged by "low quality request (LQR):" time-consuming>850ms OR status code !=2/3XX.

insert image description here

Establish low-quality request indicators to analyze the broader market

Collect and connect client network libraries, CDNs, loads, and source sites, integrate end-to-end full-link logs, and establish a low-quality request management panel based on big data analysis, so as to grasp service quality more accurately and discover low-quality request problems in a timely manner , quick response and processing. At the same time, it can also systematically display, analyze, and monitor the data of each link, helping the team quickly locate problems and improve governance efficiency. In the indicator market, each indicator corresponds to the corresponding department and person in charge, and the responsibility is clearly defined. The low-quality request market data includes domain name, interface, network type, protocol, version, module, region, operator, network type, load, source station, department, and person in charge. The low-quality main app is determined through market index analysis Requests (LQR) are at a relatively high level, accounting for more than 7% of the overall LQR, and affecting millions of users every day; low-quality request market realizes data-driven operation and information sharing, providing a strong force for the overall promotion of all teams to comprehensively manage low-quality problems data support.

From the overall analysis of low-quality request governance market, several main problems of low-quality App end-to-end full link and general ideas for governance:

On the client side: Analysis of the network library logs found that the establishment and release of a large number of TCP connections on the client side increased request delay and resource consumption, affecting access speed and efficiency. This may be caused by the client's lack of mechanisms for connection multiplexing and long-term connection management. The solution direction is to improve long connections, reduce the number of establishment and release of TCP connections, reduce server burden and improve request processing efficiency.

In terms of network links: CDN nodes are unstable or abnormal due to reasons such as poor CDN network quality, too concentrated nodes, and excessive load, resulting in request access delays and request failures. The local network environment is not good, the domain name uses the local DNS to resolve the process, which takes a long time, the access speed is slow, and the domain name is maliciously hijacked. The version of the HTTP protocol used is old, HTTP/2 is not enabled, and a more efficient encryption protocol is not used. Multiplexing requests cannot be sent serially, which affects concurrent processing capabilities, access data transmission speed, stability and reliability.

On the server side: the link between the load balancing LB and the source site is long and complex, and the logic of the application source site interface is complex, resulting in a long average response time and a high error rate in the backend as a whole. This may be caused by unreasonable load balancing algorithm, high load on the origin site, or bottleneck on the origin site interface. Solutions include optimizing the server architecture, adopting more advanced load balancing algorithms, optimizing interface logic, and strengthening monitoring and diagnosis to improve server performance and stability.
insert image description here

Back-end system construction "Smart Network" end-to-end link selection management platform landed

In order to solve the problems of many existing business domain names, high connection times, access performance loss, and switching CDN performance loss, etc. The overall transformation of the client and back-end loads in the public domain name convergence method reduces the number of client connections; the second is to analyze and evaluate the quality of nodes established by multiple CDN vendors, and perform intelligent optimal scheduling to improve performance and reliability; in addition, the overall end-to-end The optimal configuration of the end-to-end link; therefore, "Smart Network" is a low-quality request management system that comprehensively improves performance and stability.

The client upgrade supports domain name unified convergence technology architecture. The core idea is to converge hundreds of domain names to several domain names, reduce DNS queries, reduce TCP connection establishment time, improve network bandwidth utilization, and improve connection reuse rate. These advantages can improve System link performance, specifically in the following aspects:

Reduce DNS query: through unified convergence of client domain names, different services share the same domain name. This reduces the number of DNS queries and shortens domain name resolution time, resulting in faster web page loading.

Reduce TCP connection establishment time: TCP is a connection-oriented protocol. Every time a connection is established, handshake and disconnection operations are required, which consumes a lot of time and resources. Through the unified convergence of the client domain name, multiple requests can be integrated into the same TCP connection, avoiding repeated connection establishment and disconnection, thereby reducing the time and overhead of connection establishment.

Improve network bandwidth utilization: Through the unified convergence of client domain names, multiple requests can be packaged into the same TCP connection, thereby reducing the network delay between requests and responses and improving network bandwidth utilization. This is especially important for high-traffic enterprise systems, which can effectively reduce network bandwidth costs.
insert image description here

Domain Name Unified Convergence Architecture

It is necessary to carry out corresponding configuration and development work on both the server side and the client side. The key technology of the solution: use a unified domain name server, reverse proxy, and connection pool technology to realize the unified convergence of client domain names and improve the quality and performance of the system:

Establish a unified domain name server: configure the IP addresses and corresponding domain names of all services that need to be accessed in this server. When the client requests, the request is forwarded to the corresponding service through the unified domain name server.

Use reverse proxy: By configuring reverse proxy on the server side, multiple different services are mapped to the same domain name. In this way, clients only need to use one domain name to access multiple services.

Using connection pool technology: refers to pre-creating a certain number of connections when the program starts, and saving these connections in a pool. When the client needs to access the HTTP service, it selects an available connection from the connection pool for operation. This method can avoid frequently establishing and disconnecting connections and improve the connection reuse rate.

The specific solution is to align the number of convergence domain names with the number of CDNs, namely Baidu a.xxxxxx.cn, Jinshan b.xxxxxx.cn, Ping An c.xxxxxx.cn and Huawei d.xxxxxx.cn, and the convergence rule is in front of the requested url Add the convergence domain name, the original domain name becomes the first-level path of the convergence domain name, and the request sub-path and parameters remain unchanged.

For example, replace xx.app.autohome.com.cn/v1/args?key=xxx with a.xxxxxx.cn/xxx.app.autohome.com.cn/v8/args?key=xxx.
The client performs convergent domain name replacement, which can better control the quality, and at the same time effectively achieve the goal of final convergence.

Architecture call process:

1) When the App is cold-started, call the CDN interface to obtain the preferred convergent domain name (if the call to the preferred interface fails, request the real domain name of the business in a non-convergent way) (the client adds a cache, and the cache is used first when starting), and the interface returns to enable the preferred domain name immediately ;
2) The client requests D+, analyzes the preferred domain-convergent domain name, and obtains the IP of the CDN node;
3) The client connects to the IP of the CDN node through http2 and http3 (deciding whether to open it or not through the client switch and server CDN support) Keep;
4) After the App startup is complete, immediately establish associations for the remaining convergent domain names corresponding to other CDN manufacturers, and at the same time keep establishing associations for 45 seconds at a time
; Switch and reuse the above link to quickly establish a connection;
6) The URL request initiated by the client needs to convert the URL from the bottom layer, adding the convergence domain name in front to form a convergence URL, and make a request; 7)
Proxy-NG receives the convergence URL, restore and access to the final source site, the request is completed;
insert image description here

Intelligent analysis of CDN resources

Understand the user's geographic location, device type, and network environment. This function determines the best CDN vendor and edge node by comparing the network quality indicators of CDN vendors in specific areas, such as delay and bandwidth. Based on this information, it is used to Scheduling users select the best edge nodes. The implementation plan is that when the App starts, it will report the relevant information about the requested buried domain name, and different buried domain names will be resolved to different CDN vendors. The request information of these domain names will be written to the network library log along with other access logs. The system will collect all buried point data, operators, obtain region information, time-consuming and error rate information according to client ip, and obtain network library log data through flink according to time-region-operator-CDN manufacturer (source station) , integrate the data of each province and city into the vm cluster, and then use the algorithm to obtain the flat response and error rate of each region in the first half hour every ten minutes, and calculate the best cdn manufacturer in the region according to the flat response and error rate .
insert image description here

CDN selection/disaster recovery scheduling

The CDN selection function is an efficient and intelligent CDN node selection mechanism. According to the user's geographical location and the quality of CDN nodes analyzed intelligently, the optimal CDN node is intelligently selected to improve network performance and availability; in addition, the CDN used by domain names appears When there is a problem, the client can quickly switch the connection to other CDN manufacturers to ensure uninterrupted business, thus improving service access performance and reliability. Server-side core process:

1) The server calculates the approximate geographic location of the current user based on the user's IP
2) According to the user's network log domain name resolution, the IP matches the corresponding CDN node
3) According to the user's network log, calculates the request for the current domain name/IP node Error rate
4) According to the error rate in different regions/CDN dimensions, determine the best CDN vendor in the current region
5) Screen out IP nodes with abnormal error rates
6) Switch the traffic of the CDN in the current region, and issue the optimal CDN list and Abnormal IP list to client
insert image description here

End-to-end full link optimal best practice strategy

Comprehensive network links, CDN configuration, and multiple technical means to improve data transmission efficiency and reliability.

1) Domain name resolution to high-quality CDN manufacturers: high-quality links, improving request success rate and reliability;
2) Enable HTTPDNS for domain names on the end: speed up DNS resolution, reduce unnecessary domain name resolution time and delay, effectively avoid DNS hijacking, and improve The speed and response efficiency of network requests improve the success rate and reliability of requests;
3) The TTL time of the domain name on the end is adjusted to 60s: speed up the update of DNS cache and obtain new IP addresses faster, and reduce the delay time of DNS resolution; 4
) CDN opens HTTP2.0 protocol: replaces text protocol with binary protocol, improves performance and efficiency through header compression, flow control, priority and other technologies, and realizes multiplexing, so that multiple requests and responses can be processed simultaneously in a single connection , to improve network transmission efficiency;
5) CDN enables GZIP and BR compression: it can significantly reduce the size of data transmission, improve network transmission efficiency, and reduce user waiting time; description: enable
GZIP and BR compression;
6) CDN enables TLSv1 .3 Protocol: This protocol has higher security and reliability, and can effectively prevent network attacks and data leakage. Compared with the current TLSv1.2 protocol establishment process, it can reduce 1RTT and improve the security and reliability of network transmission;
7) CDN enables HTTP back-to-source: it can quickly request resources from the source site and update the cache, reducing the time for CDN and source site SSL to establish a connection, and improving service response speed; 8) The source site loads to the application to maintain a long connection: avoid frequent connection
establishment and shutdown maintenance overhead, reduce network transmission delays, and improve network transmission efficiency;
9) Source site streamlined links: reduce the intermediate load of service transmission, reduce unnecessary network transmission, improve the efficiency and stability of data transmission; ensure information Smooth and fast transmission;
insert image description here

End-to-end full link quality management results

Through the implementation of the app domain name convergence and intelligent optimal selection system, it solves the problem of seamless switching of CDN manufacturers' faulty services, intelligent selection of CDN nodes, selection of the best among the best user links, unified domain name convergence on the client side, and more than 50% of the traffic on the server side is transparent and insensitive. The main 4 convergent domain names greatly improve connection reuse and reduce network overhead. The low-quality requests of the main app are greatly reduced by 7.x% to 2.x% through collaborative governance by multiple technical teams, and the end-to-end experience of millions of users is improved.
insert image description here

Future planning for end-to-end full-link quality governance

Continuously improve the end-to-end quality management system and tool-based capacity building, combined with more intelligent machine learning and other technical tools, to realize automatic monitoring of key indicators in each link of the system, quickly find and eliminate problems, and improve the reliability and reliability of the entire system. robustness.
Based on the end-to-end full-link interface low-quality request management of the App, start the whole resource quality management process, establish a low-quality picture and video identification and management system, and effectively compress, accelerate and protect multimedia resources to improve user experience. Access speed and security, continuous management of low-quality requests, speed up the main App access speed, reduce failure rate, enhance data security and stability, improve user interaction experience, thereby improving user satisfaction and market competitiveness, and further enhancing User stickiness and conversion rate.

Guess you like

Origin blog.csdn.net/autohometech/article/details/131912607