Zhang Dong, Chief Scientist of Inspur Yunhai: System design method for one cloud and multiple cores

In recent years, huge market demand has accelerated the development of cloud computing software and hardware in my country. A cloud computing innovation chain and industrial chain from chips, complete machines, cloud operating systems, middleware to application software have been initially formed. As the industry's process of "moving to the cloud and using data to empower intelligence" continues to accelerate and deepen, application scenarios are showing a diversified trend. More and more data centers are choosing to build diversified computing power, bringing convergence pool management and flexible elastic scheduling. Here comes a new challenge.

(CPU) is the most widely used computing device, and the heterogeneous phenomenon caused by the superposition and combination of multiple manufacturers and different architectures is particularly prominent. x86 architectures such as Intel and AMD are still the dominant force in data centers, but their proportion is gradually shrinking; the ARM architecture has strong development momentum due to its advantages of multiple computing cores and low power consumption; the open source RISC-V architecture is also gradually emerging. At the same time, in the context of the restructuring of the global industrial chain, the research and development and production of core components in my country have also entered a stage of vigorous development. However, due to a late start, different technical routes, and different levels of development, multi-heterogeneous processors will have a long-term coexist and develop.

Key scientific issues in one cloud and multiple cores

As a computing power supply model that pursues cost-effectiveness, cloud computing is changing from a single architecture to a multi-heterogeneous processor upgrade, replacement and capacity expansion. In the case where there are differences in the functions, performance and reliability of multiple heterogeneous processors, in order to meet the technical requirements for high efficiency and stability, realize low-cost or free switching of applications across processors, avoid supply risks, and ensure the long-term stable operation of key businesses, "one "Cloud multi-core" has become an inevitable trend in the development of cloud computing.

The Internet industry's one-cloud, multi-core work for public clouds started early, relying on technology and financial reserves to develop cost-effective processors. For example, Amazon launched the ARM-based Graviton processor, breaking the dependence on the x86 architecture. In response to the contradiction between the diversity of industry-private cloud southbound resources and the complexity of northbound applications, domestic finance, telecommunications, energy and other industries have also begun to conduct research and construction of one cloud and multiple cores. In the early stage, the cloud management layer was used to manage multiple heterogeneous systems. Although resource pools can form a unified entrance, resource supply efficiency is low because resource pools are fragmented and applications cannot be orchestrated across architectures.

InCloud OS , Apsara Stack, EasyStack, etc. achieve unified scheduling and interconnection of heterogeneous resources through a single resource pool. However, at the current stage, they mainly solve the "multi-core" co-location problem and are far from the application-centered cross-border deployment. There is still a big gap between architecture operation and low-cost switching. In order to meet the stable operation, smooth switching and elastic scaling of services under the condition of multi-core coexistence, the following scientific issues and technical problems need to be solved urgently.

1. Application cross-architecture portability and running environment equivalence issues. When applications run on nodes with different processor architectures in multi-core systems, it is first necessary to ensure that the program itself is portable across architectures. Furthermore, when hierarchical and modular complex applications are dynamically migrated, remotely called or horizontally expanded between heterogeneous nodes, how to ensure the cross-architecture equivalent executability of the operating environment (operating system, runtime, dependent libraries, etc.) becomes a challenge ( see picture 1).

 

2. Quantitative analysis of multiple heterogeneous computing power and load-aware scheduling problems. The performance difference of multiple heterogeneous CPUs is 2 to 10 times, and the difference in computing power between nodes with additional heterogeneous acceleration units is even more orders of magnitude. When applications migrate, switch, or scale between heterogeneous nodes, they need to ensure consistent user experience and comply with the service level agreement (Service Level Agreement, SLA) of the business. How to conduct evaluation modeling and quantitative analysis on the equivalence relationship of multiple heterogeneous computing power to achieve load-aware balancing scheduling and adaptive elastic scaling has become a key scientific issue.

3. The problem of ensuring state consistency of distributed applications under a non-peer-to-peer architecture. Compared with the equivalence of traditional distributed nodes, the non-equivalence of heterogeneous nodes distributed by one-cloud multi-core applications cannot be ignored. For non-peer-to-peer distributed cloud native applications, it realizes efficient consistency consensus negotiation and data synchronization between heterogeneous nodes for stateful tasks, as well as non-intrusive traffic dynamic control and smooth segmentation for stateless tasks, becoming a cross-architecture cloud native application orchestration key technical difficulties.

Yiyun multi-core system design and key technologies

ACM Turing Award winner Niklaus Wirth proposed the famous formula "Program = Data Structure + Algorithm", which revealed the time and space nature of the program. As a software-defined extension, the one-cloud multi-core system not only includes the two spatio-temporal elements of the data plane's command logic and data status, but also includes the control plane's management and control of multiple heterogeneous resources. Therefore, the one-cloud multi-core system can be abstracted into "resource management + running program + data status".

Among them, resource management provides abstraction of hardware resources such as computing, storage, network, and security through software definition, and provides resource encapsulation and operating environment for applications at the granularity of virtual machines, containers, bare metal, etc.; running programs are decoupled according to layers. It is divided into resource layer, platform layer and application layer, for example, applications and resource management programs that carry user services; data status refers to the memory transient data, database persistent data and traffic status that the program runs on.

According to the above definition, a cloud multi-core system should be designed from three aspects: program operability, resource manageability and state migration .

1. Program runnability: The program runs cross-architecture in a cloud multi-core system. The primary design goal of the program is runnability, that is, it can be transplanted and run in the environment of different processor architectures. The technical route includes cross-platform languages, Cross-platform compilation and instruction translation technology (see Table 1).

 

Cross-platform languages, represented by Java and Python, realize cross-architecture operation of parts that have nothing to do with the program architecture through cross-platform languages. However, there are still some architecture-related problems: (1) Runtime environment dependence, for example, Java programs in multi-core systems Running requires Java Virtual Machine (JVM) runtimes of different architectures; (2) Local library dependencies, such as Java Native Interface (JNI), require cross-platform porting.

Cross-platform compilation is cross-compilation, which uses specific processor architecture environments and compilation tools to generate executable programs for other architectures. Cross-compilation achieves cross-platform binary code generation of programs through architecture-independent source code, but for executable programs, it is still necessary to unify the binary code and processor architecture.

Binary translation, or instruction set translation technology, is a research hotspot to solve the problem of cross-architecture migration of applications. Implementation methods include software-level binary translation and chip-level binary translation. Both software level and chip level are limited by the translation system. Software-level binary translation requires modification of the application running environment, which increases the complexity of the running environment, while the chip-level binary translation process suffers serious performance losses. For example, the current translator efficiency for pure computing programs is 60% to 70% of direct compilation. If system calls, locks and other operations are involved, the efficiency will drop to 30% to 40%, and there are still instruction set inconsistencies in the binary translation process. Compatibility issues such as Advanced Vector Extensions (AVX) directives.

Runtime equivalent encapsulation

Cross-platform languages ​​solve the cross-architecture problem of applications, but they need to provide a cross-architecture runtime; cross-compilation solves the cross-architecture compilation problem, but there is still a runtime dynamic library dependency problem. Therefore, when a program runs in a multi-core system, it not only needs to consider its own runnability, but for modern and complex applications, the runtime it depends on should also be comprehensively considered. The feasible route is to combine the standardized container method to encapsulate the application and its runtime dependencies as the basic resource encapsulation for cross-architecture deployment and switching of applications.

That is to say, based on the same set of source code, different container images are built for different architectures. If the program is built based on a cross-platform language, the program script or intermediate code and runtime are encapsulated into a container; if the program is based on a non-platform language, For cross-platform language construction, binary files under various architectures can be built through cross-compilation, and then encapsulated into containers with dependent libraries. This process can be automatically built through a set of pipeline operations and pushed to the mirror library.

To sum up, the design of one-cloud multi-core program runnability includes three aspects: first, realizing cross-architecture compilation and running of applications, secondly building standard containerized packaging, and finally realizing lightweight deployment through cloud resource orchestration management (see figure 2).

 

2. Resource manageability Resource manageability includes architecture awareness and quantitative analysis of computing power, as well as system-oriented resource balancing scheduling and business-oriented elastic scaling.

Architecture awareness technology Architecture awareness is the key to realizing node scheduling and adaptive display of interface functions in one cloud and multiple cores. It is the basis for supporting program operability and realizing resource encapsulation life cycle management. It can be realized through collectors, schedulers and interceptors. (See Figure 3). (1) The collector collects and reports the CPU architecture, hardware characteristics and other information of each node, and establishes a host list containing architectural characteristics. (2) The scheduler selects matching host nodes for resource encapsulation of various granularities, uses a cascade filter mechanism, loads multiple independent filters, and matches creation requests with hosts in sequence. In the one-cloud multi-core scenario, the cascaded architecture awareness filter is used to identify the image architecture tag in the resource package creation request, and the host nodes are filtered out based on the CPU architecture feature matching results. (3) The interceptor is used to establish a dynamically expandable "architecture-function" mapping matrix, parse the actions and architectural features of resource encapsulation management requests, execute interception requests and feedback the results for display, thereby achieving automatic identification of differentiated functions of different architectures. Dynamically expands, shields underlying implementation differences, and provides a unified resource management view.

 

Computing quantization technology

Due to the different computing capabilities of processors of different architectures, even if the same application uses the same resource package (such as the same CPU core, memory, etc.), there will be differences in performance when running in a heterogeneous environment. According to application scenarios, computing power can be divided into CPU general computing power and XPU heterogeneous computing power. The main problem currently faced by Yiyun's multi-core system is the heterogeneity of CPUs. ARM and x86 architecture processors from multiple manufacturers have different instruction sets, core numbers, production processes, etc., so there are also differences in performance. This difference can be characterized by the equivalence relationship of computing power, which is divided into specification computing power, effective computing power and business computing power according to the level (see Table 2).

Among them, specification computing power is the most versatile, effective computing power is more targeted for specific load types, and business computing power is closer to real application scenarios. However, due to the diversity of loads and applications, the differences between effective computing power and business computing power are The calculation needs to be completed jointly with the upstream and downstream ecology.

 

balanced scheduling technology

From the resource level, when selecting nodes for resource encapsulation, the load is scheduled using a balancing strategy based on the node computing capabilities. This is a constrained optimization problem with the goal of maximizing resource utilization. The balanced scheduling algorithm acts after the cascade filter and selects the one with the smallest load from the filtered host nodes as the final goal. For a cloud multi-core system, the key to this process is the quantitative analysis of the node's computing power. Evaluate the specification coefficients of multiple types of resources based on the specification computing power, and then combine it with numerical methods such as normalization and main resource fairness to calculate the available computing power of each node. The algorithm based on normalization is as follows:

 

The score of node j, Scorej, is the sum of the weight scores of r resource types, including CPU, memory, hard disk, etc., as shown in Equation (1). The weight score algorithm of each resource type is as shown in Equation (2), where ResourceNormalizedji is the minimum-maximum forward normalization of the allocable amount of resource i at node j, as shown in Equation (3); WeighterMultiplieri is the weight of the resource, which can be determined according to the load Adjust the weight of the CPU, memory or IO intensive type to reflect the importance of each resource. coefficientji is the specification computing power coefficient of each resource. For example, the quantified relationship between the specification computing power of ARM type and x86 type CPU is 1:2, and the specification coefficient They are 1 and 2 respectively. When the number of allocable CPU cores is the same, x86 nodes are scheduled with a higher priority, thereby achieving balanced scheduling based on quantization of computing power in a single cloud multi-core scenario.

Elastic scaling technology

In order to support elastic scaling for business peaks and valleys, it is necessary to achieve precise planning, rapid scheduling and computing power equivalence of resource encapsulation to ensure that application services can be elastically elastic, fast and accurate (see Figure 4). (1) In terms of resource planning, based on the probability distribution characteristics of the application load within a specific period, a load trend model is established based on historical data time series to describe the load portrait and capacity portrait of the relationship between application load, service quality and resources. Through load trend prediction and Use exception feedback methods to plan resource encapsulation scaling requirements. (2) In terms of fast scheduling, based on architecture awareness and balanced scheduling technology, it can quickly schedule to the best node when expanding resource encapsulation, and launch application services to ensure timely response of application services. (3) When elastic scaling triggers cross-architecture switching of resource encapsulation, the computing power of different architectures is characterized based on computing power quantization technology, and the equivalence relationship of resource encapsulation is calculated based on the effective computing power and business computing power to ensure that the service quality of the business increases as resources increase. Decrease and expand linearly.

 

3. State Migration The application state migration of the resource layer migrates persistent data, memory transient state, peripheral configuration and network traffic to the target node as a whole, involving all relevant data states within the resource package. In addition to the application itself, it also involves operating systems, middleware, etc., making migration difficult. To solve this problem, we can further follow the idea of ​​decoupling the resource layer, platform layer and application layer, and adopt state synchronization and traffic segmentation methods based on cloud native microservice governance.

Resource encapsulation migration

The online live migration technology of virtual machines is relatively mature. It usually uses a pre-copy algorithm to iteratively transfer the memory incremental state of the source virtual machine to the destination host. Optimization algorithms such as post-copy and hybrid copy as well as hardware compression acceleration technology have also appeared. , accelerate memory copy convergence, reduce downtime, and improve migration efficiency. However, virtual machine migration still faces limitations such as the generation gap between CPUs from the same manufacturer, the compatibility of the same architecture from different vendors, and the inability to live migrate different architectures. Research on online migration technology of containers started relatively late. It is essentially the migration of process groups. Current research is mainly based on Checkpoint and Restore In Userspace (CRIU) in user space to migrate the runtime state of the container and derive A series of optimization methods are provided to shorten migration time and reduce unavailability time.

In addition, adaptive container online migration dynamically adjusts the acceleration factor of the compression algorithm to match CPU and network bandwidth resources, reducing the transmission time of container snapshots. Although there has been some research and application on the overall migration of virtual machines and containers as the resource encapsulation granularity, there are still problems such as large amounts of migrated data, long downtime and total migration time, and it is difficult to achieve smooth switching of applications across architectures. With the development of cloud native technology, combining service governance methods has become a feasible route. Key technologies include data synchronization of stateful services and traffic switching of stateless services.

Data status synchronization

State synchronization of multiple copies relies on distributed consistency algorithms. ACM Turing Award winner Leslie Lamport proposed the Paxos consensus algorithm based on message passing and highly fault-tolerant. ZooKeeper's ZAB, MySQL's wsrep, Etcd, and Redis's Raft protocol are all based on its core The idea achieves data state consistency. On this basis, the data status synchronization of the one-cloud multi-core platform layer needs to further consider the asymmetric characteristics of nodes. The following uses the Raft protocol as an example for explanation.

Election (leader election) process: The master node (leader) periodically sends heartbeats to all slave nodes (followers) to ensure the status of the master node. When a slave node does not receive a heartbeat within a timeout period, the node is converted into a candidate. (candidate) nodes participate in the election. The different processing capabilities and network conditions of each node in the one-cloud multi-core system lead to differences in the impact of timeout. An adaptive method based on maximum likelihood estimation can be used to avoid nodes with large heartbeat delays and weak processing capabilities from frequently triggering elections. At the same time, This ensures that nodes with strong processing capabilities can quickly initiate elections. For the voting strategy, use node priority or narrow the random timeout value range mechanism to make it easier for strong nodes to obtain a majority of votes.

Replication (log replication) process: Using the quorum write mechanism, the master node receives requests from the client, initiates write proposals to the slave nodes and receives feedback votes. Only when each proposal receives more than half of the votes can the write be submitted. In a multi-core cloud, heterogeneous nodes are designed as disaster recovery availability zones (Availability Zones, AZs), and it must be ensured that all disaster recovery availability zones are written.

Business traffic segmentation

Cloud native applications distribute traffic to each stateless replica instance through a gateway or load balancer, and the traffic is the state of the stateless workload. In a multi-core system, when an application migrates or elastically scales between heterogeneous nodes, it is necessary to split the traffic and direct it to the replica of the corresponding node. In order to ensure that the service quality is not degraded, the specifications and quantity of equivalent target replicas are determined based on quantitative analysis of effective computing power and business computing power, and the proportion of traffic they bear is allocated. Traffic switching should be fully decoupled from business logic, and a service grid can be used. realization of ideas.

The control plane senses copy changes, generates traffic segmentation policies, and delivers them to network proxies and gateways. For east-west traffic, the network proxy hijacks the traffic and forwards it proportionally to different replicas based on the sharding policy. For north-south traffic, the gateway forwards the traffic to different replicas according to the segmentation policy (see Figure 5). During the instantaneous process of traffic segmentation, affected by factors such as the target node copy not starting, TCP connection delay and other factors, there will be situations such as failure to respond, packet loss and other application service quality degradation. You can use preheating, probes, retries, Drainage technology ensures smooth switching of applications across architectures.

 

One cloud multi-core development path

According to the system design of resource manageability, program runnability, and state migration, one cloud and multiple cores can gradually evolve in three stages (see Figure 6).

 

Phase 1: Hybrid deployment, unified management, unified view

The first phase aims at manageability and achieves unified pool management, unified service catalog and unified monitoring and operation of heterogeneous processor nodes. In terms of operability and portability, it adopts homologous heterogeneous, offline migration and manual switching. , Business segmentation enables cross-architecture deployment and collaboration of applications. At present, the construction of one cloud and multiple cores at home and abroad is mainly at this stage. Following the system design method, in the research and development practice of InCloud OS, the author's team proposed continuous integration based on homologous heterogeneity, continuous delivery based on immutable infrastructure, and architecture-aware scheduling methods to support the same mainline cloud operating system source code compilation. Build executable programs for heterogeneous nodes, and realize the minutes-level construction of tens of millions of codes in C/C++, Java, Python, and Go multi-language on 8 mainstream processors, providing reference guidance for various types of applications (see Figure 7 ).

 

In the cloud platform built on InCloud OS, a single resource pool supports all mainstream processor architectures and is cascaded to 1,000 nodes per controller, realizing one cloud, multi-core, and cross-domain data centers in three places that are more than 1,000 kilometers apart. Unified management and interconnection support the diverse business needs of cloud digital intelligence, and technical specifications and reference architectures have been formulated (see Figure 8).

 

Phase 2: Business traction, hierarchical decoupling, and architecture upgrade

On the basis of the first phase, in order to further meet the low-cost cross-architecture switching of applications, the second phase realizes cross-architecture migration of applications, multi-architecture hybrid deployment and traffic segmentation through hierarchical decoupling and architecture upgrades. The author's team conducted preliminary exploration at the resource layer, platform layer and application layer.

1. At the resource layer, combined with the GuestOS sensing response mechanism, the applicability of migration to multiple CPUs is further improved, and an online migration method based on consistent snapshots is proposed. Through changed data block tracking and multi-threaded asynchronous optimization, rapid and complete migration of 10 TB large-sized virtual machines is achieved. After the migration, the system starts an initial hardware check. If the relevant CPU features are not supported, it switches to fallback measures to ensure the normal operation of the system. In particular, it implements CPU and firmware adaptation for Windows virtual machines, and is compatible with Win XP and above desktop versions and Win Server versions above 2000 have been used in actual production environments. However, the virtual machine migration method is unaware of the application. The migration may cause the risk of database and application exceptions. Application developers need to cooperate to further verify the availability of the virtual machine after migration.

 

2. At the platform layer, the current solution adopted in the production environment is to realize cross-architecture operation of stateful applications through data synchronization and business segmentation (see Figure 9). Based on InCloud OS, x86 and ARM database cluster services and data synchronization services are provided. The data synchronization service captures data changes based on the source database write-ahead log (WAL), and uses encryption and compression algorithms, transaction merging, and network packet encapsulation during transmission. Optimize network protocol overhead and latency, improve replay efficiency through grouped multi-task parallelism and native loading mechanisms on the target side, and achieve sub-second data synchronization. The application is designed based on the read-write separation architecture. It is oriented to x86 architecture database read-write and ARM architecture database read-only, enabling cross-architecture database operation in a single cloud multi-core scenario.

3. At the application layer, InCloud OS completed the first SPEC Cloud benchmark test in a one-cloud, multi-core scenario in January 2023, verifying the resource manageability and computational intensity of multiple x86 and ARM processor architectures based on a single resource pool. The cross-architecture program runnability of the type clustering algorithm K-means and the state migration of the IO-intensive distributed database Cassandra, combined with the balanced scheduling algorithm, achieve scalability of more than 90% and performance exceeding the SLA baseline by 20%, with an average The online time exceeds the world record by 25%.

Stage three: software definition, computing power standards, full stack multi-core

One cloud, multiple cores is the integration of cores and clouds, and the synergy of platform and ecology. In the third stage, through the cooperation of upstream and downstream industry chains such as processors, complete machines, cloud operating systems, databases, middleware, and applications, complete decoupling of applications and processor architecture is achieved to ensure long-term stable operation of the business.

1. At the computing resource layer, while improving processor performance and reliability, define processor design standardization and compatibility through system design, and promote the continuous optimization of binary translation technology in the application process. On the basis of supporting multi-core processors, it extends the unified abstraction of heterogeneous computing power such as GPU and DPU to achieve heterogeneous acceleration collaboration.

2. At the platform layer, break through application feature-aware variable-granularity resource scheduling and allocation technology to solve the problem of adaptive configuration and orchestration of application types and resource encapsulation. Research function topology orchestration, efficient scheduling and quick startup technology to solve large-scale cloud native problems. Flexible construction and elastic expansion of applications.

3. At the application layer, promote applications to support multi-core homologous and heterogeneous applications, improve best practices for cloud-native transformation and upgrading, and integrate with the resource layer and platform layer to achieve application-aware, architecture-insensitive smooth switching and elastic scaling.

4. In terms of computing power assessment, standards and evaluation, research the quantitative method of multi-heterogeneous effective computing power, and cooperate with professional evaluation institutions and the upstream and downstream of the industry chain to establish a one-cloud multi-core industry standard.

Conclusion: One cloud with multiple cores is an inevitable trend to solve the problem of multi-core coexistence in data centers. In order to solve the problems of cross-architecture runnability of applications, quantitative analysis of computing power, load-aware scheduling, and distributed state consistency of non-peer-to-peer architectures, the author's team proposed the core design concept and system design method of a cloud multi-core system.

1. Adhere to the system concept, scenario-driven, and system design. Transform from a CPU-centered design model to a system-centered design model, and establish an application-oriented technology development route of diverse heterogeneous fusion, software definition, and software-hardware collaboration to continuously improve computing efficiency and energy efficiency ratio.

2. Strengthen ecological collaboration, layered decoupling, and open standards. Processors, complete machines, cloud operating systems, middleware, and applications are decoupled layer by layer. Through ecological collaboration, the problems of vertical closure and ecological discreteness caused by a single technical route are eliminated, and standardization and normalization of one cloud and multiple cores are achieved.

3. Formulate a development roadmap, iteratively innovate and continuously evolve. From hybrid deployment, offline migration and manual switching, to smooth switching and elastic scaling based on architecture upgrades, to computing power standards and full-stack multi-core iterative evolution.

The current research and practical work is in the transition period from the first stage to the second stage. The exploration and layout are carried out around program operability, resource manageability and state transferability technology. The next step is to strengthen the industrial chain and innovation chain. Collaboration, iterative evolution towards the goal of application awareness and architecture non-awareness, promotes a more solid and complete theoretical foundation of one-cloud multi-core computing, more mature and effective software-hardware collaboration and software definition mechanisms, clearer and more feasible application-aware scenario paradigms, and a more standardized industrial ecology .

Microsoft launches new "Windows App" Xiaomi officially announces that Xiaomi Vela is fully open source, and the underlying kernel is NuttX Vite 5. Alibaba Cloud 11.12 is officially released. The cause of the failure is exposed: Access Key service (Access Key) anomaly. GitHub report: TypeScript replaces Java and becomes the third most popular. The language operator’s miraculous operation: disconnecting the network in the background, deactivating broadband accounts, forcing users to change optical modems ByteDance: using AI to automatically tune Linux kernel parameters Microsoft open source Terminal Chat Spring Framework 6.1 officially GA OpenAI former CEO and president Sam Altman & Greg Brockman joins Microsoft
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5547601/blog/10123300