Aliyun Liu Weiguang: 20,000 words to interpret financial-level cloud native

Author: Liu Weiguang, President of Alibaba Cloud Intelligent New Finance & Internet Industry, Executive Director of China Finance Forty Forum, graduated from the Department of Electronic Engineering of Tsinghua University

01 Preface

When the cloud-native concept was proposed in 2015, behind the informatization and digitalization formed by the century-old development of global finance at that time, the financial-level technical service level has formed an industry-consensus standard after a long period of polishing. The classic concept of cloud native eight years ago is a new paradigm of software development focusing on containerization, DevOps, continuous development and continuous integration, and microservice architecture. Financial-level requirements such as high availability, high performance, business continuity, system security and stability, etc., seem to be in two distant categories from the concept of cloud-native architecture. With the continuous evolution of the technical level, in the development of new application systems, financial institutions have begun to gradually introduce cloud-native deployment architectures such as containerization, but they have always found that cloud-native capabilities focused on the development state cannot touch financial system construction of all levels. The rapid changes in cloud computing technology have in turn promoted the development of cloud native from a narrow sense to a broad sense. Today's cloud has become a more universal standard infrastructure and a platform for new technology and new business innovation; therefore, such as cloud native Technologies such as big data, cloud-native storage, and cloud-native network technology allow the cloud's native capabilities to extend from software development to data platforms and then to the underlying physical deployment architecture. Today's cloud computing, whether it is a public cloud or a private cloud, is indeed changing the future-oriented planning of the industry due to the advancement brought by its technical system and the embrace and support of open source.

After a long period of exploration and practice, we propose a brand-new concept: financial-grade cloud-native, whose core idea is to change the narrow sense of cloud-native into a broad sense, and expand the advanced thought of cloud-native from only covering application development to system physical deployment architecture. The complete technical link, from the simple development state to the design state + research and development state + operation state + operation and maintenance state + disaster recovery state, while combining financial-level high availability, high performance, business continuity, etc. in each category Features, summarized and defined into a financial-level full-stack cloud-native architecture paradigm. Such an architecture paradigm will highly combine the most advanced technical architecture concept with the most stringent financial-level SLA, aiming to describe a technical system for upgrading full-stack cloud native capabilities, completely replacing the traditional architecture, and in the rapid development of digital finance. Today, in the cloud era of artificial intelligence, it can provide the most powerful support.

02 Development of financial IT architecture

If the bank is Iron Man, then the IT system is his suit.

In the past 40 years, with the business development and transformation of the financial industry represented by banks, the overall structure of the IT system has also undergone multiple rounds of iterative evolution. The bank's information development process can be summarized into four main stages: stand-alone era, era of networking and online, era of data concentration, and era of distributed cloud native.

1) Stand-alone era : Computers are used to replace manual work, but there is no information interconnection, and each branch is a separate "electronic ledger", which becomes an information island.

2) The era of networking : Relying on the complete network infrastructure, banks rely on regional medium-sized cities, centering on provincial and municipal hosts, linking the business of various outlets to realize provincial and municipal interconnection.

3) The era of data centralization : Banks, according to their own development, centralize data and business to varying degrees, and realize the centralization of system infrastructure, physical servers, data and applications.

In the era of large data concentration, it is also a period in which IT informatization of banks develops the fastest and promotes the business the most. The most important thing in the construction of the entire IT system is the "core system". Core system: Core Banking System, where CORE means Centralized Online Real-time Exchange, which is also the abbreviation of "centralized online real-time transaction". Taking transfer payment as an example, it has been shortened from the original half a month to "real-time second arrival". It is through the large data concentration and the construction of real-time online transaction capabilities of the core system that China's financial services have greatly improved their service capabilities and transaction efficiency. The business richness, business transaction volume, and data volume of banks are constantly hitting new highs. At the same time, the core system, which serves as the cornerstone of the bank, poses extremely high challenges and challenges to the processing performance, stability, and security of IT systems. Require. At that time, domestic IT companies still could not afford such extremely high requirements, and the only choice for banks' IT architecture was centralized architecture.

4) Distributed cloud-native era : With the continuous expansion of financial business forms, defects such as insufficient scalability of centralized architecture, insufficient Internet-style high-concurrency response capabilities, high costs, and independent research and development requirements continue to emerge. At the same time, distributed cloud-native Technology is also gradually moving from the bank's Internet service platform to the technical architecture of the core system, gradually becoming the bank's new generation of bank-wide mainstream technical architecture.

image.png

Features of centralized architecture: Centralized architecture also refers to the system architecture paradigm dominated by IBM, Oracle, and EMC. IBM's mainframe/minicomputer, Oracle's database, and EMC's storage have always been short boards of domestic supply, highly dependent on The centralized architecture is the core architecture system. The biggest feature of the centralized architecture is that the deployment structure is simple. The underlying hardware generally uses expensive mainframes, minicomputers, and all-in-one computers purchased from IBM, HP, Oracle, and other manufacturers. There is no need to consider how to deploy services on multiple nodes, and no need to Consider the "distributed collaboration problem" between nodes. Generally, the "vertical and vertical expansion" method is adopted to improve the processing capacity of the system by increasing the resource configuration of a single machine, and to improve the availability of the system by increasing the cluster mechanism of hardware devices and basic software.

The characteristics of the distributed architecture: the system is composed of multiple modules deployed on different network computers, and the system communicates and coordinates with each other through message passing through the network. The distributed system adopts the method of "horizontal horizontal expansion" to improve the operating capacity of the system by increasing the number of servers. Theoretically, the operating capacity can be expanded infinitely. The distributed system adopts cluster deployment, and each node in the cluster is an independent operating unit, and the number of nodes can be increased or decreased at any time according to the size of the task. The failure of a single node will not affect the overall availability.

03 Financial enterprises embrace the problems and conflicts of cloud native

"Design is not about making things pretty, it's about making things work better". Similarly, cloud native is not for fashion, but to solve problems.

Alibaba proposed a decentralized architecture in 2009, and basically completed the decentralized architecture in 2013.

In terms of hardware, use standardized X86 servers to replace IBM minicomputers and EMC storage devices to solve the pressure of performance expansion.

In terms of software, use open source OceanBase and MySQL instead of Oracle database.

In terms of system, a new system was built using the idea of ​​distributed cloud native architecture.

In the process of de-centralizing the architecture, Ali not only solves massive-scale computing problems with cheap and relatively controllable PC servers, but also promotes the maturity and wide application of cloud-native technologies. With the continuous iteration and development of business and technology in the financial industry, distributed cloud-native technology not only needs to meet the requirements of high performance, high reliability, high flexibility, and high standards, but also needs to focus on security, risk, performance, and capacity costs. In terms of company-wide architecture design considerations, we have to face the following 8 major problems.

Question 1: What is cloud native? What is financial-grade cloud native?

CNCF's initial definition of cloud native is a narrow concept, focusing more on a new paradigm at the software development level. It is defined as "narrow cloud native" with four characteristics: containerized deployment + microservice architecture + continuous development and continuous integration + DevOps. ", the core is for application developers. However, with the continuous evolution of cloud computing, cloud-native storage, cloud-native network, cloud-native database, cloud-native big data, cloud-native AI, cloud-native business middle platform, etc. are all moving towards the unified category of cloud-native, so the concept is gradually expanding , indicating that "cloud native in the narrow sense" still focuses on the development level, and still cannot completely solve the customer's overall architecture upgrade problem, so "cloud native in the broad sense" has been formed.

In the face of the more stringent requirements of the financial industry, it is necessary to solve not only the problem of agile development, but also the advanced nature of the architecture, and integrate finance with security compliance, strong transaction consistency, unit expansion, disaster recovery and multi-active, and full-chain In-depth integration with cloud-native technology to realize the overall architecture upgrade of the traditional centralized architecture and develop into a set that not only meets the standards and requirements of the financial industry, but also has the advantages of native technology architecture. , forming a "financial-level cloud-native architecture".

image

Question 2: Where does cloud native change IT operation and maintenance management?

"Cars on the same track, books on the same text, and walks on the same road"

From the perspective of IT architecture evolution, although the traditional centralized architecture is easy to deploy, there are vertical chimney splits and horizontal management dispersion. Each layer and each technical product are independently managed and maintained. After the virtualization technology matures, centralized and unified management from the underlying server, storage, network, virtual machine and other levels is realized, which greatly improves the management radius of operation and maintenance personnel. The core concept of cloud native is that all resource technologies are provided in the form of pooling and services, which is no longer the traditional chimney-style resource supply relationship. The cloud-native architecture further realizes the standardization and unified management of various technical services such as IaaS resources, PaaS resources, distributed databases, distributed middleware, containers, and R&D processes, and truly realizes "vehicles on the same track, books The same text", which greatly reduces the complexity of operation and maintenance, and improves the scale of management objects per capita.

image

Question 3: How does the cloud native system implement open source governance?

In the past, if financial enterprises wanted to use cloud-native technologies or products, they needed to spend a lot of energy researching some open source projects, doing O&M and management themselves, and also needed to consider issues such as integration and stability guarantees, so as to build a cloud-native platform. Financial institutions have begun to realize that open source software can only solve explicit, functional requirements above the surface, and a large number of implicit, non-functional requirements below the surface. Open source software does not have it, but it is What financial institutions really need to consider when building cloud-native applications.

In order to make it easier for developers and operation and maintenance personnel to use cloud-native technology products, more and more financial institutions have established a set of enterprise-level cloud-native technology platforms and technical standards, from product integration, operation, monitoring, operation and maintenance Multi-dimensional product and architecture governance, to realize the adaptation and implementation of cloud native technology with SLA guarantees, mature cases, technical specifications, and gray scale.

Question 4: How can cloud native be combined with information technology application innovation to achieve 1+1>2?

A complete top-down cloud-native technology stack represents the most advanced technology system today. Therefore, in the selection of technical solutions for "information technology application innovation", it is not just a pure hardware idea or a simple point-to-point replacement idea, but more should be used The most advanced cloud-native technology architecture takes advantage of the opportunity of "information technology application innovation" transformation to achieve comprehensive capability upgrades.

"Information technology application innovation" has become an important factor that cannot be ignored in the construction of financial institutions' IT systems. When building cloud-native systems, challenges brought about by these requirements must be considered, such as "information technology application innovation" The stability of the hardware and software supply chain And the reliability of domestic chips.

"Innovation in the application of information technology" will inevitably cause financial institutions to face the "fragmentation problem" of different chip servers (resulting in increased management complexity and increased costs). If each type of chip cluster is built separately for cloud management, this multi-cloud The fragmentation and differentiation of resource pools makes it difficult for cloud-native applications to uniformly schedule and use resources, and cannot fully utilize the peaks and troughs of different businesses for elasticity. In addition, multiple clouds will also lead to complex operation and maintenance, including deployment, upgrade, and capacity expansion, which need to be managed separately, resulting in high operation and maintenance management costs and poor operating experience.

Therefore, "one cloud with multiple cores + cloud native" has become the optimal solution to the problem of fragmentation, and "one cloud with multiple cores" fundamentally solves the problem of multi-cloud management caused by the coexistence of different types of chips (unified management of fragmentation, the integration of " The difference of "multi-core" has transformed into the standardized service of "one cloud", and cloud native solves the problem of resource integration (the combination of small and large fragmented resources). Maximize the use of the powerful computing power of the resource pool on the cloud, realize the integration of computing power resources of multiple chip cluster capabilities, and truly form a cloud of 1+1>2.

image.png

Question 5: How does the cloud-native architecture respond to business security production? 

According to "Murphy's Law" - "Doubt everything, any node failure will happen!" ("Anything that can go wrong will go wrong"). The design principle of cloud-native application architecture is to regard potential "black swan" risks that affect safe production as "normal".

The suggestion of cloud-native architecture is to allow failure to occur, to ensure that each server and each component can fail without affecting the system and has self-healing and replaceable capabilities. Immediate failure (Fail fast and Fail small) is an important design principle of cloud-native systems. The philosophy behind it is that since failures cannot be avoided, the sooner problems are exposed, the easier it is for applications to recover, and the fewer problems entering the production environment. The essence of Fail small is to control the scope of failure—the blast radius. The focus will shift from how to exhaust the problems in the system to how to quickly discover and handle failures gracefully.

Technical risk is also a top priority for financial-grade cloud-native architectures. Any error in transaction processing may lead to unpredictable financial losses. It is necessary to establish a professional technical risk system (SRE, Site Risk Engineering) to ensure that from the system architecture platform to the risk culture mechanism, in the aspects of architecture design, product development, change on-line, stability assessment, fault location and recovery, etc. Ensure risk quality control throughout the life cycle, and provide a comprehensive guarantee for any system changes.

Question 6: How does the cloud-native architecture guarantee business continuity? 

For financial institutions, when the business goes online, the most unacceptable thing is that the business is not available.

Cloud native resilience represents the resilience of the entire system when various abnormalities occur in the software and hardware components on which the system depends. These abnormalities usually include hardware failures, hardware resource bottlenecks (such as CPU/network card bandwidth exhaustion), business Factors that have a fatal impact on business unavailability include traffic exceeding software design capabilities, failures and disasters that affect the work of the computer room, software bugs, and hacker attacks. Resilience interprets the ability of the system to continuously provide business services from multiple dimensions. The core is to improve the business continuity of the system as a whole and enhance the system resilience from the cloud native architecture design. The financial-level cloud-native resilience capabilities include: service asynchronous capabilities, retry/current limiting/degradation/fusing/backpressure, master-slave mode, cluster mode, high availability within AZ, unitization, cross-region disaster recovery, remote multiple Live disaster recovery, etc.

Question 7: How does the cloud native architecture deal with transaction consistency? 

People want to use a distributed system like a stand-alone system, so it is inevitable to face the problem of "distributed consistency".

"Micro" in cloud-native micro-services means that the service granularity becomes smaller, and the complexity of financial transactions is relatively large. Therefore, data consistency in cloud-native systems is a relatively complex issue. Independent data storage in different microservices makes it difficult to maintain data consistency. Because network errors in distributed microservice systems are inevitable, based on the CAP theorem, when network partitions occur, cloud-native architectures are required to balance consistency and availability.

Therefore, when planning a financial-level cloud-native architecture, you will also encounter challenges from financial services to consistency. This consistency is not only reflected in business logic (TCC, SAGA, XA transactions, message queues, etc.), but also requires more Consistency guarantee on the data level (multi-node consistency, multi-center consistency).

Question 8: What are the challenges of cloud native architecture and application design and development?

What makes people tired is not the distant mountains, but a grain of sand in the shoe.

Although cloud-native technology has many benefits, financial institutions often have a large number of existing systems. The technical systems of these existing systems are often different from cloud-native technologies. How to integrate and manage existing systems and new cloud-native applications? How to formulate the splitting strategy of microservices, how to measure the splitting dimension, splitting standard and splitting granularity? How to establish a cloud-native observable system, implement effective monitoring, log management and alarms, monitor application performance and resource usage in real time, and quickly locate and solve problems when they occur?

These problems challenge in-depth solutions. Many financial institutions realize that cloud-native technology needs to implement unified technical specifications in the five states of design, research and development, operation, operation and maintenance, and disaster recovery. Back-end capabilities and requirements such as operation and maintenance, disaster recovery, and security are considered, designed, and front-end in the design and development stages, and cloud-native technology is used to solve the back-end human workload and management complexity.

04 "New standard and new blueprint" for financial-level cloud native

The development process of financial-grade cloud native

The accuracy of Kevin Kelly's predictions of modern technology in "Out of Control: The Ultimate Fate and Ending of All Mankind" made the author the prophet king in the hearts of many technology practitioners, and this book has also become a holy book. Two key points are highlighted in the book description:

1. A complex system is composed of a large number of independent and autonomous simple systems.

2. Complex movements are assembled from simple movements, not modified.

The entire system is composed of multiple "microsystems" with single responsibilities at different levels (microservices), and the system itself has fault tolerance and iteration freedom, which can achieve a dynamic fault tolerance as a whole. Most importantly, there is no "centralized hand of God" in the system. This coincides with the system architecture design advocated by cloud native, and even the birth of cloud native was inspired by this.

As the saying goes, "When one whale falls, everything grows." With the decline and ebb of the traditional centralized architecture, cloud-native technologies are growing and emerging in an all-round way.

Cloud native is essentially the software, hardware, and architecture born out of the cloud. Cloud native is also a process of continuous development and evolution. The concept of cloud native (Cloud Native) was proposed in 2015, and then further developed and refined by CNCF to form a container, continuous delivery, continuous integration, service grid, microservices, immutable foundation The concept of "narrowly cloud-native" facilities and declarative APIs.

Today, when we discuss "digitalization", there are actually two concepts, one is called original and the other is called transformation. Cloud-native technologies in a narrow sense mainly meet the new agile innovation requirements of Internet-based "digital native" enterprises, mostly stateless Internet-based applications, and require final consistency for data consistency. However, there are often greater obstacles to the existing technical standards and technical assets (burden) of traditional financial "digital transformation" enterprises.

With the continuous deepening and popularization of cloud computing technology, more and more new technologies are "born from the cloud". These products, technologies, software, hardware, and architectures that are " born in the cloud and longer than the cloud" have gradually matured and constituted The concept of "generalized cloud origin" is born. In the future, "cloud-native" products that are "born and grown in the cloud" will continue to emerge: a new generation of databases, artificial intelligence, storage, chips, networks, and health codes. The extreme elasticity, service autonomy, and large-scale replicability of cloud native make it easier to standardize heterogeneous resources, accelerate the release of digital productivity, accelerate the iterative speed of business applications, and promote business innovation. It is the "greatest certainty" among the many uncertainties in the digital age, and its strong inclusiveness represents the overall technical architecture direction of future digital enterprises. In addition to the agile innovation requirements for the technical architecture of "digital native enterprises", the generalized cloud native technology also takes into account the technical standards and architecture compatibility requirements of traditional "digital transformation enterprises", so it has a wider range of technical architecture applicability and better enterprise-level service capabilities.

image.png

Today, as cloud native gradually moves from the community to financial institutions and becomes more and more popular among the people, financial institutions begin to study how to combine the requirements of financial scenarios with cloud native implementation--financial security compliance, strong transaction consistency, unit expansion, Disaster recovery and multi-active, full-link business risk management, operation and maintenance management and other industry requirements are deeply integrated with cloud-native technology to develop a set of "financial-grade Cloud Native Architecture". It can better meet the stringent challenges and requirements of the financial-level IT environment, and provide unified technical architecture support for traditional "stable applications" (digital transformation) and "sensitive applications" (digital native) applications of financial institutions.

If we take the unified control of the centralized financial architecture (central brain) in the past as "left", and the completely open source distributed cloud native as "right". Under the financial cloud-native architecture, the technical architecture financial institutions need is to seek a balance between the left and the right, to achieve: not only financial-level security, strong consistency, and reliability, but also fault-tolerant, scalable, and fast ability to respond. Propose a "strong local autonomy, weak central control" architecture to shield application complexity (for example: GRC architecture, G-Global global system, R-Region regional system, C-City local system), and only those that need to be judged by comprehensive factors The complex logic is completed by the global system (central brain) to reduce the burden on the central system, while a large number of daily simple judgments and execution actions are completed in a closed loop in the local system to improve fault tolerance and the robustness of the overall system.

image.png

10 new elements that define financial cloud native

Cloud-native architecture is a set of architectural principles and design patterns based on cloud-native technology, aiming to maximize the separation of non-business code parts in cloud applications, so that cloud facilities can take over a large number of original non-functional features in applications (such as Elasticity, tenacity, security, observability, gray scale, etc.), without the trouble of non-functional business interruption, it makes the business lightweight, agile, and highly automated. In the traditional architecture, the application layer has more non-business codes; in the cloud-native architecture, the ideal situation is that no non-functional codes are reflected in the application code logic, but let it sink into the infrastructure. Business operation and maintenance personnel also only need to focus on the parts related to the business code. We summarize the core of financial-level cloud native into the following 10 major architectural elements.

image.png

Element 1: Platform Engineering & Immutable Infrastructure

Faced with the large-scale use of cloud-native technology, reducing the complexity of financial institutions' research and development and operation and maintenance is a big obstacle restricting the implementation of cloud-native technology. At present, from the perspective of R&D management and operation and maintenance management, "platform engineering" and "immutable infrastructure" are two key cloud-native capabilities that can greatly reduce complexity.

The DevOps philosophy is "who builds, who runs", developers should be able to develop, deploy and run their applications end-to-end. But for most financial institutions, this is actually not easy to achieve. The proven effective division of labor (Ops and Dev) has relatively lower requirements for talents, but with the promotion of the DevOps paradigm, R&D personnel must know everything well, which greatly increases the "cognitive burden". This puts high demands on the R&D teams of financial institutions, which is not conducive to the construction of universal talents, and will also greatly hinder the comprehensive introduction of cloud native applications by financial institutions. If one of the most likely directions for improvement is Platform Engineering, Platform Engineering is the bridge between DevOps and business programmers. A self-service platform that lets developers deliver business software faster and better. Through simple page-based operations, the series configuration of this link can be completed, so that R&D does not need to pay attention to the details of many operation and maintenance tools, and can focus on the development of application functions. Gartner's description of platform engineering "The tools, capabilities, and processes brought together by the platform are carefully selected by domain experts and packaged for end-user convenience. The ultimate goal is to create a frictionless self-service experience that provides users with the right capabilities to help them get important work done at the least cost, increasing end-user productivity and reducing their cognitive load."

Traditional variable infrastructure refers to the deployment of application services based on physical machines or virtual servers. The construction of the operating environment depends on many variables, such as the configuration on some servers, basic software, etc., which can be dynamically configured or distributed between different environments. Real-time access to external services to update the status of the application. The infrastructure that the entire application service relies on is constantly changing. When a scenario that requires emergency rollback occurs, the operation and maintenance personnel's processing process is often complicated and error-prone.

Cloud-native immutable infrastructure means that based on the cloud-native mirroring solution, the infrastructure that the application depends on (operating system, security script, operation and maintenance agent, development framework, operating environment, etc.) is packaged into an immutable image. It only needs to rely on the image to pull up the container, which greatly reduces the deployment and operation and maintenance costs of the application, makes the application deployment and operation and maintenance easier and more predictable, and at the same time, the application operating environment also achieves higher consistency and reliability. In addition, O&M functions such as automatic rotation replacement and automatic rollback can be realized based on the image, which greatly improves the automation level of application O&M. On the one hand, the image management level can be improved through image layering. On the other hand, image layering can improve the image loading efficiency to a certain extent according to the principle of container loading image, thereby increasing the application startup speed.

image.png

Element 2: Elastic Hybrid Cloud

As the cloud architecture becomes the mainstream platform and infrastructure of financial institutions, it has the ability to elastically scale on demand according to business units, and can quickly and elastically expand to improve resources and application processing capabilities when facing traffic peaks, and can be released quickly after the peak application traffic. resources to achieve maximum resource utilization, so it is necessary to build an elastic architecture that is flexible and can be replicated at low cost. The essence of the elastic architecture is the extension of the unitized architecture, which provides the ability to perform elastic scaling with the smallest granularity of the business unit in the unitized architecture, mainly including pop-up and bounce-back. Pop-up is a comprehensive pop-up of computing resources, networks, applications, and data levels based on business units. It is an overall elastic means from bottom-level resources to upper-layer traffic. Pop-up units are called elastic business units. Different from ordinary business units, flexible business units have the following characteristics:

Locality: Each business unit expanded in the regular mode needs to contain all applications and all data, while the elastic business unit that pops up under the elastic architecture only needs to contain part of the application and part of the data in the unit, usually a high-traffic link related applications involved.

Temporary: Different from the long life cycle of ordinary business units, the life cycle of flexible business units is relatively short. After supporting the "Double Eleven" and other big promotional payment peaks, the business requests of flexible business units will bounce back to regular business units, and then release the elastic business units to save costs.

Cross-cloud: Elastic business units are usually located in one or several other clouds. The traffic peaks faced by scenarios using elastic architectures are several times higher than daily ones. It is difficult for the daily cloud computing base to provide sufficient resources. At this time, other cloud computing bases are required to provide a large amount of resource support.

The elastic architecture gives full play to the advantages of the hybrid cloud. Massive cloud resources allow applications to expand infinitely to cope with extremely high traffic peaks. After the traffic peaks are reached, resources can be quickly released, and resources can be elastically scaled on demand.

Element 3: Mixed deployment of resources

In daily production, in order to ensure a high quality of service, online service applications often run for a long time and monopolize CPU resources, but the CPU utilization rate is very low; while offline computing tasks are just the opposite, usually with a short life cycle and impact on resource service quality. The requirements are not high, but the CPU utilization is high during runtime. With the expansion of business scale, the resource pools of online business clusters and offline clusters gradually become larger. Due to the low peak period of business, there will be problems with resource utilization. An obvious phenomenon is that the resource allocation rate of clusters is high, but the actual The utilization rate is low.

Financial institutions deploy online and offline clusters in the process of cloud-native architecture construction. In addition to core capabilities such as CPU elastic sharing and priority preemption, offline/online application staggered scheduling, application QoS classification, and memory hierarchical management, resource isolation Based on and dynamic adjustment, online services of different attribute types and offline computing services are accurately combined to solve the problem of efficient resource utilization. Corresponding to the complexity of the financial level, it is necessary to establish the following mixed-department capability standards:

Large-scale, multi-scenario mixed department, build mixed department technology into the infrastructure and environment for business operation, improve the output of mixed department technical capabilities, and facilitate promotion to other resource environments;

Get through the consistency of mixed-department management and control and operation and maintenance systems. Unify the resource access process to ensure the global consistency maintenance and management of basic software and configuration;

Flexible, efficient, and fine-grained processes for resource scheduling, fast resource switching for online-offline services, and integrated resource scheduling;

The stability of mixed parts reaches the stability index of the same level as that of non-mixed parts. Rely on fine-grained service measurement formulation, as well as the improvement of resource isolation and business operation adaptability;

Mixed monitoring system to improve runtime monitoring, abnormal discovery and diagnosis capabilities;

The abnormal emergency response mechanism of the mixed department identifies scenarios in advance for stability risks, and formulates a process-based emergency response mechanism to create abnormal and rapid recovery capabilities.

Element 4: Heterogeneous integration of multiple technology stacks

A service mesh can be thought of as an infrastructure layer that handles communication between services. Modern cloud-native applications have complex service topologies, and service meshes are responsible for the reliable delivery of requests across these topologies. In practice, a service grid is usually a set of lightweight network agents that are deployed together with applications. It can be compared to TCP/IP between applications or microservices, responsible for network calls between services, current limiting, fuse and monitor.

Before the application of service grid technology, the implementation of the micro-service system is often provided by the middleware team for business applications. An SDK will integrate various service governance capabilities in the SDK, such as service discovery, load balancing, circuit breaking and current limiting, and service routing. wait. At runtime, the SDK and business application code are mixed and run in one process, and the coupling degree is very high, which brings a series of problems:

One upgrade cost is high. Every time the SDK is upgraded, the business application needs to modify the SDK version number, and then re-publish the application. When the business is developing rapidly, such upgrades will affect the efficiency of research and development.

Second, the version fragmentation is serious. Due to the high cost of SDK upgrades and the continuous development of middleware, over time, it will lead to problems such as inconsistent SDK versions and uneven capabilities, which will bring a huge workload to unified management.

Third, the evolution of middleware is difficult. Due to the serious fragmentation of the SDK version, when the middleware evolves forward, it needs to be compatible with various old version logics in the code. It is like walking forward with shackles and cannot achieve rapid iteration.

The service grid of financial institutions sinks some network communication capabilities originally integrated through the SDK into Sidecar, including basic RPC, message, and DB access capabilities, as well as service discovery, fusing, current limiting, traffic control, The ability to sub-database and sub-table of the database brings a more transparent communication infrastructure to the business system, decouples the iterative evolution of the infrastructure from the business system, allows business research and development to focus on business logic, reduces the burden on the business system, and improves business Iterative efficiency of systems and infrastructure.

Element 5: Continuity of infrastructure (integration of public and private)

When more and more core systems are also moving towards full cloud nativeization , the scheduling and orchestration of large-scale resources has become an indispensable capability for the continuity of financial infrastructure. How to provide services for thousands of applications in different business departments of financial institutions, how to make different applications use the cloud well, meet the differences in resource demands of different applications and make full use of the ability of the cloud to support business growth, infrastructure continuity needs to be Unified resource management capabilities like the public cloud, which not only include traditional pan-transaction and data scenarios, but also include the increasing adoption of new heterogeneous computing hardware represented by GPUs in large-scale computing. Such as distributed deep learning training tasks, online reasoning tasks, streaming media encoding and decoding tasks, etc., require richer resource computing scenarios.

Unified infrastructure continuity for unified operation and management of underlying resources can optimize costs and improve efficiency through rich cloud-native technical means from multiple dimensions such as supply chain, capacity forecasting, capacity planning, and resource pool elasticity. Real-time and accurate management and control, zero leakage of underlying resources, and support for all scenarios in a flat, easy-to-manage, flexible, configurable, and flexible way.

Element 6: Full-link technology risk prevention and control

Many production failures of the financial business system are caused by changes, and change control is crucial to the prevention and control of technical risks. Especially under the microservice distributed architecture, the service scale is huge and the source of change is extensive. If the change does not have strong control and tracking capabilities, once a problem occurs online, it is difficult to quickly find the corresponding change in the first place by relying on manual tracing. It is also difficult to effectively control the quality of the change itself, which requires a "technical risk prevention and control system" based on the cloud-native architecture to manage and control risks and changes throughout the entire link.

The core guiding principle of technical risk prevention and control is the "three tricks of change": observable, grayscale, and emergency. Any change requires observable capabilities to be deployed before implementation to evaluate expected effects, identify unexpected problems, and guide further expansion of the scope of changes and decision-making on emergency response actions. "Grayscale" emphasizes that changes need to gradually expand the scope, and design the grayscale process from multiple dimensions such as region, data center, environment, server, user, and time. "Emergency" emphasizes that the change plan should give priority to ensuring the rollback capability. Due to special circumstances, some changes may not have the rollback capability or the rollback cost is unacceptable. This needs to be handled by adding other changes, such as data correction. , new version online, etc. "Three axes of change" are also the core capabilities of change risk control under the financial cloud native architecture. The integration of capabilities, building some fuse and self-healing capabilities during the change process.

The core responsibility of the "full link risk prevention and control system" is to make changes visible and more traceable by integrating all change information. At the same time, it provides capabilities such as change arrangement, change grayscale inspection, change pre-check, and change result monitoring and early warning. When a problem occurs, it provides change association to speed up online problem processing.

In addition, the full-link risk prevention and control system also needs to be able to produce analysis of capital loss risk points, formulate prevention and control measures, and clarify the details of the plan; in the quality test and analysis stage, test and analysis of capital verification should be carried out. Before the release, it is necessary to evaluate the risk again to check whether the capital loss prevention and control measures have been implemented, including real-time verification, T + M minute-level verification, T + H hour-level verification, T + 1 next-day verification, etc. "Subscribe to check the early warning, and at the same time, the business side must conduct a complete acceptance of the capital flow. Fund flow operations are carried out through verification modes such as certificates, certificates, accounts, and accounts.

Element 7: Cloud Native Security and Credibility

At present, external threats in the Internet environment tend to be diversified and new. Traditional defense methods have a good response to known vulnerability exploits and threat attack methods, but cannot well deal with APT attacks, 0Day vulnerability attacks, etc. New type of threat. However, these known and new threats share a common characteristic: they are all behaviors that are not expected by the business. Based on this feature, cloud-native technology needs to conduct credible measurement of all service requests and resource loading behaviors, and establish a security defense-in-depth system based on credible behaviors to ensure that only expected behaviors can be accessed and executed successfully. Block and intercept to achieve the effect of resisting known and unknown threats.

At the same time, in order to ensure the security isolation between business entities in the financial industry, technical services such as infrastructure should also build an isolated environment from the business entities, with an independent isolated network environment and a higher level of security. Cloud-native platform technology services are upgraded to trusted native services through relevant transformations such as multi-tenant isolation, unified management and control, and trusted channel convergence in accordance with trusted native service standards. For the environment in which the application runs, the cloud-native secure and trusted architecture has built-in security and trusted capabilities such as identity, authentication, authorization, full-link access control, and full-link encryption in the infrastructure, and realizes infrastructure and security as much as possible. The decoupling of applications reduces interruptions to services in a trusted and native manner, and provides a trusted application operating environment.

Element 8: Financial-level consistency

image.png

image.png

Cloud-native applications are mainly distributed systems, and the applications will be divided into multiple distributed micro-service systems. Splitting is generally divided into horizontal splitting and vertical splitting. This does not only refer to databases or caches. Splitting mainly expresses a divide and conquer idea and logic.

The bottom layer of a distributed system cannot escape the "impossible triangle of CAP" (C: Consistency, consistency; A: Availability, availability; P: Partition tolerance, partition tolerance). The CAP principle proves that any distributed system can only satisfy the above two points at the same time, and cannot take care of all three. However, distributed service systems need to meet partition tolerance, so a trade-off must be made between consistency and availability. If an abnormal situation occurs in the network, the network delay between some nodes in the distributed system will continue to increase, which may cause a network partition in the distributed system. The copy operation may be delayed. If our user waits for the copy to complete before returning, it may result in being unable to return within a limited time and loses availability; and if the user does not wait for the copy to complete, but in the primary shard Returning directly after writing has usability, but loses consistency.

For financial institutions, high availability at the architectural level and strong consistency at the business level are almost equally important. This requires financial-level cloud native to be able to balance the "impossible triangle of CAP" well, and it is necessary to give consideration to strong business consistency and high system availability as much as possible.

But "consistency challenge" is not just a database problem in a distributed system, but a big topic covering all levels of a distributed system: transaction consistency, node consistency, inter-system business consistency, message power Equal consistency, cache consistency, cross-IDC consistency, etc. Therefore, it is also necessary for the cloud-native architecture to have a series of technologies that can cope with the stringent challenges of financial-level consistency.

Transaction level: It is necessary to select an appropriate distributed transaction model according to different financial scenarios. After balancing cost and performance, SAGA and TCC are two distributed transaction models commonly used by financial institutions. The SAGA mode is less intrusive to the application implementation, but it is based on the compensation transaction to ensure the consistency of the design, and the transaction isolation is not guaranteed during the execution of the previous and subsequent steps; while the TCC mode can achieve better transaction isolation, but requires application layer awareness more complexity. For some nodes in the transaction process that do not need to return results synchronously, an asynchronous message queue can be used to improve execution efficiency. For some scenarios with long transaction processes, it can significantly reduce the complexity of transaction implementation and cut peaks and fill valleys. Typical scenarios, such as customer purchase of wealth management, are simplified into two steps: deposit account deduction and wealth management account credit. In the abnormal state of "to", the system needs to reverse the deposit account deduction to ensure transaction consistency. If the TCC mode is selected, the logic processing of deposit account deduction and wealth management account entry is completed successively, and the deposit system and wealth management system need to record the status of the logic processing respectively. After both are successful, a unified submission is initiated.

Database level: In the financial scenario, there is an extreme requirement for data not to be lost. On the one hand, multiple copies need to be saved in multiple computer rooms in the same city and in different places; Off-site RPO is close to zero. The Paxos algorithm is an algorithm for achieving data consistency in a distributed system based on message passing. Central data consistency guarantee.

Computer room level: cross-computer room routing capabilities, and cross-computer room recovery capabilities for abnormal transactions. In the event of a failure in the computer room, the database needs to be able to switch to a copy in the same city/off-site, and ensure that the RPO is zero, and cooperate with the transaction routing switching at the application layer to complete the computer room-level disaster recovery switch and restore business. During the interruption of part of the transaction process due to the failure of the computer room, the distributed transaction component needs to have the ability to automatically recover, and restart the interrupted transaction process to complete forward or backward according to the pre-set business rules.

Element 9: Unitization, multiple locations and multiple activities

image.png

With the rapid development of digital financial services, the traditional centralized production environment has been difficult to meet the demand. The current evolution direction is the unitized structure of "multi-active in different places", based on the unitized computer room (hereafter referred to as LDC) to meet the high timeliness and financial level security requirements.

The "three centers in two places" architecture generally adopted by financial institutions has several typical deficiencies. First, this architecture requires two centers in the same city to have similar computer room capacity to meet full switchover. Second, under this architecture model, remote disaster recovery systems are usually "Cold" ones do not really carry business traffic, and it is difficult to take over full business when a disaster occurs. As new data centers are generally concentrated in areas far away from traditional data centers, such as Inner Mongolia and Guizhou, and the capacity ratio of new and old data centers is very unbalanced, financial institutions are required to break through the "three centers in two places" in terms of operating structure. The traditional model evolves to an N+1 "multi-active" disaster recovery solution to further improve the systemic capabilities of failure recovery.

"Remote multi-active architecture" refers to the expansion capability based on the LDC unit architecture. LDC units are deployed in IDCs in different regions, and each LDC unit is "live", which truly undertakes real business traffic on the line. In the event of a failure, fast switching between LDC units is possible. The remote multi-active unit architecture solves the following four key problems:

Off-site deployment is possible due to the minimization of cross-unit interactions and the use of asynchrony. The horizontal scalability of the entire system is greatly improved, no longer relying on the same city IDC;

It can realize the N+1 remote disaster recovery strategy, greatly reducing the cost of disaster recovery, and at the same time ensuring that the disaster recovery facilities are truly available;

There is no single point in the entire system, which greatly improves the overall high availability; multiple units deployed in the same city and in different places can be used as disaster recovery facilities for mutual backup, and can be quickly switched through the operation and maintenance management and control platform, which has the opportunity to achieve 100% continuous availability;

Under this architecture, the traffic entry and exit at the service level form a unified controllable and routable control point, and the controllability of the overall system is greatly improved. Based on this architecture, operation and maintenance management and control modes that were previously difficult to implement, such as online pressure measurement, traffic control, and gray scale release, can now be implemented very easily.

Element 10: Business continuity and digital intelligence operation and maintenance

image

In a cloud-native environment, it is necessary to correlate information on multiple containers, multiple virtual machines, multiple hosts, multiple availability zones, and even multiple regions to answer why the service is down and why the defined SLO is not implemented. , Which users and businesses are affected by the fault, etc., can realize efficient digital intelligent operation and maintenance management based on operation and maintenance data and AI intelligence.

Cloud-native digital intelligent operation and maintenance mainly includes seven aspects of capabilities:

Monitoring and discovery capabilities: Omnidirectional observability of indicators, logs, and links, comprehensive coverage of services, middleware, and infrastructure, and drill-down capabilities.

Fault emergency response capability: Abnormal comprehensive discovery, rapid location and recovery capabilities, ensuring business SLA.

Change risk prevention and control capabilities: All-round business change management and control, strictly adhering to the three axes of "can be grayscale, observable, and rollback".

Capacity management capability: From business to infrastructure, it provides accurate assessment of full-link capacity and early identification of risks to achieve a balance between stability and cost.

Disaster recovery management capability: Platform-based disaster recovery can be arranged, supporting computer room disaster recovery, unitized disaster recovery and other scenarios, coverage drills, switching and large screen capabilities.

Drill and evaluation capability: Through chaos engineering, red and blue attack and defense, etc., the business risk assurance capability is detected and tested.

Capital security assurance capability: Based on the capital security check rules, the capital flow of the business system is monitored through offline, real-time, file and other methods.

Cloud-native digital intelligent operation and maintenance mainly has three characteristics:

Efficient: Improve the efficiency of operation and maintenance through the platformization of operation and maintenance work. Such as system monitoring platform, change management and control platform, dynamic resource management and control platform, scheduling center, registration center, etc.

Security: Based on the automatic business verification platform and big data operation rules, the stability and correctness of system operation are guaranteed. Such as data verification center, dependency management and control platform, capacity detection management and control platform, etc.

Intelligence: Intelligent operation and maintenance management and control based on big data analysis and rule calculation. Such as automatic fault analysis and processing system, automatic capacity detection and expansion system, etc.

Build a new blueprint for financial cloud native

Financial-grade cloud-native application architecture 

The book "Architecture is the Future" puts forward fourteen basic principles of distributed application design, which are the core elements of the most important cloud-native application architecture.

N+1 Design : Make sure that any system you develop has at least one redundant instance in the event of a failure. Rollback Design : Ensure that the system can be rolled back to any previously released version.

Switch Disable Design : Ability to turn off any published functionality. Monitoring design : Monitoring must be considered during the design phase, not added after the implementation is complete.

Design a multi-active data center : Consider multi-active deployment when designing, and don't be limited by a data center solution.

Asynchronous design : Asynchronous is suitable for concurrency, only make synchronous calls when absolutely necessary.

Stateless system : A stateless system is more conducive to expansion and load balancing. Use state only when the business really requires it.

Horizontal scaling not vertical upgrades : Never depend on bigger, faster systems. The core idea of ​​microservices is to expand horizontally, not to concentrate all functions in one system. When necessary, divide the requirements into multiple systems instead of upgrading the original system.

Forward-looking design : consider in advance the solutions that affect the next-stage system scalability issues, and continuously refine public shared services to reduce the number of refactorings.

Buy if it is not core : If it is not what you are best at and does not provide a differentiated competitive advantage, then buy it directly. Databases, cloud services, etc. can be purchased.

Small builds, small releases, fast trial and error : All R&D requires small builds and continuous iteration to allow the system to grow continuously. Small releases have a lower failure rate because the failure rate is directly related to the number of changes in the solution.

Isolate faults : Realize the design of isolated faults, and avoid fault propagation and cross-effects through open circuit protection. It is very important to avoid mutual influence between multiple systems.

Automation : "Automation is the source of wisdom". In the cloud-native architecture, rapid deployment and automated management are the core. Design begins with the process of automating as much as possible through architecture and design. Don't depend on humans if machines can do it.

Use proven technology : If a technology has a high failure rate, it should never be used.

Financial-grade cloud-native platform architecture

The overall financial cloud native platform architecture can be divided into five major domains: design domain, research and development domain, operation domain, operation and maintenance domain, and disaster recovery domain.

Design mode: adopt domain-driven design and other design methods that are naturally compatible with the microservice architecture system, and pay attention to issues such as data consistency and service granularity during the design process, and implement the design principles and specifications of distributed architecture design.

R&D state: For R&D personnel, provide one-stop R&D productivity tools, shield the complexity of distributed technology, and improve the experience and productivity of R&D personnel. Reach a broad consensus engineering template to reduce organizational cognitive costs.

Running state: application-oriented, infrastructure for distributed application operation, covering the entire life cycle of applications, including creation, deployment, monitoring, and configuration changes, supporting various forms of application interaction and data storage. The bottom layer supports various forms of computing methods and scheduling methods on them.

Operation and maintenance state: For operation and maintenance personnel, it solves the inherent complexity of distributed architecture, and widely uses engineering methods to ensure the overall availability of the system.

Disaster recovery state: oriented to disasters, it provides the ability to tolerate disasters at the node level, computer room level, and city level.

image.png

Financial-grade cloud-native data architecture 

The cloud-native framework has inherent advantages such as fast delivery, elastic scaling, standardization, automation, and isolation. It is continuously integrated with the new generation of data technology to form a cloud-native data architecture system with the following characteristics.

1. Scalable fusion of multiple computing modes

The cloud-native data architecture can uniformly support the integration of different computing modes such as batch, stream, interactive, multi-mode, and graph, such as: integration of lake and warehouse, integration of stream and batch, and stream machine learning, enabling deep integration of various computing systems. Complementary functions and ecology, users can complete more types of calculations in one system, improve platform operation efficiency, and reduce use costs.

2. Multi-layer intelligent distributed storage layer

The separation of storage and computing will become the standard within two or three years, and the data platform will develop in the direction of hosting and cloud native. Refined tiering within storage has become a key means of balancing performance and cost. Based on the combination of multi-tier storage (hot storage/standard storage/cold storage, etc.) on a distributed storage system and storage utilization, storage costs can be reduced. AI will play a greater role in layered algorithms. In the case of limited optimization space for encoding and compression on general-purpose processors, future breakthroughs and technological upgrades will depend on the technical development and application of software and hardware integration. .

3. Unified scheduling and elastic scaling resource pool management

As the separation of data lake storage and computing continues to deepen, the establishment of a unified containerized resource scheduling system based on cloud-native architecture has become a necessary component for the development of data lake storage and computing separation, providing unified resource pooling and offline storage for the integrated architecture of big data and AI. The basic support of the mixed department; realize the overall planning and scheduling of resources through the unified computing power resource pool, optimize the management and scheduling of fine-grained resources, and can combine offline computing and other online computing tasks to achieve the effect of peak and valley complementarity, which helps to improve Server resource utilization; at the same time, computing task resources can also be allocated according to business priorities to ensure that there is no contention during resource scheduling, so that during peak business periods, computing power resources can be called in an elastic expansion and contraction mode to give full play to resource computing power. Improve response efficiency.

4. Big data SRE intelligent operation and maintenance capabilities

The diversity of big data technologies and the complexity of data platform architecture bring challenges to the operation and maintenance of big data platforms. The new-generation big data platform can support online rolling upgrades to shorten the upgrade time; provide unified operation of various heterogeneous workload processes, unified management of job life cycles, unified scheduling of task workflows, and guarantee the scale and performance of tasks; through job logs , performance indicators, resource utilization and other data, combined with historical records and real-time load conditions, use machine learning methods to analyze, detect and optimize, in query planning, data model, resource management self-adaptation, and system anomaly detection and self-healing, etc. In terms of continuous optimization, it forms the intelligent operation and maintenance capabilities of large-scale data platforms.

Financial-grade cloud-native infrastructure 

Financial-grade cloud-native infrastructure needs to meet five general requirements and 13 management requirements.

(1) The five general requirements are:

One is to adopt mature cloud platform products to create an integrated cloud computing platform of IaaS and PaaS, realize a complete service catalog on the tenant side and the operation and maintenance side, and seamlessly connect with the software development system and the production operation and maintenance system;

The second is to realize the flexible supply of company-wide basic resources, and support the company-wide business system to realize a high-availability disaster recovery architecture in accordance with the distributed technology framework to meet the requirements of safe production;

The third is to fully meet the innovation requirements of information technology applications. From the cloud platform base to software services, it has the ability to innovate and run full-link information technology applications, while ensuring the high-performance and stable operation of distributed applications;

Fourth, it has the foundation to provide large-scale applications to the cloud, provides a complete application framework, and provides stable, continuous, and high-performance support for the application system;

Fifth, the cloud platform products have a mature ecosystem, which basically keeps pace with the development of public cloud technology in the industry, and adapts to the evolution of the latest open source technology.

(2) The 13 management ability requirements are:

Unified resource management: use unified physical resource types and architecture to realize unified management of basic hardware resources, such as servers, switches, operating systems, etc.; cloud management platform realizes computing in three centers in two places through unified management methods (console, API, etc.) , storage, network and other cloud resources are managed to reduce the complexity of development and operation and maintenance.

Unified data management: For intra-city active-active and remote multi-active architectures, data consistency of distributed cloud nodes is guaranteed through data storage, migration, synchronization, etc., and integrated disaster recovery and linkage switching capabilities are provided to meet business continuity requirements to the greatest extent. For example, it provides a unified mirroring solution, object storage disaster recovery, database cross-regional backup and synchronization, etc.

Unified service management: Support three central nodes in two places to manage cloud services through unified API, SDK, console, etc., such as unified control plane for service deployment and update, which greatly reduces the complexity of cloud service management and improves the efficiency of cloud use.

Unified operation and maintenance management: through cloud management, the same operation and maintenance system can be used to manage different nodes in three centers in two places, providing consistent operation, monitoring, reliability SLA and other services, reducing the workload of operation and maintenance managers and improving operation and maintenance efficiency , greatly reducing system failures and shortening downtime.

Unified security management: On the one hand, platform-side security is realized through physical infrastructure, network security, data plane/control plane isolation, etc.; on the other hand, security services are realized through host security, access control, firewall, situational awareness, etc. to ensure integrated security.

Unified Resource Scheduling: Through the cloud management, the unified scheduling of the computing power resources of the three centers in two places is realized, and various scheduling strategies are supported. Location-based scheduling meets delay and bandwidth-sensitive services (such as mobile banking audio and video applications); computing-based scheduling meets AI, big data and other large-scale computing services (such as tidal scheduling, mixed department and other scenarios); based Workload scheduling satisfies multi-dimensional and heterogeneous scenarios (such as financial panic buying, points exchange, Double 11 and other application scenarios).

Unified monitoring management: complete the access and unified display of various types of monitoring indicators on the cloud and off the cloud; complete the distributed link tracking capability on the cloud and off the cloud, and realize layer-by-layer monitoring from business monitoring, application service monitoring, and resource monitoring Drill down and multi-dimensional analysis to improve fault location and analysis capabilities; through the docking and optimization of the unified alarm center to complete dynamic thresholds, improve the overall business event perception capabilities, rapid positioning capabilities, and intelligent analysis and decision-making capabilities.

Support multiple computing power: The cloud resource pool is compatible with multiple computing power such as CPU and GPU, and provides efficient cloud computing power services for new financial technology application products in multi-field scenarios such as artificial intelligence, deep learning, and scientific computing.

Support full-stack information technology application innovation: Through a system compatible with multi-product service capabilities, support one-cloud multi-core, full-stack XC cloud platform service capabilities, and promote the implementation of information technology application innovation strategies.

Support refined management: Through the metering and billing capabilities of the platform and the connection with various systems in the industry, the metering and billing capabilities of computing, storage, network, security and other resources are realized. Gradually realize the refined management of IT costs, realize the measurement and evaluation of business IT investment and business output, realize the balance between cost and efficiency, and realize the efficient use of IT resources.

Support bare metal management: meet bare metal delivery process automation and batching from server racking, automatic installation, system settings, and software orchestration, improve delivery efficiency and reduce manual workload; meet bare metal unified management requirements, and realize unified monitoring and management of bare metal alarm.

Supporting service quality: Through the improvement of self-service capabilities, the construction of the infrastructure management platform will be able to provide efficient and stable operation and refined management to provide better services. According to the platform's data collection and analysis, it will effectively improve the management direction and content, and can Effectively enhance service quality.

Development of supporting architecture: adopt the industry-leading proprietary cloud architecture, build a cloud platform with the same source as the public cloud, and meet the disaster recovery requirements of the financial industry, support all products through a set of systems, and support the construction of an integrated online and offline operation and maintenance system for the whole bank , through an organic and unified architecture design, to meet the capacity building of the future full-stack cloud platform.

05 Financial-level cloud-native implementation path

Financial-level cloud-native capability assessment

"The best way to invest in the future is to improve the present".

Financial-level cloud-native has greatly released the dividends of the digital age. Cloud-native fully inherits the design ideas of the cloud. In the future, more applications will be developed based on the cloud. That is, cloud-native applications are more suitable for the cloud architecture, and cloud computing is also for Cloud-native applications provide better basic support, such as resource isolation mechanism, distributed deployment, and high-availability architecture. Through new architectures and technologies, the application system becomes more robust. It can be said that cloud-native applications maximize the advantages of the cloud. .

Based on the IaaS/PaaS integrated cloud platform, a bank uses distributed micro-service framework, cloud middleware, container, DevOps and other cloud-native technologies to build a cloud platform that can provide horizontal expansion, second-level scaling, intelligent operation and maintenance, and adapt to rapid development and continuous delivery. The PaaS-level cloud platform promotes the bank's evolution from traditional architecture to Internet architecture. The platform deploys, runs, and schedules resources based on containers, and utilizes the lightweight features of containers to save more application deployment and operation resources when the number of services surges, and can easily cope with fluctuating business traffic. At the same time, the image delivery form of the application realizes "one build, multiple deployments", avoiding the operational complexity and operational risks brought about by the traditional deployment process. Through this platform, the application delivery cycle has been shortened by 80%, and the response speed of business needs has been increased by 50%.

However, when financial institutions began to purchase and adopt cloud-native technologies in large quantities, there were many problems such as the cloud-native technology product system was too complex, the open source ecosystem lacked governance, and the compatibility and adaptation between products were difficult. Partial technical characteristics often cause great interference to the selection of financial institutions and generate high trial and error costs.

"Abandoning the whole and looking at the local details are hooligans."

The more platform-based technology is, the more it needs to be considered from an overall perspective. Therefore, there is an urgent need for a set of unified standards that combine industry characteristics to provide financial institutions with a capability reference model, so that financial institutions can position themselves in the development stage of cloud-native technology transformation, compare and analyze the shortcomings of cloud-native capacity building, and formulate future technologies and capabilities. construction direction. We combine some financial industry practices to provide financial institutions with a complete technical capability framework and a nine-dimensional maturity assessment model for adopting cloud-native technologies, which can be developed with reference to the following indicators:

Microservice architecture level, application cloudification level, observability, high availability management, configuration automation, DevOps, cloud platform capability, cloud native security, container and K8s capability.

image.png

Financial-level cloud-native evolution path 

A good architecture comes from evolution. We need a complete set of architecture planning to ensure integrity and construction specifications, but we also need the architecture to continue to evolve to ensure the overall stability and controllability. Therefore, we have summarized two cloud-native architectures The evolution path is used as a reference.

Reference path 1: Looking at the global macro scale (from top to bottom), look for technical shortcomings and evolution paths based on cloud native capability assessment. The following example is a three-stage evolution path of cloud-native architecture, which helps financial institutions gradually realize the transformation of application architecture from single microservices to unitization, and realize the transition from double-active in the same city to multi-active in different places. Seek the most balanced architecture development path to meet business development and harsh scenario tests.

image

Reference path 2: Starting from the problem (from bottom to top), the purpose of architecture evolution must be to solve a certain type of problem. May wish to start from the perspective of "problems" to design the evolution of the overall cloud native architecture. The following example is a practice of continuously evolving the cloud-native architecture by solving technical problems.

image

Step 1: In order to make the entire application architecture have "better underlying support", run the application architecture on the cloud platform

Step 2: In order to solve the "complexity problem" of monolithic architecture, use microservice architecture

Step 3: In order to solve the "communication exception problem" between microservices, use governance framework + monitoring 

Step 4: In order to solve the "deployment problem" of a large number of applications under the microservice architecture, use containers

Step 5: To solve the "orchestration and scheduling problem" of containers, use Kubernetes

Step 6: In order to solve the "intrusive problem" of the microservice framework, use Service Mesh

06 Epilogue

This article maps and combines the generalized cloud-native technical concept and financial-level technical standards, and defines the blueprint and ten elements of financial-level cloud-native technology, aiming to extend the advanced cloud-native technology concept to all-round technologies of enterprise organizations The stack proposes a brand-new reference architecture for the financial industry's architecture planning for information technology application innovation. Let us persist in exploring and practicing together to speed up financial-level architecture innovation.

About the Author:

Liu Weiguang, President of Alibaba Cloud Intelligent New Finance & Internet Industry, Executive Director of China Finance 40 Forum, graduated from the Department of Electronic Engineering of Tsinghua University. Before joining Alibaba Cloud, he was responsible for the business promotion and ecological construction of financial technology and the business development of Ant Blockchain at Ant Financial; he has been deeply involved in the enterprise software market for many years, and once founded the Pivotal Software Greater China Branch, creating an enterprise-level It is the first in the market of big data and enterprise-level cloud computing PaaS platform. Before founding Pivotal China Software Company, Liu Weiguang served as the general manager of the Data Computing Division of EMC Greater China, and worked for Oracle China for many years. He once created the Product Division of Exadata Greater China and served as the director of the division.

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3874284/blog/8750844