Progressive release of GitOps practice

img


The author of this article: Chen Juntong
, senior solution architect of Tencent Cloud CODING DevOps
, has been engaged in technical evangelism for many years, and has rich experience in enterprise digital transformation, IT and DevOps construction, and value stream system construction in the cloud-native era. Provide consulting, solutions and in-house training services. It not only pays attention to the cloud native development and construction and best practice implementation from the engineer's perspective, but also pays attention to the process management and R&D efficiency improvement from the manager's perspective.

This article is based on the speech content of Mr. Chen Juntong at the 2023 China DevOps Community Summit Wuhan Station. He was named "Top Ten Lecturers" in this summit and was recognized by the community as an excellent knowledge sharer.

img

Introduction

With the popularization of cloud computing and microservice architecture, the environment of software development and operation and maintenance has become more and more complex. When we are pursuing the best practices of development and delivery, the concept of GitOps came into being and quickly gained wide popularity. recognized. Incremental releases play an important role as a risk management strategy in many effective GitOps practices . It allows us to release new service versions incrementally, bringing new features or fixes to production as quickly as possible while keeping the system stable. There are various implementation strategies, including rolling upgrade, blue-green deployment, grayscale deployment, canary release, and A/B testing.

In this post, we'll explore how GitOps and progressive releases can be combined for efficient, reliable, and secure service deployments in cloud-native environments . We will compare and analyze different progressive release strategies, introduce mainstream cloud-native tools, and use CODING's self-developed cloud-native application management platform Orbit for elegant implementation. We hope this article provides valuable information and inspiration for readers who want to introduce GitOps and progressive release into their own software development process.

text

img

img

GitOps was first proposed by Alexis Richardson in 2017. This model advocates using Git as a "Single Source of Truth" (Single Source of Truth) to manage and synchronize development and production environments, rather than using traditional infrastructure management and deployment methods. After this idea was put forward, it triggered a discussion and reflection on cloud-native deployment and operation and maintenance models.

The concept of GitOps began to be accepted and applied by more organizations and cloud service providers in 2018, and they all began to explore how to integrate GitOps into their products and services. For example, cloud service providers such as Amazon, Google, and Microsoft have integrated GitOps tools into their Kubernetes services.

In 2020, in order to further promote the concept and practice of GitOps, the Cloud Native Computing Foundation (CNCF) announced the establishment of the GitOps working group. This working group is made up of several cloud service providers and software companies, and together they have developed a standardized set of GitOps best practices and principles.

In 2021, the GitOps Maturity Model was born, a joint effort of GitOps working groups and the community. The model provides a way to quantify and evaluate GitOps implementation, thereby helping organizations better understand and implement GitOps.

GitOps can be succinctly summarized in the following three principles:

  • The code warehouse is the only source of truth (Single Source of Truth) for the entire system.
  • All change activities, including creation, modification, and destruction, are carried out with the code repository as the core.
  • All changes must be observable and verifiable.

Of course, in addition to the above three basic principles, GitOps usually needs to follow some prerequisites and best practices in practice, including but not limited to:

  1. Separation of code and configuration : In traditional application deployment, code and configuration are often mixed together, which makes switching between environments difficult. In the practice of GitOps, we recommend separating code and configuration. This means that the application's code and its runtime configuration (such as environment variables, database connections, etc.) should be in different repositories or at least in different branches. This approach can not only make code and configuration changes clearer, but also avoid code errors caused by configuration errors.
  2. **Continuous Delivery:** Continuous delivery is an important part of GitOps. This means that every commit goes through an automated build, test, and deploy process to deploy new changes to production faster and more frequently.
  3. **Clear audit trail:** Since all changes are made through the code repository, in GitOps we have a clear audit trail. This can help us better trace the source of changes and better understand the reasons behind each change.
  4. **Automation and self-healing:** In GitOps, we automate all operations as much as possible and fix the system as soon as possible when it deviates from the expected state. This is often achieved through declarative infrastructure management and Kubernetes' self-healing capabilities.
  5. **Privilege Management and Separation of Responsibilities:** In order to avoid single points of failure and improve security, GitOps advocates the principle of minimum authority and separation of duties. This means that each role should have only the minimum permissions needed to complete its tasks, and different responsibilities should be assigned to different roles or teams.
  6. **Observability and Feedback:**GitOps not only needs to have clear observability of changes, but also needs to have a feedback mechanism to let developers and operators know whether their changes are successfully applied, and the actual operating status of the system . This is often achieved through logging, monitoring and alerting systems.

We can also explain GitOps with a concise formula: Infrastructure as Code (IaC) + Merge Request + Continuous Integration/Continuous Deployment (CI/CD).

img

Taking the CODING platform as an example, in this process, developers will first submit the code to the CODING code warehouse . Code reviewers then review the code. Only after the code review is passed, the submitted code can be merged into the main branch. Once the new code is merged into the main branch, the corresponding CODING continuous integration pipeline will be triggered to automatically build and push the Docker image to the CODING product warehouse. Subsequently, when the continuous deployment module of CODING detects that a new Docker image has been pushed to the product warehouse, it will automatically trigger the corresponding continuous deployment pipeline to release the new Docker image to the Kubernetes cluster.

For operation and maintenance personnel, they cannot directly use the kubectl apply command to operate the Kubernetes cluster as before. They need to modify the configuration by submitting updated configuration files in the CODING code repository. When these configuration files pass the review and are merged into the main branch, CODING's continuous deployment module will detect changes in these configuration files, trigger the corresponding continuous deployment pipeline, and release the new configuration to the Kubernetes cluster.

The advantage of this model is that all authority control can be managed centrally through the code warehouse, and it is no longer necessary to open the access authority of the Kubernetes cluster to all operation and maintenance personnel, thus greatly improving the security of the cluster . In addition, since all operations are carried out around the code warehouse, each operation will leave a submission record, which is convenient for us to trace the source of the problem.

In general, GitOps is a key practice of modern DevOps. It emphasizes all operations with code as the core , and ensures the observability and verifiability of all changes , improving the transparency, manageability and security of the system.

In this sharing, we focus on the last mile of GitOps practice delivery, that is, how to publish and deploy application services to the corresponding environment to meet the needs of business scenarios. On this topic, many excellent practices have been accumulated in the community. Next, let's sort out the common deployment strategies.

rolling release

✅ There is no need to stop the service, and the user experience is good.
❌ Version upgrade and rollback take a long time.

img

"Rolling release" is a common deployment strategy, which generally involves the following steps: gradually take one or more servers out of service, perform the update, and then bring it back into service after the update is completed. This process is repeated until all instances in the cluster have been updated to the new version.

The advantage of rolling update is that the whole update process will not interrupt the service, so it can provide a good user experience. At the same time, the rolling update can also gradually replace the old version instance while maintaining service continuity, which can effectively reduce the risk caused by the update.

However, rolling updates also have their disadvantages. Since it upgrades instances one by one, the entire update process can take a long time, especially if the cluster size is large. Also, if something goes wrong with the new version, performing a rollback can take a long time to complete, which can leave the system in an unstable state during that time.

blue-green release

✅ Upgrade switching and rollback are very fast.
❌ If there is a problem with the new cluster, the impact will be large, and two sets of machine resources will be required, which will cost a lot.

img

"Blue-green release" is a deployment strategy that relies on two identical cluster environments online . These two sets of environments are marked blue and green, usually in the initial state, all traffic will be routed to the green cluster. When we are ready to release a new version, we deploy the new application version on the blue cluster. Once the new version is deployed and tested on the blue cluster, we will switch all traffic to the blue cluster at one time to complete the system upgrade switch.

The advantage of the blue-green release is that since we have two independent clusters and the traffic can be switched at one time, the system upgrade and rollback operations can be completed quickly. This strategy also allows us sufficient time to verify the stability and performance of the new version on the blue cluster, reducing the risk of releasing a new version.

However, blue-green releases also have their drawbacks. First of all, because the traffic switching is completed at one time, if there is a problem with the new cluster, all users will be affected, and the impact will be large. Second, because blue-green releases need to maintain two identical environments, this will bring higher costs, including costs in terms of hardware resources, maintenance work, and complexity.

canary release

✅ Smooth upgrade, low risk, no user perception.
❌ Traffic is indiscriminately directed to the new version, which may affect the experience of important users.

img

"Canary release" is an optimization strategy for "blue-green release" , which draws on the method that miners use canary as coal mine safety monitoring. This release strategy starts by switching a small portion of traffic from the current version of the cluster (which we call the green cluster, or V1) to the new version of the cluster (which we call the blue cluster, or V2). This way we can test the new version in the production environment.

Canary tests can be of varying complexity. Simple canary tests may only involve manual verification, while complex canary tests may require a well-equipped monitoring infrastructure. Through monitoring indicator feedback, observe the health of the canary environment as a basis for subsequent releases or rollbacks .

If the canary test is successful, we will switch all remaining traffic to the blue cluster (V2). If the canary test fails, we stop traffic switching, declare the new release as a failed release, and troubleshoot and fix accordingly.

Compared with the blue-green release, the advantage of the canary release is that it adds a verification process in the production environment, which can reduce the risk and impact of problems in the new version. Since there are two sets of clusters, we can upgrade smoothly without stopping the service, making the entire release process imperceptible to users.

However, canary releases also have disadvantages. During the release of the new version, traffic will be routed to the new version indiscriminately. If there is a problem with the new version, it may affect the experience of those important customers who are routed to the new version.

A/B testing

✅ A new version of the service can be provided to a specific user group, and if a failure occurs, the scope of impact is small.
❌ The release cycle is relatively long.

img

"A/B testing" is an online testing strategy in which multiple versions of a program are run concurrently , dynamically routing traffic to different versions based on meta-information about user requests. In other words, we can decide which version to route traffic to based on the content of the request. To illustrate with a specific example, we can set rules to route requests with the User-Agent value of Android (that is, mobile phone requests from the Android system) to the new version, while routing requests from other systems (such as iOS) or other devices ( For example, on the computer side) requests continue to be routed to the old version.

The advantage of this is that we can compare the performance of the two versions to determine which version is more popular with users and which version has better user feedback.

The main advantage of AB testing is that we can perform dynamic routing based on request information, and provide new versions of services for specific user groups . This way, if the new version fails, the scope of its impact will be relatively small.

However, AB testing also has its disadvantages: Since multiple versions need to be deployed and maintained at the same time, and it takes a period of time to observe the operation of different versions and collect user feedback, the entire release cycle may be relatively long.

img

Let us now do a brief comparative analysis of these common deployment strategies.

  1. **Downtime reconstruction (Recreate):** This method does not support the parallel operation of the old and new versions, cannot perform real traffic verification, and the release efficiency is low, but it may be more reliable in some specific cases.
  2. **Rolling Updates:** This method can be deployed without downtime, but the speed of rolling updates needs to be controlled to avoid resource overload. It allows the old and new versions to run in parallel for a period of time, the flow control granularity is low, real flow verification can be performed, and the release efficiency and release reliability are medium.
  3. **Blue-Green Deployment (Blue-Green Deployment):** It allows the old and new versions to run in parallel, with high flow control granularity, real traffic verification, and high release efficiency and release reliability.
  4. **Canary Deployment:** This strategy allows the old and new versions to run in parallel, with high flow control granularity, real traffic verification, and high release efficiency and reliability.
  5. **A/B Testing (A/B Testing):**It can be carried out when the old and new versions are running in parallel, the traffic control granularity is high, and real traffic verification can be carried out. The release efficiency and release reliability depend on the specifics of the test implement.
  6. **Automatic Progressive:** This strategy combines the advantages of multiple deployment strategies, and gradually rolls out new versions of applications through automated tools and processes to reduce risks. It allows the old and new versions to run in parallel, the flow control granularity is high, real flow verification can be performed, and the release efficiency and release reliability are high.

In a cloud-native environment, application update and deployment are common and critical operations, so Kubernetes, as a container orchestration system, has built-in some basic release strategies to meet most common deployment requirements, mainly through Deployment and StatefulSets resources To achieve:

  1. **Recreate:** In this strategy, Kubernetes will first stop all old Pods, and then start new Pods. This strategy results in a period of downtime for the application during deployment.

  2. Rolling Update: This is the default deployment strategy. In this strategy, Kubernetes gradually replaces old Pods with new ones. This strategy ensures that the application keeps running without downtime during the deployment process. We can control the scroll update speed by setting maxUnavailable and maxSurge.

A rolling update consists of a well-defined series of checks and actions for upgrading any number of replicas managed by a Deployment. The Deployment resource orchestrates ReplicaSets for rolling updates, and the following figure shows how this mechanism works:

img

Whenever a Deployment resource is updated, a rolling update mechanism is initiated by default. It will create a new ReplicaSet to handle Pods created with the new updated configuration defined in spec.template. This new ReplicaSet does not start all three replicas at once, as this would cause a spike in resource consumption. So it creates a separate replica, verifies its readiness, connects it to the service pool, and finally terminates the replica with the old configuration.

  1. Blue/Green Deployments (Blue/Green Deployments): Although Kubernetes itself does not directly support blue-green deployments, we can achieve similar functions by using Service and two different Deployments. We can create a new Deployment (green) to deploy the new version of the application, switch the Service to the new Deployment, and delete the old Deployment (blue).

img

  1. Canary Deployments: Kubernetes itself does not directly support canary deployment, but we can achieve it by manually managing multiple Deployments and Services. We can create a new Deployment to deploy a new version of the application, and route a portion of user traffic to the new Deployment, then gradually increase the proportion of traffic, and finally route all traffic to the new Deployment.

img

Although Kubernetes provides these basic release strategies, in an actual production environment, we may need more advanced and flexible release strategies. Such as automatic rollback, metric analysis and complex traffic routing rules, which can better control and manage the application release process. Therefore, many tools have been born in the community, among which Rollouts provided by Argo and Flagger provided by Flux are more popular.

Argo Rollouts is developed and maintained by the Argo project, an open source project focused on Kubernetes workflow, CI/CD, and application deployment. The Argo project is in the incubation stage of the Cloud Native Computing Foundation (CNCF), and the main company behind it is Intuit. The product concept of Argo Rollouts is to use the native features of Kubernetes to provide a flexible and configurable progressive deployment method. By introducing new custom resource definitions (CRDs), Argo Rollouts extends the deployment capabilities of Kubernetes, enabling users to implement more complex deployment strategies in the Kubernetes environment. In principle, Argo Rollouts introduces a new resource type Rollout, and users can use Rollout resources to define deployment strategies and configurations. When a Rollout resource changes, the Argo Rollouts controller automatically executes the deployment according to defined policies.

Flux is also an open source GitOps tool focused on continuous delivery on Kubernetes. The Flux project is also an incubation project of CNCF, and the main company behind it is Weaveworks. Flagger's product philosophy is progressive delivery through automation with minimal human intervention. The goal of Flagger is to provide an automated, safe, and observable way to roll out new application versions for continuous delivery without impacting the user experience. In principle, Flagger automatically changes the traffic weight of Kubernetes service objects and gradually transfers traffic to new versions of services, thereby achieving progressive delivery. During this process, Flagger continuously monitors the performance metrics of the application, and if any anomalies are found, it will automatically roll back to the previous version.

img

The table shown in the figure is a brief comparative analysis of Argo Rollouts and Flux Flagger. For example, from the perspective of load intrusion, Rollouts is a complete intrusion, while the original object in Flagger will be regarded as a grayscale verification object, and the real exposed service is its copy. For example, regarding whether the new version can be repeatedly verified during the deployment process, Argo supports it very well, and although Flux also supports it, it is essentially a new deployment. The abilities of the two also differ on other levels.

To expand, both support load management and complex deployment , they both provide deployment strategies like canary and blue-green, and support traffic segmentation and switching functions at runtime. These tools can integrate open source service mesh software, such as Istio, Linkered, AWS App Mesh, etc., and ingress controllers, such as Envoy API Gateway, NGINX, Traefik, etc. They also support CRD analysis of metrics collected from monitoring or APM tools such as Prometheus, Datadog, Stack driver, Graphite, and New Relic, among others.

In terms of security and policy integration, both Argo CD and Flux CD provide authentication, authorization and secret management to enhance the security of the platform. Whereas Flux relies heavily on Kubernetes' underlying RBAC, Argo supports finer-grained application permissions and a read-only mode for production systems that do not have access to Kubernetes resources. In terms of governance, Flux does not provide security and compliance checks as part of the software delivery strategy. Argo CD provides a minimal capability to ensure that all applications are checked before being deployed to the target cluster. In terms of ease of use, Flux does not provide an out-of-the-box UI, while Argo provides an excellent UI. For complexity, Argo Rollouts provides a large number of options and configurations to suit various deployment needs, including traffic routing, experimentation, analysis, anti-affinity configuration, and delay settings in grayscale deployments, etc. However, these configurations may increase The complexity of its use.

Overall, both Argo Rollouts and Flux Flagger are powerful GitOps tools, but their key benefits and usage scenarios differ. Argo may be a better fit for environments that require more advanced security and policy integration, as well as more comprehensive observability. And Flux may be more suitable for those teams who want to simplify the deployment and management process. When choosing a tool, you need to take into account your specific needs and usage scenarios.

img

img

Tencent Cloud CODING is a one-stop R&D management solution for enterprises. Many customers rely on our platform for R&D process management . Since the service is always in use, we need to ensure that any publishing activity requires little downtime and is barely noticeable to users.

Our release strategy is to roll out one major release each month, and these releases usually include new features. Before the official release, we will invite some users to try and test. In this process, it is necessary to ensure that the experience of old users will not be affected in any way. Therefore, we need to support fine-grained traffic control by user. After verifying that there is no problem with the new version, we will release it to paying users in batches. We need to be able to roll back quickly if anything goes wrong during the release process.

In addition to the monthly major releases, we release lightweight patches (Livepatch) several times a day. These patches usually do not undergo grayscale release, but it is necessary to ensure that they can be rolled back quickly if the release fails. For some services, sometimes you choose to skip grayscale publishing, so our system needs to be able to support ignoring the publishing of specified services.

In addition, the features of the version released every month need to be distributed to different channels, so our release system needs to have the ability to automatically collect these features and form Release Notes.

These requirements not only ensure that our services are highly available, but also ensure that we can minimize any possible risks while maintaining efficient releases.

img

According to the above business demands, automatic progressive release has obviously become the most suitable deployment strategy. It can provide the possibility to further optimize the release process, realize truly automatic, non-perceptual release, and maximize user experience.

Although Argo Rollouts and Flux Flagger have provided a large number of functions to support progressive release, according to the existing needs, there are still some places that cannot be met by these two tools.

  1. Load intrusion problem : In Argo and Flux, when performing grayscale publishing or A/B testing, we need to add corresponding logic at the service level to support traffic segmentation, which increases the complexity and development burden of the application to a certain extent .

  2. Traffic smoothing problem : When Argo and Flux are undergoing version upgrades, the process of traffic switching is not completely smooth. It may cause traffic jitter during the switching process and affect the stability of the service.

  3. **Configuration complexity: **The configuration of Argo and Flux is relatively complicated, which requires more time for developers and operation and maintenance personnel to understand and master, which increases the threshold of use.

  4. **Lack of full-link grayscale:**Argo and Flux do not fully support full-link grayscale, and it is difficult to meet complex grayscale release requirements, such as refined traffic routing based on user characteristics.

  5. **Version management issues:** When using Argo and Flux for release, there is a problem of repeated verification of new versions, and the version management mechanism is relatively weak, which may cause certain troubles for frequent version releases.

  6. **Delivery complexity: **Argo and Flux are designed for service application delivery in software engineering, and they mainly deal with technical issues. However, delivery is not only service application delivery in software engineering, but also user value delivery at the business level. This involves how to accurately and effectively deliver the features of the new version to target users, and how to implement flexible version control and release strategy formulation according to product requirements in project management on the premise of ensuring user experience. This is an in-depth exploration of business understanding and business requirements, which requires us to pay more attention to the realization and optimization of business value in addition to technical realization . This kind of complexity and subtleties cannot be covered and supported by the current design and implementation of the two.

    img

For the aforementioned reasons, the current design and implementation of common open source tools such as Argo and Flux fail to meet our business needs. So we built our own automated progressive release solution . Its process can be briefly outlined as follows:

The whole process constitutes a continuous reconcile (Reconcile) cycle. First, the preparation stage is carried out to configure and deploy the pre-canary workload that is ready to go online. Subsequently, the proportion of the gray load is gradually increased (pre canary traffic). Next, we increase the proportion of traffic directed to the canary version, that is, the post canary traffic stage.

After this step, the system will verify whether the grayscale release is successful. If a problem or failure is detected, the system falls back to the pre canary workload stage, otherwise, the process continues.

Next is the pre-promote stage, preparing to promote the new version to the official environment. Then, during the promote phase, the new version is deployed in the production environment. Followed by the pre route traffic stage, all traffic is ready to be switched to the new version. After the official route traffic phase begins, all traffic will point to the new version.

In the pre finalize phase, we are ready to finalize all processes and clean up. When the entire Reconcile Loop is over, if a rollback is required, the pre rollback phase will start, and then post rollback will be performed to restore to the initial state.

In general, we have deeply customized and optimized the grayscale release process to make it more in line with our business scenarios.

img

img

For the above process, it can be analyzed from the perspective of resources. The entire deployment process can be split into the following steps:

  1. **Pre-deployment stage:** Create a new "canary" namespace (canary namespace). At this time, the proportion of pod workload inside it is 0%.

  2. **The first phase of grayscale deployment:** Gradually increase the workload ratio of canary namespace, for example, increase the number of pods to 15%. At the same time, adjust the traffic weight to route part of the traffic to the canary namespace, so as to verify the performance and stability of the new version in this environment.

  3. **Second stage grayscale deployment:** If there is no problem with the verification in the first stage, we will continue to increase the workload ratio and traffic weight of the canary namespace for a wider range of verification.

  4. **Official Deployment:** After two rounds of grayscale testing are successfully passed, we will deploy the new version of the application in the "production" namespace of the original production environment. This process is called promotion. After the promotion, all traffic will be switched to the production namespace, which means that users will start using the new version of the service.

  5. **Cleanup phase:**After the deployment of the new version is completed, we will clean up the canary namespace.

Different from the traditional blue-green deployment method, this process does not switch all traffic at once, but upgrades the original production environment after two rounds of grayscale verification are successful. The idea of ​​this design is that after multiple switches in the traditional blue-green deployment method, because each production environment is switched to a new namespace, it may cause confusion, making it difficult for us to determine which namespace is used for the production environment. which one. And by continuously upgrading the "production" namespace, we can ensure that this namespace is always used in the production environment, while other namespaces are dedicated to grayscale testing, avoiding potential risks caused by confusion.

img

Based on the solution designed above, we refer to the excellent open source solution and realize the self-developed application publishing engine - Orbit . Orbit provides a series of advanced and flexible release strategies in the form of productization, such as blue-green deployment and grayscale deployment in batches. In addition, it also has the ability to support the operation of "ignore grayscale" for specified services. The design and implementation of these functions make Orbit show significant advantages in meeting complex publishing requirements.

img

Orbit's design philosophy focuses on practice and innovation , and it has a sound change review mechanism to ensure the accuracy and reliability of releases. While satisfying automation, the release process retains sufficient controllability, and users can flexibly manage and adjust the release process according to actual needs.

During the release process, Orbit supports pre-execution of changes to ensure the accuracy and reliability of the release process . Throughout the deployment process, every change is well documented and audited when required.

In the automated process, Orbit has specially added a manual confirmation link to reduce the risk of release through manual review and avoid problems caused by unexpected events in the automated process. This not only retains the efficiency brought by automation, but also takes into account the advantages of manual review in risk control.

Orbit also has the function of fast rollback. Once a problem is found during the release process or after the new version is launched, it can be quickly rolled back to the old version to minimize the impact of the problem.

Additionally, Orbit automatically manages and cleans up grayscale environments . After the new version is successfully released and running stably, Orbit can automatically clean up the grayscale environment that is no longer needed. This means teams don't have to do manual cleanup, reducing the maintenance burden while ensuring resource utilization and a clean environment.

In general, Orbit, with its comprehensive and flexible design, provides a one-stop solution for application release. From change execution to environment cleanup, each link reflects its high efficiency, stability and reliability .

img

In addition, Orbit has further deepened the understanding of business value delivery. It is not just a tool, but more like a bridge that closely combines software version management at the technical level with value creation at the business level. Under the management of Orbit, each release is no longer just code changes and image updates, but directly reflects the delivery of specific user stories and business value. This approach integrates technology and business and provides a new perspective on application release management.

In this way, Orbit can help enterprises more accurately understand the business value behind each version change, making the communication and understanding between technical teams and business teams smoother . In addition, by mapping business value to specific version changes, enterprises can more clearly understand and evaluate the actual effects of each release, thereby achieving more effective product iteration and continuous optimization. This innovative design concept not only enables Orbit to have significant advantages in the comprehensiveness and depth of application release management, but also achieves a significant improvement in the effect of value delivery.

img

Different tools have their own advantages and features. Here, we will conduct an in-depth comparison and analysis of the three tools Orbit, Argo Rollouts and Flux Flagger, especially focusing on our self-developed Orbit:

  1. **Supported load types: **Orbit supports any type of load, including but not limited to Rollout and Flagger. In contrast, Argo and Flux have more limited support.

  2. **Intrusive:** Orbit is designed to be completely non-intrusive and integrate seamlessly with existing environments. Argo Rollouts, on the other hand, are completely invasive and may require major tweaks to existing environments.

  3. **Grayscale copy designation: **Orbit allows users to specify a grayscale copy to deploy to a specific namespace, providing greater flexibility.

  4. **Traffic smoothness:** Orbit's traffic smoothness can be adjusted through configuration to meet different business needs. And Flux Flagger may not provide completely smooth flow switching during the promotion process.

  5. **Full link grayscale support: **Orbit supports full link grayscale publishing, which can better coordinate workloads. However, Argo Rollouts and Flux Flagger mainly use a single workload as the granularity, so it may be difficult to coordinate the entire link.

  6. **Configuration complexity: **Orbit's basic functions can be operated through the UI interface, and advanced functions can be declared through yaml files, making the user experience more friendly. The configuration of Argo Rollouts may be more complicated, and Flux Flagger needs to configure Grayscale Service, TrafficRef, etc.

  7. **Product observability: **Orbit has excellent observability, and users can conduct global observation of the entire environment through the UI interface. In contrast, Argo Rollouts also provides detailed observation information, but needs to be operated through the command line, and Flux Flagger does not provide related functions.

  8. **Version management: **Orbit's version management is based on the Open Application Model (OAM), combined with the business model. While Argo Rollouts is mainly based on revision for version management, Flux Flagger does not support version management.

  9. **Integration of business value delivery: **Orbit has significant advantages in value delivery at the business level. It integrates version change management, can automatically identify and associate changes, and realizes that the released version is not only the version number of the image, but also includes the delivery of user stories and business value. This goes beyond the capabilities of Argo Rollouts and Flux Flagger to more closely align the release process with business value.

img

In the process of GitOps progressive delivery based on Orbit practice, developers need to submit their responsible code, configuration and SQL script changes to the code warehouse. The submission of business code will trigger the continuous integration (CI) process, which will build a new application image and store it in the mirror warehouse.

At this point, Orbit's GitOps automated picking engine comes into play. It continuously listens to application artifacts and code repositories, picking up images, configurations, and SQL changes automatically . These changes are not only integrated into the same version category, but also displayed in the version in a visual way, so that developers or publishers can review all changes before the version is released.

These changes are then published through Orbit's All In One publishing engine. The engine aggregates all changes to the application and releases them atomically and versioned to multiple environments. This process ensures consistent and reliable releases and reduces problems caused by inconsistencies in the environment. At the same time, Orbit also supports version-based atomic rollback, which further improves the controllability and security of the release process.

Overall, Orbit achieves an efficient, accurate and flexible application release process through the GitOps automatic picking engine and All In One versioned release. Developers only need to focus on the code, configuration and SQL scripts they are responsible for, while the consistency, reliability and traceability of the release are guaranteed by Orbit, which greatly simplifies the release process and improves work efficiency.

img

The practice of GitOps is not limited to the delivery phase, and its concepts and methods can also be applied to other life cycle phases such as application modeling and operation and maintenance. As a cloud-native application lifecycle management tool, Orbit can better implement the entire GitOps process.

img

Orbit proposes a three-tier architecture model for applications under the cloud-native architecture. The business layer declares the basic elements of the application, including services, service configurations, and database scripts. The delivery layer declares the delivery process of the application, and the resource layer declares the runtime of the application. Environment and infrastructure, based on the Application as Code theory, the application model is stored in the code warehouse to achieve unified evolution, auditable and traceable.

In addition, we provide a solution based on the separation of perspectives based on the OAM specification. A small number of cloud-native experts in the enterprise encapsulate k8s specifications through service templates, and encapsulate k8s ecological expansion capabilities through operation and maintenance plug-ins, such as resource constraints, K8s native capabilities such as probes, or k8s ecological capabilities such as monitoring and link tracking, cloud native experts can stipulate that the production environment must enable resource restrictions, probes, monitoring and other operation and maintenance plug-ins according to the actual situation, implement cloud native specifications, and develop Based on the service template and operation and maintenance plug-in, students can complete the creation of the service by filling in the business parameters in the form of a visual form, which greatly reduces the impact of cloud-native complexity left shift on research and development.

img

Facing the challenges of application release efficiency and reliability, Orbit's solution is the GitOps automatic picking engine and All In One versioned release. Developers only need to focus on the code, configuration and SQL scripts they are responsible for, and Orbit's GitOps automatic picking The engine will continue to monitor the application product library and code warehouse, and automatically pick up mirroring, configuration, and SQL changes.

Subsequently, the changed materials of these applications will be aggregated by our All in One release engine and displayed in the version in a visual way. Developers or release personnel can review all changes before the release of the version, and execute the application release after confirming that they are correct. The version process releases the application to multiple environments in an atomic and versioned manner, ensuring the consistency and reliability of multi-environment releases. At the same time, Orbit can also perform atomic rollbacks based on versions.

img

After solving the problems of application release efficiency and reliability, facing the problems of tool islands, perspective separation, and data faults in application operation and maintenance, Orbit launched an application-centric hybrid cloud unified observation plane , Orbit self-developed adapter service, unified Processes monitoring alarms, logs, and link tracking data, and provides a set of data standards for each type, realizing the scalability of observation tools. Users can freely choose observation tools officially supported by Orbit, and can also expand other observations by themselves tool.

The adapter mode shields the complexity of the observation tool. On this basis, we redesigned the observation interface and built it into the application environment. Combined with the application model, it finally presents an application-centric and R&D-oriented perspective for users . Observation products, whether it is a scenario where open source and cloud are mixed, or a multi-cloud multi-environment scenario, for research and development, they are all browsed in the same application, no need to switch back and forth, which is time-consuming and laborious, and solves the problem of isolated tools. At the same time, observation The data presented by the product is application-centric, that is to say, in this application, usually only the observation data in the current application can be seen, so as to solve the problem of perspective separation.

Facing the problem of data faults, we need to establish the correlation between observation tools. The rise of OpenTelemetry technology has given us a good path. Under the microservice architecture, link tracking records all the nodes and Contextual information makes the system data have business properties, and records the unique identifier TraceID of link tracking into the log and monitoring system, which can not only solve the problem of data faults between observation tools, but also make the entire observation data have business attributes and establish business-related the unified observation plane.

img

At present, Orbit is compatible with the open source ecosystem and supports Prometheus, Loki, EFK, Skywalking, Jaeger and other mainstream open source observation products . Log service products to protect the production environment.

img

img

Although GitOps is one of the important practices of DevOps, we understand that the field of DevOps is much more than that. Therefore, Orbit is only a manifestation of some product capabilities in the CODING platform.

Gartner, an authoritative organization, believes that in a cloud-native environment, we need to establish a service-oriented DevOps concept and provide DevOps support for service teams through platform-level capabilities. This model is considered to be the most suitable DevOps service model for microservices, and its core value proposition is " DevOps as a Service ".

In this context, the task of the R&D support team is to provide the development team with a stable and powerful engineering platform to help them shield the complex underlying infrastructure. This platform should include all kinds of tools required by developers, a self-service developer service interface, and reusable templates to meet the daily needs of the development team.

Here, team topology theory provides us with a framework for understanding and designing development organizations . It emphasizes designing the structure of development teams based on the organization's business goals and communication fluidity for optimal collaboration. Team topology helps us understand how different types of teams (eg, platform teams and development teams) can work together effectively through well-defined interaction patterns (eg, collaboration or service).

The emergence of platform engineering is actually the inevitable result of the topological development of the team. In modern software development, due to the acceleration of technical complexity and speed of change, as well as the widespread use of microservice architecture, development teams can no longer effectively pay attention to all technical details. Platform engineering is to abstract these general and low-level work and provide it by a dedicated platform team, so that the service team can focus on their main goals.

Therefore, the goal of the CODING platform is to serve as the core support force for platform engineering, to help each team better deliver value and realize an industrialized software production line by realizing an excellent team topology and providing strong platform engineering capabilities.

img

img

Guess you like

Origin blog.csdn.net/CODING_devops/article/details/131092817