Best architecture practices based on Docker and Kubernetes

[Editor's Note] Is it possible to build the coolest architecture based on Docker and Kubernests? This article will take you on a journey of the best architecture and explain all the problems you have encountered along the way. Now, Let's go!

How has the field of software development changed in the era of Docker and Kubernetes? Is it possible to use these technologies to build a once-for-all architecture? When everything is "packaged" into the container, is it possible to unify the development and integration process? What are the requirements for these decisions? What restrictions will they bring? Do they make it easier for developers, or on the contrary, add unnecessary complexity?

1.jpg


It's time to clarify these and other issues in text and original illustrations!

This article will take you on the journey from real life to the development process to the architecture and finally back to real life, and will answer the most important questions you encounter at these stops along the way. We will try to identify some components and principles that should be part of the architecture and demonstrate some examples, but will not enter the realm of its implementation.

The conclusion of the article may make you upset or extremely happy. It all depends on your experience, your opinion of the three chapters, and even your mood when reading this article. You can comment or ask questions below and let me know what you think!

From real life to development workflow

2.jpg


In most cases, all the development processes I have seen or are honored to build are just for a simple goal-shorten the time interval from concept generation to delivery to the production environment, while maintaining a certain level of code quality.

It doesn't matter whether the idea is good or bad. Because bad ideas come and go in a hurry-you just have to try and you can throw them into the pile of old paper. It is worth mentioning here that rolling back from a bad idea can fall on the shoulders of automated facilities, which can automate your workflow.

Continuous integration and delivery seem like a lifesaver in the field of software development. What could be easier than this? If you have an idea and you have the code, then do it! If it is not a minor problem, it will be flawless-the integration and delivery process is relatively difficult to be independent of the company's unique technology and business processes.

However, despite the seeming complexity of the task, there will be some excellent ideas from time to time in life that can bring us (of course, I am sure myself) closer to building a flawless one that can be used in almost any occasion mechanism. For me, the closest steps to such a mechanism are Docker and Kubernetes. Their level of abstraction and thinking methods make me think that 80% of the problems can now be solved in almost the same way.

The remaining 20% ​​of the problems are obviously still in place, but because of this, you can focus your creative talents from your heart on interesting tasks instead of dealing with repetitive routines. Just take care of the "architectural framework" once, and you can forget about 80% of the problems that have been solved.

What does all this mean? And how does Docker solve the problem of development workflow? Let's look at a simple process, which is sufficient for most work environments:

3.jpg


With proper methods, you can automate and integrate everything in the sequence diagram above and forget about it in the next few months.

Set up the development environment

4.jpg


A project should contain a docker-compose.yml file, which saves you from thinking about what you need to do and how to run the application/service on your local machine. A simple command docker-compose up should start your application and all its dependencies, populate the database with fixtures, upload the native code in the container, enable code tracking for instant compilation, and finally start responding to requests on the desired port. Even when setting up a new service, you don’t have to worry about how to start, where to submit changes, or which framework to use. All of these should advance in the standard specification, and provided by designated service templates for different settings: 前端、后端和worker.

automated test

5.jpg


All you want to know about the "black box" (more information on why I call the container so will be explained later in the article) is that everything in it is intact, yes or no, 1 or 0. You can execute a limited number of commands inside the container, and docker-compose.ymlall its dependencies are described. You can easily automate and integrate these tests without paying too much attention to implementation details.

For example, like this !

Here, testing means not only unit testing, but also functional testing, integration testing, ( 代码样式) testing and copying, checking outdated dependencies, and whether the license of the used package is normal or not, etc. The point is that all of this should be encapsulated in a Docker image.

System delivery

6.jpg


It doesn't matter when and where you want to install your project. The result is just like the installation process and should be consistent. It makes no difference as to which part of the entire ecosystem you want to install or which Git repository you will get the code from. The most important component here is idempotence. The only thing that should be specified are the variables that control the installation process.

The following is my very effective algorithm in solving this problem:

  1. DockerfilesCollect images from all ( like this for example )
  2. Using meta projects, these images are delivered to Kubernetes through the Kube API . Several input parameters are usually required to initiate a delivery:
    • Kube API endpoint
    • A "confidential" object, different for different environments (local/test/pre-release/production)
    • The names of the systems to be displayed and the labels of the Docker images for these systems (obtained in the previous step)


As an example of a meta-project that covers all systems and services (in other words, a project that describes how the ecosystem is orchestrated and how to deliver updates), I prefer to use Ansible  to integrate with the Kube API playbooksthrough this module . However, complex automation can refer to other options, and I will discuss my options in detail later. However, you must consider a centralized/unified management structure. Such a method allows you to conveniently and uniformly manage all services/systems, and eliminates any complications that may be brought about by the upcoming technology and system jungles that perform similar functions.

Generally, the following installation environment is required:

  • "Test"-used to manually check or debug the system
  • "Pre-release"-for near real-time environment and integration with external systems (usually located in the DMZ instead of 测试环境)
  • "Production"-the actual environment of the end user

 

Continuity of integration and delivery

7.jpg


If you have a unified way to test Docker images-or "black boxes"-you can assume that these test results will allow you to seamlessly (and have a clear conscience) 功能分支integrated into your Git repository 上游或主分支.

Perhaps the only transaction breaker here is the order of integration and delivery. If there is no release, how to prevent "race conditions" on a system through a set of parallel feature branches?

Therefore, this process can only be started when there is no competition, otherwise the "race conditions" will linger in the mind:

  1. Try to 功能分支update to 上游( git rebase/ merge)
  2. From the Dockerfilesconstruction of the mirror
  3. Test all built images
  4. Start and wait until the system delivers the image built from step 2
  5. If the previous step fails, roll back the ecosystem to its previous state
  6. In 上游combined 功能分支and sent to a repository


Any failure in any step should terminate the delivery process and return the task to the developer to resolve the error, whether it is a failed test or a merge conflict.

You can use this procedure to manipulate multiple repositories. You only need to perform each step once for all repositories (step 1 for code bases A and B, step 2 for code bases A and B, etc.) instead of repeating the entire process for each individual repository (step 1 -6 is for code base A, steps 1-6 are for code base B, etc.).

In addition, Kubernetes allows you to roll out updates in batches for various AB testing and risk analysis. Kubernetes is implemented internally by separating services (access points) and applications. You can always balance the old and new versions of the components in the required proportions to facilitate problem analysis and provide a path for potential rollbacks.

System rollback

One of the mandatory requirements of the architectural framework is the ability to roll back any deployment. In turn, this requires some explicit and implicit nuances. Here are some of the most important things:

  • The service should be able to set up its environment and roll back changes. For example, database migration, RabbitMQ schema, etc.
  • If the environment cannot be rolled back, the service should be polymorphic and support old and new versions of code. For example: database migration should not interrupt the service of the old version (usually 2 or 3 previous versions)
  • Be backward compatible with any service updates. Usually, this is API compatibility, message format, etc.


Rolling back the state in a Kubernetes cluster is fairly simple (run kubectl rollout undo deployment/some-deployment, Kubernetes will restore the previous "snapshot"), but in order for this feature to take effect, your meta project should contain information about this snapshot. But more complex delivery rollback algorithms can be daunting, although they are sometimes necessary.

The following is the content that can trigger the rollback mechanism:

  • High percentage of application errors after launch
  • Signals from key monitoring points
  • Failed smoke test
  • Manual mode-human factors

 

Ensure information security and audit

No workflow can miraculously "build" invulnerable security and protect your ecosystem from external and internal threats, so you need to ensure that your architecture framework is in accordance with the company's standards at every level and all subsystems And security policy enforcement.

I will discuss all three levels of the solution in the following chapters on monitoring and alerting, which themselves are also the key to system integrity.

Kubernetes has a good set of built-in mechanisms for access control , network policies , event auditing, and other powerful tools related to information security, which can be used to build a good protection boundary to resist and prevent attacks and data leakage.

From development process to architecture

The idea of ​​tightly integrating the development process with the ecosystem should be seriously considered. Adding this integrated requirement to the traditional set of architectural requirements (elasticity, scalability, availability, reliability, resistance to threats, etc.) can greatly increase the value of the architectural framework. This is a crucial aspect, which led to the emergence of a concept called "DevOps" (Development Operation and Maintenance), which is a reasonable step to achieve full automation and optimization of infrastructure. However, if there is a well-designed architecture and reliable subsystems, DevOps tasks can be minimized.

Microservice architecture

There is no need to discuss in detail the benefits of service-oriented architecture-SOA , including why services should be "micro". I will only say that if you decide to use Docker and Kubernetes, then you are likely to understand (and accept) that the monolithic application architecture is difficult or even fundamentally wrong. Docker is designed to run a process and persist, Docker allows us to focus on thinking within the DDD framework (Domain Driven Development). In Docker, the packaged code is regarded as a black box with some public ports.

Key components and solutions of the ecosystem

According to my experience in designing systems with higher availability and reliability, there are several components that are critical to the operation and maintenance of microservices. I will list and discuss these components later. I will use them in a Kubernetes environment. Cite them, you can also refer to my list as a checklist for any other platform.

If you (like me) will come to the conclusion that these components are managed as regular Kubernetes services, then I suggest you run them in a separate cluster apart from the "production environment". For example, a "pre-release" cluster, because it can save you time when the production environment is unstable and you desperately need the source of its images, code or monitoring tools. It can be said that this solves the chicken and egg problem.

Authentication

8.jpg


As usual, it starts with access-servers, virtual machines, applications, office mail, etc. If you are or want to be a customer of one of the major enterprise platforms (IBM, Google, Microsoft), access issues will be handled by a service of the provider. However, if you want to have your own solution, can it only be managed by you and within your budget?

This list can help you determine the appropriate solution and estimate the amount of work required for setup and maintenance. Of course, your choice must comply with the company's security policy and be approved by the information security department.

Automated service configuration

9.jpg


Although Kubernetes requires only a few components on physical machines/cloud virtual machines (Docker, kubelet, kube proxy, etcd clusters), the addition of new machines and cluster management still need to be automated. Here are some simple methods:

  • KOPS -This tool allows you to install clusters on one of the two cloud providers (AWS or GCE)
  • Teraform -This allows you to manage the infrastructure of any environment and follow the IAC (Infrastructure as Code) philosophy
  • Ansible -for any type of general automation tool


Personally, I prefer the third option (with an integrated module for Kubernetes ) because it allows me to use servers and Kubernetes objects and perform any kind of automation. However, nothing prevents you from using Teraform and its Kubernetes modules . KOPS does not work well on "bare metal", but it is still a great tool to use with AWS/GCE!

Git code base and task tracker

10.jpg


For any Docker container, the only way to make its logs accessible is to write them to the STDOUT or STDERR of the root process running in the container. The service developer does not care about the subsequent changes in the log data, but mainly they should be necessary Available at times, and preferably contains a record of a certain point in the past. All responsibility for meeting these expectations lies with Kubernetes and the engineers that support the ecosystem.

In the official documentation , you can find instructions on basic (and good) strategies for handling logs, which will help you choose a service for aggregating and storing large amounts of text data.

In the recommended service for the logging system, the same document mentions that fluentd is used to collect data (when it is started as an agent on each node of the cluster) and Elasticsearch is used to store and index data . Even if you may not agree with the efficiency of this solution, given its reliability and ease of use, I think this is at least a good start.

Elasticsearch is a resource-intensive solution, but it can scale well and has ready-made Docker images that can run on a single node and a cluster of the required size.

Tracking system

11.jpg


Even if the code is perfect, it will still fail. Then you want to study them very carefully in the production environment and try to understand "If everything is working properly on my local machine, then what went wrong in the production environment? ?". For example, slow database queries, incorrect caches, slow disks or connections to external resources, transactions in the ecosystem, bottlenecks, and insufficient computing services are all you have to track and estimate code execution time under actual load Some reasons.

Opentracing and Zipkin are sufficient for this task of most modern programming languages, and will not add extra burden after encapsulating the code. Of course, all the collected data should be stored in an appropriate place and used as a component .

The above-mentioned development standards and service templates can solve the complex situations that occur when encapsulating code and forwarding "Trace ID" through services, message queues, and databases. The latter also takes into account the consistency of the method.

Monitoring and alerting

12.jpg


Prometheus has become the de facto standard in modern monitoring systems, and more importantly, it has gained out-of-the- box support on Kubernetes . You can refer to the official Kubernetes documentation to learn more about monitoring and alerting.

Monitoring is one of the few auxiliary systems that must be installed in the cluster, and the cluster is a monitored entity. But the monitoring of the monitoring system (sorry for being verbose) can only be done from the outside (for example, from the same "pre-release" environment). In this case, cross-check can be used as a convenient solution for any distributed environment, which will not complicate the highly unified ecosystem architecture.

The entire monitoring range can be divided into three completely logically isolated levels. The following are examples of what I think are the most important tracking points at each level:

  • Physical layer: network resources and their availability-disk (I/O, free space)-basic resources of a single node (CPU, RAM, LA)
  • Cluster layer:-Availability of the main cluster system on each node (kubelet, kubeAPI, DNS, etcd, etc.)-the number of available resources and their uniform distribution-the monitoring of the allowable available resources relative to the actual resources consumed by the service- Pod reload
  • Service layer:-any type of application monitoring-from database content to API call frequency-the number of HTTP errors on the API gateway-queue size and worker utilization-multiple metrics of the database (replication delay, Time and number of transactions, slow requests, etc.)-error analysis of non-HTTP processes-monitoring of requests sent to the logging system (any request can be converted into a metric)


As for the alarm notification at each level, I would like to recommend one of the external services that have been used countless times. You can send notification emails, text messages, or call your mobile phone number. I will also mention another system- OpsGenie -which is tightly integrated with Prometheus' alertmanaer .

OpsGenie is a flexible alerting tool that can help handle upgrades, work around the clock, notification channel selection, and more. It is also easy to distribute alerts among teams. For example, different levels of monitoring should send notifications to different teams/departments: physics-Infra + Devops, clusters-Devops, applications-each related team.

API Gateway and single sign-on

13.jpg


To handle tasks such as authorization, authentication, user registration (external users-corporate customers), and other types of access control, you need highly reliable services to maintain flexible integration with API Gateway. There is no harm in using the same solution as the "identity service", but you may need to separate these two resources to achieve different levels of availability and reliability.

The integration of internal services should not be complicated, and your services should not worry about the authorization and authentication of users and the other party. Instead, the architecture and ecosystem should have a proxy service that handles all communication and HTTP traffic.

Let us consider the most suitable way to integrate with API Gateway, which is the entire ecosystem-tokens. This method applies to all three access scenarios: from UI, from service to service, and from external systems. Then, the task of receiving the token (based on the login name and password) is done by the user interface itself or the service developer. It also makes sense to distinguish the lifetime of the token used in the UI (shorter TTL) and other situations (longer and custom TTL).

The following are some of the problems solved by API Gateway:

  • Access to ecosystem services from outside and inside (services do not directly communicate with each other)
  • Integration with single sign-on service: token conversion and additional HTTPS request, the header contains the user identification data (ID, role and other details) of the requested service-according to the role received from the single sign-on service enable/disable pair Access control for the requested service
  • Single point monitoring for HTTP traffic
  • Composite API documents of different services (for example, composite Swagger json/yml files )

  • Able to manage the routing of the entire ecosystem based on the domain and requested URI
  • Single access point for external traffic, and integration with access providers

 

Event Bus and Enterprise Integration/Service Bus

14.jpg


If your ecosystem contains hundreds of services that can work in a macro domain, you will have to deal with thousands of possible ways of service communication. To simplify the data flow, you should have the ability to distribute information to a large number of recipients when a specific event occurs, regardless of the context of the event. In other words, you need an event bus to publish events based on standard protocols and subscribe to them.

As an event bus, you can use any system that can operate the so-called Broker: RabbitMQ , Kafka , ActiveMQ, etc. Generally speaking, high availability and consistency of data are essential for microservices , but due to the CAP theorem , you still have to sacrifice something to achieve the correct distribution and clustering of the bus.

Naturally, the event bus should be able to solve various communication problems between services, but as the number of services increases from a few hundred to thousands or even tens of thousands, even the best event bus-based architecture will be discouraged and you will need Looking for another solution. A good example is the integrated bus method, which can extend the functions of the aforementioned "Dumb tube-smart consumption" strategy.

There are dozens of reasons to use the " enterprise integration/service bus " approach, the purpose of which is to reduce the complexity of service-oriented architecture. Here are a few reasons:

  • Aggregate multiple messages
  • Split an event into several events
  • Synchronization/transaction analysis of system response to events
  • Coordination of interfaces, which is particularly important for integration with external systems
  • Advanced logic of event routing
  • Multiple integrations with the same service (from outside and inside)
  • Unscalable centralization of data bus


As an open source software for enterprise integration bus, you may want to consider Apache ServiceMix , which contains several components that are critical to the design and development of such SOA.

Database and other stateful services

15.jpg


Like Kubernetes, Docker has changed all the rules of the game for services that require data persistence and disk-related services time and time again. Some people say that services should "survive" the old way on physical servers or virtual machines. I respect this point of view and will not talk about its advantages and disadvantages, but I am fairly certain that this claim exists only because of the temporary lack of knowledge, solutions, and experience in managing stateful services in the Docker environment.

I should also mention that databases often occupy the center of the storage world, so the solution you choose should be completely ready to work in a Kubernetes environment.

Based on my experience and market conditions, I can distinguish the following groups of stateful services and examples of the most suitable Docker solution for each service:

  • Database Management System- PostDock is a simple and reliable solution for PostgreSQL in any Docker environment
  • Queue/message broker- RabbitMQ is a classic software for building a message queuing system and routing messages. The cluster_formationparameters in the RabbitMQ configuration are essential for cluster settings
  • Cache service- Redis is considered to be one of the most reliable and flexible data caching solutions
  • Full-text search-The Elasticsearch technology stack I have mentioned above was originally used for full-text search, but it is also good at storing logs and any work with large amounts of text data
  • File storage service-a general service group for any type of file storage and delivery (ftp, sftp, etc.)

 

Dependent mirror

16.jpg


If you have not yet encountered a situation where the packages or dependencies you need have been deleted or temporarily unavailable, please don't assume that this will never happen. To avoid unnecessary unavailability and provide security for internal systems, make sure that no Internet connection is required to build and deliver services. Configure the image and copy all dependencies to the internal network: Docker image, rpm package, source code library, python/go/js/php module.

These and any other types of dependencies have their own solutions. The most common one can be Googled by querying " private dependency mirror for ... ".

From architecture to real life

17.jpg


Whether you like it or not, your entire architecture is destined to be unsustainable sooner or later. It always happens: technology obsolescence is fast (1-5 years), method and methodology-a bit slow (5-10 years), design principles and basics-occasionally (10-20 years), but it is inevitable after all of.

Taking into account the obsolescence of technology, you need to always try to keep your ecosystem at the peak of technological innovation, plan and launch new services to meet the needs of developers, businesses and end users, and promote new utility to your stakeholders Process, deliver knowledge to move your team and company forward.

Keep yourself at the top of the ecological chain by integrating into the professional community, reading relevant literature and communicating with colleagues. Pay attention to the new opportunities in the project and the correct use of new trends. Experiment and apply scientific methods to analyze research results, or rely on the conclusions of others you trust and respect.

Unless you are an expert in the field, it is difficult to prepare for fundamental changes. All of us will only witness some major technological changes throughout our careers, but it is not the amount of knowledge in our heads that makes us professionals and makes us climb to the top, but the openness of our thinking and acceptance The ability to transform.

Back to the question in the title: "Is it possible to build a better architecture?". The answer is obvious: No, it's not "once and forever", but you must actively strive for it to some extent. In a "very short time" in the future, you will definitely succeed!

PS: The
original text was written in Russian, so I would like to thank my colleague at Lazada  Sergey Rodin for their excellent translation help!

Original link: The best architecture with Docker and Kubernetes — myth or reality? (Translation: Hu Zhen)

Guess you like

Origin blog.csdn.net/litianquan/article/details/95950912