Implementation of algorithm training system based on k8s (architecture idea + implementation plan)

Background of the project

​ In order to meet the project management and control, personnel coordination, progress control, task allocation, resource allocation, data analysis, and achievement management needs of xx University in the process of scientific research, teaching, and training, it provides integrated project coordination and big data collection for practical teaching. , data crowdsourcing service, data cleaning and governance, data analysis platform and other commonly used workbenches, it is necessary to build a set of system management, project management, course management, algorithm development, data crowdsourcing, literature query, approval management and academic circle as An integrated platform for teaching, scientific research and practical training.

Algorithm training system

Introduction

Algorithm training is one of the core functions of this system. It provides users with a deep learning algorithm training platform, built-in a variety of common data sets, and supports code programming and graphical programming. The code supports three languages: python, scala, and r. Programming Complete modeling by dragging and dropping components and selecting parameters without writing code to start algorithm training. This programming method greatly reduces the threshold for developers and improves development efficiency.

Functional Requirements

  1. Create Algorithm Training Tasks
  2. Code online writing (support R, Scala, Python language)
  3. Build Algorithm Model Visually
  4. Verify that the algorithm model is compliant
  5. The graphical view of the algorithm model can be converted into a code view
  6. Running Algorithm Training Tasks Online
  7. Can monitor code running status and running logs
  8. Elastic Computing Power Scheduling
    • Algorithmic tasks can be arranged, such as execution order, maximum number of runs, etc.
    • Can automatically schedule server resources to provide a running environment for algorithm tasks
    • Each algorithm task can use the maximum resources and the maximum running time is controllable
    • Can limit the maximum resources consumed by task execution
    • After the algorithm is executed, computing resources can be automatically reclaimed

non-functional requirements

  1. Security: The operating resources and system permissions of user programs must be limited, that is, the user's program running time, memory, cpu, and number of threads are all limited, and cannot cause damage to the system.

responsibility

The figure below is a simple execution logic diagram. After the user submits the code, an independent machine is created in the server cluster, and the code is executed on the machine. 0

In short, my responsibility is to elastically schedule server resources to provide an environment for task execution, and to arrange task execution.

image-20221013164651872

Design ideas

This system is similar to the oj (online judgment) system. The execution logic is to submit code for the user. The client sends the user code to the server through http, and the server runs the code on the evaluation machine and returns the running result. The code submitted by the user is not necessarily safe. It may create infinite processes or consume files to consume the resources of the evaluation machine, or establish a connection to a remote service, providing a backdoor for the attacker. In order to ensure the security of the server, we need to limit the resources used by user programs and system calls.

resource constraints

Restricting the resources that a program can run refers to restricting resources such as memory, running time, number of processes, and threads. Generally use setrlimit () to complete.

The setrlimit() function is a function of C. For specific usage, please refer to https://blog.csdn.net/u012206617/article/details/89286635

system call limit

All system-related operations of the program, such as input and output, creating processes, obtaining system information, etc., require system calls (System Call). Restricting system calls can restrict some dangerous behaviors of the program, such as obtaining the system directory structure or obtaining system permissions. If a program that does not restrict system calls is run together with a server or other programs, it may damage the security of the system and the operation of other programs.

Currently, the two schemes commonly used to restrict system calls are ptrace() and seccomp . The principle of the former is to notify the main program every time the target program tries to make a system call. If it is found to be a dangerous system call, the program will be killed in time. ptrace() generates two interrupts for each system call (one before entering the system call and one after the system call returns), which affects efficiency. In comparison, seccmap may be a better choice.

seccomp (full name securecomputing mode) is a security mechanism supported by the linux kernel. In the Linux system, a large number of system calls (systemcalls) are directly exposed to user mode programs. However, not all system calls are needed, and unsafe code that abuses system calls can pose a security threat to the system. Through seccomp, we restrict the program to use certain system calls, which can reduce the exposure of the system, and at the same time make the program enter a "safe" state.

ptrace()The evaluation systems used are: HustOJ , UOJ ; the evaluation systems seccampused are: QDUOJ , TJudger .

Docker-based sandbox

Using coding to limit the resources and system calls of program running requires complex logic processing, and there are still great security risks when user programs and web servers run together. We can consider another way of thinking to ensure system security. Isolate the target program from the system environment to form a sandbox (SandBox) environment.

What is a sandbox?

Sandbox refers to a virtual system program, the environment provided by it is independent from each running program, and will not affect the existing system. In fact, the java virtual machine jvm is the sandbox mechanism used.

Docker is an open source application container engine that allows developers to package their applications and dependencies into a lightweight, portable container. The containers fully use the sandbox mechanism without any interfaces between them.

train of thought

We only need to create a Docker container every time we run the user program, pass in the input data and the user program, let the program run in the container, create a monitoring thread, monitor the running status of the program, and put the output in a specific The directory can be mounted with the outside world .

Docker's mirroring technology can provide corresponding operating environments for different languages.

For languages ​​that require a running environment (such as python and other scripting languages), we can map the system's library directory to the container . As for security, it generally does not need to be considered, because user programs are also run as ordinary users in the container. The directory has no write permission, and there is no sensitive information such as configuration files that can be read in the library directory.

The network function provided by Docker can also be used to network the target program in the container.

For resource limits, you can use the resource limits provided by Docker

Advantages and disadvantages

advantage:

  • You don't need to write your own code to complete the resource limit function, you can directly use the resource limit provided by Docker

  • Containers are isolated from each other for higher security

  • No need to limit system calls

shortcoming:

  • Container creation requires a certain amount of overhead, and each run has been tested to take about an additional second of time.

For this system, because the deep learning algorithm training itself takes a long time, the time overhead of container creation is basically negligible

image-20221013171745497

issues to consider

  1. The server uses a cluster, how should the created docker be managed?

  2. How to reclaim docker resources after the user program finishes running?

  3. In a cluster environment, how does docker mount the required resources (the machine where docker is located may not be on the same machine as the resource machine).

  4. How to monitor program execution status in real time? (failed, successful, running)

Arranging docker containers based on k8s

In fact, the essence of the above problems is to solve the problem of managing containers across machines , that is, container orchestration , and this has to mention k8s ( Kubernetes ) , a very popular technology in the container orchestration circle , which is an open source version of Google's large-scale container management system borg. , which provides functions such as application deployment, maintenance, and extension mechanisms.

Introduction to kubernetes

Kubernetes is a complete distributed system support platform that supports multi-layer security protection, access mechanism, multi-tenant application support, transparent service registration, service discovery, built-in load balancing, powerful fault discovery and self-healing mechanism, and service rolling Upgrade and online expansion, scalable resource automatic scheduling mechanism, multi-granularity resource quota management capabilities, comprehensive management tools, including development, testing, deployment, operation and maintenance monitoring, one-stop complete distributed system development and support platform .

k8s uses pod as the smallest scheduling unit to arrange containers , and containers are encapsulated in pods . A pod may consist of one or more containers, which have the same life cycle and are orchestrated together on Node as a whole. They share environments, storage volumes and IP space.

We generally don't create pods directly, but let the k8s controller create and manage pods. **In the controller, you can define the deployment method of the Pod, how many copies, which Node it needs to run on, and so on. Different controllers have different characteristics and are suitable for different business environments. Common controllers include **Deployment, DaemonSet, Job, CronJob, StatefulSet and other controllers. **The following are the applicable scenarios of several controllers.

  1. Deployment: suitable for stateless service deployment

  2. StatefulSet: suitable for stateful service deployment

  3. DaemonSet: Once deployed, all node nodes will be deployed, such as some typical application scenarios:

    Run the cluster storage daemon, for example, run on each Node

    Run the log collection daemon on each Node

  4. Job: Execute a task one or more times

  5. CronJob: Periodic or scheduled execution tasks

k8s provides resource management and scheduling methods. You only need to set parameters in the template file to limit the pod's computing resources, external reference resources, and resource objects.

If you want to know more about k8s, you can read the official documents , which are very complete

Implementation ideas

Based on the above introduction, we can form a general implementation idea. We only need to create a pod every time we run a user program, map resources into the pod, and execute the user's program code in the pod. We don't need to think about container orchestration.

As we mentioned above, the creation of pods is generally handled by the controller. According to the understanding of the characteristics of the five commonly used controllers, the Job controller is undoubtedly the controller that best meets our business needs. Let's introduce the job controller in detail:

A job controller can perform three types of tasks:

  • One-time tasks : Usually only one Pod will be started (unless the Pod fails, and the Pod creation failure will continue to restart). Once the Pod terminates successfully, the Job is complete.
  • Serial task : Execute a task several times in a row. When the previous task is completed, the next task is executed until all tasks are executed.
  • Parallel task : Execute multiple tasks concurrently at the same time.

Note: The serial and parallel execution of multiple tasks here refers to the execution of the same task multiple times.

According to business needs, the most suitable is to use one-time tasks. Every time the user runs the program, create a job controller that executes one-time tasks, and let the pod execute the user program. When the program is completed, the pod terminates and the job is completed. You can also set spec.ttlSecondsAfterFinishedthe parameters to let the job delete after waiting for a period of time after the task execution is completed. (Generally, you need to reserve some time, for example, you need to check the execution log or something).

image-20221013170406401

resource constraints

Regarding resource limitation, as mentioned above, using the resource limitation method provided by docker also provides resource management and scheduling methods in k8s. You only need to set parameters in the template file to limit pod computing resources, external reference resources, resource object. The use cases are as follows:

resources: #资源限制和请求的设置
 limits: #资源限制的设置
 	cpu: String #CPU的限制,单位为CPU内核数,将用于docker run --cpu-quota 参数;
 	#也可以使用小数,例如0.1,它等价于表达式100m(表示100milicore)
	memory: String #内存限制,单位可以为MiB/GiB/MB/GB,将用于docker run --memory参数,
 requests: #资源请求的设置
 	cpu: String #CPU请求,容器刚启动时的可用CPU数量,将用于docker run --cpu shares参数
 	memory: String #内存请求,容器刚启动时的可用内存数量

But at this point there are still two unresolved issues:

  1. The resource server and the k8s cluster server are separate. How to access the resources needed for program operation (such as user code, etc.) in the pod
  2. How to monitor code execution status

Solve the problem of data storage and sharing between containers

Let me talk about the first question first. In docker, we use mounts to map host files to containers, while k8s defines its own storage volume (volume) abstraction, and the data storage functions they provide are very powerful. Not only can data be injected into pods through configuration, but data can also be shared between containers within pods. For Pods of different machines, data sharing can be realized by defining storage volumes.

Storage volumes in k8s are mainly divided into 4 categories

  • Local storage volume: mainly used for data sharing between containers in Pod, or data storage and sharing between Pod and Node
  • Network storage volume: mainly used for data storage and sharing between multiple Pods or multiple Nodes
  • Persistent storage volume: Based on network storage volume, users do not need to care about the storage system used by the storage volume, but only need to define how many resources need to be consumed
  • Configuration storage volume: mainly used to inject configuration information into each Pod

The network storage volume nfs selected by this system is based on the following reasons:

  1. There are multiple machines where the provider runs, that is, multiple Nodes are required to share data storage.
  2. The storageclass configuration is too complicated, and using network storage volumes can meet the requirements.
  3. The company's serverless operation and maintenance, k8s cluster construction and use are all completed by one person. Persistent storage is only to decouple storage system users from providers, so it is not necessary to use it.

network storage volume

Here it is necessary to mention the concept of network storage volume: network storage volume is to solve the problem of data storage and sharing between multiple Pods or multiple Nodes. k8s supports products and network storage solutions of many cloud providers, such as NFS /iSCSI/GlusterFS/RDB/flocker etc.

This system uses NFS (Network File System) network file system , which allows computers in the network to share resources through the TCP/IP network . Through NFS, local NFS client applications can directly read and write files on the NFS server, just like accessing local files .

We only need to use the resource server as the server and the k8s cluster server as the client, and use network storage volume technology to allow Pods in k8s to share the directories and files of the resource server. NFS can restrict the security of exposed directories through configuration files. Such as restricting access to hosts, read and write permissions, etc.

image-20221013150115011

Monitor the status of program execution

The first is how to judge the state of the user program. The running of the user program can be roughly divided into four states: queuing, running, failure, and completion. We mentioned earlier that a monitoring thread is started to monitor the state of the program in the container

Different stages of pod life correspond to a phase value , and we can judge the running status of the user program through these phase values.

  • Pending: The Pod has been accepted by the k8s system, but one or more container images have not yet been created. For example, the computing time consumed before scheduling, and the time consumed by downloading the image through the network, these preparations will cause the container image to not be created
  • Running : The Pod has been bound to the Node and all containers have been created. At least one container is still running, or is starting or restarting
  • Succeeded: All containers in the Pod have terminated successfully, and at least one container exhibited a failed termination status. That is, the container exited with a non-zero status, or was terminated by the system
  • Failed: All containers in the Pod have terminated, and at least one container exhibited a failed termination status. That is, the container either exited with a non-zero status or was terminated by the system
  • Unknown: For some reason, the status of the Pod cannot be obtained. This is usually caused by a communication error on the host where the Pod is located.

There is a problem here. This is the state of the pod. How can we determine that after the pod succeeds, the code running in the pod container also succeeds? That is, how does the pod's succeed and fail states correspond to the state of the user program? In fact, there are In a word, "The container either exits with a non-zero status or is terminated by the system" , which means that the Failed status will only appear when the container exits with a non-zero status, except that the system directly terminates the container and causes the container to exit. Then we only need to monitor the running status of the user program inside the container, and if it fails to run, let the program exit with a non-zero status. For example, the following shell command

python run.py
# 判断命令是否执行成功
if [ $? -ne 0 ];then
    echo "================执行失败==============================="
    exit 0;
fi

Because one-time job execution will start a pod and a container, so we only need to run the script when the container starts. The command parameter and args parameter of the pod in k8s can respectively set the startup command and startup parameter list of the container. These two parameters can achieve the effect we want.

Note: The command and args settings will respectively overwrite the EntryPoint and CMD defined in the original Docker image. Please pay attention to the following rules when using it:

  • If no command or args are provided in the template, the defaults defined in the Docker image are used to run.

  • If command is provided in the template, but no args are provided, only the provided command is used. Both the default EntryPoint and the default command defined in the Docker image will be ignored.

  • If only args are provided, the default EntryPoint defined in the Docker image will be run in combination with the provided args.

  • If both command and args are provided, the default EntryPoint and command defined in the Docker image will be ignored. The provided command and args will be combined to run.

Although the status of the program can be judged by the status of the pod, the status of the pod cannot be monitored in real time at present. There are three implementation plans for the research at that time:

  1. Periodic polling query pod status
  2. Use k8s lifecycle callback
  3. Use k8s watch to monitor
Regularly poll pod status

The idea is very simple. Check the status of the corresponding job for a period of time, but the disadvantages are also great. First, there will be a delay. Secondly, polling consumes a lot of k8s server and web server. If there are too many jobs, it will undoubtedly It greatly increases the server load, so it is not advisable.

Use k8s lifecycle callback

k8s has two lifecycle events, the PostStart event and the PreStop event, which execute callbacks just before the container is created successfully and before the container ends.

  • PostStart: Just after the container is successfully created, an event is triggered and the callback is executed . If the operation in the callback fails, the container will be terminated, and it will be determined whether to restart the container according to the restart policy of the container.

  • PreStop: **Before the container ends, an event is triggered and a callback is executed. ** Regardless of the callback execution result, the container will be terminated.

There are two ways to implement the callback, one is Exec, the other is HttpGet,

  • The Exec callback executes a specific command or operation. If the command executed by Exec is OK in the stdout result, it means the execution is successful; otherwise, it is considered to be abnormal, and the kubelet will force restart the container.
  • The HttpGet callback will execute a specific HttpGet request. It judges whether the request is executed successfully by the returned HTTP status code.

We can use the callback mechanism to initiate an http request when the container is successfully created, and notify the server to change the status to running

Initiate an http request before the container ends, pass the program running status (failure or success) to the server, and then notify the server to change the status.

The following figure is the life cycle event diagram of pod

image-20221013202125728

question

Although this solution is feasible, in the actual implementation process, it is found that the status of the program execution cannot be directly transmitted back through the get request, and the server can only be notified to query the status of the pod. But at this time the pod has not stopped and is still running, so we cannot know whether the program runs successfully. But later we thought of enabling asynchronous use of watch to monitor the status of the container when the pod stops, so that the subsequent status of the pod can be monitored.

Use watch to monitor pod status

In k8s, the status changes of pods or other components can be continuously detected through the Watch interface, such as the pod list. If the pod status changes, it will output

kubectl get pod -w   或者--watch

The general idea is to use the javaClient provided by k8s to call the watch to monitor the status of the pod. (When creating a job, the label will be used to uniquely identify the job and the pod). Since watch monitoring is a continuous process, javaClient uses it when calling the k8s watch api It is OkHttpClient. According to the official example, OkHttpClient10s will disconnect. We can set the timeout period of OkHttpClient through configuration. The timeout period should not be set too long. If the connection is not closed in time, the k8s apiserver will waste a lot of connections and the pressure will be high , but if it is too short, the connection may be closed before the program finishes executing.

So here comes another question: how to reasonably set the timeout period of OkHttpClient, and ensure that the connection will not be closed before the end of the program?

question

The initial solution to this problem is to set the timeout period of **OkHttpClient a little longer than the time limit of the user program, **because the user program execution timeout will definitely destroy the pod, so it doesn’t matter even if the connection is closed. However, this system is an algorithm training system. **The execution time is generally long, and the time limit of the user program is relatively wide. **If such a connection is maintained for a long time, it is indeed a waste of resources.

Life cycle callback + watch continuous listener status

But if we combine it with the previous strategy and start the pod monitoring before the end of the pod (at this time the program has been executed) , the connection time can be greatly reduced. So in the end we adopted the strategy of combining life cycle callback + watch monitoring.

image-20221014102725472

So far, the design and implementation ideas of the sandbox environment for user program execution have been explained, but there are still some details that we need to solve.

  1. How to control the number of jobs for a job?
  2. How to display the execution result of the user program to the user?

Control the number of jobs for a job

In addition to limiting the resources of a single user program, we also need to limit the number of programs that are running overall, otherwise many users running programs together will still fill up the server resources.

train of thought

The easiest way to limit the number of jobs for a job is to query the number of jobs that are currently running at runtime, and create a job if it is less than max.

image-20221014110704604

Once the number of jobs exceeds the maximum limit, what should be done? It is definitely not good to directly tell the user that the operation failed. It is better to have a container that can store the user's request, let the user's execution request enter the queue state, and wait for other programs to complete. , and let it continue execution. This logic makes it easy for us to think of the data structure of the queue, and the use of message queue middleware can undoubtedly solve this problem well.

The specific process is:

  • The user sends a request to the server to run the program and directly stores the request in the consumer queue

  • After receiving the request, the consumer first goes to the k8s server to query the number of jobs

  • If the quantity is consumed without exceeding the limit, send a request to the k8s server to create a job

  • If it is greater than or equal to the limit, the message will be re-stored at the head of the queue and consumed after a period of time

image-20221014140533923

Although the above steps can meet the requirements, there are still parts that can be optimized. We mentioned above that when judging the number of jobs, we always go to the k8s server to query the number of jobs. When the number of jobs exceeds the limit, the message The queue will continue to retry, and each time it will query the number of jobs in k8s, which will undoubtedly increase the pressure on the k8s server. In fact, we can manually maintain the number of running jobs instead of going to k8s every time Inquire. We only need to use the basic data structure string of redis to cache the maximum limit of the job , subtract one each time a job is started, and add one when the job is destroyed.

However, it should be noted here that it must be an atomic operation from reducing the job by one to initiating a request to create a job, so you can add a distributed lock here. If the job creation fails, you also need to consider the issue of data rollback. As we said before, the job is automatically destroyed, so we need to monitor the pod to maintain the number of jobs when the pod fails or completes.

How to inform the user of the execution result

We mentioned above that the output of the **user program is saved in a specific directory,** but how to let the user know the output of the program?

There are two options for this:

  1. Use websocket to establish a long connection, read log files at regular intervals, and output logs in real time.
  2. Let the client directly request the log content, and you can see the latest log every time you request.

Because the programs of this system are generally time-consuming, establishing a long-term connection requires a lot of resources to maintain the connection. According to past development experience, establishing a long-term connection is easy to disconnect due to network fluctuations or some user operations, and the development steps are cumbersome. Considering that the program log has no real-time requirements, the second solution is adopted.

minio object storage

The storage system for logs and code files has also been designed to a certain extent.

This system has a specific resource server, which uses the distributed object storage system built by the open source object storage project minio. The main goal of minio design is as a standard solution for private cloud object storage. It is mainly used to store massive pictures, videos, documents, etc. It is very suitable for storing large-capacity unstructured data, such as pictures, videos, log files, backup data, and container/virtual machine images.

For the specific process of building, please refer to the blog https://blog.csdn.net/R1011/article/details/124399434

We also mentioned above that we use nfs to connect the k8s cluster server and the resource server together, and share the directory of the resource server with the k8s cluster. We upload the user's code file to the resource directory of minio and share it with the k8s cluster, so that it can The user's code directory is mapped through the data volume at runtime. At the same time, we need a data volume to map a specific user log directory at startup. The log is output to this directory when the program is running, so that the log can be synchronized to the resource server. When the client queries the program log, it only needs to directly access the log file through minio.

Summarize

Finally, we use a picture to summarize the architecture of this algorithm system.

image-20221014154441257

Guess you like

Origin blog.csdn.net/qq_45473439/article/details/127326736