Scheduling System Design Elements

 

 

Scheduling system time granularity: minutes, hours, days, weeks, months, quarters, years

The upper and lower types of dependencies: the time granularity is larger and the lower is smaller, the time granularity is equal, and the time granularity is smaller and larger

Dependency classification: common dependencies (upstream and downstream), deadline dependencies (cumulative tasks, which need to be accumulated to a certain day. Weekly deadline tasks will depend on the data of upstream tasks from Monday to the present), self-dependence (depending on their own tasks) the result of the previous run)

Properties of dependencies: offset, step size, left and right compensation.

 

Timing: Timing is critical, especially for cumulative tasks, tasks that are self-dependent. If the timing is disordered, the data results generated by the task are out of order and have no value and meaning.

 

Noun: task, task data version, task running instance

 

---------------------------

Resource Scheduler:

1. When managing the load information of each machine in the cluster, such as the usage of CPU and memory, you need to set different thresholds for each machine under the cluster. Because the CPU and memory of each machine are heterogeneous, customization is required.

For example, the memory and CPU of a dedicated machine should reserve 20% of the CPU and 30% of the memory for use by the operating system and other processes.

For machines shared with other processes, more CPU and memory should be reserved for use by other processes.

-- Strictly configure a uniform load threshold for the resource scheduler, otherwise it will cause some machines to be full of CPU and memory

2. Resource isolation problem

The resources of each team must be isolated. For example, the resource scheduler will use zk to store running record information, and the stored information is often stored in teams. If there is a problem with a team node of zk, it will affect the use of the entire zk, causing the system to go down or work abnormally.

To solve this problem, it is necessary to create a zk link for each team, so that even if one zk is abnormal, the business of other teams will not be affected.

-- The problem of zk's isolation by team above is just a typical example. Other resources related to the team need to be isolated to prevent mutual influence. On 2015/12/07, a multi-team shared zk appeared in the resource scheduler of our online system. One of the team zk had a problem, which affected the whole system.

(The problem is that the data written under a single node of zk exceeds 4M, and there is a problem with reading, which causes the session of zk to time out and the system is unavailable)

 

 3. Functional design of web pages

On the operation and maintenance page, for the operation function of each task, you must set the corresponding batch processing function. Such as: kill, batch kill; repair tasks, batch repair tasks, etc.

When there are thousands of tasks to be processed, batch operations are extra important to avoid other exceptions caused by sql operations in the background. (Note: SQL processing background data is very dangerous, and such operations should be prohibited)

 

 

 

 

 

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326801751&siteId=291194637