When I read an article "Selection of Timing Task Framework", a message from a netizen reached me:

I have seen so many so-called tutorials, most of them teach "how to use tools", not many teach "how to make tools", and those who can teach "how to imitate tools" are already rare. China's software industry lacks Most of them are programmers who can really "make tools", and there is absolutely no shortage of programmers who "use tools"! ..." The last thing this industry needs is "engineers who can use XX tools", but "creative software engineers"! All jobs in the industry are essentially provided by "creative software engineers"!

Writing this article, I want to talk about task scheduling from head to toe with you. I hope that after reading it, you can understand the core logic of implementing a task scheduling system.

1 Quartz

Quartz is a Java open source task scheduling framework, and it is also the starting point for many Java engineers to get in touch with task scheduling.

The following figure shows the overall process of task scheduling:

At the heart of Quartz are three components.

Task: Job is used to represent a scheduled task;
Trigger: Trigger defines the elements of scheduling time, that is, according to what time rules to execute tasks. A Job can be associated with multiple Triggers, but a Trigger can only be associated with one Job;
Scheduler: The factory class creates a Scheduler to schedule tasks according to the time rules defined by the trigger.

The JobStore of Quartz in the code above is RAMJobStore , and Trigger and Job are stored in memory.

The core class that performs task scheduling is QuartzSchedulerThread .

The scheduling thread obtains the list of triggers to be executed from the JobStore and modifies the status of the triggers;
Fire trigger, modify the trigger information (the next execution time of the trigger, and the status of the trigger), and store it.
Finally, create a specific execution task object and execute the task through the worker thread pool.

Next, let's talk about Quartz's cluster deployment solution.

Quartz's cluster deployment scheme needs to create Quartz tables on database instances for different database types (MySQL, ORACLE). The JobStore is: JobStoreSupport .

This solution is distributed, and there is no node responsible for centralized management. Instead, it uses database row-level locks to achieve concurrency control in a cluster environment.

The scheduler instance first obtains the row lock in the {0}LOCKS table in the cluster mode, and Mysql obtains the row lock statement:

{0} will be replaced with the one configured by default in the configuration file QRTZ_. sched_name is the instance name of the application cluster, and lock_name is the row-level lock name. Quartz mainly has two row-level locks, trigger access lock ( TRIGGER_ACCESS ) and state access lock ( STATE_ACCESS ).

This architecture solves the problem of distributed scheduling of tasks. Only one node can run the same task, and other nodes will not execute the task. When encountering a large number of short tasks, each node frequently competes for database locks. The more nodes there are, the performance will decrease. worse.

2 Distributed lock mode

Quartz's cluster mode can be expanded horizontally or distributed, but requires the business side to add corresponding tables in the database, which is highly intrusive.

In order to avoid this intrusiveness, many R&D students have also explored the distributed lock mode .

Business scenario: For an e-commerce project, if the user fails to pay for a period of time after placing an order, the system will close the order after a timeout.

Usually we will do a timed task every two minutes to check the order of the previous half hour, query the list of orders that have not been paid, and then restore the inventory of the goods in the order, and then set the order as invalid.

We use Spring Schedule to do a timed task.

@Scheduled(cron = "0 */2 * * * ? ")
public void doTask() {
    
    
   log.info("定时任务启动");
   //执行关闭订单的操作
   orderService.closeExpireUnpayOrders();
   log.info("定时任务结束");
 }

When a single server is running normally, considering the high availability and the surge in business volume, the architecture will evolve into a cluster mode. Multiple services execute a scheduled task at the same time , which may lead to business disorder.

The solution is to use Redis distributed locks to solve such problems during task execution.

@Scheduled(cron = "0 */2 * * * ? ")
public void doTask() {
    
    
    log.info("定时任务启动");
    String lockName = "closeExpireUnpayOrdersLock";
    RedisLock redisLock = redisClient.getLock(lockName);
    //尝试加锁，最多等待3秒，上锁以后5分钟自动解锁
    boolean locked = redisLock.tryLock(3, 300, TimeUnit.SECONDS);
    if(!locked){
    
    
        log.info("没有获得分布式锁:{}" , lockName);
        return;
    }
    try{
    
    
       //执行关闭订单的操作
       orderService.closeExpireUnpayOrders();
    } finally {
    
    
       redisLock.unlock();
    }
    log.info("定时任务结束");
}

Redis has excellent read and write performance, and distributed locks are lighter than Quartz database row-level locks. Of course, Redis locks can also be replaced by Zookeeper locks, which is the same mechanism.

In small projects, using: timed task framework (Quartz/Spring Schedule) and distributed locks (redis/zookeeper) have good results.

but? We can see that there are two problems with this combination:

Scheduled tasks may run empty in distributed scenarios, and tasks cannot be fragmented;
To manually trigger tasks, you must add additional code to complete.

3 ElasticJob-Lite Framework

ElasticJob-Lite is positioned as a lightweight decentralized solution, providing coordination services for distributed tasks in the form of jars.
Official website structure diagram

Define the task class inside the application, implement the SimpleJob interface, and write the actual business process of your own task.

public class MyElasticJob implements SimpleJob {
    @Override
    public void execute(ShardingContext context) {
        switch (context.getShardingItem()) {
            case 0:
                // do something by sharding item 0
                break;
            case 1:
                // do something by sharding item 1
                break;
            case 2:
                // do something by sharding item 2
                break;
            // case n: ...
        }
    }
}

Example: Application A has five tasks to be executed, namely A, B, C, D, and E. Task E needs to be divided into four subtasks, and the application is deployed on two machines.

After application A is started, five tasks are coordinated by Zookeeper and distributed to two machines, and different tasks are executed separately by Quartz Scheduler.

In essence, the underlying task scheduling of ElasticJob is still through Quartz. Compared with Redis distributed locks or Quartz distributed deployment, its advantage is that it can rely on Zookeeper, a big killer, to assign tasks to the Quartz Scheduler in the application through a load balancing algorithm. container.

From the user's point of view, it is very simple and easy to use. But from an architectural point of view, the scheduler and executor are still in the same application-side JVM, and after the container starts, it still needs to do load balancing. If the application restarts frequently, constantly chooses the master, and performs load balancing on the shards, these are relatively heavy operations.

In addition, the ElasticJob console is relatively rough. It reads the registry data to display the job status, updates the registry data, and modifies the global task configuration.

4 centralized schools

The principle of centralization is to separate scheduling and task execution into two parts: the scheduling center and the executor. The dispatching center module only needs to be responsible for task scheduling attributes and trigger scheduling commands. The executor receives scheduling commands to execute specific business logic, and both can perform distributed expansion.

4.1 MQ mode

Let me talk about the first centralized structure that I came into contact with in the promotion team of eLong.

The scheduling center relies on the Quartz cluster mode, and sends messages to RabbitMQ when tasks are scheduled. After the business application receives the task message, it consumes the task information.

This model makes full use of the decoupling feature of MQ. The scheduling center sends tasks, and the application side acts as an executor to receive and execute tasks.

But this design strongly relies on message queues, scalability and functionality, and system load is greatly related to message queues. This architectural design requires architects to be very familiar with message queues.

4.2 XXL-JOB

XXL-JOB is a distributed task scheduling platform. Its core design goals are rapid development, easy learning, lightweight, and easy expansion. The source code is now open and connected to the online product lines of many companies, out of the box.

xxl-job 2.3.0 architecture diagram

We focus on analyzing the architecture diagram:

▍ Network communication server-worker model

The communication between the two modules of the dispatch center and the executor is a server-worker mode. The dispatch center itself is a SpringBoot project, and it will listen to port 8080 when it starts.

After the executor starts, it will start the built-in service (EmbedServer) to listen on port 9994. In this way, both parties can send commands to each other.

How does the dispatch center know the address information of the actuator? In the figure above, the executor will send registration commands regularly, so that the dispatch center can obtain the online executor list.

Through the list of executors, nodes can be selected to execute tasks according to the routing strategy of the task configuration. Common routing strategies are as follows:

Random node execution: Select an available execution node in the cluster to execute the scheduled task. Applicable scenario: offline order settlement.

Broadcast execution: Distribute scheduling tasks and execute them on all execution nodes in the cluster. Applicable scenario: update the local cache of the application in batches.
Fragmentation execution: split according to user-defined fragmentation logic, distribute to different nodes in the cluster for parallel execution, and improve resource utilization efficiency. Applicable scenario: massive log statistics.

▍ Scheduler

The scheduler is a very core component in the task scheduling system. Early versions of XXL-JOB relied on Quartz.

However, in the v2.1.0 version, the Quartz dependency is completely removed, and the Quartz table that needs to be created is replaced with a self-developed table.

The core scheduling class is: JobTriggerPoolHelper . After calling the start method, two threads will be started: scheduleThread and ringThread.

First of all, the scheduleThread will regularly load the tasks that need to be scheduled from the database, which is essentially based on the database row lock to ensure that only one scheduling center node triggers task scheduling at the same time.

Connection conn = XxlJobAdminConfig.getAdminConfig()
                  .getDataSource().getConnection();
connAutoCommit = conn.getAutoCommit();
conn.setAutoCommit(false);
preparedStatement = conn.prepareStatement(
"select * from xxl_job_lock where lock_name = 'schedule_lock' for update");
preparedStatement.execute();
# 触发任务调度 (伪代码)
for (XxlJobInfo jobInfo: scheduleList) {
    
    
  // 省略代码
}
# 事务提交
conn.commit();

The scheduling thread will take different actions according to the "next trigger time" of the task:

Expired tasks that need to be executed immediately are directly placed in the thread pool to trigger execution, and tasks that need to be executed within five seconds are placed in the ringData object.

After ringThread is started, it regularly obtains the list of tasks to be executed from the ringData object, and puts them into the thread pool to trigger execution.

5 Self-developed on the shoulders of giants

In 2018, I had a self-developed task scheduling system experience.

The background is: Compatible with the RPC framework self-developed by the technical team, the technical team does not need to modify the code, and the RPC annotation method can be hosted in the task scheduling system and directly executed as a task.

In the process of self-development, I studied the source code of XXL-JOB, and at the same time absorbed a lot of nutrition from Alibaba Cloud's distributed task scheduling SchedulerX.

SchedulerX 1.0 architecture diagram

Schedulerx-console is a task scheduling console for creating and managing scheduled tasks. Responsible for data creation, modification and query. Interact with schedulerx-server inside the product.
Schedulerx-server is the server of task scheduling and the core component of Scheduler. Responsible for the scheduling and triggering of client tasks and the monitoring of task execution status.
Schedulerx-client is a task scheduling client. Each application process connected to the client is a Worker. Worker is responsible for establishing communication with schedulerx-server, allowing schedulerx-server to discover the client machine. And register the group where the current application is located with schedulerx-server, so that schedulerx-server can regularly trigger tasks to the client.

We imitated the modules of SchedulerX, and the architecture design is as follows:

I chose the communication module remoting of the RocketMQ source code as the communication framework of the self-developed scheduling system. Based on the following two points:

I am not familiar with Dubbo, which is well-known in the industry, and I have done many wheels for remoting, and I believe I can handle it;
In reading the source code of SchedulerX 1.0 client, I found that the communication framework of SchedulerX is similar to RocketMQ Remoting in many places. There are ready-made engineering implementations in its source code, which is completely a treasure.

I removed the name service code from the RocketMQ remoting module and made a certain degree of customization.

In RocketMQ's remoting, the server adopts the Processor mode.

The dispatch center needs to register two processors: the callback result processor CallBackProcessor and the heartbeat processor HeartBeatProcessor. The executor needs to register the trigger task processor TriggerTaskProcessor.

public void registerProcessor(
             int requestCode,
             NettyRequestProcessor processor,
             ExecutorService executor);

Processor interface:

public interface NettyRequestProcessor {
    
    
 RemotingCommand processRequest(
                 ChannelHandlerContext ctx,
                 RemotingCommand request) throws Exception;
 boolean rejectRequest();
}

For the communication framework, I don't need to pay attention to the communication details, I just need to implement the processor interface.

Take TriggerTaskProcessor as an example:

After the network communication is settled, how to design the scheduler? In the end I chose the Quartz cluster mode. Mainly based on the following reasons:

When the amount of scheduling is not large, the Quartz cluster mode is stable enough and compatible with the original XXL-JOB tasks;
If you use the time wheel, you don't have enough practical experience and worry about problems. In addition, how to trigger tasks through different scheduling services (schedule-server) requires a coordinator. So I thought of Zookeeper. But in this case, a new component is introduced.
The R&D cycle should not be too long, and I want to produce results as soon as possible.

The self-developed version of the scheduling service took a month and a half to go online. The system runs very stably, and the R&D team accesses it smoothly. The amount of scheduling is not large, and the total amount of scheduling in four months is close to 40 million to 50 million.

Frankly speaking, I can often see the bottleneck of the self-developed version in my mind. The amount of data is large, and I can handle sub-database and sub-table, but the Quartz cluster is based on the row-level lock mode, so the upper limit is destined not to be too high.

In order to relieve the confusion in my heart, I wrote a wheel DEMO to see if it works:

Remove the external registration center, and the scheduling service (schedule-server) manages the session;
Introduce zookeeper to coordinate and dispatch services through zk. But the HA mechanism is very rough, which is equivalent to one task scheduling service running and another service standby;
Quartz is replaced by the time wheel (refer to the time wheel source code in Dubbo).

This Demo version can run in the development environment, but there are many details that need to be optimized. It is just a toy and has no chance to run in the production environment.

I recently read an article by Alibaba Cloud "How to Realize Millions of Rule Alarms Through Task Scheduling". The high-availability architecture of SchedulerX2.0 is shown in the figure below:

The article mentions:

Each application will do three backups, through ZK lock grabbing, one master and two backups. If a server hangs up, failover will be performed, and other servers will take over the scheduling tasks.

In terms of architecture, the self-developed task scheduling system is not complicated. It realizes the core functions of XXL-JOB and is also compatible with the RPC framework of the technical team, but it does not implement workflow and mapreduce sharding.

SchedulerX is based on the new Akka architecture after upgrading to 2.0. This architecture claims to implement a high-performance workflow engine, realize inter-process communication, and reduce network communication code.

In the open source task scheduling system I investigated, PowerJob is also based on the Akka architecture, and also implements workflow and MapReduce execution modes.

I am very interested in PowerJob , and I will also output related articles after learning and practicing, so stay tuned.

6 Technology selection

First, we put together the task scheduling open source product and the commercial product SchedulerX to generate a comparison table:

Quartz and ElasticJob still belong to the framework level in essence.

Centralized products are clearer in terms of architecture, more flexible at the scheduling level, and can support more complex scheduling (mapreduce dynamic sharding, workflow).

XXL-JOB has been simplified from the product level, out of the box, and the scheduling mode can meet the needs of most R&D teams. It is easy to use + can play, so it is very popular with everyone.

In fact, each technical team has different technical reserves and faces different scenarios, so technical selection cannot be generalized.

No matter which technology you use, you still need to pay attention to two points when writing task business code:

idempotent. When the task is executed repeatedly, or when the distributed lock fails, the program can still output the correct result;
The task is gone, don't panic. View the scheduling log, use the Jstack command to view the stack at the JVM level, and add a timeout period for network communication, which can generally solve most problems.

7 write to the end

2015 was actually a very interesting year. Two different genres of task scheduling projects, ElasticJob and XXL-JOB, are open source.

In the XXL-JOB source code, there is still a dynamic screenshot of Teacher Xu Xueli in Open Source China:

The task scheduling framework just written, web dynamic management tasks, takes effect in real time, and is warm. If there is no accident, it will be pushed to git.osc at noon tomorrow. Haha, let’s go downstairs to fry some noodles and add a poached egg to celebrate.

Seeing this screenshot, there is actually a kind of empathy deep in my heart, and the corners of my mouth can't help but rise.

I remembered again: In 2016, Mr. Zhang Liang, the author of ElasticJob, open sourced sharding-jdbc. I created a private project on github, refer to the source code of sharding-jdbc, and realize the function of sub-database and sub-table by myself. The name of the first class is: ShardingDataSource, and the time is fixed at 2016/3/29.

I don't know how to define a "creative software engineer", but I believe that an engineer who is curious, hard-working, willing to share, and willing to help others will not be too lucky.

If you think it is helpful to you, please give the author a "Like" and "Watching", and we will see you in the next issue.

Past recommendations:

Taste the Beauty of Spring Cache Design
The ordinary road of chasing the source code
My Eight Years of Love with Message Queue

To implement a task scheduling system, this article is enough