Elastic-Job - Distributed Scheduled Task Framework

Elastic-Job is a distributed elastic job framework separated from the job module of dd-job in ddframe. Removed monitoring and ddframe access specification sections in and dd-job. The project is based on mature open source products Quartz and Zookeeper and its client Curator for secondary development.

Project open source address: https://github.com/dangdangdotcom/elastic-job

Other modules of ddframe also have independent open source parts. Dangdang has open sourced DubboX, the cornerstone module of dd-soa.

The relationship between elastic-job and ddframe is shown in the figure below

Elastic-Job main functions

Scheduled tasks: Execute scheduled tasks based on Quartz cron expressions, a mature scheduled task job framework.
Job Registration Center: A global job registration control center based on Zookeeper and its client Curator. Used to register, control and coordinate distributed job execution.
Job sharding: Divide a task into multiple small task items and execute them simultaneously on multiple servers.
Elastic capacity expansion and contraction: If the running job server crashes, or n new job servers are added, the job framework will be re-sharded before the next job execution without affecting the current job execution.
Support multiple job execution modes: Support OneOff, Perpetual and SequencePerpetual three job modes.
Failover: A crash of a running job server will not cause resharding, only the next time the job starts. Enabling the failover function can monitor the idleness of other job servers during the execution of this job, and fetch unfinished orphan shard items for execution.
Runtime status collection: Monitor the job runtime status, count the number of successful and failed data processed in the recent period, and record the start time, end time, and next time of the job's last run.
Job stop, resume and disable: used to operate job start and stop, and can prohibit a job from running (commonly used when going online).
Retriggered Jobs Missed : Automatically log missed jobs and trigger automatically after the last job completes. Refer to misfire of Quartz.
Multi-threaded fast data processing: Use multi-threaded processing of captured data to improve throughput.
Idempotency: The determination of repeated job task items does not repeat the execution of already-running job task items. Since enabling idempotency requires monitoring the job running status, it has a great impact on the performance of jobs that run repeatedly in an instant.
Fault- tolerant processing: If the job server fails to communicate with the Zookeeper server, the job will be stopped immediately to prevent the job registry from assigning the failed shard item to other job servers, while the current job server is still executing tasks, resulting in repeated execution.
Spring support: support for spring container, custom namespace, support for placeholders.
Operation and maintenance platform: Provides an operation and maintenance interface, which can manage jobs and registration centers.

Directory Structure Description

elastic-job-core

The core module of elastic-job can execute distributed jobs only through Quartz and Curator.
elastic-job-spring

elastic-job supports spring modules, including namespaces, dependency injection, placeholders, etc.
elastic-job-console

The elastic-job web console can put the compiled war into a servlet container such as tomcat for use.
elastic-job-example

Use examples.
elastic-job-test

Test the public class used by elastic-job, the user does not need to pay attention.

Introduce maven dependencies

The elastic-job has been released to the central repository, and maven coordinates can be directly introduced in the pom.xml file.

<!-- 引入elastic-job核心模块 -->
<dependency>
    <groupId>com.dangdang</groupId>
    <artifactId>elastic-job-core</artifactId>
    <version>1.0.1</version>
</dependency>
<!-- 使用springframework自定义命名空间时引入 -->
<dependency>
    <groupId>com.dangdang</groupId>
    <artifactId>elastic-job-spring</artifactId>
    <version>1.0.1</version>
</dependency>

code development

There are 3 job types available, OneOff, Perpetual and SequencePerpetual. Need to inherit the corresponding abstract class.

The method parameter shardingContext contains job configuration, sharding and runtime information. The total number of shards, the set of serial numbers of shards running on the job server, etc. can be obtained through getShardingTotalCount(), getShardingItems() and other methods.

OneOff Type Jobs

The OneOff job type is relatively simple and needs to inherit from AbstractOneOffElasticJob. This class only provides one method for overriding, and this method will be executed regularly. It is used to perform common timing tasks, similar to the native interface of Quartz, but with the addition of functions such as elastic scaling and sharding.

public class MyElasticJob extends AbstractOneOffElasticJob {

    @Override
    protected void process(JobExecutionMultipleShardingContext context) {
        // do something by sharding items
    }
}

Perpetual type jobs

The Perpetual job type is slightly more complicated, it needs to inherit AbstractPerpetualElasticJob and can specify the return value generic. This class provides two methods that can be overridden for grabbing and processing data respectively. Auxiliary monitoring information such as the number of successful and failed data processing can be obtained. It should be noted that only when the return value of the fetchData method is null or the length is empty, the job will stop executing, otherwise the job will continue to run. This is based on the design of TbSchedule. The Perpetual job type is more suitable for streaming uninterrupted data processing.

When the job is executed, the data of fetchData will be passed to processData for processing, where the data obtained by processData is divided by multi-threading (the size of the thread pool can be configured). It is recommended that after processData processes the data, update its state to avoid fetchData being fetched again, so that the job will never stop. The return value of processData is used to indicate whether the data is processed successfully. Throwing an exception or returning false will count the number of failures in the statistics, and return true to count the number of successes.

public class MyElasticJob extends AbstractPerpetualElasticJob<Foo> {

    @Override
    protected List<Foo> fetchData(JobExecutionMultipleShardingContext context) {
        List<Foo> result = // get data from database by sharding items
        return result;
    }
    
    @Override
    protected boolean processData(JobExecutionMultipleShardingContext context, Foo data) {
        // process data
        return true;
    }
}

SequencePerpetual type job

The SequencePerpetual job type is very similar to the Perpetual job type. The difference is that the Perpetual job type can process the acquired data in multiple threads, but does not guarantee the order of multi-threaded processing of data. For example, a total of 100 pieces of data are obtained from 2 shards, 40 pieces of data are obtained from the first piece, and 60 pieces of the second piece are configured to be processed by two threads, then the first thread processes the first 50 pieces of data, and the second After each thread processes 50 pieces of data, shard items are ignored; SequencePerpetual type jobs are multi-threaded according to the number of shard items allocated by the current server, and each shard item is processed by the same thread, preventing the data of the same shard from being processed by the same thread. Multithreading, which leads to order problems. For example, if a total of 100 pieces of data are obtained from 2 shards, 40 pieces of data for the first piece and 60 pieces of data for the second piece, the system automatically allocates two threads for processing, and the first thread processes the data of the first piece. 40 pieces of data, the second thread processes 60 pieces of data for the second shard. Since Perpetual jobs can be processed using an arbitrary number of threads for redundant shard items, performance tuning may be better than SequencePerpetual jobs.

public class MyElasticJob extends AbstractSequencePerpetualElasticJob<Foo> {

    @Override
    protected List<Foo> fetchData(JobExecutionSingleShardingContext context) {
        List<Foo> result = // get data from database by sharding items
        return result;
    }
    
    @Override
    protected boolean processData(JobExecutionSingleShardingContext context, Foo data) {
        // process data
        return true;
    }
}

Job configuration

Using jobs with the Spring container, you can configure the job bean as a Spring Bean, and you can use objects such as data sources managed by the Spring container in the job through dependency injection. The placeholder can be used to get the value from the properties file.

Spring namespace configuration

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:reg="http://www.dangdang.com/schema/ddframe/reg" 
    xmlns:job="http://www.dangdang.com/schema/ddframe/job" 
    xsi:schemaLocation="http://www.springframework.org/schema/beans
                        http://www.springframework.org/schema/beans/spring-beans.xsd
                        http://www.dangdang.com/schema/ddframe/reg
                        http://www.dangdang.com/schema/ddframe/reg/reg.xsd
                        http://www.dangdang.com/schema/ddframe/job
                        http://www.dangdang.com/schema/ddframe/job/job.xsd
                        ">
    <!--配置作业注册中心 -->
    <reg:zookeeper id="regCenter" serverLists=" yourhost:2181" namespace="dd-job" baseSleepTimeMilliseconds="1000" maxSleepTimeMilliseconds="3000" maxRetries="3" />
    <!-- 配置作业A-->
    <job:bean id="oneOffElasticJob" class="xxx.MyOneOffElasticJob" regCenter="regCenter" cron="0/10 * * * * ?"   shardingTotalCount="3" shardingItemParameters="0=A,1=B,2=C" />
    <!-- 配置作业B-->
    <job:bean id="perpetualElasticJob" class="xxx.MyPerpetualElasticJob" regCenter="regCenter" cron="0/10 * * * * ?" shardingTotalCount="3" shardingItemParameters="0=A,1=B,2=C" processCountIntervalSeconds="10" concurrentDataProcessThreadCount="10" />
</beans>

<job:bean /> namespace attribute details

<reg:zookeeper /> Namespace property details

Based on Spring but not using namespaces

    <!-- 配置作业注册中心 -->
    <bean id="regCenter" class="com.dangdang.ddframe.reg.zookeeper.ZookeeperRegistryCenter" init-method="init">
        <constructor-arg>
            <bean class="com.dangdang.ddframe.reg.zookeeper.ZookeeperConfiguration">
                <property name="serverLists" value="${xxx}" />
                <property name="namespace" value="${xxx}" />
                <property name="baseSleepTimeMilliseconds" value="${xxx}" />
                <property name="maxSleepTimeMilliseconds" value="${xxx}" />
                <property name="maxRetries" value="${xxx}" />
            </bean>
        </constructor-arg>
    </bean>    <!-- 配置作业-->
    <bean id="xxxJob" class="com.dangdang.ddframe.job.spring.schedule.SpringJobController" init-method="init">
        <constructor-arg ref="regCenter" />
        <constructor-arg>
            <bean class="com.dangdang.ddframe.job.api.JobConfiguration">
                <constructor-arg name="jobName" value="xxxJob" />
                <constructor-arg name="jobClass" value="xxxDemoJob" />
                <constructor-arg name="shardingTotalCount" value="10" />
                <constructor-arg name="cron" value="0/10 * * * * ?" />
                <property name="shardingItemParameters" value="${xxx}" />
            </bean>
        </constructor-arg>
    </bean>

Not using Spring configuration

If you do not use the Spring framework, you can start the job as follows.

import com.dangdang.ddframe.job.api.JobConfiguration;
import com.dangdang.ddframe.job.schedule.JobController;
import com.dangdang.ddframe.reg.base.CoordinatorRegistryCenter;
import com.dangdang.ddframe.reg.zookeeper.ZookeeperConfiguration;
import com.dangdang.ddframe.reg.zookeeper.ZookeeperRegistryCenter;
import com.dangdang.example.elasticjob.core.job.OneOffElasticDemoJob;
import com.dangdang.example.elasticjob.core.job.PerpetualElasticDemoJob;
import com.dangdang.example.elasticjob.core.job.SequencePerpetualElasticDemoJob;

public class JobDemo {

    // 定义Zookeeper注册中心配置对象
    private ZookeeperConfiguration zkConfig = new ZookeeperConfiguration("localhost:2181", "elastic-job-example", 1000, 3000, 3);
    
    // 定义Zookeeper注册中心
    private CoordinatorRegistryCenter regCenter = new ZookeeperRegistryCenter(zkConfig);
    
    // 定义作业1配置对象
    private JobConfiguration jobConfig1 = new JobConfiguration("oneOffElasticDemoJob", OneOffElasticDemoJob.class, 10, "0/5 * * * * ?");
    
    // 定义作业2配置对象
    private JobConfiguration jobConfig2 = new JobConfiguration("perpetualElasticDemoJob", PerpetualElasticDemoJob.class, 10, "0/5 * * * * ?");
    
    // 定义作业3配置对象
    private JobConfiguration jobConfig3 = new JobConfiguration("sequencePerpetualElasticDemoJob", SequencePerpetualElasticDemoJob.class, 10, "0/5 * * * * ?");
    
    public static void main(final String[] args) {
        new JobDemo().init();
    }
    
    private void init() {
        // 连接注册中心
        regCenter.init();
        // 启动作业1
        new JobController(regCenter, jobConfig1).init();
        // 启动作业2
        new JobController(regCenter, jobConfig2).init();
        // 启动作业3
        new JobController(regCenter, jobConfig3).init();
    }
}

usage restrictions

Once the job is started successfully, the job name cannot be modified. If the name is modified, it is regarded as a new job.
The same job server can only run one instance of the same job, because the job runtime is registered and managed by IP.
The job obtains the IP address according to the /etc/hosts file, if the obtained IP address is 127.0.0.1 instead of the real IP address, this file should be configured correctly.
Once the server fluctuates or the shard item is modified, resharding will be triggered; triggering resharding will cause the running Perpetual and SequencePerpetual jobs to execute the job and not continue to execute until the sharding ends. Back to normal.
Only when monitorExecution is turned on can realize the idempotency of distributed jobs (that is, the same shard will not be run on multiple job servers), but monitorExecution has a great impact on the performance of jobs executed in a short time (such as triggering every 5 seconds). It is recommended to close and implement idempotency on your own.
elastic-job does not have the function of automatically deleting the job server, because it is impossible to distinguish whether the server crashed or went offline normally. So if you want to offline the server, you need to manually delete the related server nodes in zookeeper. Due to the high risk of directly deleting server nodes, we will not consider adding this function to the operation and maintenance platform for the time being.

Implementation principle

Resilient distributed implementation
1. The first server goes online to trigger the master server election. Once the master server goes offline, the election will be re-triggered, and the election process will be blocked. Only after the master server election is completed, other tasks will be performed.
2. When a job server goes online, the server information is automatically registered in the registration center, and the server status is automatically updated when it goes offline.
3. The re-sharding flag is updated when the master node is elected, the server goes offline, and the total number of shards changes.
4. When a scheduled task is triggered, if it needs to be re-sharded, it can be sharded through the main server, blocked during the sharding process, and the task can be executed after the sharding ends. If the primary server goes offline during the sharding process, the primary server will be elected first, and then sharded.
5. From 4, it can be seen that in order to maintain the stability of the job runtime, only the sharding status will be marked during the running process, and will not be re-sharded. Fragmentation can only happen before the next task triggers.
6. Each shard will be sorted by server IP to ensure that the sharding result will not fluctuate greatly.
7. Implement the failover function, actively grab unallocated shards after a server finishes execution, and actively search for available servers to perform tasks after a server goes offline.

flow chart

job start

job execution

Operation and maintenance platform

The elastic-job operation and maintenance platform is provided in the form of a war package, which can be deployed to a web container that supports servlet such as tomcat or jetty. elastic-job-console.war can be obtained by compiling the source code or from the maven central repository.

Log in

The default username and password are root/root , you can modify the default login username and password by modifying the conf\auth.properties file.
The main function

Login Security Control

Registry management

Job dimension status view

Server dimension status view

Quickly modify job settings

Control Job Pause and Resume
design concept

The operation and maintenance platform is not directly related to elastic-job. It displays the job status by reading the job registration center data, or updating the registration center data to modify the global configuration.

The console can only control whether the job itself is running, but cannot control the start and stop of the job process, because the console and the job server are completely distributed, and the console cannot control the job server.
Item not supported

Add a job. Because jobs are automatically added at the first run, it is not necessary to use the operation and maintenance platform to add jobs.

Stop the job. Even deleting the Zookeeper information doesn't really stop the job from running and can also cause problems with running jobs.

Delete the job server. Due to the high risk of directly deleting server nodes, we will not consider adding this function to the operation and maintenance platform for the time being.
main interface
Overview page

Registry Management Page

Job details page

Service Area Details Page