1. Background
Timing tasks are an unavoidable business in everyone's redevelopment. For example, in some e-commerce systems, birthday coupons may be sent to users regularly, and in some reconciliation systems, accounts may be reconciled regularly. About a long time ago, each service may have only one machine, and a Timerschedule directly on this machine can basically meet our business needs, but with the changes of the times, a single machine can no longer meet our needs. , at this time we may need 10, 20 or even more machines to run our business and accept our traffic, which is what we call horizontal scaling. But there is a problem here, what will happen if so many machines still use our Timerschedule? In the above e-commerce system, many birthday coupons may be issued to a certain user, causing a lot of losses to the company, so we need some other methods to make the scheduled task only execute once on multiple machines.
Here I would like to ask you how you do timed tasks before you know or use the distributed task scheduling framework? In the Spring project, everyone must know Spring-Scheduler. We only need to add @Scheduler
annotations to the corresponding methods of beans in Spring to complete our timing tasks, but this annotation alone is far from guaranteeing that the timing tasks are executed multiple times. , we need some other means of guarantee, in general, the methods may be nothing more than the following (all based on Spring projects):
-
A machine, we can use a dedicated service desk to carry some less important scheduled tasks, and then use a single machine to run, even if it hangs, as long as we restore it within an acceptable time, our business will not be able to. will be affected.
-
Multiple machines, plus distributed locks, as long as we first acquire a distributed lock when executing the task, if the acquisition fails for so long, it proves that other services have been running again. If the acquisition is successful, it proves that no service is running the scheduled task, then can be executed.
-
Multiple machines use ZooKeeper to perform timed tasks on the leader machine. Many businesses have already used ZK, so when executing timed tasks, determine whether you are a leader. If not, do not execute it, and if so, execute business logic. our aim.
At present, our company also uses the above three methods to do timed tasks. In the early stage of the business, these methods can basically be satisfied. However, as time goes by, we encounter more and more problems. Here we share with you:
- The first is the stand-alone problem. How to divide a business is not very important. This part is inherently more complicated. It is possible that everyone says that their business is important. The second is that if the stand-alone hangs up, it may be down, or it may be other In some cases, how can this time ensure that we can recover within an acceptable range, these are all difficult points.
- At present, when we use a timed task, if we want it to be executed immediately, we may need to write an additional Rest interface or write a separate Job at this time.
- Another is that we need to change the execution time of the scheduled task. For example, there is a requirement that the execution is changed from once every 12 hours to once every 6 hours. We have to modify the code, submit the pr, and then package it online, just to modify a time It takes us a lot of time.
- We cannot suspend our scheduled tasks. When there may be some problems with our scheduled tasks, such as the need for some scheduled alarms, when the alarms suddenly become too many, we need to pause for a while to stop sending alarms. At this time, we may use some distribution Do it with the switch of the type configuration, and then judge whether the timed task switch is turned on in the logic, and then do it. This is relatively simple, but we need to add some new logic that is not related to the task.
- There is a lack of monitoring of timed tasks, and developers have no way of knowing after the task fails. Some people say that there is no Error log. If an Error log alarms at one time, can your service endure it? Generally speaking, it will only trigger several consecutive Errors. Alarm, and the periodic nature of our timed task is not easy to trigger continuous Error.
Of course, there are still some more or less small problems, which are not listed here.
2. Basic principles of research
The first chapter above talked about the reasons for our framework. No matter what you want to introduce or improve, you need a reason, because doing everything has a cost. I often see some small projects start to introduce message queues, or distributed Transactions, etc., but doing this is putting the cart before the horse. For example, some blog systems may implement a message queue to cut peaks and reduce flow, which may not come as quickly as synchronous calls.
When we have the reasons, we can start to do some research or design technical solutions. Here I will talk about some basic principles of my research framework. If you have similar research framework needs in the future, you can apply it to this.
- Simple - Easy to access for developers and easy to use for users.
- Rich documentation, there are many open source projects with very few documents. Of course, some open source projects only have English documents. If you are not proficient in English, you may need to consider documents that are mostly Chinese.
- There is a management interface, which is very convenient to perform operations and statistics.
- Support mainstream frameworks: such as Spring, Springboot, etc. Of course, this at least supports the mainstream frameworks in your business.
- The framework is lightweight and easy to customize according to your needs.
- High performance, high reliability, and high availability: The framework cannot be a bottleneck in the business.
- Code update frequency and community usage: The more companies that use it, the more people love it. The higher the code update frequency, the less problems there will be. It is best to open source and maintain it by a big company.
- Multilingual requirements: If you have multilingual requirements in your business, for example, your company uses a lot of development languages and requires a scheduling framework, then you need to use multilingual support. For example, the representative of Rpc supporting multiple languages is Thrift.
- Can you solve the current pain point: This is the most important thing. If you can't solve your problem, what's the point of using this?
When we have the above-mentioned principles, we can then enter the research.
3. Research framework
3.1 TBSchedule
Generally, when investigating some frameworks in the Java department, you can first check whether Alibaba has open source. After all, Alibaba has done a very good job in open source in recent years. Then I searched online and found that Alibaba has open sourced a scheduling framework in 2012 called TBSchedule, now go to search for the code, and find that the code has been cleaned up. Of course, there is also a personal project that forks it out and maintains it continuously, but there are really few users, so I won't explain it here. github address: https://github.com/taobao/TBSchedule
3.2 elastic-job
Elastic-Job is a distributed scheduling solution open sourced by Dangdang. It consists of two independent sub-projects, Elastic-Job-Lite and Elastic-Job-Cloud. Positioned as a lightweight decentralized solution, it provides coordination services for distributed tasks in the form of jar packages. It supports distributed scheduling coordination, elastic expansion and contraction, failover, re-triggering of missed execution jobs, parallel scheduling, self-diagnosis and repair, etc.
This framework was very popular about 2 years ago. Many companies used it at that time, and many people must have heard of it. Unfortunately, it is no longer maintained. The code has not been updated for 2 years. This violates the principle of update frequency. If there is a problem, there may be no one to help you, so we do not recommend it very much. github address: https://github.com/elasticjob/elastic-job-lite
3.3 Some relatively niche
There are some relatively small github stars on the Internet, and the update frequency is also very small: Uncode-Schedule, LTS, openCron, etc., these are not in line with our principles and will not be considered.
3.4 XXL-JOB
Since there are no foundations such as CNCF, Apache, etc. for distributed timing tasks, the choice may not be so difficult. Unlike the message queue, there are several in Apache: Kafka, rocketmq, plusar, etc., each of which has a huge community, and it may be difficult to choose. Then we basically have two choices left. One is self-research. This task scheduling framework is far less difficult to develop than the research and development of message queues. In fact, many companies have chosen self-research, such as : Meituan's Crane these. However, for some complex middleware such as message queues, secondary development may be selected. For example, Meituan's mafka is based on kafka secondary development, and Didi's DDMQ is also based on Rocketmq. At present, if we choose self-research, it is obviously not enough in terms of resources. Here we still use the strategy of secondary development framework.
Of course, there is still one XXL-Job: http://www.xuxueli.com/xxl-job selection, which basically conforms to our principles, the current code is also continuously updated, the issue author is also actively replying, using The company also has more than 200, including previous reviews, and other principles are also in line. Generally, when you decide to choose a framework, you need to list the advantages in detail so that others can be convinced.
xxl-job has the following features:
- Simple: support CRUD operations on tasks through web pages, the operation is simple, and you can get started in one minute;
- Dynamic: Supports dynamic modification of task status, start/stop of tasks, and termination of running tasks, with immediate effect;
- Scheduling center HA (centralized): The scheduling adopts a centralized design. The "dispatching center" self-developed scheduling components and supports cluster deployment, which can ensure the HA of the scheduling center;
- Executor HA (distributed): Distributed execution of tasks, task "executor" supports cluster deployment and ensures task execution HA;
- Registration Center: The executor will automatically register tasks periodically, and the scheduling center will automatically discover the registered tasks and trigger execution. At the same time, it also supports manual entry of the actuator address;
- Elastic capacity expansion and contraction: Once a new executor machine goes online or offline, the task will be reassigned the next time it is scheduled;
- Routing strategy: The executor cluster provides rich routing strategies, including: first, last, round robin, random, consistent HASH, least frequently used, most recently unused, failover, busy transfer, etc.;
- Failover: When the task routing policy selects "Failover", if a machine in the executor cluster fails, it will automatically failover to a normal executor to send scheduling requests.
- Blocking processing strategy: The processing strategy when the scheduling is too intensive for the executor to process, the strategies include: stand-alone serial (default), discarding subsequent scheduling, and overwriting previous scheduling;
- Event triggering: In addition to "Cron mode" and "task dependent mode" triggering task execution, event-based triggering tasks are supported. The dispatch center provides API services that trigger a single execution of tasks, which can be flexibly triggered according to business events.
- Task progress monitoring: support real-time monitoring of task progress;
- Rolling real-time log: support online viewing of scheduling results, and support real-time viewing of the complete execution log output by the executor in Rolling mode
Basically, some of the above features are needed in our business, so XXL-JOB is finally selected here.
4. Summary
As the saying goes: It is better to teach a man to fish than to give him a fish. The previous articles have introduced the framework of XX every time. This time I prefer to introduce how I chose this framework, so that everyone can do it in the process of future research. According to this idea, if you also have good and different research ideas, you are welcome to leave a message or join a group to communicate. Of course, after the general research is completed, as a researcher, if you do not understand the source code and implementation principle of this framework, then you are an unqualified researcher, so in the next article, I will introduce the implementation principle of XXL-Job in detail.
If you think this article is helpful to you, your attention and forwarding are the greatest support for me, O(∩_∩)O: