Real-time and offline batch running of financial task instances, three application scenarios and five optimizations of Apache DolphinScheduler in Xinwang Bank...

d5b7827fd3189b5ae937b37285112a5d.png

In Xinwang Bank, a large number of task instances are generated every day, among which real-time tasks occupy the majority. In order to better handle task instances, Xinwang Bank chose to use Apache DolphinScheduler to complete this challenge after comprehensive consideration. Now, many projects of Xinwang Bank have completed real-time and quasi-real-time running batches, and offline running batches of the indicator management system, which are applied to offline data development and task scheduling, quasi-real-time data development and task scheduling, and other non-ETL user definitions There are three types of scenarios where data runs in batches.

In order to better adapt to business needs, how did Xinwang Bank transform it based on Apache DolphinScheduler? At the April Meetup of Apache DolphinScheduler, Chen Wei, a senior big data engineer from the Big Data Center of Xinwang Bank, brought us "Practical Application of Apache DolphinScheduler in Xinwang Bank".

 This sharing is divided into four parts:

  1. Introduction to the background of Apache DolphinScheduler introduced by Xinwang Bank

  2. Application scenarios of Apache DolphinScheduler

  3. Optimization and Transformation of Xinwang Bank

  4. Subsequent plans for Xinwang Bank to use Apache DolphinScheduler

 Apache DolphinScheduler 

85c4fa7e62931fc710c871eff00dd305.png

Chen Wei 

Senior Big Data Engineer, Xinwang Bank Big Data Center

11 years of work experience, engaged in the construction of data warehouses in the early stage, and then turned to the construction of big data basic platforms, dispatching systems, etc., has experience in traditional financial industry, Internet data warehouses, data mart construction, many years of construction experience in dispatching systems, Migu Culture Analysis Cloud Scheduling system design, report platform design, currently mainly responsible for the construction of Xinwang Bank's DataOps system related systems (offline development, indicator system, label system).

01

background introduction

We chose to use Apache DolphinScheduler mainly based on three requirements: unification of R&D scenarios, optimization of test scenarios, and optimization of production deployment scenarios.

01

R & D scene

In the past, we did not have a unified development tool in the data development process, so Xinwang Bank needed to switch back and forth between multiple tools during the development work process, resulting in excessive development costs;

On the other hand, our parameter replacement requirements during the development process cannot be met, impromptu debugging is impossible, and there are no ready-made tools to support offline tasks in the development state and production state.

02

testing scenarios

During the deployment of the test scenario, when our developers provide the script to the test, the returned document is quite unfriendly. Especially when it needs to be deployed in multiple versions and multiple scenarios, the workload of testers increases sharply, and the visual deployment is relatively weak, making it impossible to perform friendly automated testing.

03

Production deployment

  • The current scheduling system configuration is complicated and the visualization effect is poor;

  • The development and production environment networks are physically isolated, so the process of deploying code from the development environment to the production environment is long and error-prone. The test environment cannot fully reflect the configuration of the production environment, and manual configuration files are prone to errors and missing configurations;

  • The operation and maintenance monitoring capability is insufficient, the visualization effect is poor, and the logs cannot be viewed online. To troubleshoot and enter the monitoring room, you must log in to the physical machine, and the process is complicated.

02

Application Scenario

The scenarios where we apply Apache DolphinScheduler mainly include the following three categories: offline data development and task scheduling, quasi-real-time data development and task scheduling, and other non-ETL user-defined data running batches.

01

Offline data development and task scheduling

In offline data development and task scheduling, it is mainly used in our banking data warehouse, data mart, etc. The data includes some offline data, daily and monthly offline processing data, etc.

02

Quasi real-time data development and task scheduling

In Xinwang Bank, the quasi-real-time data is fused and calculated from the log of the upstream message queue database through Flink. After completing the relevant dimension information, the data is pushed to Clickhouse for processing. However, batch calculations are performed at the minute level, but there are some special requirements compared to the daily batch scheduling.

03

Other non-ETL user-defined data running batches

We have this part of the application to realize the function through some internal low-code platforms. We open the application system to business personnel, and they can analyze the application data by themselves without the need for developers to deal with it. After the business personnel define it, they can run batches on this part of the data by themselves.

1. Offline data development and task scheduling

DS

Among them, we apply Apache DolphinScheduler in offline data development and task scheduling scenarios, mainly involving five sections: task development mode, historical task integration, workflow and task separation, project environment variables, and data source search.

1. Task development mode (SQL, SHELL, PYTHON, XSQL, etc.), online development mode (check the log below, and check the SQL query return result online). WEBIDE can automatically replace pop-up variables, which will be dynamically replaced according to user settings and default processing.

2. Historical task integration

Most of the data warehouses in the banking industry have been established for four or five years, and there are many historical tasks. Therefore, we do not want users to modify the code independently when our new system is launched, because this will lead to relatively high user costs.

3. Separation of workflow and tasks

The development directly develops tasks and debugs and tests, and the workflow directly refers to the developed tasks, so that our task development and our task arrangement are cut accordingly.

4. Project environment variables

Added project environment variables. Project environment variables are adapted to all jobs in the project by default, so we don't need to configure in each workflow, and each project can be directly referenced.

5. Data source

We look up data sources by data source name, supporting data sources such as phoenix. In the future, we hope that tasks can be imported and exported, but during the process of importing and exporting, the parameter definitions and data sources in our tasks cannot be changed, so that the test can be directly directed to direct production, which will be relatively simple in production.

2. Quasi-real-time tasks

DS

  1. Task development debugging (SQL), online development debugging (online viewing of logs, online viewing of SQL query return results), pop-up windows in WEBIDE to replace script variables.

  2. Clickhouse data source HA configuration integration support. However, there will be a small problem in offline batch running, that is, if the current port is not available, an error may be reported directly. In this area, additional processing is required.

  3. The quasi-real-time workflow runs with a single instance. If there is an initialized instance or an ongoing workflow instance, even if the next batch is triggered, the workflow will not be triggered to run.

3. Other non-ETL user-defined data running batches

DS

1. We currently have model data calculation tasks pushed from the indicator management platform. For these user-defined simple reports, the platform will dynamically generate SQL and then push them directly to offline scheduling. In the future this process will not involve developers.

2. In the label management system, we mainly adapt by generating special plug-in tasks.

03

Optimization and transformation

1. Current status of Xinwang Bank

DS

In Xinwang Bank, about 9000+ task instances are generated every day, among which real-time tasks occupy the majority. Today, we have used Apache DolphinScheduler to complete real-time and quasi-real-time batch running in many projects, offline batch running of the indicator management system, etc., including running batches for integrated internal SQL tools that support XSQL.

effe77b09cd8e982485e6d136068f1f1.png

In the screenshot on the right, we can see that we have actually completed the task independence and replaced the parameters twice. In addition, in terms of task lineage, especially SQL-type tasks, we can do automatic parsing or manually increase. This is mainly used for the automatic orchestration of our workflow, such as task maps within the company, etc.

In order to meet the above business needs, we have carried out the following five optimizations on Apache DolphinScheduler, and also listed the corresponding modifications that must be paid attention to during the transformation process.

  1. The project considers various types (development, testing) in different scenarios through the environment;

  2. Environment variables are isolated from projects and environments, but the names of environment variables in different environments are consistent;

  3. Data sources are isolated by projects and environments, but the names of data sources in different environments are consistent;

  4. Add non-JDBC data sources, ES, Livy, etc. Because in internally transparent applications, Livy is required as a data service framework to connect to Spark jobs for data desensitization.

2. Independent tasks

DS

  • Develop independent task development, debugging, and configuration pages, which can support project environment variables

  • JDBC, XSQL tasks can refer to data sources by data source name

  • Develop interactive WEBIDE debug development

  • Complete parameter optimization, support user ${parameter} and reference system built-in time function

  • Complete independent SQL, XQSL automatic lineage analysis

  • Complete SQL automatic parameter parsing

3. Workflow startup logic optimization

DS

  • Quasi-real-time workflow single-instance operation, if there is already a running workflow instance, this operation will be ignored

  • Increase the environment control strategy, and the workflow refers to different environment variables and data source access connections according to different environments. For example, if the disaster recovery environment and production environment are configured in advance, once there is a problem in the production environment, you can switch to the disaster recovery environment with one click.

  • Optimize scheduling problems caused by workflow and task separation, mainly including abnormal detection

4. Import and export optimization

DS

  • Added import and export tasks, task configuration and resource files, etc.

  • Since the banking and financial industries have many development and test environment networks that are inconsistent with the production network, it is necessary to export a relatively friendly resource script workflow and resource file information when processing data in multiple environments.

  • Added workflow import and export logic to handle data conflicts due to auto-increment IDs of different database instances

  • Navigated import and export, version management, mainly to deal with emergencies, rollback of some codes, etc.

5. Improvement and optimization of the alarm system

DS

  • Docking with the internal alarm system of Xinwang Bank, by default, the task creator will be alerted to the users who subscribe to the alarm group

  • Add policy alarms (start delay, completion delay), start and complete delay alarms for key tasks

6. Connect with internal system

DS

  • Model task running and monitoring

  • Report push task operation and monitoring

  • Connect to the internal IAM SSO unified login authentication system

  • Depending on the network, limit specific functions (code editing, workflow running, task running, etc.)

There is a special phenomenon in the financial industry, that is, our production needs to be done in a specific computer room. We must limit certain operations to be completed in the computer room, but we also need to reduce the cost of one modification. We hope that the development will see the log In the future, it will be repaired directly in the office network, and after the repair is completed, it will go to the computer room for production.

464f82e79e838cdfbc6a816327665ad0.png

As shown in the figure above, we mainly automatically create reports based on this dimensional model theory. After configuration, we perform code consolidation calculations for multiple tables based on the configuration report logic. After the aggregation calculation is completed, it is pushed to the report server. In this way, business users can follow some of the basic functions we provide. Data aggregation is performed directly without writing SQL, and it also avoids the uneasiness of business-side users to put forward temporary requirements for us.

04

follow-up plan

  1. Promote the offline data research and development platform to more project teams

  2. Gradually replace the existing scheduling system in the industry to realize the smooth migration of all offline tasks

  3. Dispatching system sinking, docking with row data R&D management system

technical goals

DS

  1. A more intelligent and automated task scheduling and orchestration system lowers the threshold for using the scheduling system on the user side

  2. Operation monitoring, forecasting, face-to-face and operation and maintenance personnel provide more friendly operation and maintenance monitoring, task completion time prediction and other functions

  3. The global view function provides a global view of offline tasks for development and operation and maintenance personnel, and provides data lineage and impact analysis functions

  4. Further integrate customized configuration modules in the industry to reduce development costs for developers

  5. Integration with data quality management platform

  6. User defined plank task support

Thank you everyone, that's all for my sharing today.

Participate in contribution

With the rapid rise of open source in China, the Apache DolphinScheduler community has ushered in vigorous development. In order to make better and easy-to-use scheduling, we sincerely welcome partners who love open source to join the open source community and contribute to the rise of China's open source. , Let local open source go global.

5be108b4d8ae6d0f2dabe840b11031fb.png

There are many ways to participate in the DolphinScheduler community, including:

7d397649667f69b0599a930134e0f0f8.png

Contributing the first PR (documentation, code) We also hope to be simple, the first PR is used to familiarize yourself with the submission process and community collaboration and feel the friendliness of the community.

The community has put together the following list of issues for beginners: https://github.com/apache/dolphinscheduler/issues/5689

List of non-novice issues: https://github.com/apache/dolphinscheduler/issues?q=is%3Aopen+is%3Aissue+label%3A%22volunteer+wanted%22

How to participate in the contribution link: https://dolphinscheduler.apache.org/zh-cn/docs/development/contribute.html

Come on, the DolphinScheduler open source community needs your participation to contribute to the rise of China's open source, even if it is only a small tile, the combined power is huge.

Participating in open source allows you to compete closely with experts from all walks of life and quickly improve your skills. If you want to contribute, we have a seed incubation group for contributors. You can add the community assistant WeChat (Leonard-ds) to teach you (contributors) Regardless of the level, the question must be answered, the key is to have a heart that is willing to contribute).

Please indicate that you want to participate in the contribution when adding the WeChat Assistant.

Come on, the open source community is looking forward to your participation.

Activity recommendation

Summer of Open Source is a summer open source activity initiated and long-term supported by the "Open Source Software Supply Chain Lighting Project". It is jointly organized by the Institute of Software, Chinese Academy of Sciences and the openEuler community, aiming to encourage students to actively participate in the development and maintenance of open source software , to promote the vigorous development of excellent open source software communities, and to cultivate and discover more outstanding developers.

Students can independently choose the projects they are interested in to apply for, and get personal guidance from community mentors after they are selected. Depending on the difficulty and completion of the project, participants will also receive Open Source Summer bonuses and project completion certificates.

Official website of Open Source Summer: https://summer.iscas.ac.cn/

Welcome to join the exchange group exchange.

3d7cf4c4f7e5788bd656f71d344123a5.png

More exciting recommendations

☞It is another year of open source summer, and the bonuses of the eight major projects are waiting for you!

Apache DolphinScheduler 2.X nanny-level source code analysis, China Mobile engineers reveal the whole process of service scheduling startup

Plug-in expansion, blood coupling system, production environment optimization, and application improvements to improve the ease of use of Apache DolphinScheduler

☞China Unicom transformed the Apache DolphinScheduler resource center to realize cross-cluster calling and one-stop access to

☞Expert Column| Don't know how to use Apache Dolphinscheduler yet? The most complete introductory tutorial written by the boss in a month

Click to read the original text and sign up for the [Open Source Summer] event for free!

Guess you like

Origin blog.csdn.net/DolphinScheduler/article/details/124811919