In Xinwang Bank, a large number of task instances are generated every day, among which real-time tasks occupy the majority. In order to better handle task instances, Xinwang Bank chose to use Apache DolphinScheduler to complete this challenge after comprehensive consideration. Now, many projects of Xinwang Bank have completed real-time and quasi-real-time running batches, and offline running batches of the indicator management system, which are applied to offline data development and task scheduling, quasi-real-time data development and task scheduling, and other non-ETL user definitions There are three types of scenarios where data runs in batches.
In order to better adapt to business needs, how did Xinwang Bank transform it based on Apache DolphinScheduler? At the April Meetup of Apache DolphinScheduler, Chen Wei, a senior big data engineer from the Big Data Center of Xinwang Bank, brought us "Practical Application of Apache DolphinScheduler in Xinwang Bank".
“
This sharing is divided into four parts:
Introduction to the background of Apache DolphinScheduler introduced by Xinwang Bank
Application scenarios of Apache DolphinScheduler
Optimization and Transformation of Xinwang Bank
Subsequent plans for Xinwang Bank to use Apache DolphinScheduler
Apache DolphinScheduler
Chen Wei
Senior Big Data Engineer, Xinwang Bank Big Data Center
11 years of work experience, engaged in the construction of data warehouses in the early stage, and then turned to the construction of big data basic platforms, dispatching systems, etc., has experience in traditional financial industry, Internet data warehouses, data mart construction, many years of construction experience in dispatching systems, Migu Culture Analysis Cloud Scheduling system design, report platform design, currently mainly responsible for the construction of Xinwang Bank's DataOps system related systems (offline development, indicator system, label system).
01
background introduction
We chose to use Apache DolphinScheduler mainly based on three requirements: unification of R&D scenarios, optimization of test scenarios, and optimization of production deployment scenarios.
01
R & D scene
In the past, we did not have a unified development tool in the data development process, so Xinwang Bank needed to switch back and forth between multiple tools during the development work process, resulting in excessive development costs;
On the other hand, our parameter replacement requirements during the development process cannot be met, impromptu debugging is impossible, and there are no ready-made tools to support offline tasks in the development state and production state.
02
testing scenarios
During the deployment of the test scenario, when our developers provide the script to the test, the returned document is quite unfriendly. Especially when it needs to be deployed in multiple versions and multiple scenarios, the workload of testers increases sharply, and the visual deployment is relatively weak, making it impossible to perform friendly automated testing.
03
Production deployment
The current scheduling system configuration is complicated and the visualization effect is poor;
The development and production environment networks are physically isolated, so the process of deploying code from the development environment to the production environment is long and error-prone. The test environment cannot fully reflect the configuration of the production environment, and manual configuration files are prone to errors and missing configurations;
The operation and maintenance monitoring capability is insufficient, the visualization effect is poor, and the logs cannot be viewed online. To troubleshoot and enter the monitoring room, you must log in to the physical machine, and the process is complicated.
02
Application Scenario
The scenarios where we apply Apache DolphinScheduler mainly include the following three categories: offline data development and task scheduling, quasi-real-time data development and task scheduling, and other non-ETL user-defined data running batches.
01
Offline data development and task scheduling
In offline data development and task scheduling, it is mainly used in our banking data warehouse, data mart, etc. The data includes some offline data, daily and monthly offline processing data, etc.
02
Quasi real-time data development and task scheduling
In Xinwang Bank, the quasi-real-time data is fused and calculated from the log of the upstream message queue database through Flink. After completing the relevant dimension information, the data is pushed to Clickhouse for processing. However, batch calculations are performed at the minute level, but there are some special requirements compared to the daily batch scheduling.
03
Other non-ETL user-defined data running batches
We have this part of the application to realize the function through some internal low-code platforms. We open the application system to business personnel, and they can analyze the application data by themselves without the need for developers to deal with it. After the business personnel define it, they can run batches on this part of the data by themselves.
1. Offline data development and task scheduling
DS
Among them, we apply Apache DolphinScheduler in offline data development and task scheduling scenarios, mainly involving five sections: task development mode, historical task integration, workflow and task separation, project environment variables, and data source search.
1. Task development mode (SQL, SHELL, PYTHON, XSQL, etc.), online development mode (check the log below, and check the SQL query return result online). WEBIDE can automatically replace pop-up variables, which will be dynamically replaced according to user settings and default processing.
2. Historical task integration
Most of the data warehouses in the banking industry have been established for four or five years, and there are many historical tasks. Therefore, we do not want users to modify the code independently when our new system is launched, because this will lead to relatively high user costs.
3. Separation of workflow and tasks
The development directly develops tasks and debugs and tests, and the workflow directly refers to the developed tasks, so that our task development and our task arrangement are cut accordingly.
4. Project environment variables
Added project environment variables. Project environment variables are adapted to all jobs in the project by default, so we don't need to configure in each workflow, and each project can be directly referenced.
5. Data source
We look up data sources by data source name, supporting data sources such as phoenix. In the future, we hope that tasks can be imported and exported, but during the process of importing and exporting, the parameter definitions and data sources in our tasks cannot be changed, so that the test can be directly directed to direct production, which will be relatively simple in production.
2. Quasi-real-time tasks
DS
Task development debugging (SQL), online development debugging (online viewing of logs, online viewing of SQL query return results), pop-up windows in WEBIDE to replace script variables.
Clickhouse data source HA configuration integration support. However, there will be a small problem in offline batch running, that is, if the current port is not available, an error may be reported directly. In this area, additional processing is required.
The quasi-real-time workflow runs with a single instance. If there is an initialized instance or an ongoing workflow instance, even if the next batch is triggered, the workflow will not be triggered to run.
3. Other non-ETL user-defined data running batches
DS
1. We currently have model data calculation tasks pushed from the indicator management platform. For these user-defined simple reports, the platform will dynamically generate SQL and then push them directly to offline scheduling. In the future this process will not involve developers.
2. In the label management system, we mainly adapt by generating special plug-in tasks.
03
Optimization and transformation
1. Current status of Xinwang Bank
DS
In Xinwang Bank, about 9000+ task instances are generated every day, among which real-time tasks occupy the majority. Today, we have used Apache DolphinScheduler to complete real-time and quasi-real-time batch running in many projects, offline batch running of the indicator management system, etc., including running batches for integrated internal SQL tools that support XSQL.
In the screenshot on the right, we can see that we have actually completed the task independence and replaced the parameters twice. In addition, in terms of task lineage, especially SQL-type tasks, we can do automatic parsing or manually increase. This is mainly used for the automatic orchestration of our workflow, such as task maps within the company, etc.
In order to meet the above business needs, we have carried out the following five optimizations on Apache DolphinScheduler, and also listed the corresponding modifications that must be paid attention to during the transformation process.
The project considers various types (development, testing) in different scenarios through the environment;
Environment variables are isolated from projects and environments, but the names of environment variables in different environments are consistent;
Data sources are isolated by projects and environments, but the names of data sources in different environments are consistent;
Add non-JDBC data sources, ES, Livy, etc. Because in internally transparent applications, Livy is required as a data service framework to connect to Spark jobs for data desensitization.
2. Independent tasks
DS
Develop independent task development, debugging, and configuration pages, which can support project environment variables
JDBC, XSQL tasks can refer to data sources by data source name
Develop interactive WEBIDE debug development
Complete parameter optimization, support user ${parameter} and reference system built-in time function
Complete independent SQL, XQSL automatic lineage analysis
Complete SQL automatic parameter parsing
3. Workflow startup logic optimization
DS
Quasi-real-time workflow single-instance operation, if there is already a running workflow instance, this operation will be ignored
Increase the environment control strategy, and the workflow refers to different environment variables and data source access connections according to different environments. For example, if the disaster recovery environment and production environment are configured in advance, once there is a problem in the production environment, you can switch to the disaster recovery environment with one click.
Optimize scheduling problems caused by workflow and task separation, mainly including abnormal detection
4. Import and export optimization
DS
Added import and export tasks, task configuration and resource files, etc.
Since the banking and financial industries have many development and test environment networks that are inconsistent with the production network, it is necessary to export a relatively friendly resource script workflow and resource file information when processing data in multiple environments.
Added workflow import and export logic to handle data conflicts due to auto-increment IDs of different database instances
Navigated import and export, version management, mainly to deal with emergencies, rollback of some codes, etc.
5. Improvement and optimization of the alarm system
DS
Docking with the internal alarm system of Xinwang Bank, by default, the task creator will be alerted to the users who subscribe to the alarm group
Add policy alarms (start delay, completion delay), start and complete delay alarms for key tasks
6. Connect with internal system
DS
Model task running and monitoring
Report push task operation and monitoring
Connect to the internal IAM SSO unified login authentication system
Depending on the network, limit specific functions (code editing, workflow running, task running, etc.)
There is a special phenomenon in the financial industry, that is, our production needs to be done in a specific computer room. We must limit certain operations to be completed in the computer room, but we also need to reduce the cost of one modification. We hope that the development will see the log In the future, it will be repaired directly in the office network, and after the repair is completed, it will go to the computer room for production.
As shown in the figure above, we mainly automatically create reports based on this dimensional model theory. After configuration, we perform code consolidation calculations for multiple tables based on the configuration report logic. After the aggregation calculation is completed, it is pushed to the report server. In this way, business users can follow some of the basic functions we provide. Data aggregation is performed directly without writing SQL, and it also avoids the uneasiness of business-side users to put forward temporary requirements for us.
04
follow-up plan
Promote the offline data research and development platform to more project teams
Gradually replace the existing scheduling system in the industry to realize the smooth migration of all offline tasks
Dispatching system sinking, docking with row data R&D management system
technical goals
DS
A more intelligent and automated task scheduling and orchestration system lowers the threshold for using the scheduling system on the user side
Operation monitoring, forecasting, face-to-face and operation and maintenance personnel provide more friendly operation and maintenance monitoring, task completion time prediction and other functions
The global view function provides a global view of offline tasks for development and operation and maintenance personnel, and provides data lineage and impact analysis functions
Further integrate customized configuration modules in the industry to reduce development costs for developers
Integration with data quality management platform
User defined plank task support
Thank you everyone, that's all for my sharing today.
Participate in contribution
With the rapid rise of open source in China, the Apache DolphinScheduler community has ushered in vigorous development. In order to make better and easy-to-use scheduling, we sincerely welcome partners who love open source to join the open source community and contribute to the rise of China's open source. , Let local open source go global.
There are many ways to participate in the DolphinScheduler community, including:
Contributing the first PR (documentation, code) We also hope to be simple, the first PR is used to familiarize yourself with the submission process and community collaboration and feel the friendliness of the community.
The community has put together the following list of issues for beginners: https://github.com/apache/dolphinscheduler/issues/5689
List of non-novice issues: https://github.com/apache/dolphinscheduler/issues?q=is%3Aopen+is%3Aissue+label%3A%22volunteer+wanted%22
How to participate in the contribution link: https://dolphinscheduler.apache.org/zh-cn/docs/development/contribute.html
Come on, the DolphinScheduler open source community needs your participation to contribute to the rise of China's open source, even if it is only a small tile, the combined power is huge.
Participating in open source allows you to compete closely with experts from all walks of life and quickly improve your skills. If you want to contribute, we have a seed incubation group for contributors. You can add the community assistant WeChat (Leonard-ds) to teach you (contributors) Regardless of the level, the question must be answered, the key is to have a heart that is willing to contribute).
Please indicate that you want to participate in the contribution when adding the WeChat Assistant.
Come on, the open source community is looking forward to your participation.
Activity recommendation
Summer of Open Source is a summer open source activity initiated and long-term supported by the "Open Source Software Supply Chain Lighting Project". It is jointly organized by the Institute of Software, Chinese Academy of Sciences and the openEuler community, aiming to encourage students to actively participate in the development and maintenance of open source software , to promote the vigorous development of excellent open source software communities, and to cultivate and discover more outstanding developers.
Students can independently choose the projects they are interested in to apply for, and get personal guidance from community mentors after they are selected. Depending on the difficulty and completion of the project, participants will also receive Open Source Summer bonuses and project completion certificates.
Official website of Open Source Summer: https://summer.iscas.ac.cn/
Welcome to join the exchange group exchange.
More exciting recommendations