A free and easy-to-use global data integration platform


Hello everyone, I am Mr. Foot (o^^o)

Everyone knows that in the previous research and development of the data center, we have been using Datax for data aggregation.

And in the data lake project, Kettle is used as the core function of offline development.

Although these two open source projects are excellent, they are only aimed at a single direction. Once there are real-time or other needs, they will inevitably be stretched.

Therefore, we have been suffering from the lack of a menu-based streaming batch data integration platform. After that, we conducted a network-wide survey~~~ and finally chose the global data integration platform RestCloud.

Since using RestCloud, our progress in big data project delivery has been greatly improved, which is very nice.

Global data integration platform RestCloud

ETLCloud is the latest generation of data integration platform. We are committed to building a data integration platform (DataOps) that integrates offline data integration ETL, ELT, CDC real-time data integration, orchestration and scheduling, and data service API to meet the needs of enterprises in one stop A variety of the most complex data integration scenarios. Provide privatized deployment capabilities and cloud-native architecture to meet the business needs of enterprises at different stages of development. Providing an open component market, enterprises can quickly build a big data infrastructure through this platform, and at the same time quickly open up ERP, MES, OA, SaaS, API, MQ, IOT and other data to build a data warehouse.

1. Product architecture

The RestCloud data integration platform is developed using SprigCloud microservice architecture technology. The bottom layer is based on pure Java language and adopts a front-end and back-end separation architecture. The front-end is developed using React technology. The RestCloud data integration platform is developed based on the architecture of data flow + workflow engine.

Its bottom layer is designed with a workflow engine specially developed for data processing task flow to support any complex data flow processing, including: serial, synchronous parallel, asynchronous parallel, synchronous sub-process, asynchronous sub-process, transaction control, and cyclic tasks. Execution, multi-stream merging, data splitting, data stream replication, etc.

Instead of data flow processing logic based on a simple directed acyclic graph like DAG, thanks to our collective advantages in workflow, we can not only do simple dependent task processing like DAG, but also do complex multi-layer processing. With task scheduling capabilities, enterprises can divide data processing tasks into atomic layer, logical combination layer, scheduling layer, etc. to combine the enterprise's complex task scheduling needs, and can split a complex data integration process into multiple reusable sub-layers. Tasks are scheduled.

Next, I will introduce each module of the data integration platform to my friends to get everyone excited.

1. Data source management

As we all know, data source management is a core function of the data platform. The traditional method is based on a single data source. However, with the gradual development of business, it has become difficult to satisfy this requirement. Therefore, heterogeneous data source integration is an urgently needed business function today.

  • 1. Unified management of data sources: Compared with Kettle, which needs to separately maintain and manage the connection and authentication information of the data source in each task, this will increase the complexity of management. In contrast, we provide unified multi-data source management capabilities that simplify data source management and maintenance and reduce errors and repetitive work.
  • 2. Multiple data source support: Support includes mainstream and domestic relational databases, NoSQL databases, file systems, cloud storage, etc., which can meet the needs of linking multiple different data sources.
  • 3. Reduce management complexity: A unified interface manages multiple data sources, reducing management complexity. Developers can manage, transform, and load data in a single process without switching between multiple tools, reducing maintenance costs and error rates.
  • 4. Enhance data security: Unified data source connection, data encryption, security authentication and other functions can help protect data security. Developers can configure and manage these security features in a single tool, increasing data security and reliability.

2. Offline data integration

Create integration tasks between heterogeneous data sources through visual drag, drop, and drag, and perform operations such as cleaning, conversion, and transmission of data. It can be said that this feature is very ahead of other open source offline data integration. The biggest difference lies in the rich variety of components and support for big data components.

  • 1. The platform provides ETL and ELT dual engine modules. Users can choose to use ETL or ELT components according to different business scenarios.

  • 2.ETL can help users realize complex data integration scenarios and the ETL process of reverse integration of data warehouse business systems.

  • 3. ELT can quickly realize the extraction process of business data into data warehouses and data lakes for users.

  • 4. With experience in stable scheduling and implementation of tens of thousands of data pipelines in a single project, we can provide users with complex data pipeline architecture solutions and global data compliance exchange.

3. Real-time data integration

Real-time data integration is generally used in scenarios that require high timeliness, and can also be used for incremental data collection in offline data integration.

In the ETLCloud data integration platform, real-time data monitoring and reading from heterogeneous data sources is supported. After cleaning and conversion, real-time data can be stored in the data warehouse in real time and can be immediately published into API services.

  • 1. Able to automatically capture data change logs based on different database types to achieve real-time millisecond-level synchronization of data tables. Real-time data can be distributed to multiple target libraries or applications in parallel at the same time.
  • 2. Support real-time data transmission to Hive, MongoDB, Doris, and MQ. It also supports real-time transmission from MongoDB, MQ, and files to SQL databases. It supports one-to-many transmission, supports multi-stream merged transmission, and supports data quality during the transmission process. Check, it can distribute dirty data to specified tables in real time and send alarm notifications.

4. Monitoring Center

Use intelligent algorithms to uniformly monitor and alert running automated processes. Not just the traditional monitoring method, but only using plug-ins to obtain task exceptions.

Currently, most monitoring tasks use prometheus for monitoring to obtain data and display it through Grafana. However, this method is often inaccurate and causes serious economic losses. The ETLCloud data integration platform uses intelligent algorithms to monitor tasks to achieve more precise purposes.

5. Comparison and summary

Currently, there are many open source heterogeneous data source integration platforms, among which Kettle is the most famous. Then we made a summary and comparison:

What are the advantages and disadvantages of RestCloud ETL and Kettle?

We compare the platform architecture, platform management, monitoring analysis, data components, data transmission, and platform performance as follows:

6. Offline data integration practice

Next, we will demonstrate a case of offline data integration to let friends experience:

The mysql data source integrates the info table data of the mysql test library into the info_target table of the test_target library.

(Here, friends can integrate data from different data sources)

Have fun playing with the data integration function!

1. Create a new mysql data source

In the data platform, enter data source management and then create a new data source.

Here, I take the mysql data source as an example.

Configure the parameters of mysql and test the connection. This completes the establishment of our data source.

2. Offline data integration

Enter the offline data integration function module and create your own project application

After this, enter the application and reach our core functional area: Dataset Process

Here, friends can create and design data processes

**My data flow design:** Synchronize the info table data of the test library to the info_target table of the test_target library.

3. Execute synchronized data

After completing the design of the data process, we can execute the task (you can choose to execute it manually or schedule it)

After executing the data process task, let’s take a look at the results~~~

Info table data of test library:

info_target data of test_target:

At this point, it is very nice. With one mysql source, the data synchronization of different database tables is completed. Of course, the data integration function is far more than this.

Key : In the process of data integration, friends can use the design of the data process to explore more functions, complete their own work quickly, and enjoy fishing happily is what we yearn for!

Finished, scatter flowers.

Insert image description here

I wish you all success and harvest !

Guess you like

Origin blog.csdn.net/shujuelin/article/details/132534784