The first Chinese-led open source data integration tool! Uncover the story behind Apache's top-level project SeaTunnel

"In the next ten years, the world's open source depends on China."

In the interview of CSDN "Open Source Interviews", Dai Lidong, the Apache incubator mentor and Apache SeaTunnel PMC Member& Mentor, said this sentence. From the projects he saw in the Apache incubator, it is dominated by developers from China. The proportion of open source projects is increasing.

Dai Lidong himself and Guo Wei, the "big man", planted SeaTunnel under the influence of open source, which has become the top open source project of Apache, and this is also the first project in the field of data integration led by Chinese .

5 years have passed quietly, nearly 250,000 lines of code, more than 200 contributors, and global collaboration. Behind this, what are the little-known stories and what setbacks have you experienced? Why set up the rhetoric of "I made this wheel"? In this article, Guo Wei and Dai Lidong jointly revealed how SeaTunnel started from scratch, became open source, and went to the world.

At the same time, Guo Wei, Dai Lidong, and Liu Tiandong made an appointment with us on CSDN to share their experience and experience in Apache's top projects, and to go to the future of open source together. Welcome to scan the QR code to watch the live playback.

file

Author | Guo Wei, Dai Lidong

Editor in charge | Tang Xiaoyin

Listing | CSDN (ID: CSDNnews)

On June 1, 2023, on Children's Day, Apache SeaTunnel, the first Chinese-led open source data integration tool, officially announced its graduation from the Apache Software Foundation Incubator as a top-level project. After 18 months of incubation, the project finally came to fruition. But just like a newborn baby, the new journey of Apache SeaTunnel has just begun.

From the earliest Waterdrop to today's Apache SeaTunnel;

From real-time data processing system to a new generation of one-stop high-performance, distributed, massive data integration solution tools;

From the first line of code in January 2018 to 245,000 lines of code today;

From less than 10 contributors to 200+ contributors;

From looking for the first user to thousands of enterprises in the production environment;

From finding a Mentor to successfully becoming an Apache top-level project.

……

The core members of the Apache SeaTunnel community will tell about the ups and downs, and use the timeline as a clue to show you the story behind its open source road.

Behind the Birth of Apache SeaTunnel

We have always faced many challenges in data processing, one of which is the need to support seamless integration and high-speed synchronization among many data sources. At that time, I investigated the existing data integration tools on the market, and found that most of them supported very limited data sources, and often supported upstream data sources, but could not find connectors for downstream data sources. And when faced with large-scale data volume, the performance is often too low, and the operation is also complicated and inflexible. So, we came up with the idea of ​​making an open source data integration tool!

After some polishing by the core team, Apache SeaTunnel was born. It not only supports hundreds of data sources (Database/Cloud/SaaS), but also supports real-time CDC and batch synchronization of massive data, which can synchronize trillions of data stably and efficiently.

file

In addition to basic data reading and writing functions, the functions of Apache SeaTunnel that are different from general data integration tools are:

The engine is decoupled from Spark and Flink, and has its own Zeta engine specially designed for data integration scenarios, which is faster, more stable, and more resource-saving, which means that Apache SeaTunnel supports three execution engines at the same time - Spark, Flink, and Tunnel self-developed The engine Zeta Engine;

  • It has a web interface that is more intuitive and easy to operate;
  • Supports the connection of 100+ connectors, with rich types of data processing to meet production needs;
  • Unique Checkpoint function design, enhanced data storage capacity, etc.
  • This enables Apache SeaTunnel to:
  • Support hundreds of data sources, faster transmission speed and high accuracy;
  • To reduce complexity, the connector developed based on API is compatible with offline synchronization, real-time synchronization, full synchronization, incremental synchronization, CDC real-time synchronization and other scenarios;
  • Provides a drag-and-drop and SQL-like language interface to save more time for developers, and provides job visualization management, scheduling, operation and monitoring capabilities. Accelerate the integration of low-code and no-code tools;
  • It is simple and easy to maintain, and supports stand-alone and cluster deployment. If you choose Zeta engine deployment, you don't need to rely on big data components such as Spark and Flink.

Although SeaTunnel has so many capabilities now, back to two years ago, SeaTunnel, which was also called Waterdrop at the time, was positioned to make Flink and Spark easier to use, so the entire architecture design is based on Spark and Flink, which is why The first big discussion in the community - the linker must be independent of the specific engine.

Why make the connector independent of the engine?

First, let's look at the role of the connector. Connectors are responsible for connecting specific upstream and downstream data sources, which is a key component of data integration. Waterdrop’s architecture at the time was basically the introduction of Spark and Flink connectors, using the native APIs of Spark and Flink, which required separate development One set of code, early batch and stream are still different APIs, which means that in order to achieve batch synchronization and stream synchronization for the same data source, two sets of code need to be developed. And considering the large version compatibility issues between Spark and Flink, the amount of code development and maintenance costs are too high.

So at the beginning of 2022, the community initiated a discussion on refactoring the connector. The goal is to define SeaTunnel's own connector API to decouple from the specific engine, independent of the specific engine API, and truly realize the integration of batch and stream. The same data source only needs one Only one set of code can run on Spark and Flink engines at the same time.

At the beginning of the discussion, many people objected, thinking that engines such as Flink and Spark are very mature, and there is no problem in relying heavily on them. Some contributors think that we should abandon Spark and fully rely on Flink, and improve the functions on the basis of Flink. And refactoring the connector API meant that the work of the previous 50+ connectors needed to start from zero.

But after communicating and discussing with many great gods in the industry, the community soon determined that a connector that does not depend on the engine must be done, "cannot dance with shackles", the new API will make connector development easier, Those old connectors can also be quickly supported under the new architecture.

It turns out that when this goal was established, the community spent a month designing a new connector API, many enthusiastic contributors participated, and we achieved support for more than 100 connectors in just 4 months, and the speed The speed is unimaginable, and the new API has truly realized the ability to support multiple engines.

After realizing that the connector has nothing to do with the engine, "just implement a new engine that focuses on data integration, once and for all!", a sentence that SeaTunnel PMC Chair Gao Jun inadvertently said aroused the uncontrollable enthusiasm of community contributors. Enthusiasm.

Why self-develop a new engine?

"What, self-developed engine?" Hearing the news of self-developed integrated engine, the community exploded, and immediately launched an unprecedented heated debate on whether it is necessary to build an engine by itself.

There are several main points of contention:

  • From the perspective of ease of use, both Spark and Flink require enterprises to have a big data platform, which is a great technical burden for those small and medium-sized enterprises. Everyone needs a simpler and lower-cost engine to reduce SeaTunnel’s Use threshold.

  • From a performance point of view, Spark/Flink is born for computing. They mainly solve the problem of T in the ETL architecture, and data integration mainly solves the EL process in ELT, such as Join, Aggregation, window calculation, etc. Features are not the focus of data integration. The data integration engine should focus on integration rather than calculation. All code optimization and architecture design should start from improving job performance and stability. Therefore, we need an engine specially designed for integration scenarios, which should have excellent performance. Extremely stable and takes up less resources. Especially when there are many tables to be synchronized, can real-time synchronization of these tables be completed with less resources (such as 1 core CPU)?

  • From the perspective of business scenarios, Flink/Spark itself cannot meet the characteristics of CDC multi-table synchronization, entire database synchronization, and DDL change synchronization in the CDC process. If these characteristics are to be supported, the source code of Flink/Spark needs to be modified. We cannot be sure whether these features can be accepted by the Spark/Flink community, because this is inconsistent with the direction of their main problem solving (T in ELT, focusing on calculation processing in the data warehouse). If it is not accepted, we need to maintain a version of Spark/Flink ourselves, which is almost impossible. From this perspective, SeaTunnel must build an integrated engine by itself.

At that time, many contributors in the community participated in the discussion, and some people felt that this was reinventing the wheel. Of course, in the end the community reached a consensus and decided to start the design and development of a professional integration engine. I remember some contributors issued a declaration of "I made this wheel".

In this way, the community gritted its teeth and stomped its feet, and made Zeta, an engine that focuses on solving problems in the synchronization field. In October last year, we successfully released the official version of Zeta. At that time, the name was SeaTunnel Engine, and everyone felt that it should be a familiar name that fit the positioning of the engine.

So the community started brainstorming. After about two weeks of discussion, we chose to use the name Zeta among many candidates. Zeta is the fastest planet in the observable universe at present, and many users are kind. Call it "Zeta Ultraman" - the strongest Ultraman force in the universe, let's protect the faith of light together! We hope that the "Ultraman Zeta" engine will make integration easier, more efficient, more stable, and save resources.

Start Incubating: Why Join The Apache Software Foundation?

In fact, before Apache SeaTunnel changed its name from Waterdrop, it had plans to join the world's largest open source organization - the Apache Software Foundation. Guo Wei (SeaTunnel Mentor) said when SeaTunnel joined the Apache incubator:

Now that Apache Sqoop is decommissioned, how to solve the problem of data connection between data sources is not a particularly good open source project to solve. However, there are many types of data sources now. If only one company solves the connection between several data sources it uses, it cannot solve the problem that more people use more data sources to connect. If a new data source appears, it needs to be restarted write. And open source is the model of "gathering sand into a tower and inclusive of all rivers", which allows every enterprise and everyone to use open source data source connectors conveniently and quickly. At the same time, if they have their own data sources, they can also contribute to open source projects. In this way, an open source project that connects various data sources can become bigger and bigger like a snowball, making it easier for more users to connect to various data sources, thereby realizing the "flywheel effect" in data integration.

Another important point is that before this, some core contributors and mentors of Apache SeaTunnel have had successful incubation experience of the open source project DolphinScheduler, so everyone is full of confidence and expectation for SeaTunnel to enter the incubator. Although the process of entering the Apache incubator was not smooth, but the previous experience made the team not at a loss, but proceeded in an orderly manner.

Specifically, the core contributors of Waterdrop, the predecessor of SeaTunnel, established close ties with the DolphinScheduler community in 2018, and the partners of DolphinScheduler have also been paying close attention to Waterdrop. Whether it is the code quality of the project itself or the data integration in the future In terms of the potential of the industry, it is a "potential stock". So when Waterdrop discussed with us whether we can continue to do it together, without a long hesitation, we invested manpower and energy into the research and development of Apache SeaTunnel, and promoted it to enter the Apache incubator soon, with an open mind, We hope that under the power of open source, SeaTunnel can efficiently, accurately, and quickly synchronize and transform data across data sources, so that everyone can quickly and easily accomplish their goals in a multi-data source scenario. We believe that under the guidance of "Apache Way", Apache SeaTunnel will gain more support and accelerate the growth of the project.

When entering the Apache Foundation, finding a Mentor is often the initial and critical step. But unlike other projects that need to cross the river by feeling the stones, Apache SeaTunnel has attracted the attention of Apache incubator mentors around the world during the incubator discussion stage. Not too much." Some mentors also expressed their regrets on the global Apache incubator discussion mailing list. Apache incubator projects "drought to death, flooded to death".

Soon, Apache SeaTunnel will have 7 mentors under Jean-Baptiste Onofré, Kevin Ratnasekera, Willem Ning Jiang (Jiang Ning), Ted Liu (Liu Tiandong), Lidong Dai (Dai Lidong), Guo William (Guo Wei), Zhenxu Ke With the help and guidance of , he quickly stepped into the right track of the Apache incubator.

Jiang Ning is an open source "veteran" and eventually became our Champion. Jiang Ning is one of the most senior Apache members in China. He will be re-elected as a director of the Apache Software Foundation in 2023, becoming the first Chinese to be re-elected as a director of the Apache Foundation.

Dai Lidong is the Chair of the Apache DolphinScheduler project and has extensive experience in the open source field. He also has a lot to do with Apache SeaTunnel, and he helped organize and design many functions of Apache SeaTunnel, and together build the Apache SeaTunnel community. During the more than one year of participating in the construction of Apache SeaTunnel, he successively served as the Mentor of several Apache incubation projects, and was elected as ASF Member in 2022.

Apache Members Jean Baptiste Onofré and Kevin Ratnasekera are also familiar people during the incubation of Apache DolphinScheduler, and they all have rich experience in project incubation.

Later, Guo Wei, Ted Liu, and Ke Zhenxu also joined the ranks of Mentor, making the incubation of Apache SeaTunnel smoother.

In order to formally enter the Apache incubator, we also refer to mature projects and carry out the overall project code specification for Apache SeaTunnel; in order to meet international standards, we have also done a lot of English translation and proofreading of the project documents, and the Apache project website has also been fully updated. English culture. These collations make the Apache SeaTunnel project more standardized and "international".

In addition, after joining the incubator, we have made relatively large changes in project functions, the most important of which is the development and release of the data synchronization engine Zeta mentioned above. This engine, which can provide users with high-throughput, low-latency, and strong-consistency synchronous job guarantees, was officially released in version 2.3.0 as the default engine used by Apache SeaTunnel. It achieves decoupling from the Flink and Spark engines, allowing users to run independently without relying on Flink and Spark, autonomous clusters (no centralization), data caching, controllable speed, shared connection pool, resumable uploads, and fine-grained farming Unique features such as unique fault-tolerant design and dynamic shared threads also make the Apache SeaTunnel Zeta engine easier to use, more resource-saving, more stable, and faster than ever before, and can support full-scenario data synchronization.

file

Exploring the Apache Way

Just like we need to understand the culture of this company when we join a new company, before participating in the Apache open source project, we also need to understand the culture of ASF, namely The Apache Way.

If you go deep into open source, you will find that open source is not just a simple matter of open source, but also related to community management, activity, communication, culture, etc., which requires us to have a deeper understanding of the Apache Way.

In view of previous experience, Apache SeaTunnel had a deep understanding of the importance of the Apache Way at the early stage of entering the Apache Incubator. For example, for the open source community, the concept of Community Over Code should be rooted in the heart, which also requires the community to make preparations and Efforts to lower the threshold for everyone interested in participating in the project as much as possible, or even create a zero threshold, such as formulating a community incentive plan, making a beginner's guide, selecting Good First Issues, tracking the progress of important features, obtaining feedback through regular user interviews and Optimization suggestions, regular answers to community questions about projects and communities, etc.

Community contributions are not limited to code, and non-code contributions can sometimes even play a more valuable role than code, such as using your own influence to contribute to the attention of the project, writing project-related technical and non-technical articles, and participating in various activities organized by the community , "endorsing" Apache SeaTunnel at various times and occasions, recommending it to more users, etc., are all channels for participating in the community.

At the same time, Community Over Code also emphasizes openness, communication, and cooperation. Apache SeaTunnel upholds these concepts, insists on maintaining communication with communities at home and abroad, learning from each other, and establishing communication with the Apache community. All discussions take place in emails and Issues , and announce the major progress and plans of the project and the community through the community's self-media channels, so that the community remains open and transparent.

Since entering the incubation period, Apache SeaTunnel has held more than 20 online and offline meetups with a number of open source projects at home and abroad, including Apache Shenyu, Apache InLong, Apache Linkis, Apache Doris, IoTDB that have successfully graduated from the ASF incubator before Apache SeaTunnel , StarRocks, TDengine and other mature open source projects, as well as Meetups held jointly with Trino, APISIX, Shopee and ALC Indore in the United States, India and other overseas regions.

Cooperation and communication between communities promote the development and application of open source technologies. Apache SeaTunnel cooperates with other open source projects to solve technical problems, which is conducive to improving the overall level of open source ecology and expanding the boundaries of open source ecology.

Over time, the community has undergone qualitative changes. From the community's email discussions and GitHub data display, you will find that the Apache SeaTunnel community has become truly active and diverse. From the table below, we can see the community data changes of Apache SeaTunnel in the Apache incubator for more than one year.

It can be seen that the community and contributors are like the coexistence between "fish" and "water". More and more contributors participate in the community, bringing fresh water of life to the "fish" of the community, so The Apache SeaTunnel community is thriving; the water is moving faster and farther because of the continuous prancing of this big fish in the community. Only when fish and water coexist can life continue.

file

graduate from the incubator

After 18 months of incubation, the community has carefully evaluated the seven aspects of code, license and copyright, version release, quality, community, consensus, and independence according to the Apache maturity assessment model, and believes that the time for Apache SeaTunnel to graduate is already When they are more mature, they start preparing for graduation from the ASF incubator.

file

Apache Project Maturity Assessment Model

For example, in terms of code maturity, the community has experienced multiple version upgrades and new features, which have improved the performance and functions of Apache SeaTunnel, and further improved the efficient synchronization and conversion capabilities between data sources; in terms of community building, as above As mentioned above, through a number of online and offline Meetup activities at home and abroad, the Apache SeaTunnel community provides a platform for communication and sharing, which promotes communication and cooperation among developers and expands the influence of open source projects. In addition, Apache SeaTunnel has also strengthened the integration with upstream and downstream ecological projects, such as Flink, Spark, TiDB, OceanBase, IoTDB, etc., which promotes the collaborative development between different projects and improves the interoperability and overall performance of the entire open source ecosystem.

Under the guidance of Apache Members, Apache SeaTunnel launched a graduation discussion in the community in April, and made improvements based on the guidance of the ASF incubator, and constantly corrected them. In the end, Apache SeaTunnel passed the graduation vote and passed the resolution of the ASF board of directors on May 17, 2023, and joined the ranks of TLP as he wished!

The Road to the Future: How China's Open Source Goes Global

The goal of Apache SeaTunnel is to "connect all sources and synchronize like flying", striving to become the world's first-class data integration tool, and will also integrate with more upstream and downstream ecological projects in the future. At the same time, it will continue to undertake the mission of promoting open source culture, promote communication and cooperation among developers, provide more platforms for the development of open source communities, and inspire more people to participate in and contribute to open source projects.

At this important time, we call for more people to participate in Apache SeaTunnel contributors!

Finally, for Apache SeaTunnel, the road to graduate from ASF is not easy. Based on our experience in the open source world, we would like to express some opinions and suggestions on the development of China's open source ecosystem:

  • Strengthen the construction of open source culture

In China, the dissemination and popularization of open source culture still needs to be further strengthened. More developers and businesses need to be encouraged to participate in open source projects, promoting knowledge sharing and collaboration. At the same time, it is also necessary to improve the awareness and understanding of open source, and promote the wide application of open source in education, business and government.

  • Improve the quality and impact of open source projects

The number of open source projects in China has accumulated to a certain extent, but there is still room for improvement in terms of quality and influence. It is necessary to pay attention to the technological innovation and practicality of the project, and encourage more high-quality projects to emerge. At the same time, it is necessary to actively participate in the international open source community, cooperate and communicate with international projects, and increase the popularity and influence of the project.

  • Strengthen open source community building and governance

The open source community is the key to the success of an open source project. It is necessary to establish a sound community governance mechanism to promote the participation and contribution of community members. At the same time, it is necessary to provide a good communication and collaboration platform to encourage exchanges and cooperation among developers. In addition, there is a need to strengthen training and support for community members to improve their technical and managerial capabilities.

  • Strengthen the combination of open source and industry

Open source technology plays an important role in promoting industrial innovation and development. It is necessary to strengthen the combination of open source technology and various industries, and promote the application of open source technology in the fields of enterprises and public services. At the same time, it is necessary to actively cultivate an open source technology ecosystem and promote the coordinated development of open source projects and industrial chains.

All in all, China's open source has achieved some achievements, and many domestic open source projects have also been widely recognized and used internationally, but there is still a lot of work to be done. By strengthening the construction of open source culture, improving project quality and influence, strengthening community building and governance, and strengthening the combination of open source and industry, we can further promote the development of China's open source ecology and promote technological innovation and industrial upgrading.

About the Author:

Guo Wei, member of Apache Foundation, Apache DolphinScheduler PMC Member, Apache SeaTunnel Mentor.

Dai Lidong, Co-Founder of Beluga Open Source, Apache DolphinScheduler PMC Chair & Apache SeaTunnel PMC Member& Mentor, Apache Incubator Mentor, Apache Local Community Beijing Member.

This article is supported by Beluga Open Source Technology !

Guess you like

Origin blog.csdn.net/weixin_54625990/article/details/131384481