Stream computing is ushering in a generational change: the streaming lake warehouse Flink + Paimon is accelerating its implementation, and Flink CDC is undergoing a major upgrade.

On December 9, 2023, Flink Forward Asia 2023 (hereinafter referred to as FFA) concluded successfully in Beijing. 70+ speech topics, 30+ technology and practice sharing from first-tier manufacturers, and a packed venue all demonstrated the industry appeal of FFA, which has returned to offline.

To borrow the words of Wang Feng, founder of the Apache Flink Chinese community, Apache Paimon PPMC Member, and head of Alibaba Cloud Intelligent Open Source Big Data Platform: "After nearly ten years of development, Flink has become the de facto standard for streaming computing."

Of course, for community developers, the new trends, new practices and new developments in stream computing brought at this conference may be the focus of attention.

1. Two major versions of Flink have been updated to go deeper into scenarios and strive for excellence.

In the past decade, with the growing demands for big data, the Internet of Things, and real-time analysis, traditional batch processing and static analysis methods cannot meet the efficient and real-time requirements in new data processing scenarios, and the concept of stream computing emerged. Stream computing has strong real-time performance and the ability to process real-time data streams, allowing the system to respond and analyze data generated from large-scale devices, sensors and other data sources in a more timely manner. Apache Flink began to rise to prominence and has won the favor of many enterprises and developers around the world with its excellent performance and flexibility.

At the press conference, Wang Feng said: "Flink, as an evergreen tree in the open source big data field, has maintained rapid development. The fundamental reason is our continuous evolution in the core technology field. We have also made some new progress in 2023 , still maintains the release of two major versions a year, including Flink1.17 and 1.18, and also produces many new contributors."

1

"In the core stream processing field, the overall summary is to strive for excellence." Wang Feng mentioned. In fact, from a technical perspective, Flink has now become a benchmark in the global stream computing field, so in 2023 Flink will focus on version updates in-depth scenarios and continuous polishing. For example, we will continue to improve Flink's performance issues and functional integrity issues in batch processing mode, etc., so that Flink can become a computing engine that can handle both limited and unlimited data sets uniformly.

Specifically, the Flink team has made hundreds of adjustments or optimizations in terms of Flink SQL, the development language that users are most concerned about and used the most. For example, this year the community launched a new feature - Plan Advice, which can help users intelligently check streaming SQL. Plan Advice will automatically check possible problems or risks in the SQL after the user writes the streaming SQL, and give prompts and feasible suggestions as soon as possible. This feature has been warmly welcomed by users.

In terms of Streaming Runtime, which is the real Streaming core architecture, major upgrades have also been made this year. "The biggest feature of Flink is that it is state-oriented and has its own state storage and state access capabilities. Therefore, state management, state management, checkpoint, snapshot management, etc. are very core parts of Flink and are also parts that users have strong demands for. Although Flink Global consistency snapshots are taken regularly, but users hope that the frequency of snapshots is as fast as possible and the cost is as small as possible. A feasible idea is to allow the system to play back as little data as possible when fault tolerance occurs, such as achieving second-level checkpoints. ." Wang Feng said. After a year of hard work, the general incremental checkpoint capability was successfully implemented in Flink 1.17 and 1.18, and reached a fully production-available state.

In addition, Flink has also made a lot of performance optimizations at the Batch (batch) engine level. Wang Feng said: "Flink is a stream-batch integrated engine. In addition to its powerful stream computing capabilities, we hope that it will also be excellent in batch processing and bring users a one-stop data computing and data development experience."

This year, Flink not only optimizes the execution efficiency in batch processing scenarios based on the core streaming engine advantages, but also introduces traditional batch processing optimization methods into Flink. After testing, the performance of the Batch execution mode of Flink 1.18 version on the TPC-DS 10T data set is 54% higher than that of Flink 1.16, basically reaching the industry-leading level.

"These optimizations will continue to ensure that Flink not only reaches the strongest level in the industry in the field of stream computing, but also can still achieve first-class engine execution capabilities in the field of batch processing. In fact, this year we have seen many companies sharing Flink-based It has been implemented as a whole," Wang Feng said.

In terms of deployment architecture, Flink community developers have also done a lot of work to promote Flink to run better on the cloud. There is no doubt that cloud native is not only a new trend in big data, but also provides a foundation for inclusive development including AI. In order to meet the needs of more and more projects and software that can better run on the cloud and improve user experience, community developers have done a lot of work. For example, it supports users to expand and shrink online and in real time through API without restarting the entire Flink instance. At the same time, in order to achieve unattended elastic expansion and contraction on the cloud, the Flink community also launched Autoscaling technology based on Kubernetes, which uses Autoscaling to dynamically and real-time monitor the load and latency of the entire task to ensure users' elastic expansion and contraction. experience.

In terms of scenarios, Wang Feng said that in order to allow more data to flow, the Flink community has also made many attempts, such as allowing Flink to collaborate with the Lakehouse architecture, hoping to use Flink's powerful real-time computing capabilities to accelerate Lakehouse's data flow and analyze. It is worth mentioning that this year's two versions of Flink have added many new APIs to support Lakehouse; they have also added support for JDBC driver, so users can use traditional BI tools to achieve seamless integration with Flink. .

"In fact, the evolution of Flink has always followed a pattern, that is, the development trend of the big data industry. Currently, the big data scenario is accelerating upgrades and transformations from offline to real-time. Under this big wave, every work of Flink is constantly evolving. It has been continuously verified." Wang Feng concluded.

At the press conference, Shi Hefu, head of Caocao’s basic travel research and development department, shared their real-time data warehouse practice based on Flink. According to the introduction, there are currently three main types of real-time data needs in the operations of Caocao Travel Company. First, management needs real-time data to understand the company's operating status at any time; second, the operations team also needs a focused tool to visually grasp operational details; finally, algorithms have increasingly higher requirements for real-time data, and more real-time data is more important for The algorithm performs better.

“In the past year or so, through the streaming data warehouse based on Flink, Caocao Travel has been able to generate diversified indicators and real-time features and feed them to the algorithm engine, which ultimately increased the passenger subsidy efficiency by 60% and the driver efficiency. 20%. What is even more exciting is that our gross profit has increased 10 times." said Shi Hefu, head of Caocao's basic research and development department.

2

Regarding Flink’s follow-up planning, Song Xintong, Alibaba Cloud Intelligent Flink Distributed Execution Director, Apache Flink PMC member, and Flink 2.0 Release Manager, said: “We just launched version 1.18 in October. In February and June next year, we will Versions 1.19 and 1.20 were launched respectively to meet the needs of the API migration cycle. In addition, we will launch a new Flink2.0 version in October next year, which will focus on the ultimate optimization and technology evolution of stream processing, the evolution of the stream-batch integrated architecture, and user There are three core directions of experience improvement, vigorously promoting storage and calculation separation state management, Batch dynamic execution optimization, stream-batch unified SQL syntax, stream-batch integrated computing model, API and configuration system upgrades, etc. We look forward to more community partners to join Let’s build together!”

2. Apache Paimon: Leading new changes in streaming lake warehouses

In addition to continuing to accelerate the evolution of Flink, the Flink community has also incubated a brand-new project in the past year and successfully donated it to the Apache community, which is Apache Paimon.

Apache Paimon, formerly known as Flink Table Store, is a streaming data lake storage technology that can provide users with high throughput, low-latency data ingestion, streaming subscription, and real-time query capabilities.

In fact, as Lakehouse has become a new architectural trend in the field of data analysis, more and more users are transferring traditional data warehouse systems based on Hive and Hadoop to Lakehouse architecture. The Lakehouse architecture has five main advantages: separation of computing and storage, hot and cold storage tiering, more flexible operations, replaceable queries, and minute-level timeliness.

"Timeliness is the core driving force for business migration, and Flink is the core computing engine that reduces timeliness. Is it possible to integrate Flink into Lakehouse and unlock a new Streaming House architecture? This is the original design intention of Paimon ." said Li Jinsong, head of Alibaba Cloud Intelligent Open Source Table Storage, founder of Paimon, and member of Flink PMC.

Of course, in order to realize this idea, the Flink team also experienced many challenges. The biggest challenge comes from the "Lake format". Streaming technology and the large number of updates generated in the stream have brought huge challenges to the Lake storage format.

"First of all, Iceberg is an excellent lake storage format with a simple architecture and an open ecosystem. As early as 2020, we were trying to integrate Flink into Iceberg to give Iceberg the ability to stream reading and writing. But gradually we found that Iceberg as a whole It is also designed for offline design, and it must maintain a simple architectural design to accommodate various computing engines, which has brought great obstacles to our kernel improvement." Li Jinsong explained.

After the exploration of Flink+Iceberg hit a wall, the Flink team began to think about the integration of Flink+Hudi. After Flink was connected, Hudi's latency was reduced from Spark update hour to ten minutes. But further down the road, we encountered new obstacles, because Hudi itself is designed for Spark and batch computing, and its architecture does not meet the needs of stream computing and updates.

"After summarizing the experience of Flink+Iceberg and Flink+Hudi, we redesigned a new streaming data lake architecture, which is Flink+Paimon." Li Jinsong said.

Flink+Paimon has a lake storage+LSM native design and is built for streaming updates. Paimon has good compatibility with Flink and Spark, and supports powerful streaming reading and writing functions, which can truly reduce latency to 1-5 minutes.

3

"Apache Paimon is a stream-batch integrated lake storage format. It is just a format that stores data on OSS or HDFS. Then based on this lake format, we can use the Flink CDC to realize one-click lake storage, and also Use Flink and Spark to stream and batch write to Paimon. Later, Paimon will also support reading from various mainstream open source engines and stream reading from Flink and Spark." Li Jinsong added.

“With the version iteration of Paimon, you can see huge progress. Currently, the Paimon community has 120 Contributors from all walks of life, and the number of Stars has reached 1,500+. More and more companies are beginning to apply Paimon and Share Paimon-related practices.”

At the press conference, Tongcheng Travel shared their data lake practice based on Paimon. According to reports, Tongcheng Travel has currently switched 80% of Hudi lake warehouses to Paimon, covering more than 500 tasks and more than ten real-time link scenarios based on Paimon, processing approximately 100TB of data, and the overall number of data items has reached approximately 100 billion.

Wu Xiangping, Tongcheng Travel’s big data expert and Apache Hudi & Paimon Contributor, said that after Tongcheng Travel’s architecture upgrade based on Paimon, the ODS layer synchronization efficiency has increased by about 30%, the writing speed has increased by about 3 times, and some query speeds have even improved. It has increased by 7 times; using the ability of Tag, it saves about 40% of storage space in export scenarios; through the reusability of intermediate data, the work efficiency of indicator developers increases by about 50%.

In addition, in the field of automotive services, Autohome has also successfully applied Paimon to scenarios such as operational analysis and reporting, which not only brought great convenience, but also helped the business achieve significant benefits.

Di Xingxing, head of Autohome's big data computing platform, said that through in-depth cooperation with Flink CDC, Autohome has implemented streaming reading, written data to Paimon, and used Flink to consume Paimon's incremental data again, thereby integrating the entire Pipeline is built into a real-time intelligent system.

Similarly, thanks to the combined use of Flink+Paimon, the architecture has been simplified on the basis of ensuring data consistency. Host resources have been reduced by about 50% compared to before. At the same time, data correction, real-time batch analysis and intermediate results have been improved. Real-time query and other aspects have received good support.

From the above cases, it is not difficult to see that Paimon plays an important role in realizing the integration of flow and batch. It successfully integrates two different computing modes, flow computing and batch computing, and combines it with Flink's flow-batch integrated computing and storage capabilities to create an architecture that truly integrates flow and batch. The most common application scenario of Paimon is to input data into the lake in real time. In this process, the combination of Flink CDC and Paimon can realize a minimalist link into the lake, making the operation into the lake achieve full and incremental integration. This combination provides strong support for building real-time data warehouses, real-time analysis systems, etc.

3. Flink CDC 3.0 real-time data integration framework was released and announced its donation to the Apache Foundation

Flink CDC is a connector based on a series of databases built by Flink. It is mainly used to help Flink read incremental change logs hidden in business databases. Currently, Flink CDC supports more than ten mainstream data sources and can be seamlessly linked with Flink SQL, allowing users to build rich application forms through SQL.

It is reported that by capturing data changes, Flink CDC can transmit data changes to the target system in real time, thereby achieving near-real-time data processing. This processing method greatly shortens the data processing delay and can reduce the timeliness of data processing from days to minutes, thereby significantly improving business value.

At the press conference, Wu Chong, head of Alibaba Cloud Intelligent Flink SQL and Flink CDC, said that Flink CDC will develop very rapidly in 2023. In terms of ecology, it has added support for two new connectors, IBM DB2 and Vitess, and expanded the ability to read incremental snapshots to more data sources; in terms of engine capabilities, it provides a lot of advanced features. For example, dynamically adding tables, automatic shrinking, asynchronous sharding, designated sites, etc., it also supports the At-Least-Once reading mode, and has compatibility across the five major versions of Flink 1.14 to 1.18.

It has been more than three years since Flink CDC was born. At the beginning, it was positioned as a connector for a series of databases and data sources. However, with the widespread application of Flink CDC in the industry, the Flink team gradually found that the initial product positioning could not cover more business scenarios.

4

"If it is just a database connector, if users want to build a data integration solution, they still need to do a lot of assembly work or encounter some restrictions. So we hope that Flink CDC will not only be used for databases The connector can also connect to more data sources, including message queue databases, data lakes, file formats, SaaS services, etc. Going further, we hope to build it into a connector that can connect data sources, data pipelines, and result targets. End-to-end real-time data integration solutions and tools. This is the Flink CDC 3.0 real-time data integration framework that we are officially releasing today." Wu Chong said.

5

It is reported that the Flink CDC 3.0 real-time data integration framework is built on the core of Apache Flink. At the engine layer, Flink CDC 3.0 opens up many advanced features, including real-time synchronization, entire database synchronization, sub-database and sub-table merging, etc.; at the connection layer, Flink CDC 3.0 already supports synchronization links for MySQL, StarRocks, Doris, Paimon , kafaka, MongoDB and other synchronization links are also planned; at the access layer, Flink CDC 3.0 provides a set of YAML + CLI API methods to simplify the cost of user development of real-time data integration.

"The release of the Flink CDC 3.0 real-time data integration framework is a technological milestone for Flink CDC. Behind this, we cannot do without the power of the community and the power of open source, so we hope to use technology to give back to open source." Wu Chong said.

Subsequently, at the press conference, Alibaba officially announced that it would donate Flink CDC to Apache Flink and the Apache Software Foundation.

4. From introduction to going global, open source ecology from a global perspective

Not only its continuous innovation and fruitful results at the technical level, Apache Flink has also attracted attention in the construction of international ecological communities in recent years.

In China, the Flink Chinese community has become one of the most active technical communities. On the occasion of the fifth anniversary of the Flink Chinese community, community developers have witnessed the cumulative release of 750 technical articles, attracting the active participation of 111 companies and 351 developers. These articles have received a cumulative reading of 2.35 million. It highlights the Chinese power of Flink’s global community technical evangelism.

Looking around the world, Flink has become one of the most active top projects of the Apache Foundation. Flink's community members cover Europe, North America, and Southeast Asia, covering various fields such as academia, industry, and open source communities. As the leader among them, Alibaba’s international community management has also been increasingly strengthened. According to statistics, Alibaba has trained more than 70% of Flink Project Management Committee (PMC) members and committers.

6

It is worth mentioning that in June this year, Flink was awarded the SIGMOD System Award 2023 by SIGMOD, the top international database conference, for its technological innovation and global influence in the field of real-time big data. Past recipients of this award have been star projects in the global database field, such as Apache Spark, Postgres and BerkeleyDB.

From being introduced to China from abroad to Chinese contributions promoting the globalization of Flink open source technology, Chinese developers are playing an increasingly important role in this process. They not only play an important role in the development of open source projects, but also actively participate in various open source community activities to jointly promote technology development and innovation. The contributions of these Chinese developers have played a key role in the growth and development of the Apache Flink ecosystem.

Apache Flink's open source ecosystem with a global perspective has become an important force for its continued prosperity. In this ecosystem, various technologies, cultures, and ideas are fully exchanged and integrated, providing a huge impetus to innovation. We also have reason to believe that in the future, on this fertile open source soil with a global perspective, more and more developers will join and jointly contribute to the future of real-time computing.


Flink Forward Asia 2023

For more exciting content at this Flink Forward Asia, scan the QR code of the image to read all the topic PPT online ~

Click to view FFA 2023 speech PPT

7

Follow the Apache Flink public account and reply to FFA 2023 to get the speech PPT online reading address

more content

Guess you like

Origin blog.csdn.net/weixin_44904816/article/details/135027683