CommunityOverCode Asia presentation on stream processing

introduction

After several years of development, big data technology is no longer just a concept, it has been successfully practiced in various industry segments. With the increasing number of real-time scenarios, enterprises have put forward higher requirements for big data processing technology.

Stream processing is rapidly becoming a key technology for modernizing enterprise applications and improving real-time data analytics for data-driven applications, as it can help businesses gain a competitive advantage by quickly responding to changing market conditions, customer behavior, and other business-critical information.

The stream processing topic of CommunityOverCode Asia 2023 (formerly ApacheCon Asia) will bring you the latest information on Apache-related projects, let's take a look now!

Producer 

fad131404196a0033460160dd31f358a.png

8c5359b6ad8971597b106d54953b3366.png

Swipe left and right to view the producer

Li Yu (flower name: Jue Ding)

ASF Member, Apache Flink & HBase PMC Member, Apache Paimon (incubating) & Celeborn (incubating) Champion, Alibaba Cloud EMR team leader, Alibaba senior technical expert.

Wang Xin

Apache Member, Apache Storm, Incubator PMC Member, Committer, Apache RocketMQ, Apache IoTDB, Apache StreamPipes Committer, head of real-time data of Ant Group Big Data Department.

Topics

Streaming data processing is a trend in today's big data field. Many companies are eager to gain insight into their data in a more timely manner, and the previous "batch processing" thinking is rapidly being replaced by stream processing. More and more companies, no matter how big or small, are rethinking the technical architecture with real-time as the first consideration, and starting to use powerful open source engines such as Apache Flink, Apache Spark, Apache Kafka, Apache Pulsar, Apache Storm, Apache StreamPark (incubating), Apache Paimon (incubating), etc. build their own real-time computing platforms.

In this topic, you will learn about the actual experience of first-tier manufacturers applying these Apache projects to their production environments, as well as the latest development of these Apache project ecology and the future development direction of stream computing technology.

Agenda Highlights

August 18, 13:30 - 16:45

■  Speech Topic: Apache Flink Stream Batch Adaptive Shuffle

Sharing time: 13:30 - 14:00, August 18

Topic introduction:

At Flink Forward Asia in 2022, we first proposed the Flink Shuffle 3.0 architecture with cloud native, stream-batch fusion, and self-adaptation as the core.

The new Shuffle architecture has the following advantages:

1. More adaptable to the resource arrangement and isolation characteristics of the cloud native environment;

2. It combines the advantages of traditional streaming and batch Shuffle technologies;

3. It can make adaptive adjustments according to the resources and load conditions at runtime, making it easier to use.

In this sharing, we will introduce the latest progress and future planning of Flink version 1.18 in this regard.

33329b972532fa64db5e24011451dfe8.png

Guest introduction

Song Xintong丨Alibaba Cloud Senior Technical Expert

Apache Flink PMC Member & Committer, Alibaba Cloud Senior Technical Expert, Alibaba Cloud Flink Shuffle & SDK Team Leader.

418eb1589317147617e9674ccaff01d4.png

Guest introduction

Tan Yuxin丨Alibaba Cloud Senior Development Engineer

Worked in the open source big data department of Alibaba Cloud Computing Platform, focusing on the Apache Flink open source project.


■  Speech Topic: Building a Streaming Graph Processing System Based on Apache Calcite/Gremlin

Sharing time: 14:00 - 14:30, August 18

Topic introduction:

Typical stream computing is mainly aimed at the processing scenarios of table models, but how to perform stream processing and analysis on graph models is currently difficult to support for general stream computing. This sharing mainly introduces GeaFlow, an ant's self-developed flow graph engine, and how GeaFlow can build a flow graph query language around Apache Calcite and Apache Gremlin. At the same time, we will also share the practice and application of flow graph computing in Ant.

3b8a8a0678ba966e4cb16ac32bf97721.png

Guest introduction

Pan Zhenxuan丨Ant Group Senior Technical Expert

Senior technical expert of Ant Financial, currently in charge of the stream graph computing team of Ant Graph Computing Department. Joined Ali Group Data Platform in 2012, joined Ant Group Data Technology Department in 2016, and experienced the evolution of Ali and Ant real-time computing from 0 to 1. Since the end of 2017, he has been responsible for the construction of the flow chart system and team, from 0 to 1. Ant's flow graph system. Have an in-depth understanding of real-time computing and graph computing, as well as upper-level application scenarios.


■  Speech topic: China Unicom's large-scale real-time computing production practice based on Apache StreamPark

Sharing time: 14:30 - 15:00, August 18

Topic introduction:

1. The big data real-time computing platform supports event-based low-latency processing and stream-batch integrated data processing, supporting real-time businesses of 30+ internal and external organizations and 10,000+ data service subscriptions, processing 2.3 trillion pieces of data every day, 600TB+ data volume, 480+ servers exclusive to the cluster scale, serving more than a dozen business production product lines.

2. Based on the Apache StreamPark one-stop management platform for real-time computing operations, it supports the real-time computing job management of 500+Flink ON YARN in the production environment, and completes project management, job management, team management, and permissions through a visual and concise operation process Management, alarm management, log management, version management, cluster management, resource configuration, Flink JAR, Flink SQL, monitoring large screen and other management functions realize real-time job lifecycle management and help the team solve the quagmire of job operation and maintenance and improve Management efficiency, reducing the failure rate, improving the quality of business support, and fully realizing the integration of real-time computing and platform-based management.

e5fc44e4c217d83e4ab3f6d1da5453cf.png

Guest introduction

Mu Chunjin丨Head of R&D of Big Data Real-time Computing Platform of China Unicom Digital Technology Co., Ltd.

Apache StreamPark PMC, head of big data real-time computing platform R&D, responsible for trillion-level Flink real-time computing development, operation and maintenance, and platform construction.


■  Speech topic: FlinkSQL's field lineage and data permission solution

Sharing time: 15:00 - 15:30, August 18

Topic introduction:

Data lineage and data security are indispensable capabilities for building an enterprise-level data warehouse. In recent years, as the demand for real-time big data in various industries has become stronger and stronger, real-time data warehouses represented by Flink have emerged rapidly. However, due to the relatively short development time, the field of offline data warehouses based on Apache Ranger and Apache Atlas is relatively mature. Flink SQL's data lineage and security solutions do not yet support Flink SQL, and relying on Ranger and Atlas will lead to excessive system deployment and operation and maintenance. Therefore, it is particularly important to realize the field lineage and data authority management of FlinkSQL under the premise of zero intrusion into the source code of Flink and Calcite. This sharing will introduce related solutions in detail to help the audience build Atlas+Ranger in the field of Flink real-time data warehouse.

8758116f8aaa832655ad74da5b71cc05.png

Guest introduction

Bai Song丨Deputy General Manager of R&D Center of Hangzhou Shulan Technology Co., Ltd.

Co-founder of Shulan Technology Co., Ltd., deputy general manager of the R&D center, has 9 years of experience in big data platform research and development, focusing on research in the fields of big data, real-time computing, and data permissions. Responsible for the product research and development of the company's core products Shuqi Platform and Shuqi EMR. At present, Shuqi products have become the infrastructure tools for hundreds of companies at home and abroad to build data platforms, such as CITIC Group, Foxconn, Vanke, BMW, Zhejiang Communications Investment Group etc.


■  Speech topic: Streaming Apache Kudu within Apache Flink

Sharing time: 15:45 - 16:15, August 18

Topic introduction:

So far CDC is not supported within Apache Kudu, so there is no way to read data from it in a streaming style like other CDC enabled data sources when integrating with Apache Flink. To overcome this, a Apache Flink source connector has been built to unlock the ability for Apache Kudu to stream the data in a continuous and incremental way. In this speech, we will discuss and share the detailed design and implementation for the solution.

b7f0d4a9a7602a8aafe9e91c600c97a1.png

Guest introduction

Wei Chen丨eBay Staff Software Engineer

Wei is focusing on empowering the eBay's Notification Platform by leveraging the big data and streaming processing technologies. He is also a tech blog writer and actively contributing in open source community. Wei received his bachelor and master degrees from Shanghai Jiao Tong University.


■ 演讲议题:Shaping the Future: Unveiling High-Concurrency Streaming Analytics with Apache Druid

Sharing time: 16:15 - 16:45, August 18

Topic introduction:

Stream processing is rapidly evolving to meet the high-demand, real-time requirements of today's data-driven world. As organizations seek to leverage the real-time insights offered by streaming data, the need for robust, highly concurrent analytics platforms has never been greater.

This presentation introduces Apache Druid, a modern, open-source data store designed for such real-time analytical workloads. Apache Druid's key strength lies in its ability to ingest massive quantities of event data and provide sub-second queries, making it a leading choice for high concurrency streaming analytics. Our exploration will cover architecture, its underlying principles, tuning principles and the unique features that make it optimal for high concurrency use-cases. We'll dive into real-life applications, demonstrate how Druid addresses the challenge of immediate data visibility, and discuss its role in powering interactive, exploratory analytics on streaming data.

Participants will gain an in-depth understanding of Apache Druid’s value in the rapidly evolving landscape of streaming analytics and will be equipped with the knowledge to harness its power in their own data-intensive environments. Join us as we delve into the future of real-time analytics, discovering how to 'Shaping the Future: Unveiling High-Concurrency Streaming Analytics with Apache Druid'.

59ab42e042b17ddf4a27ddd6c2a7e11d.png

Guest introduction

Tijo Thomas丨Imply Data inc Lead Solutions Architect

SummaryLead with great passion for big data technology, having 18+ years of experience in the software industry ( engineering, professional service , product management). Helping customer in the field , negotiating with  customer on the feature request and align them with the product roadmap  Extensive experience across the stack in Managing, Architecting, Designing and Implementing Big data applications, frameworks and platforms.More than 4 year of experience as Solution Architect Experience in  design and implementing a highly scalable SAAS platform for public Cloud. Hold two patents in the area of Big Data.


August 19, 13:30 - 16:45

■  Speech topic: Alibaba Cloud's real-time data integration practice based on Flink CDC

Sharing time: 13:30 - 14:00, August 19

Topic introduction:

CDC (Change Data Capture) is a technology used to capture changes from the database. Flink CDC is an open source representative of real-time data integration framework, with technical advantages such as full incremental integration, lock-free reading, concurrent reading, and distributed architecture. , very popular in the open source community. Flink CDC supports powerful data processing capabilities. It can perform real-time association, aggregation, and widening of database data through SQL. With Flink's rich downstream ecology, the processed data can be easily written to Kafka, Hudi, Iceberg, Doris, etc. , to realize real-time data entry into the lake and warehouse. In this sharing, we will first introduce the core design and key implementation of Flink CDC technology, and explain the new features of version 2.4.0 in detail. Then combined with specific business scenarios, share the solutions of Alibaba Cloud's internal Flink CDC for business pain points in different scenarios, such as the scenario of entering the lake and warehouse, and the problem of Binlog expiration.

a208edd564040d20ac0c07bc0ca8885f.png

Guest introduction

Ruan Hang丨Alibaba Cloud Senior R&D Engineer

Alibaba Cloud Senior R&D Engineer, Flink CDC Maintainer & Apache Flink Contributor.


■ Speech topic: Deep analysis of Ziroom's large-scale On Kubernetes real-time computing production practice based on Apache StreamPark

Sharing time: 14:00 - 14:30, August 19

Topic introduction:

1. In this speech, we will discuss in depth how to use Apache StreamPark, a one-stop real-time computing job management platform, to finely manage more than 300 Flink On Kubernetes real-time jobs. Apache StreamPark provides us with an intuitive visual interface to help us manage many key functions, including Flink job development, job deployment to Kubernetes, Flink Docker image management, Flink Kubernetes Pod Template management, etc.

2. We have also explored some innovative practices based on StreamPark: we have further combined with the scheduling system to realize offline data synchronization based on FlinkSQL, thereby optimizing the data processing process.

Through Apache StreamPark, we have realized the full lifecycle management of real-time jobs, greatly improving the efficiency of development and management. This process vividly demonstrates the powerful capabilities of real-time computing platform management and its great value in the actual production environment.

c0cdf410a364e7b2e6a63982175e9064.png

Guest introduction

Chen Zhuoyu丨Ziroom Big Data Platform R&D Engineer

Apache StreamPark PPMC.


■  Speech topic: Flink K8S Operator AutoScaling

Sharing time: 14:30 - 15:00, August 19

Topic introduction:

Stream processing is in today's big data field, among which Apache Flink is a dark horse that keeps appearing in front of everyone, but the 24-hour operation and maintenance challenges it brings cannot be ignored. Under the current background of reducing costs and increasing efficiency, the effective utilization of resources has become the focus of everyone's attention. This speech elaborates in detail the sub-project derived from the Apache Flink community: Flink K8S Operator, briefly introduces the source and development history of this project, and introduces the automatic tuning function introduced in the latest version. -271) function to explain its working principles and best practices in detail, and at the same time introduce the non-stop update function (FLIP-291) that the community is implementing, and finally introduce some future plans of the current Flink community in this work.

91508ceb7e3718b92bc5a4e203c41033.png

Guest introduction

Zhengyu Chen丨Senior Big Data Development Engineer of Really Fun Games

Apache Flink/Streampark Contributor, has been engaged in data development in the game industry for a long time, and is currently in charge of the cloud-native Flink big data deployment job platform construction and job development in the company. From 0 to 1, it is a one-stop Flink intelligence for building, deploying and submitting jobs for really interesting games Operating platform, anti-cheat platform and data integration platform.


■  Speech Topic: RSQLDB Streaming Database Based on Message Queue

Sharing time: 15:00 - 15:30, August 19

Topic introduction:

With the deepening of digitalization and the explosive growth of data, higher and higher requirements are put forward for the real-time and correctness of data processing, and stream computing emerges as the times require. At the same time, message queuing products, as a data transfer platform, are widely used in big data computing architectures, and there are countless cases of stream computing through message queuing/message engines. However, in the era of cloud computing, the cost of use has become the main goal of architecture design or evolution. RSQLDB is a distributed stream computing engine based on message queue RocketMQ as storage. It supports at least 2 nodes for production deployment, and the standardized SQL interaction method greatly reduces the threshold of use; functionally, RSQLDB supports windows, JOIN and state recovery, etc.

This speech will introduce RSQLDB from the following aspects:

1. The evolution of stream computing, why RSQLDB is needed;

2. RSQLDB architecture design principle;

3. The application practice of RSQLDB in Alibaba Cloud.

c93b313d23dbd22eb88a735c0112d4d1.png

Guest introduction

Ni Ze丨Aliyun, Apache RocketMQ Committer, RocketMQ Streams maintainer, RSQLDB maintainer

Apache RocketMQ Committer, RocketMQ Streams maintainer, RSQLDB maintainer, cloud-native messaging team R&D computing expert.


■  Speech topic: State of Scala API in Apache Flink

Sharing time: 15:45 - 16:15, August 19

Topic introduction:

As a Scala developer writing new Flink job, you expect to use latest Scala 3 version, rather the one Flink was compiled with. Support of Scala 2.13 and Scala 3 was not really possible until Flink 1.15 came out. In this talk we will review how the Scala API was done in Apache Flink prior the version 1.15 and what has changed in that release. Apache Flink chose quite opposite way to enable Scala developers to use any Scala version than Apache Spark project and that is interesting discussion on its own.

During this talk we will go through the SBT example project to build Flink jobs with Scala 3. We will look at the current community options of Scala wrappers for Flink Java API and challenges related to that. As a result, we will see that using Scala in Flink jobs is much more convenient than writing your streaming jobs with Java API. An introduction to the Scala CLI makes the whole packaging experience of Scala Jobs a pure joy.

e08064cb62b2b90215dc0d1e4cc40b38.png

Guest introduction

Alexey丨Ververica Solution Architect

Alexey is a Solution Architect working for last the last 6 years on data solutions and products. At Ververica, he is focusing on supporting clients to solve their challenges in adopting data stream processing with Apache Flink. Among his previous project and companies he developed different systems such as Data Lakes, Data Integration and Data Virtualization Layers. He has also spent many years on developing data services for investment banks including currency trading software. In his spare time, he also contributes to various open-source projects or starts his own for fun. His hobbies are astronomy, playing music and gym.


■  Speech topic: Construction practice of Xiaomi Flink real-time computing platform

Sharing time: 16:15 - 16:45, August 19

Topic introduction:

This sharing will focus on the construction of a real-time computing platform, combined with Xiaomi's own business practice experience, share Xiaomi's exploration and construction in the field of real-time computing, and create a unified real-time computing platform with resource flexibility, low cost, and ease of use.

Content outline:

1. Introduction to Xiaomi Real-Time Computing Platform This part will introduce the business overview of Xiaomi's real-time computing, and interpret the pain points and solutions encountered in conjunction with the evolution and development of Xiaomi's real-time computing platform.

2. Real-time computing platform construction This part will introduce Xiaomi's overall real-time computing platform architecture, and explore the usability of Xiaomi's real-time computing platform in combination with unified metadata management, authority management, lineage, and scheduling management.

3. Platform operation, maintenance and governance This part will deeply explore the operation, maintenance and governance of real-time computing, share Xiaomi's exploration at the framework layer and platform layer, and make Xiaomi's real-time computing platform have resources through productization under the guidance of the governance closed-loop methodology Elasticity, low cost, and easy-to-use capabilities.

4. Summary and prospect Briefly summarize the content shared this time, and discuss and look forward to the future evolution direction of the real-time computing platform.


352275332f00b0b3964c326359108a9d.png

Guest introduction

Chen Zihao丨Xiaomi Software R&D Engineer

Xiaomi software R&D engineer, mainly responsible for Xiaomi real-time computing platform and Flink framework kernel development.


Thematic agenda

dfa3f179223686c53262549dd1eb9af3.jpeg

As the official global conference series of the Apache Software Foundation (ASF), CommunityOverCode Asia attracts participants and communities from all levels of the world to explore "tomorrow's technology" every year . From August 18th to 20th , at the upcoming CommunityOverCode Asia 2023, you can experience the latest developments and emerging innovations from the Apache project up close.

5703ec0776e12b156368c9da7aa08b4d.jpeg

Guess you like

Origin blog.csdn.net/weixin_44904816/article/details/132074200