CommunityOverCode Asia 专题介绍之流处理

引言

经过近几年的发展，大数据技术已不再仅停留于概念，它在各个行业的细分领域中都有了成功实践。随着实时化需求的场景日益增多，企业对大数据处理技术提出了更高的要求。

流处理因可以帮助企业快速响应不断变化的市场条件、客户行为和其他关键业务信息，从而获得竞争优势，正在迅速成为企业应用程序现代化和改进数据驱动应用程序实时数据分析的关键技术。

本次 CommunityOverCode Asia 2023（原 ApacheCon Asia）的流处理专题，将给大家带来 Apache 相关项目的最新资讯，现在就一起来看看吧！

出品人

左右滑动查看出品人

李钰（花名：绝顶）

ASF Member，Apache Flink & HBase PMC Member，Apache Paimon (incubating) & Celeborn (incubating) Champion，阿里云 EMR 团队负责人，阿里巴巴资深技术专家。

王鑫

Apache Member， Apache Storm、Incubator PMC Member ，Committer，Apache RocketMQ、 Apache IoTDB、Apache StreamPipes Committer，蚂蚁集团大数据部实时数据负责人。

专题介绍

流式数据处理是当今大数据领域的趋势，很多企业渴望更及时地洞察自己的数据，而曾经的“批处理”思维正迅速被流式处理所取代。越来越多的公司，无论大小，都在重新思考技术架构时把实时性作为第一考量，并开始用强大的开源引擎如 Apache Flink, Apache Spark, Apache Kafka, Apache Pulsar, Apache Storm, Apache StreamPark (incubating), Apache Paimon (incubating) 等构建自己的实时计算平台。

在该主题中，您将了解到一线大厂把这些 Apache 项目应用到其生产环境中的实际经验，以及这些 Apache 项目生态的最新发展和流计算技术未来的发展方向。

议程亮点

8 月 18 日 13:30 - 16:45

■ 演讲议题：Apache Flink 流批自适应 Shuffle

分享时间：8 月 18 日 13:30 - 14:00

议题介绍：

在 2022 年的 Flink Forward Asia 上，我们首次提出了以云原生、流批融合、自适应为核心的 Flink Shuffle 3.0 架构。

新的 Shuffle 架构具有以下优势：

1. 更加适应云原生环境的资源编排与隔离特点;

2. 兼具传统流式与批式 Shuffle 技术的优势；

3. 能够根据运行时的资源与负载情况做出自适应调节，更加易用。

本次分享，我们将介绍 Flink 1.18 版本在这方面取得的最新进展与未来规划。

嘉宾介绍

宋辛童丨阿里云高级技术专家

Apache Flink PMC Member & Committer，阿里云高级技术专家，阿里云 Flink Shuffle & SDK 团队负责人。

嘉宾介绍

谭玉新丨阿里云高级开发工程师

就职于阿里云计算平台开源大数据部门，专注于 Apache Flink 开源项目。

■ 演讲议题：基于 Apache Calcite/Gremlin 构建流式图处理系统

分享时间：8 月 18 日 14:00 - 14:30

议题介绍：

典型的流计算主要针对表模型的处理场景，而针对图模型如何进行流式的处理和分析，目前通用流计算还难以支持。本次分享主要介绍蚂蚁自研的流式图引擎 GeaFlow，以及 GeaFlow 如何围绕 Apache Calcite 和 Apache Gremlin 构建流式图查询语言的能力。同时也会分享基于流式图计算在蚂蚁内部的实践和应用。

嘉宾介绍

潘臻轩丨蚂蚁集团资深技术专家

蚂蚁金服资深技术专家，现负责蚂蚁图计算部门流式图计算团队。2012 年加入阿里集团数据平台， 2016 年加入蚂蚁集团数据技术部，经历了阿里和蚂蚁实时计算从 0 到 1 的演进，从 17 年底开始负责流式图系统和团队的构建，从 0 到 1 打造了蚂蚁的流式图系统。对实时计算和图计算以及上层的应用场景有深入的理解。

■ 演讲议题：联通基于 Apache StreamPark 的大规模实时计算生产实践

分享时间：8 月 18 日 14:30 - 15:00

议题介绍：

1. 大数据实时计算平台支持基于事件的低延迟处理以及流批一体的数据处理，支撑了 30+ 内部和外部组织的实时化业务和 10000+ 的数据服务订阅，每天处理 2.3 万亿条数据、600TB+ 数据量，集群规模独享 480+ 服务器，服务了十几条业务生产产品线。

2. 基于 Apache StreamPark 一站式的面向实时计算作业的管理平台，支撑了生产环境 500+Flink ON YARN 实时计算作业管理，通过可视化的简洁的操作流程完成了项目管理、作业管理、团队管理、权限管理、告警管理、日志管理、版本管理、集群管理、资源配置、Flink JAR、Flink SQL、监控大屏等管理功能，实现了实时作业全生命周期管理，帮助团队解决了作业运维泥沼、提升了管理效率、减低了故障率、提高了业务支撑质量，全面实现了实时计算的一体化、平台化的管理。

嘉宾介绍

穆纯进丨联通数字科技有限公司大数据实时计算平台研发负责人

Apache StreamPark PMC、大数据实时计算平台研发负责人，负责万亿级 Flink 实时计算开发、运维以及平台建设。

■ 演讲议题：FlinkSQL 的字段血缘及数据权限解决方案

分享时间：8 月 18 日 15:00 - 15:30

议题介绍：

数据血缘和数据安全是搭建企业级数据仓库不可或缺的能力。近年来随着各行各业对大数据实时化的需求越来越强烈，以 Flink 为代表的实时数仓快速兴起，但由于发展时间相对较短，离线数仓领域基于 Apache Ranger 和 Apache Atlas 相对成熟的数据血缘和安全解决方案尚未支持 Flink SQL，且依赖 Ranger 和 Atlas 会导致系统部署和运维过重。因此，如何在对 Flink 和 Calcite 源码零侵入的前提下实现 FlinkSQL 的字段血缘及数据权限管理，就显得尤为重要。本次分享将详细介绍相关解决方案，帮助听众打造 Flink 实时数仓领域的 Atlas+Ranger。

嘉宾介绍

白松丨杭州数澜科技有限公司研发中心副总经理

数澜科技公司联合创始人、研发中心副总经理，拥有 9 年大数据平台研发经验，专注于大数据、实时计算、数据权限等领域的研究。负责公司核心产品数栖平台和数栖 EMR 的产品研发工作，目前数栖产品已成为国内外数百家公司建设数据中台的基础设施工具，例如中信集团、富士康、万科、宝马、浙江交投集团等。

■ 演讲议题：Streaming Apache Kudu within Apache Flink

分享时间：8 月 18 日 15:45 - 16:15

议题介绍：

So far CDC is not supported within Apache Kudu, so there is no way to read data from it in a streaming style like other CDC enabled data sources when integrating with Apache Flink. To overcome this, a Apache Flink source connector has been built to unlock the ability for Apache Kudu to stream the data in a continuous and incremental way. In this speech, we will discuss and share the detailed design and implementation for the solution.

嘉宾介绍

Wei Chen丨eBay Staff Software Engineer

Wei is focusing on empowering the eBay's Notification Platform by leveraging the big data and streaming processing technologies. He is also a tech blog writer and actively contributing in open source community. Wei received his bachelor and master degrees from Shanghai Jiao Tong University.

■ 演讲议题：Shaping the Future: Unveiling High-Concurrency Streaming Analytics with Apache Druid

分享时间：8 月 18 日 16:15 - 16:45

议题介绍：

Stream processing is rapidly evolving to meet the high-demand, real-time requirements of today's data-driven world. As organizations seek to leverage the real-time insights offered by streaming data, the need for robust, highly concurrent analytics platforms has never been greater.

This presentation introduces Apache Druid, a modern, open-source data store designed for such real-time analytical workloads. Apache Druid's key strength lies in its ability to ingest massive quantities of event data and provide sub-second queries, making it a leading choice for high concurrency streaming analytics. Our exploration will cover architecture, its underlying principles, tuning principles and the unique features that make it optimal for high concurrency use-cases. We'll dive into real-life applications, demonstrate how Druid addresses the challenge of immediate data visibility, and discuss its role in powering interactive, exploratory analytics on streaming data.

Participants will gain an in-depth understanding of Apache Druid’s value in the rapidly evolving landscape of streaming analytics and will be equipped with the knowledge to harness its power in their own data-intensive environments. Join us as we delve into the future of real-time analytics, discovering how to 'Shaping the Future: Unveiling High-Concurrency Streaming Analytics with Apache Druid'.

嘉宾介绍

Tijo Thomas丨Imply Data inc Lead Solutions Architect

SummaryLead with great passion for big data technology, having 18+ years of experience in the software industry ( engineering, professional service , product management). Helping customer in the field , negotiating with customer on the feature request and align them with the product roadmap Extensive experience across the stack in Managing, Architecting, Designing and Implementing Big data applications, frameworks and platforms.More than 4 year of experience as Solution Architect Experience in design and implementing a highly scalable SAAS platform for public Cloud. Hold two patents in the area of Big Data.

8 月 19 日 13:30 - 16:45

■ 演讲议题：阿里云基于 Flink CDC 的实时数据集成实践

分享时间：8 月 19 日 13:30 - 14:00

议题介绍：

CDC（Change Data Capture）是用于从数据库中捕获变更的技术，Flink CDC 是实时数据集成框架的开源代表，具有全增量一体化、无锁读取、并发读取、分布式架构等技术优势，在开源社区中非常受欢迎。Flink CDC 支持强大的数据加工能力，可以通过 SQL 对数据库数据做实时关联、聚合、打宽等, 配合 Flink 丰富的下游生态可以将加工后的数据方便地写入 Kafka、Hudi、Iceberg 、Doris 等下游，实现数据实时入湖入仓。在本次分享中，我们将首先会介绍 Flink CDC 技术的核心设计和关键实现，详细讲解 2.4.0 版本的新特性。然后结合具体的业务场景，分享阿里云内部 Flink CDC 在不同场景针对业务痛点的解决方案，如入湖入仓场景，Binlog 过期问题等。

嘉宾介绍

阮航丨阿里云高级研发工程师

阿里云高级研发工程师, Flink CDC Maintainer & Apache Flink Contributor。

■ 演讲议题：自如基于 Apache StreamPark 的大规模 On Kubernetes 实时计算生产实践深度解析

分享时间：8 月 19 日 14:00 - 14:30

议题介绍：

1. 在此次演讲中，我们将深度探讨如何借助 Apache StreamPark——一站式实时计算作业管理平台，精细化管理自如超过 300 个 Flink On Kubernetes 实时作业。Apache StreamPark 为我们提供了一套直观的可视化界面，协助我们管理了众多关键功能，包括 Flink 作业的开发，作业部署到 Kubernetes，Flink Docker 镜像管理、Flink Kubernetes Pod Template 管理等。

2. 我们基于 StreamPark 也探索出了一些创新实践：我们进一步与调度系统结合，实现了基于 FlinkSQL 的离线数据同步，从而优化了数据处理过程。

通过 Apache StreamPark，我们实现了实时作业的全生命周期管理，开发和管理人效都得到极大提升。这个过程生动展现了实时计算平台化管理的强大能力和其在实际生产环境中的巨大价值。

嘉宾介绍

陈卓宇丨自如大数据平台研发工程师

Apache StreamPark PPMC.

■ 演讲议题：Flink K8S Operator AutoScaling

分享时间：8 月 19 日 14:30 - 15:00

议题介绍：

流处理在当今大数据领域，其中，Apache Flink 正是一片黑马不断出现在大家眼前，但是其带来的 24 小时的运维挑战不可忽视。在当前降本增效的大背景下，资源的有效利用率成为了大家关注的重点。本次演讲详细阐述了 Apache Flink 社区衍生出来的子项目：Flink K8S Operator，简单介绍这个项目来源以及发展历史，同时介绍最新版本中引入的自动调优功能，从当前社区实现的自动调优（FLIP-271）功能到进行详细讲解其工作原理和最佳实践，同时引出社区正在实现的无停机更新功能（FLIP-291），最后介绍当前 Flink 社区在这个工作的一些未来规划。

嘉宾介绍

陈政羽丨真有趣游戏高级大数据开发工程师

Apache Flink/Streampark Contributor ，长期从事游戏行业数据开发，目前在公司负责云原生的 Flink 大数据部署作业平台构建与作业研发，从 0 到 1 为真有趣游戏构建部署、提交作业的一站式 Flink 智能作业平台、反外挂平台以及数据集成平台。

■ 演讲议题：RSQLDB 基于消息队列的流数据库

分享时间：8 月 19 日 15:00 - 15:30

议题介绍：

随着数字化程度的加深，数据在爆炸性增长，对数据处理的实时性、正确性都提出了越来越高的要求，流计算应运而生。同时，消息队列产品作为数据流转平台，被广泛的运用在大数据计算架构之中，通过消息队列/消息引擎进行流计算的案例也是不胜枚举。然而在云计算时代，使用成本成为架构设计或者演进的主要目标。RSQLDB 是一款基于消息队列 RocketMQ 为存储的分布式流计算引擎。最小支持 2 节点生产部署，标准化 SQL 交互方式极大降低使用门槛；功能上，RSQLDB 支持窗口、JOIN 和状态恢复等等。

本演讲将从一下几个方面介绍 RSQLDB：

1. 流计算演进之路，为什么需要 RSQLDB;

2. RSQLDB 架构设计原理;

3. RSQLDB 在阿里云的应用实践。

嘉宾介绍

倪泽丨阿里云，Apache RocketMQ Committer，RocketMQ Streams 维护人，RSQLDB 维护人

Apache RocketMQ Committer，RocketMQ Streams maintainer，RSQLDB maintainer，云原生消息团队研发计算专家。

■ 演讲议题：State of Scala API in Apache Flink

分享时间：8 月 19 日 15:45 - 16:15

议题介绍：

As a Scala developer writing new Flink job, you expect to use latest Scala 3 version, rather the one Flink was compiled with. Support of Scala 2.13 and Scala 3 was not really possible until Flink 1.15 came out. In this talk we will review how the Scala API was done in Apache Flink prior the version 1.15 and what has changed in that release. Apache Flink chose quite opposite way to enable Scala developers to use any Scala version than Apache Spark project and that is interesting discussion on its own.

During this talk we will go through the SBT example project to build Flink jobs with Scala 3. We will look at the current community options of Scala wrappers for Flink Java API and challenges related to that. As a result, we will see that using Scala in Flink jobs is much more convenient than writing your streaming jobs with Java API. An introduction to the Scala CLI makes the whole packaging experience of Scala Jobs a pure joy.

嘉宾介绍

Alexey丨Ververica Solution Architect

Alexey is a Solution Architect working for last the last 6 years on data solutions and products. At Ververica, he is focusing on supporting clients to solve their challenges in adopting data stream processing with Apache Flink. Among his previous project and companies he developed different systems such as Data Lakes, Data Integration and Data Virtualization Layers. He has also spent many years on developing data services for investment banks including currency trading software. In his spare time, he also contributes to various open-source projects or starts his own for fun. His hobbies are astronomy, playing music and gym.

■ 演讲议题：小米 Flink 实时计算平台的建设实践

分享时间：8 月 19 日 16:15 - 16:45

议题介绍：

本次分享将聚焦实时计算平台的建设，结合小米自身的业务实践经验，分享小米在实时计算领域的探索与建设，打造具备资源弹性、低成本、易用等特性的统一实时计算平台。

内容提纲：

1. 小米实时计算平台介绍该部分将介绍小米实时计算的业务全貌，并结合小米实时计算平台的演变发展解读遇到的痛点与解法。

2. 实时计算平台建设该部分将介绍小米整体的实时计算平台架构，结合统一的元数据管理、权限管理、血缘、调度管理等方面展开小米在实时计算平台易用性方面的探索。

3. 平台运维与治理该部分将深入探索实时计算的运维与治理工作，分享小米在框架层和平台层的探索，以及在治理闭环方法论的指导下通过产品化使小米实时计算平台具备资源弹性、低成本、易用的能力。

4. 总结与展望对本次分享的内容进行简要总结，并对实时计算平台未来的演变方向展开一定的探讨和展望。

嘉宾介绍

陈子豪丨小米软件研发工程师

小米软件研发工程师，主要负责小米实时计算平台及 Flink 框架内核开发。

专题议程

作为 Apache 软件基金会（ASF）的官方全球系列大会，每年的 CommunityOverCode Asia 都吸引着来自全球各个层次的参与者、社区共同探索 "明天的技术"。8 月 18 日至 20 日，即将强势来袭的 CommunityOverCode Asia 2023 上，大家可以近距离感受来自 Apache 项目的最新发展和新兴创新。