阿里巴巴编程之夏项目——Apache Flink

项目介绍：

Apache Flink 是由 Apache 软件基金会开发的开源流处理框架，其核心是用 Java 和 Scala 编写的分布式流数据流引擎。Flink 以数据并行和流水线方式执行任意流数据程序，Flink 的流水线运行时系统可以执行批处理和流处理程序。此外，Flink 的运行时本身也支持迭代算法的执行。

Idea list

1.Add a new implementation of the HighAvailabilityServices using etcd：https://issues.apache.org/jira/browse/FLINK-11105

Mentor：沙晟阳 @ 成阳；GitHub ID：[MalcolmSanders;(https://github.com/MalcolmSanders) Apache YARN、Flink 贡献者; 阿里云计算平台高级开发工程师

2.在树莓派等有限硬件资源的环境下高效的运行 flink，将 flink 应用于 IoT，边缘计算场景

Mentor：宋辛童 @ 五藏；Github ID: xintongsong 北京大学博士;阿里巴巴高级开发工程师

3.通过 Intelij Idea 一站式编写、远程提交和分布式Debug Flink 任务。Intelij Idea 是很好的编程语言 IDE，Flink 是下一代分布式大数据处理引擎，两者结合，在 Intellij Idea 上构建Flink 任务编写、远程任务提交、分布式 Debug 和在线运维的一站式服务将对 Flink 用户带来更好的体验。通过该项目，有助于熟练使用 Flink，提升大数据处理和相关工具的开发使用能力，提交的代码反馈社区，尽早参与到 Flink 生态建设中。

Mentor：何健超 @ 迟南; Github id: hejianchao; 阿里巴巴技术专家

4.State storage is on the critical path of Flink, a stateful computing engine. Basically it's a kv store but with computing-relative requirement, thus an interdisciplinary area. Gemini is a KeyValue store we designed for such scenario. In Gemini, using elastic pages from a few bytes to tens of KB to store the data.
In this topic you need to implement a cache allocator for pages, which aims at supporting off-heap to reduce GC, having high throughput and always replacing cold data with hot ones to increase cache hit ratio and memory utilization.

Mentor：李钰 @ 绝顶； Github id: [email protected]
Apache HBase PMC & committer, Flink/HDFS contributor; 阿里巴巴高级技术专家

5.State storage is on the critical path of Flink, a state-ful computing engine. Basically it's a kv store but with computing-relative requirement, thus an interdisciplinary area. Gemini is a KeyValue store we designed for such scenario, it's a two-component LSM-tree structure, of which C0 tree is write buffer, and C1 tree could be an enhanced B+-tree or hash table, where hash table offers faster random lookup than sorted-base index.In this topic you need to implement a CSBw-tree, which is a combination of CSB+-tree[1] and Bw-tree[2], which aims at both good cpu cache utility (cache-conscious) and fast random access.

[1] Making B+-Trees Cache Conscious in Main Memory, SIGMOD 2000
[2] The Bw-Tree: A B-tree for New Hardware Platforms, ICDE 2013

Mentor：李钰 @ 绝顶；Github id: [email protected]
Apache HBase PMC & committer, Flink/HDFS contributor; 阿里巴巴高级技术专家

6.Batch benchmark has matured and been widely used to analyze performance of batch processing technologies. However, There is no suitable benchmark to test streaming framework, which has more performance latitudes and usage scenarios. So we need to develop streaming benchmark to comprehensive test Flink and other streaming processing framework, and optimize Flink according to the benchmark results.

Mentor:
胥平勇 @姬平; Github id: XuPingyong; Apache Flink contributor; 阿里巴巴技术专家