Flink介绍及简单实践
Flink是一个分布式实时计算引擎的框架,主要用于对无界和有界数据流进行状态计算;由于被阿里等大厂广泛应用,所以现在在国内也是形成了一股热潮;
Flink计算框架为什么这么火,得益于它的强大功能特征:
所以flink能保证发挥数据的最大价值
学习建议:先实践,再深入
笔记主要目的是为了让大家使用flink的使用场景、原理以及一些相关的细节知识。
Flink的行业使用分布
电商和市场营销
物联网
电信业
银行和金融业
Flink中流的概念
在flink中数据都是流的概念,DataStream、DataSet分别是有界流无界流形成的抽象
有界流
有界流是有始有终的数据流
无界流
无界流是有始无终的数据流
Flink中Time的概念
eventTime
processingTime
简单的FlinkWordCount
1.maven依赖
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-scala -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_2.11</artifactId>
<version>1.7.2</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-scala -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-scala_2.11</artifactId>
<version>1.7.2</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-kafka -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_2.11</artifactId>
<version>1.7.2</version>
</dependency>
</dependencies>
2.编写wordcount程序
package com.shufang.flink.examples
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import scala.collection.mutable.ArrayBuffer
/**
* 这个案例只是在本地模式下测试单词计数
*/
object FlinkWordCount {
def main(args: Array[String]): Unit = {
/**
* 使用有界流来处理wordcount
*/
//获取有界流运行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//模拟一批数据
val words: ArrayBuffer[String] = ArrayBuffer("a","a","a","b","b","c","c","c","a")
//这里需要注意implicit转换,所有需要引入依赖:org.apache.flink.api.scala._
val ds: DataSet[String] = env.fromCollection(words)
//这是使用有界流饿方式进行wordcount
val counts: AggregateDataSet[(String, Int)] = ds.map((_,1)).groupBy(0).sum(1)
counts.print()
/**
*使用无界流来处理wordcount
*/
val env1: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//可以设置并行度
//env1.setParallelism(5)
//肩痛socket端口处理数据
val dstream: DataStream[String] = env1.socketTextStream("localhost",9999,'\n')
val result: DataStream[(String, Int)] = dstream
.flatMap(_.split(" ")).setParallelism(2)
.filter(_.nonEmpty).setParallelism(2)
.map((_, 1)).setParallelism(2)
.keyBy(0)
.reduce((a, b) => (a._1, (a._2 + b._2)))
result.print()
env1.execute("stream_wordcount")
}
}