Flink是Apache开源的流处理框架,Spark是Apache开源的大规模数据处理快速通用的计算引擎,两者都是分布式框架,支持Java、Scala等多种语言。接下来通过实例来说明两者的应用。
基于Flink的WordCount
1. 对pom.xml文件添加依赖
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_2.11</artifactId>
<version>1.8.0</version>
</dependency>
2. 在idea上编写WordCount代码
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.scala._
object SocketWordCount {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val lines: DataStream[String] = env.socketTextStream("localhost", 9999, '\n')
val wordCount = lines
.flatMap(_.split(" "))
.map((_,1))
.keyBy(_._1)
.timeWindow(Time.seconds(5), Time.seconds(1))
.reduce((x,y) => {
val key = x._1
val value = x._2 + y._2
(key, value)
})
wordCount.print().setParallelism(1)
env.execute("SocketWordCount")
}
}
3. 使用netcat开启本地服务
$ nc -l 9999
hello world hello
4. 运行SocketWordCount代码,结果如下:
(hello,2)
(world,1)
基于SparkStreaming的WordCount
1. 对pom.xml文件添加依赖
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.4.2</version>
<scope>provided</scope>
</dependency>
2. 在idea上编写WordCount代码
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
object SparkStreamingSocketWordCount {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("SparkStreamingSocketWordCount").setMaster("local[2]")
val ssc = new StreamingContext(conf, Seconds(5))
val lines = ssc.socketTextStream("localhost", 9999)
val wordCount = lines
.flatMap(_.split(" "))
.map((_,1))
.reduceByKey(_+_)
wordCount.print()
ssc.start()
ssc.awaitTermination()
}
}
3. 使用netcat开启本地服务
$ nc -l 9999
hello world hello
4. 运行SocketWordCount代码,结果如下:
-------------------------------------------
Time: 1560395435000 ms
-------------------------------------------
(hello,2)
(world,1)
总结:基于以上两种开发会发现两者有很多相似之处,不同的是假如对实时性要求较高建议选择Flink,能够承受低延迟选择SparkStreaming,因为后者对应的API更丰富更利于开发使用。