The SparkStreaming + Flume method is an early form. In the new version of Spark, it is not recommended to directly connect with Flume, but sometimes we still need this form to process data.
Spark integrates Flume's data. There are two ways in Spark: Poll pull and Push push. Compared with the two modes, the Poll mode is preferred.
The pull mode is that Spark provides a sink, and SparkStreaming takes the initiative to get data from the channel, and obtains the data according to its own conditions, which has good stability.
In push mode, Flume acts as a cache and stores data. And monitor Spark, if Spark is reachable, push the data over. (Simple, low coupling). The disadvantage is that if the SparkStreaming program is not started, an error will be reported on the Flume side, and it may cause the Spark Streaming program to lose data too late to consume.
First introduce the way to pull data in Poll
Need to import a jar, as follows
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-flume_2.11</artifactId>
<version>2.0.2</version>
</dependency>
When pulling Poll, pay attention to that the version of Flume must be above 1.6, I use 1.8, and I need to make sure that there are two jars in the lib directory of flume. My version is as follows
scala-library-2.11.8.jar
spark-streaming-flume-sink_2.11-2.0.2.jar
scala-library-2.11.8.jar needs to be renamed after putting it in lib, to scala-library-2.10.5.jar.BAK
The following is to write the agent of Flume, the source uses the nc tool, you can also use your own agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# source需要改动的是源和目的地,此处是源,使用nc工具
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Describe the sink 需要改动的是源和目的地,此处是目的地,且为Spark配置一个连接的地址
a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink
a1.sinks.k1.hostname=192.168.182.146
a1.sinks.k1.port = 8888
a1.sinks.k1.batchSize= 2000
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
Start flume
bin/flume-ng agent --conf conf --conf-file conf/socket-source.properties --name a1 -Dflume.root.logger=INFO,console
After flume is started, data can be generated through the previously set nc port, the Poll method will not report an error, and the data will be temporarily stored in the channel
Let's write and start the code
package com.stream
import org.apache.spark.{
SparkConf}
import org.apache.spark.streaming.flume.{
FlumeUtils}
import org.apache.spark.streaming.{
Seconds, StreamingContext}
object StreamFromFlume {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("StreamFromKafka").setMaster("local[2]")
val scc = new StreamingContext(conf,Seconds(10))
//设置checkpoint,可以忽略
scc.checkpoint("D:\\checkpoint")
// 从flume中拉取数据 这里的ip和端口是你agent中sink配置的
val flumeStream = FlumeUtils.createPollingStream(scc,"192.168.182.146",8888)
val lineStream= flumeStream.map(x=>new String(x.event.getBody.array()))
//实现单词汇总
val result = lineStream.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
result.print()
scc.start()
scc.awaitTermination()
}
}
Then introduce the Push method of pushing data
Pushing data is because Flume takes the initiative, so it is not very recommended, and it is unlikely to be used in work. After all, data loss is an uncomfortable time for a data project.
You must start Spark first when Push, and then Push is Flume taking the initiative, using the avro serialization owned by Flume itself. This method can actually push data to most frameworks, not just Spark.
The difference between the receiving code and Poll is that the API has changed
package com.stream
import org.apache.spark.{
SparkConf}
import org.apache.spark.streaming.flume.{
FlumeUtils}
import org.apache.spark.streaming.{
Seconds, StreamingContext}
object StreamFromFlume {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("StreamFromKafka").setMaster("local[2]")
val scc = new StreamingContext(conf,Seconds(10))
//设置checkpoint,用来提高数据消费的安全
scc.checkpoint("D:\\checkpoint")
// ip和port任然是Flume配置文件中的
val flumeStream: ReceiverInputDStream[SparkFlumeEvent] = FlumeUtils.createStream(scc,"192.168.182.146",8888,StorageLevel.MEMORY_AND_DISK)
val lineStream= flumeStream.map(x=>new String(x.event.getBody.array()))
//实现单词汇总
val result = lineStream.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
result.print()
scc.start()
scc.awaitTermination()
}
}
After the Spark Push method is started, it will normally wait for Flume to push data. Next, we begin to prepare the Flume side
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# source需要改动的是源和目的地,此处是源,使用nc工具
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Describe the sink 需要改动的是源和目的地
a1.sinks.k1.type = avro
a1.sinks.k1.hostname=192.168.182.146
a1.sinks.k1.port = 8888
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
After starting Flume, we check Spark and find that as we input data in the nc tool, there is a corresponding output in Spark.
Finally, make a supplement. If you run the two modes in the middle of the following error occurs
org.apache.avro.AvroRuntimeException: Unknown datum type: java.lang.Exception:
java.lang.NoClassDefFoundError: Could not initialize class
org.apache.spark.streaming.flume.sink.EventBatch
This error is because your Flume's avro serialized version is not compatible with Spark. You need to import it separately. Import the following jar in your pom file
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>1.8.2</version>
</dependency>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro-ipc</artifactId>
<version>1.8.2</version>
</dependency>
And copy these two jars to Flume's lib directory, delete the original jar in the lib