SparkStreaming+Flume stream computing

The SparkStreaming + Flume method is an early form. In the new version of Spark, it is not recommended to directly connect with Flume, but sometimes we still need this form to process data.

Spark integrates Flume's data. There are two ways in Spark: Poll pull and Push push. Compared with the two modes, the Poll mode is preferred.

The pull mode is that Spark provides a sink, and SparkStreaming takes the initiative to get data from the channel, and obtains the data according to its own conditions, which has good stability.

In push mode, Flume acts as a cache and stores data. And monitor Spark, if Spark is reachable, push the data over. (Simple, low coupling). The disadvantage is that if the SparkStreaming program is not started, an error will be reported on the Flume side, and it may cause the Spark Streaming program to lose data too late to consume.

First introduce the way to pull data in Poll

Need to import a jar, as follows

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming-flume_2.11</artifactId>
    <version>2.0.2</version>
</dependency>

When pulling Poll, pay attention to that the version of Flume must be above 1.6, I use 1.8, and I need to make sure that there are two jars in the lib directory of flume. My version is as follows

scala-library-2.11.8.jar 
spark-streaming-flume-sink_2.11-2.0.2.jar

scala-library-2.11.8.jar needs to be renamed after putting it in lib, to scala-library-2.10.5.jar.BAK

The following is to write the agent of Flume, the source uses the nc tool, you can also use your own agent

a1.sources = r1
a1.sinks = k1
a1.channels = c1

#  source需要改动的是源和目的地,此处是源,使用nc工具
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink 需要改动的是源和目的地,此处是目的地,且为Spark配置一个连接的地址
a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink
a1.sinks.k1.hostname=192.168.182.146
a1.sinks.k1.port = 8888
a1.sinks.k1.batchSize= 2000 


# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Start flume

bin/flume-ng agent --conf conf --conf-file conf/socket-source.properties --name a1 -Dflume.root.logger=INFO,console

After flume is started, data can be generated through the previously set nc port, the Poll method will not report an error, and the data will be temporarily stored in the channel

Let's write and start the code

package com.stream

import org.apache.spark.{
    
    SparkConf}
import org.apache.spark.streaming.flume.{
    
    FlumeUtils}
import org.apache.spark.streaming.{
    
    Seconds, StreamingContext}

object StreamFromFlume {
    
    
  def main(args: Array[String]): Unit = {
    
    
    val conf = new SparkConf().setAppName("StreamFromKafka").setMaster("local[2]")
    val scc = new StreamingContext(conf,Seconds(10))
    //设置checkpoint,可以忽略
    scc.checkpoint("D:\\checkpoint")
    // 从flume中拉取数据 这里的ip和端口是你agent中sink配置的
    val flumeStream = FlumeUtils.createPollingStream(scc,"192.168.182.146",8888)
    
    val lineStream= flumeStream.map(x=>new String(x.event.getBody.array()))

    //实现单词汇总
    val result = lineStream.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)

    result.print()
    scc.start()
    scc.awaitTermination()
  }

}

Then introduce the Push method of pushing data

Pushing data is because Flume takes the initiative, so it is not very recommended, and it is unlikely to be used in work. After all, data loss is an uncomfortable time for a data project.

You must start Spark first when Push, and then Push is Flume taking the initiative, using the avro serialization owned by Flume itself. This method can actually push data to most frameworks, not just Spark.

The difference between the receiving code and Poll is that the API has changed

package com.stream

import org.apache.spark.{
    
    SparkConf}
import org.apache.spark.streaming.flume.{
    
    FlumeUtils}
import org.apache.spark.streaming.{
    
    Seconds, StreamingContext}

object StreamFromFlume {
    
    
  def main(args: Array[String]): Unit = {
    
    
    val conf = new SparkConf().setAppName("StreamFromKafka").setMaster("local[2]")
    val scc = new StreamingContext(conf,Seconds(10))
    //设置checkpoint,用来提高数据消费的安全
    scc.checkpoint("D:\\checkpoint")
    // ip和port任然是Flume配置文件中的
    val flumeStream: ReceiverInputDStream[SparkFlumeEvent] = FlumeUtils.createStream(scc,"192.168.182.146",8888,StorageLevel.MEMORY_AND_DISK)
    
    val lineStream= flumeStream.map(x=>new String(x.event.getBody.array()))

    //实现单词汇总
    val result = lineStream.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)

    result.print()
    scc.start()
    scc.awaitTermination()
  }

}

After the Spark Push method is started, it will normally wait for Flume to push data. Next, we begin to prepare the Flume side

a1.sources = r1
a1.sinks = k1
a1.channels = c1

#  source需要改动的是源和目的地,此处是源,使用nc工具
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink 需要改动的是源和目的地
a1.sinks.k1.type = avro
a1.sinks.k1.hostname=192.168.182.146
a1.sinks.k1.port = 8888


# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

After starting Flume, we check Spark and find that as we input data in the nc tool, there is a corresponding output in Spark.


Finally, make a supplement. If you run the two modes in the middle of the following error occurs
org.apache.avro.AvroRuntimeException: Unknown datum type: java.lang.Exception: 
java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.spark.streaming.flume.sink.EventBatch 

This error is because your Flume's avro serialized version is not compatible with Spark. You need to import it separately. Import the following jar in your pom file

<dependency>
	<groupId>org.apache.avro</groupId>
	<artifactId>avro</artifactId>
<version>1.8.2</version>
</dependency>
<dependency>
	<groupId>org.apache.avro</groupId>
	<artifactId>avro-ipc</artifactId>
	<version>1.8.2</version>
</dependency>

And copy these two jars to Flume's lib directory, delete the original jar in the lib

Guess you like

Origin blog.csdn.net/dudadudadd/article/details/115005546