Two ways to combine Flume and SparkStream--pull

Hello everyone:

   How flume connects to the pull of SparkStream,

A brief introduction: SparkStream pulls data from flume

----flume configuration file flume-poll.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /root/flume
a1.sources.r1.fileHeader = true

# Describe the sink
a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink
a1.sinks.k1.hostname = 192.168.17.108
a1.sinks.k1.port = 8888

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

 

Step 1: Run flume on the virtual machine first

bin/flume-ng agent -n a1 -c conf/ -f conf/flume-poll.conf -Dflume.root.logger=WARN,console

Prerequisite: Put the downloaded spark-assembly-1.6.1-hadoop2.6.0.jar and spark-streaming-flume-sink_2.10-1.6.1.jar into the lib directory of flume

---The code of how flume connects to SparkStream's pull is as follows:

package SparkStream

import java.net.{InetSocketAddress}

import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.flume.FlumeUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}

/**
  * Created by Administrator on 2017/10/10.
  * 功能:演示flume和sparkstreaming的结合 pull的形式
  *
  */
object FlumePullDemon {
  def main(args: Array[String]): Unit = {
    //设置日志的级别
    LoggerLevels.setStreamingLogLevels()
    val conf=new SparkConf().setAppName("FlumePullDemon").setMaster("local[2]")
    val sc=new SparkContext(conf)
    val ssc=new StreamingContext(sc,Seconds(5))
    //从flume中拉取数据  192.168.17.108 是flume的地址  ip地址可以写多个
//    val address=Seq(new InetSocketAddress("192.168.17.108",1111),new InetSocketAddress("192.168.17.109",1111))
    val address=Seq(new InetSocketAddress("192.168.17.108",8888)) // 用一个ip形式
    val flumeStream=FlumeUtils.createPollingStream(ssc,address,StorageLevel.MEMORY_ONLY_SER)

    val words=flumeStream.flatMap(x=>new String(x.event.getBody.array()).split(" ").map((_,1)))
    val result=words.reduceByKey(_+_)
    result.print()


    //启动
    ssc.start()
    // 等待结束
    ssc.awaitTermination()
  }

}

 

Step 2: Run the sparkstream program on the local idea. After normal operation, the screenshot is as follows:

This means that ss has been running normally, because there is no data in the directory monitored by flume, so the result is empty

Switch to the directory monitored by flume and manually create the data

cd /root/flume

echo "bejing huan ying ni88" >> test.log

 

Observe the running results of ss on this machine: The results are as follows:

The result is displayed correctly, verification is complete

Note: 1 The ip address in the ss and flume configuration files refers to the address of the machine running flume

      2 The pull method can select multiple flume addresses, just configure it in Seq

  3 After the current batch file is executed, flume will add COMPLETED after the file name, for example, change the file "test.log" to "test.log.COMPLETED", which is the same as the push method

4 During the test, it is found that flume will report the error of repeated use of the file name. This error is the same as the error encountered when Flume is connected to the push method of ss, so I will not repeat it here.

 

 

Guess you like

Origin blog.csdn.net/zhaoxiangchong/article/details/78380190