Two ways to combine Flume and SparkStream--push

everyone:

   it is good! Two ways to combine Flume and SparkStream--push

A brief introduction: Flume pushes data to SparkStream.

----The code of sparkstreaming is as follows:

package SparkStream

import org.apache.spark.streaming.flume.FlumeUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}

/**
  * Created by Administrator on 2017/10/10.
  * 功能:演示flume和sparkstreaming的结合 push的形式
  *
  */
object FlumePushDemon {
  def main(args: Array[String]): Unit = {
    //设置日志的级别
    LoggerLevels.setStreamingLogLevels()
    val conf=new SparkConf().setAppName("FlumePushDemon").setMaster("local[2]")
    val sc=new SparkContext(conf)
    val ssc=new StreamingContext(sc,Seconds(5))
    //以推送的方式: flume向ss发送数据(用flume给我们提供的工具)
    // 192.168.17.10 是本地的虚拟机的地址 运行sparkstreaming,并接收flume的推送
    val flumeStream=FlumeUtils.createStream(ssc,"192.168.17.10",8888)
    //flume中的数据通过event.getBody才能拿到真正的内容(ip地址只能有一个)
    val words=flumeStream.flatMap(x=>new String(x.event.getBody.array()).split(" ").map((_,1)))
    val result=words.reduceByKey(_+_)
    result.print()


    //启动
    ssc.start()
    // 等待结束
    ssc.awaitTermination()
  }

}

 

Step 1: Run the ss program locally

First, run the sparkstream program in the local idea. After running the program normally, the screenshot is as follows:

This picture means that ss has been started and is waiting for input from flume

The configuration file of -----flume-push.conf is as follows:

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1


# source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /root/flume
a1.sources.r1.fileHeader = true


# Describe the sink
a1.sinks.k1.type = avro
#这是接收方
a1.sinks.k1.hostname = 192.168.17.10
a1.sinks.k1.port = 8888


# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100


# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

 

Step 2: Start Flume on the virtual machine. Here I started it in the installation directory of Flume. The specific situation is analyzed in detail.

bin/flume-ng agent -n a1 -c conf/ -f conf/flume-push.conf -Dflume.root.logger=WARN,console

Switch to the directory monitored by flume and manually create the data

cd /root/flume

echo "bejing huan ying ni1" >> test.log

 

Observe the running results of ss on this machine: The results are as follows:

This is the data manually inserted just now, displayed correctly, and the verification is complete

Explanation: 1 The ip address in ss and flume refers to the address that receives flume and runs sparkstream. Because my ss is running locally, the address of the local virtual machine vmnetwork1 is configured.
2 In this way, the received flume There can only be one ss address. I personally think that there will be disadvantages when the amount of data is large.
3 After the current batch file is executed, Flume will add COMPLETED after the file name, such as changing the file "test.log" to "test.log" .COMPLETED", personally think it is used for marking to prevent data duplication

4 In the test, it is found that if two identical files are generated at different times, flume will report an error. For example, first manually generate the file test.log, after flume is executed, the file "test.log" will be changed to "test.log.COMPLETED". If you manually generate a file test.log at this time, Flume will report the file name reuse error. My understanding is that flume distinguishes whether it is synchronized based on the file name

 

 

Guess you like

Origin blog.csdn.net/zhaoxiangchong/article/details/78380235