The twenty-fourth day of getting started with big data - SparkStreaming (2) integration with flume and kafka

In the previous article, the data source used to get data from a socket, which is a bit of a "sidewalk", and the serious thing is to get data from message queues such as kafka!

The main supported sources are as follows from the official website:

  The form of data acquisition includes push push and pull pull

1. Spark streaming integrates flume

  1. The way of push

    More recommended is the pull method

    Import dependencies:

     <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-flume_2.10</artifactId>
            <version>${spark.version}</version>
        </dependency>

    Write code:

package com.streaming

import org.apache.spark.SparkConf
import org.apache.spark.streaming.flume.FlumeUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * Created by ZX on 2015/6/22.
  */
object FlumePushWordCount {

  def main(args: Array[String]) {
    val host = args(0)
    val port = args(1).toInt
    val conf = new SparkConf().setAppName("FlumeWordCount") // .setMaster("local[2]")
     // Using this constructor will omit sc and build it by the constructor 
    val ssc = new StreamingContext(conf, Seconds (5 ))
     // Push method: flume sends data to spark (note that the host and port here are the streaming address and port, let others send to this address) 
    val flumeStream = FlumeUtils.createStream(ssc, host, port)
     // The data in flume can get the real content through event.getBody() 
    val words = flumeStream.flatMap(x => new String(x.event.getBody().array()).split(" ")).map ((_, 1 ))

    val results = words.reduceByKey(_ + _)
    results.print()
    ssc.start()
    ssc.awaitTermination ()
  }
}

    flume-push.conf—flume-side configuration file:

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /export/data/flume
a1.sources.r1.fileHeader = true

# Describe the sink
a1.sinks.k1.type = avro
#This is the receiver
a1.sinks.k1.hostname = 192.168.31.172
a1.sinks.k1.port = 8888

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
flume-push.conf

  2. Pull way

    It is the recommended way to actively pull the data generated by flume through streaming

    Write code: (depending on the same as above)

package com.streaming

import java.net.InetSocketAddress

import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.flume.FlumeUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

object FlumePollWordCount {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("FlumePollWordCount").setMaster("local[2]")
    val ssc = new StreamingContext(conf, Seconds(5 ))
     // Pull data from flume (flume address), through the Seq sequence, you can new multiple addresses, and pull from multiple flume addresses 
    val address = Seq( new InetSocketAddress("172.16.0.11", 8888 ))
    val flumeStream = FlumeUtils.createPollingStream(ssc, address, StorageLevel.MEMORY_AND_DISK)
    val words = flumeStream.flatMap(x => new String(x.event.getBody().array()).split(" ")).map((_,1))
    val results = words.reduceByKey(_+_)
    results.print()
    ssc.start()
    ssc.awaitTermination ()
  }
}

      configure flume

  The method of pulling requires relevant JARs in the lib directory of flume (the spark program should be used to adjust the pulling of flume), and the specific JAR information can be obtained through the official website:

  

    Configure flume:

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /export/data/flume
a1.sources.r1.fileHeader = true

# Describe the sink (the address of flume is configured, waiting to be pulled)
a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink
a1.sinks.k1.hostname = mini1
a1.sinks.k1.port = 8888

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
flume-poll.conf

    Start flume, then start spark streaming in IDEA:

bin/flume-ng agent -c conf -f conf/netcat-logger.conf -n a1 -Dflume.root.logger= INFO,console
 // Parameters after -D are optional

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325337275&siteId=291194637