In the previous article, the data source used to get data from a socket, which is a bit of a "sidewalk", and the serious thing is to get data from message queues such as kafka!
The main supported sources are as follows from the official website:
The form of data acquisition includes push push and pull pull
1. Spark streaming integrates flume
1. The way of push
More recommended is the pull method
Import dependencies:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-flume_2.10</artifactId>
<version>${spark.version}</version>
</dependency>
Write code:
package com.streaming
import org.apache.spark.SparkConf
import org.apache.spark.streaming.flume.FlumeUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* Created by ZX on 2015/6/22.
*/
object FlumePushWordCount {
def main(args: Array[String]) {
val host = args(0)
val port = args(1).toInt
val conf = new SparkConf().setAppName("FlumeWordCount") // .setMaster("local[2]")
// Using this constructor will omit sc and build it by the constructor
val ssc = new StreamingContext(conf, Seconds (5 ))
// Push method: flume sends data to spark (note that the host and port here are the streaming address and port, let others send to this address)
val flumeStream = FlumeUtils.createStream(ssc, host, port)
// The data in flume can get the real content through event.getBody()
val words = flumeStream.flatMap(x => new String(x.event.getBody().array()).split(" ")).map ((_, 1 ))
val results = words.reduceByKey(_ + _)
results.print()
ssc.start()
ssc.awaitTermination ()
}
}
flume-push.conf—flume-side configuration file:
# Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # source a1.sources.r1.type = spooldir a1.sources.r1.spoolDir = /export/data/flume a1.sources.r1.fileHeader = true # Describe the sink a1.sinks.k1.type = avro #This is the receiver a1.sinks.k1.hostname = 192.168.31.172 a1.sinks.k1.port = 8888 # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
2. Pull way
It is the recommended way to actively pull the data generated by flume through streaming
Write code: (depending on the same as above)
package com.streaming
import java.net.InetSocketAddress
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.flume.FlumeUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
object FlumePollWordCount {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("FlumePollWordCount").setMaster("local[2]")
val ssc = new StreamingContext(conf, Seconds(5 ))
// Pull data from flume (flume address), through the Seq sequence, you can new multiple addresses, and pull from multiple flume addresses
val address = Seq( new InetSocketAddress("172.16.0.11", 8888 ))
val flumeStream = FlumeUtils.createPollingStream(ssc, address, StorageLevel.MEMORY_AND_DISK)
val words = flumeStream.flatMap(x => new String(x.event.getBody().array()).split(" ")).map((_,1))
val results = words.reduceByKey(_+_)
results.print()
ssc.start()
ssc.awaitTermination ()
}
}
configure flume
The method of pulling requires relevant JARs in the lib directory of flume (the spark program should be used to adjust the pulling of flume), and the specific JAR information can be obtained through the official website:
Configure flume:
# Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # source a1.sources.r1.type = spooldir a1.sources.r1.spoolDir = /export/data/flume a1.sources.r1.fileHeader = true # Describe the sink (the address of flume is configured, waiting to be pulled) a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink a1.sinks.k1.hostname = mini1 a1.sinks.k1.port = 8888 # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
Start flume, then start spark streaming in IDEA:
bin/flume-ng agent -c conf -f conf/netcat-logger.conf -n a1 -Dflume.root.logger= INFO,console
// Parameters after -D are optional