SparkStreaming (a) the basics

I. Overview

Spark Streaming Spark Core function is expanded, the data stream may be implemented scalable, high throughput, fault tolerance. SparkStreaming data processing may be derived from multiple data sources (such as: Kafka, Flume, TCP sockets), the data streams, and processing streams through a complicated process of computing the final result of the calculation will be stored in the file system, database, or real-time data display in the dashboard.
Here Insert Picture Description
Internally, the stream data received SparkStreaming will split into a batch of data (micro batch), by micro-batch process RDD Spark engine, to produce the final result stream.
Here Insert Picture Description
There in a Spark Streaming high level abstraction called a discrete stream or DSTREAM . DSTREAM can be constructed by an external data source or a new conversion obtained DSTREAM (similar to using a Spark RDD);

Conclusion : DSTREAM underlayer is made Seq [RDD] sequence

Two, DStream discrete flow principle

DSTREAM Spark Streaming is the core of the abstract performance for a continuous stream of data (a set of consecutive sequence set essentially RDD), a RDD DSTREAM of a data set containing a fixed interval.
Here Insert Picture Description
DStream applied on any underlying operations are converted to RDD operation.
Here Insert Picture Description
The core idea: micro batch ideas underlying the use of spark rdd deal with discrete data stream

Three, Input Source and Receivers

Input DStream represents DSTREAM object data source from the data received Construction
Construction Input DStream two ways:

  • Source Basic : usually does not rely on third-party dependencies can be created directly by ssc, such as: filesystem and socket
  • Source advanced : typically need to integrate third party reliance, such as: Kafka , other streaming data storage system Flume

basic source (the underlying data source):

  • Create a file system (HDFS API to read the file directory using data from any file system, as DStream data source)
// 通过文件系统构建DStream 注意:路径指向一个目录而不是一个具体的文件
val lines = ssc.textFileStream("hdfs://xxx:9000/data")

note:

  • Path is a directory, not a specific file
  • Data Directory support wildcards, such as: hdfs://xxx:9000/data*;
  • We must ensure unified data file format, the proposed text type
  • TCP Socket Socket
val lines = ssc.socketTextStream("localhost",8888)
  • Queue RDD (RDD queues, can store a plurality of RDD to construct a queue Queue DSTREAM)
// 注意:ssc中封装了sparkContext可以直接获取 无需手动创建
val rdd1 = ssc.sparkContext.makeRDD(List("Hello Spark","Hello Kafka"))
val rdd2 = ssc.sparkContext.makeRDD(List("Hello Scala","Hello Hadoop"))

// 通过Queue封装RDD,创建一个DStream
val queue = scala.collection.mutable.Queue(rdd1,rdd2)
val lines = ssc.queueStream(queue)

advanced source (Advanced Data source):

  • Based Kafka
    (. 1). Import dependencies
<dependency>
   <groupId>org.apache.spark</groupId>
   <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
   <version>2.4.4</version>
</dependency>

(2) Development and applications

package source

import org.apache.kafka.clients.consumer.{ConsumerConfig}
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka010._

object KafkaSource {
  def main(args: Array[String]): Unit = {
    //1. 初始化ssc
    val conf = new SparkConf().setAppName("kafka wordcount").setMaster("local[*]")
    val ssc = new StreamingContext(conf,Seconds(5))
    ssc.sparkContext.setLogLevel("ERROR")

    //2. 初始化kafka的配置对象
    val kafkaParams = Map[String,Object](
      (ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,"localhost:9092"),
      (ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG,classOf[StringDeserializer]),
      (ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,classOf[StringDeserializer]),
      (ConsumerConfig.GROUP_ID_CONFIG,"g1")
    )

    //3. 准备Array 填写需要订阅的主题Topic
    val arr = Array("spark")

    //4. 通过工具类初始化DStream
    val lines = KafkaUtils.createDirectStream[String, String](
      ssc,
      LocationStrategies.PreferConsistent, // 位置策略 优化方案
      ConsumerStrategies.Subscribe[String,String](arr,kafkaParams)
    )

    //5. 对数据的处理
    lines
      // kafka record(数据)  ---> value
      .map(record => record.value())
      .flatMap(_.split("\\s"))
      .map((_,1))
      .groupByKey()
      .map(t2 => (t2._1,t2._2.size))
      .print()

    //6. 启动streaming应用
    ssc.start()

    //7. 优雅的停止应用
    ssc.awaitTermination()
  }
}

(3) Start kafka service, and start the message producer

# 启动zk
bin/zkServer.sh start conf/zoo.cfg
# 启动kafka
bin/kafka-server-start.sh -daemon config/server.properties
# 创建spark topic
bin/kafka-topics.sh --create --topic spark --bootstrap xxx:9092 --partitions 1 --replication-factor 1
Created topic "spark".
# 启动spark topic的生产者
bin/kafka-console-producer.sh --topic spark --broker-list xxx:9092
  • Based on Flume
    (1) .flume profile
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = localhost
a1.sinks.k1.port = 9999

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

(2) introducing dependent

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-streaming-flume_2.11</artifactId>
  <version>2.4.4</version>
</dependency>

(3) Development and applications

package source

import org.apache.spark.SparkConf
import org.apache.spark.streaming.flume.FlumeUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

object FlumeSource {
  def main(args: Array[String]): Unit = {
    //1. 初始化ssc
    val conf = new SparkConf().setAppName("kafka wordcount").setMaster("local[*]")
    val ssc = new StreamingContext(conf, Seconds(5))
    ssc.sparkContext.setLogLevel("ERROR")

    //2. 通过Flume的工具类初始化DStream
    val dstream = FlumeUtils
      .createStream(ssc, "localhost", 9999)

    // event body内容封装到 String
    dstream
      .map(event => new String(event.event.getBody.array()))
      .print()
    //6. 启动streaming应用
    ssc.start()

    //7. 优雅的停止应用
    ssc.awaitTermination()
  }
}

(4) Start data acquisition Flume Agent

bin/flume-ng agent --conf conf --conf-file conf/simple.conf --name a1 -Dflume.root.logger=INFO,console

(5) Start the telnet service, start sending data

telnet localhost 44444
Hello Spark
OK
Hello Spark
OK

Four, DStreams output operation

Output operations refers to DStream write processing result to the external storage system, such as: a database or Redis, HDFS, HBase like;
saveAsTextFiles ( prefix , [ suffix ]) :( a DStream to save the contents of the application in the form of a text file directory)
saveAsObjectFiles ( prefix , [ suffix ]): ** (DStream of the content stored in the form of a sequence of files to the directory where the application)

lines
.flatMap(_.split("\\s")) // DStream ----> DStream
.map((_, 1L))
.groupByKey()
.map(t2 => (t2._1, t2._2.size))
// DStream的计算结果保存到应用的运行目录中
//.saveAsTextFiles("result", "xyz")
.saveAsObjectFiles("result", "xyz")

saveAsNewAPIHadoopFiles ( prefix , [ suffix ]): (Note: the results stored in HDFS: `/ user / username /)

lines
.flatMap(_.split("\\s")) // DStream ----> DStream
.map((_, 1L))
.groupByKey()
.map(t2 => (t2._1, t2._2.size))
.saveAsNewAPIHadoopFiles(
  "result",
  "xyz",
  classOf[Text],
  classOf[LongWritable],
  classOf[TextOutputFormat[Text, LongWritable]],
  conf = hadoopConf)

foreachRDD ( FUNC ): (corresponding to the batch processing traversing the RDD DStream, RDD each micro batch data can be written to an arbitrary external storage systems, such as a database or the Redis)

 lines
.flatMap(_.split("\\s")) // DStream ----> DStream
.map((_, 1L))
.groupByKey()
.map(t2 => (t2._1, t2._2.size))
.foreachRDD(rdd => {
  // 将计算的结果保存到redis中
  //方法一: rdd输出的操作  一个分区对应一个jedis连接(建议使用)
     rdd.foreachPartition(itar => {
       val jedis = new Jedis("localhost", 6379)
       while (itar.hasNext) {
            val tuple = itar.next()
            val word = tuple._1
            val count = tuple._2
            jedis.set(word, count.toString)
        }
          jedis.close()
     }) 
    
  //方法二: 一个值对象一个Redis连接
  /*rdd.foreach(t2 => {
    val jedis = new Jedis("localhost", 6379)
    jedis.set(t2._1, t2._2.toString)
    jedis.close()
    */
  })
})

Remarks:

Do not use Spark Streaming applications local[1]as Master URL address, if you do so, means that only one thread for receiving streaming data, without threading for data processing. Therefore, the number of threads must be greater than or equal to 2, such as: local[2]orlocal[*]

Published 24 original articles · won praise 1 · views 496

Guess you like

Origin blog.csdn.net/Mr_YXX/article/details/105033815