Scala Spark Streaming + Kafka + Zookeeper完成数据的发布和消费

一、Spark Streaming

　　Spark Streaming是核心Spark API的扩展，可实现实时数据流的可扩展，高吞吐量，容错流处理。数据可以从许多来源（如Kafka，Flume，Kinesis或TCP sockets）中提取，并且可以使用以高级函数表示的复杂算法进行处理map，例如reduce，join和window。最后，处理后的数据可以推送到文件系统，数据库和实时仪表板。

二、SparkStreaming实现

　　Kafka和Zookeeper事先装过，没有先安装Zookeeper，则无法运行Kafka服务。但是，已经为CloudKarafka群集安装并配置了Zookeeper。

　　我搭建的是 Scala 的 maven 项目，项目和环境都在单机上运行。

　　1、先看我的 pom.xml 配置：

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.sparkstream</groupId>
    <artifactId>LyhSparkStreaming</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <spark.version>2.3.3</spark.version>
        <scala.version>2.11</scala.version>
    </properties>


    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_${scala.version}</artifactId>
            <version>${spark.version}</version>
            <exclusions>
                <exclusion>
                    <groupId>commons-beanutils</groupId>
                    <artifactId>commons-beanutils-core</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-kafka_2.11</artifactId>
            <version>1.6.3</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.scala-tools</groupId>
                <artifactId>maven-scala-plugin</artifactId>
                <version>2.15.2</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>

            <plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.6.0</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-plugin</artifactId>
                <version>2.19</version>
                <configuration>
                    <skip>true</skip>
                </configuration>
            </plugin>

        </plugins>
    </build>

</project>

　　我用的Scala版本是 2.11.8。

　　2、我的 Producer，代码读取 text1.txt文件中的内容，然后把每行数据发送到 Kafka的名叫 Hunter 的 topic 中，这个名字可以自己改，如果Kafka中不存在这个topic的话，系统会自动创建。text1.txt文件自己创建一个，一行一行的数据就可以，并没什么要求，文件的路径改成自己的。

package KafkaAndStreaming

import java.io.{BufferedReader, FileInputStream, FileNotFoundException, IOException, InputStreamReader}
import java.util.Properties

import org.apache.kafka.clients.consumer.KafkaConsumer
import org.apache.kafka.clients.producer.{Callback, KafkaProducer, Producer, ProducerRecord, RecordMetadata}


object TestKafkaProducer {

  type Key = String
  type Val = String

  def getProducerCnfig():Properties={
  /**
     * 对于kafka producer的相关配置文件项
     * 还有其他的属性，自己去查一下
   **/
    val props:Properties = new Properties()
    // Kafka 的 url
    props.put("bootstrap.servers", "localhost:9092")
    // group.id 随便写
    props.put("group.id", "producer-group")
    props.put("replication.factor", "min.insync.replicas")
    // 备份数量
    props.put("min.insync.replicas", "3")
    props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
    props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
    props
  }

  def main(args: Array[String])={
    // 获取配置文件
    val props:Properties = getProducerCnfig()
    // 创建生产者
    var producer = new KafkaProducer[String, String](props)
    try{
      //读取保存的文件
      val fis:FileInputStream = new FileInputStream("/Users/hunter/text1.txt")
      val isr:InputStreamReader = new InputStreamReader(fis, "UTF-8")
      val br:BufferedReader = new BufferedReader(isr)
      var line: String = ""

      line = br.readLine()
      var i: Int=0

      while (line != null) {
        producer.send(toMessage(line.toString, Option(i.toString), Option("Hunter")),new Callback {
          override def onCompletion(recordMetadata: RecordMetadata, e: Exception): Unit = {
            println(s"""Message $i, send to: """ + recordMetadata.topic())
          }
        })
        i += 1
        line = br.readLine()
      }
      producer.close()
      br.close()
      isr.close()
      fis.close()
    } catch {
      case ex: FileNotFoundException =>{
        println("Missing file exception")
      }
      case ex: IOException => {
        println("IO Exception")
      }
      case _ =>{
        println("Have other Error")
      }
    }
//    dealWithData
  }

  //把我的消息包装成了ProducerRecord
  private def toMessage(value: String, key: Option[Key] = None, topic: Option[String] = None): ProducerRecord[Key, Val] = {
    val t = topic.getOrElse(Option("test").getOrElse(throw new IllegalArgumentException("Must provide topic or default topic")))
    require(!t.isEmpty, "Topic must not be empty")
    key match {
      case Some(k) => new ProducerRecord(t, k, value)
      case _ => new ProducerRecord(t, value)
    }
  }
}

　　3、我的 Consumer端，这里主函数启动需要参数，参数为 localhost:9092 Hunter

　　代码如下：

package KafkaAndStreaming

import kafka.common.TopicAndPartition
import kafka.message.MessageAndMetadata
import kafka.serializer.StringDecoder
import org.apache.spark.{SparkConf, TaskContext}
import org.apache.spark.streaming.kafka.{HasOffsetRanges, KafkaUtils, OffsetRange}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object KafkaAndPrintInSpark {
  //判断设置的时输入参数，是否包含brokers 和 topic 至少参数的长度为2，即单机运行一个test的topic：   broker=localhost:9092 topic=test
  def main(args: Array[String]) {
    if (args.length < 2) {
      System.err.println(
        s"""
           |Usage: DirectKafkaWordCount <brokers> <topics>
           |  <brokers> is a list of one or more Kafka brokers
           |  <topics> is a list of one or more kafka topics to consume from
        """.stripMargin)
      System.exit(1)
    }

    //将参数args读入到数组中
    val Array(brokers, topics) = args

    // 用2秒批间隔创建上下文
    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("DirectKafkaWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(5))

    // 创建kafka流与brokers和topic
    val topicsSet = topics.split(",").toSet
    val kafkaParams = Map[String, String](
      "metadata.broker.list" -> brokers,
      "bootstrap.servers" -> "localhost:9092",
//      "auto.offset.reset" -> "smallest",
      "key.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer",
      "value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer")
    val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
      ssc, kafkaParams, topicsSet)

// 此处注释部分是自定义偏移量 5L代表从 5开始读取。默认是读取最新的，offset从上一次读取结束的位置开始
//    val offsetList = List(("Hunter", 0, 5L))
//    val fromOffsets = setFromOffsets(offsetList)//对List进行处理，变成需要的格式，即Map[TopicAndPartition, Long]
//    val messageHandler = (mam: MessageAndMetadata[String, String]) => (mam.topic, mam.message()) //构建MessageAndMetadata，这个所有使用情况都是一样的，就这么写
//    val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, String)](
//      ssc, kafkaParams, fromOffsets, messageHandler)

    messages.foreachRDD( rdd => {
//      if(rdd.count()>0) {
        rdd.foreach( records => {
          println("_1: " + records._1)
          println("_2: " + records._2)
        })

        val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
        rdd.foreachPartition { iter =>
          val o: OffsetRange = offsetRanges(TaskContext.get.partitionId())
          println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
        }
//      }
    })
    // 开始计算
    ssc.start()
    ssc.awaitTermination()
  }

  def setFromOffsets(list: List[(String, Int, Long)]): Map[TopicAndPartition, Long] = {
    var fromOffsets: Map[TopicAndPartition, Long] = Map()
    for (offset <- list) {
      val tp = TopicAndPartition(offset._1, offset._2)//topic和分区数
      fromOffsets += (tp -> offset._3)           // offset位置
    }
    fromOffsets
  }
}

　　三、最后结果

　　　　1、Producer

　　　　2、Consumer

Scala Spark Streaming + Kafka + Zookeeper完成数据的发布和消费

猜你喜欢