SparkStreaming integrity kafka Demo

 

Used here is the low-level API, because the high-level API is very difficult to use, require complicated configuration, automation is not enough, but the effect of low-level API and the same, so here make presentations to low-level API

You have to have zookeeper and kafka

I have here is three nodes host

Chart

Differs from the high-level API, a simple parallel (do not need to create a plurality of input streams, it automatically reads the parallel data kafka), efficient (will not be twice as copy data receiver), a one-time semantic (disadvantages: not available zookeeper monitoring tools)

 

1. Create a maven project

First, add a pom-dependent, please refer to additional operational dependency sparkStreaming integration WordCount

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
    <version>2.0.2</version>
</dependency>

2. Start zookeeper cluster

I zookeeper cluster inflicted a script, direct the implementation of the script to start all zookeeper

 

Successful start

3. Start kafka cluster

I was here three hosts, all three need

Enter the directory

cd /export/servers/kafka/bin/

start up

kafka-server-start.sh -daemon /export/servers/kafka/config/server.properties 

 

success

4. Test kafka

Creating topic

cd /export/servers/kafka_2.11-0.10.2.1
bin/kafka-topics.sh --create --zookeeper node01:2181 --replication-factor 1 --partitions 1 --topic kafka_spark

通过生产者发送消息

cd /export/servers/kafka_2.11-0.10.2.1
bin/kafka-console-producer.sh --broker-list node01:9092 --topic  kafka_spark

想发啥,发啥。此时通过创建AP接收生产者发送的数据

编写代码

package SparkStreaming

import kafka.serializer.StringDecoder
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}

object SparkStreamingKafka {
  def main(args: Array[String]): Unit = {
    // 1.创建SparkConf对象
    val conf: SparkConf = new SparkConf()
      .setAppName("SparkStreamingKafka_Direct")
      .setMaster("local[2]")

    // 2.创建SparkContext对象
    val sc: SparkContext = new SparkContext(conf)
    sc.setLogLevel("WARN")

    // 3.创建StreamingContext对象
    /**
      * 参数说明:
      *   参数一:SparkContext对象
      *   参数二:每个批次的间隔时间
      */
    val ssc: StreamingContext = new StreamingContext(sc,Seconds(5))
    //设置checkpoint目录

    ssc.checkpoint("./Kafka_Direct")

    // 4.通过KafkaUtils.createDirectStream对接kafka(采用是kafka低级api偏移量不受zk管理)
    // 4.1.配置kafka相关参数
    val kafkaParams=Map("metadata.broker.list"->"192.168.52.110:9092,192.168.52.120:9092,192.168.52.130:9092","group.id"->"kafka_Direct")
    // 4.2.定义topic
    val topics=Set("kafka_spark")

    val dstream: InputDStream[(String, String)] = KafkaUtils
      .createDirectStream[String,String,StringDecoder,StringDecoder](ssc,kafkaParams,topics)

    // 5.获取topic中的数据
    val topicData: DStream[String] = dstream.map(_._2)

    // 6.切分每一行,每个单词计为1
    val wordAndOne: DStream[(String, Int)] = topicData.flatMap(_.split(" ")).map((_,1))

    // 7.相同单词出现的次数累加
    val resultDS: DStream[(String, Int)] = wordAndOne.reduceByKey(_+_)

    // 8.通过Output Operations操作打印数据
    resultDS.print()

    // 9.开启流式计算
    ssc.start()

    // 阻塞一直运行
    ssc.awaitTermination()



  }
}

 

生产者生产数据

API接收控制台打印计算结果

 

Guess you like

Origin www.cnblogs.com/BigDataBugKing/p/11233729.html