The connection between Kafka and SparkStreaming

everyone:

   it is good! The connection between Kafka and SparkStreaming, the following is my own summary, only for parameters. The scala code is as follows:

package SparkStream

import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{HashPartitioner, SparkConf, SparkContext}

/**
  * Created by Administrator on 2017/10/11.
  * 功能:演示kafka的单词统计(kafka同SparkStreaming的对接)
  *
  */
object KafkaWc {

  val updateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => {
    iterator.flatMap{case(x,y,z)=> Some(y.sum + z.getOrElse(0)).map(n=>(x, n))}
  }

  def main(args: Array[String]): Unit = {
    //设置日志的级别
    LoggerLevels.setStreamingLogLevels()
    //接收命令行中的参数
    val Array(zkQuorum, groupId, topics, numThreads)=args
    val conf=new SparkConf().setAppName("KafkaWc").setMaster("local[2]")
    val sc=new SparkContext(conf)
    val ssc=new StreamingContext(sc,Seconds(5))
    //设置检查点
    ssc.checkpoint("c://test//checkpoint1011")
    //设置topic信息  //消费者用多少个线程来消费这个topic
    val topicMap=topics.split(",").map((_,numThreads.toInt)).toMap
    //重Kafka中拉取数据创建DStream
    val lines = KafkaUtils.createStream(ssc, zkQuorum ,groupId, topicMap, StorageLevel.MEMORY_ONLY)
    //切分数据,截取用户点击的url
    // 取2的目的是kafka中存的数据都是键值对的形式
    val word = lines.map(_._2).flatMap(_.split(" ")).map((_,1))
    //统计各个单词的个数
    val result = word.updateStateByKey(updateFunc, new HashPartitioner(ssc.sparkContext.defaultParallelism), true)
    //将结果打印到控制台
    result.print()

    ssc.start()
    ssc.awaitTermination()
  }
}

Step 1: Set the parameters of the program in idea

1.1 First open the window of configuration parameters, the screenshot is as follows:

1.2 In the configuration parameter window, make the following settings, as shown in the screenshot below:

Description: 1 The main class is the class to be run

 2 "192.168.17.108:2181 1 20180815 3" in the corresponding code is

//Receive the parameters in the command line

  val Array(zkQuorum, groupId, topics, numThreads)=args

The first parameter: "192.168.17.108:2181" is the address of zookeeper

The second parameter: "1" is the corresponding groupid

The third parameter: "20180815" refers to the name of the topic corresponding to the producer of kafka

The fourth parameter: "3" refers to how many threads the consumer uses to consume this topic

 

Step 2: Run the program of kafka to connect to sparkstream in idea, the screenshot is as follows:

As you can see from the screenshot, the machine has started 3 threads to receive Kafka's topic 20180815. Because there is no data to Kafka at this time, the running result of ss is still empty

Description: 1 The cache directory of the checkpoint on the local must be deleted before running ss

 

Step 3: Create topic 20180815 in Kafka

[root@hadoop ~]# kafka-topics.sh --create --zookeeper hadoop:2181 --topic 20180815 --partitions 1 --replication-factor 1

Created topic "20180815".

Note: You must create a new topic instead of reusing the previously created topic. Otherwise, when typing data into Kafka, an error will be reported and ss will not receive the data

 

Step 4: Call Kafka's producer script and use Kafka as a data source to provide data for ss

kafka-console-producer.sh  --broker-list hadoop:9092 --topic 20180815 

 

As you can see from the screenshot, the mouse displays the state of waiting for input. This means that Kafka's producer script has been successfully scheduled. Of course, if you check the running result of ss at this time, there is still no space, because the producer of kafka does not have data at this time, which is understandable

 

Step 5: In the producer window of Kafka, enter "bei jing huan ying ni", the screenshot is shown below:

Note: The words are separated by spaces in order to be consistent with ss, and it depends on the actual situation.

 

Step 6: View the running result of ss in the local idea, the screenshot is as follows:

 

As you can see from the screenshot, the words in a single time period have been counted

 

Step 7: Verify the word accumulation function in the ss program, enter "shang hai huan ying ni", the screenshot is as follows:

Step 8: View the running result of ss in the local idea, the screenshot is as follows:

 

As you can see from the screenshot, the three words huan, ying, and ni are all twice, and the remaining four words are one. This is in line with the data provided by the Kafka producer, and the program verification is complete.

 

Summary points: 1 The method of setting the parameters of the program in the idea

 2 Kafka producers are the core point of sparkstreaming data sources

    val lines = KafkaUtils.createStream(ssc, zkQuorum ,groupId, topicMap, StorageLevel.MEMORY_ONLY)

 

Guess you like

Origin blog.csdn.net/zhaoxiangchong/article/details/78380139