Spark-Streaming kafka count Case

Streaming from kafka statistical data, more involved, kafka here is to use the data obtained from the flume, where the equivalent of a small case.

1. Start kafka

Spark-Streaming hdfs count Case

2. Start flume

flume-ng agent -c conf -f conf/kafka_test.conf -n a1 -Dflume.root.logger=INFO,console

  flume configuration file as follows

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /root/code/flume_exec_test.txt

# Describe the sink
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.brokerList=master:9092
a1.sinks.k1.topic=kaka
a1.sinks.k1.serializer.class=kafka.serializer.StringEncoder

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

  Here flume is data from a file, as long as this data into the file, will be to monitor the flume, when the test need only write data to the file on it.

3. Start kafka consumers to observe

kafka-console-consumer.sh --bootstrap-server master:9092 --topic kaka

4. The following is a statistical code Streaming

package com.hw.streaming

import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Minutes, Seconds, StreamingContext}

object KafkaWordCount {
  def main(args: Array[String]): Unit = {
    if (args.length < 4) {
      System.err.println("Usage: KafkaWordCount <zkQuorum> <group> <topics> <numThreads>")
      System.exit(1)
    }

    val Array(zkQuorum, group, topics, numThreads) = args
    val sparkConf = new SparkConf().setAppName("KafkaWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(2))

    val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
    Lines KafkaUtils.createStream = Val (SSC, zkQuorum, Group, topicMap) .map (_._ 2) 
    Val = lines.flatMap words (_. Split ( ",") (. 1)) 
// 10 seconds window size, the size of the slide 2 seconds, where the window size must be a multiple of the size of the sliding job 
    val wordCounts = words.map ((_, 1L)) reduceByKeyAndWindow. (_ + _, _ - _, seconds (10), seconds (2)) 
    wordCounts.print () 

    ssc.start () 
    ssc.awaitTermination () 
  } 

}

5. Execute the script

# kafka count bash
$SPARK_HOME/bin/spark-submit\
        --class com.hw.streaming.KafkaWordCount\
        --master yarn-cluster \
        --executor-memory 1G \
        --total-executor-cores 2 \
        --files $HIVE_HOME/conf/hive-site.xml \
        --jars $HIVE_HOME/lib/mysql-connector-java-5.1.25-bin.jar,$SPARK_HOME/jars/datanucleus-api-jdo-3.2.6.jar,$SPARK_HOME/jars/datanucleus-core-3.2.10.jar,$SPARK_HOME/jars/datanucleus-rdbms-3.2.9.jar,$SPARK_HOME/jars/guava-14.0.1.jar \
        ./SparkPro-1.0-SNAPSHOT-jar-with-dependencies.jar \
        master:2181 group_id_1 kaka 1

6. write data, and write them to a file on-line monitoring flume

import random
import time
readFileName="/root/orders.csv"
writeFileName="/root/code/flume_exec_test.txt"
with open(writeFileName,'a+')as wf:
    with open(readFileName,'rb') as f:
        for line in f.readlines():
            for word in line.split(" "):
                ss = line.strip()
                if len(ss)<1:
                    continue
                wf.write(ss+'\n')
            rand_num = random.random()
            time.sleep(rand_num)

7. Observe whether consumer spending data, find the following error when executing the script, a window of time is a problem, is to set up a checkpoint.

Window time setting is wrong, the following error will be reported

User class threw exception: java.lang.IllegalArgumentException: requirement failed: The window duration of ReducedWindowedDStream (3000 ms) must be multiple of the slide duration of parent DStream (10000 ms)
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.streaming.dstream.ReducedWindowedDStream.<init>(ReducedWindowedDStream.scala:39)
at org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$reduceByKeyAndWindow$6.apply(PairDStreamFunctions.scala:348)
at org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$reduceByKeyAndWindow$6.apply(PairDStreamFunctions.scala:343)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:693)
at org.apache.spark.streaming.StreamingContext.withScope(StreamingContext.scala:265)
at org.apache.spark.streaming.dstream.PairDStreamFunctions.reduceByKeyAndWindow(PairDStreamFunctions.scala:343)
at org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$reduceByKeyAndWindow$5.apply(PairDStreamFunctions.scala:311)
at org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$reduceByKeyAndWindow$5.apply(PairDStreamFunctions.scala:311)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:693)
at org.apache.spark.streaming.StreamingContext.withScope(StreamingContext.scala:265)
at org.apache.spark.streaming.dstream.PairDStreamFunctions.reduceByKeyAndWindow(PairDStreamFunctions.scala:310)
at com.badou.streaming.KafkaWordCount$.main(KafkaWordCount.scala:22)
at com.badou.streaming.KafkaWordCount.main(KafkaWordCount.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:721)

Error changes, the need to slide the window time is set to a multiple of time. Script given above has been modified, if the installation steps above operation, it will not report the mistake.

If you do not increase the checkpoint, will be error, given as follows:

requirement failed: The checkpoint directory has not been set. Please set it by StreamingContext.checkpoint().

To set the corresponding checkpoint.

# Add the following statement in this statistical code 
# = Val ssc new new StreamingContext (sparkConf, Seconds The fast-(2)) 
ssc.setCheckPoint ( "/ root / checkpoint")

If the above execution is complete, you can view the log in your browser, you'll see the corresponding statistics. 

# Log 192.168.56.122:8080 
# view corresponding log information

Summary, in the test, starting flume when an error is encountered, the following error:

[WARN - kafka.utils.Logging$class.warn(Logging.scala:83)] 
Error while fetching metadata     partition 4     leader: none    replicas:       isr
:    isUnderReplicated: false for topic partition [default-flume-topic,4]: 
[class kafka.common.LeaderNotAvailableException]

The cause of the error encountered mainly flume profiles, kafka sink does not lead to the setting, you can see this should listen topic is kaka, but here it is monitoring the default default-flume-topic, finally found after inspection errors are caused by careless, sinks written in the sink, be sure to pay attention to details, we must learn to look at the log.

 

Guess you like

Origin www.cnblogs.com/hanwen1014/p/11260456.html