Streaming from kafka statistical data, more involved, kafka here is to use the data obtained from the flume, where the equivalent of a small case.
1. Start kafka
Spark-Streaming hdfs count Case
2. Start flume
flume-ng agent -c conf -f conf/kafka_test.conf -n a1 -Dflume.root.logger=INFO,console
flume configuration file as follows
# Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source a1.sources.r1.type = exec a1.sources.r1.command = tail -f /root/code/flume_exec_test.txt # Describe the sink a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink a1.sinks.k1.brokerList=master:9092 a1.sinks.k1.topic=kaka a1.sinks.k1.serializer.class=kafka.serializer.StringEncoder # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 10000 a1.channels.c1.transactionCapacity = 1000 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
Here flume is data from a file, as long as this data into the file, will be to monitor the flume, when the test need only write data to the file on it.
3. Start kafka consumers to observe
kafka-console-consumer.sh --bootstrap-server master:9092 --topic kaka
4. The following is a statistical code Streaming
package com.hw.streaming import org.apache.spark.SparkConf import org.apache.spark.streaming.kafka.KafkaUtils import org.apache.spark.streaming.{Minutes, Seconds, StreamingContext} object KafkaWordCount { def main(args: Array[String]): Unit = { if (args.length < 4) { System.err.println("Usage: KafkaWordCount <zkQuorum> <group> <topics> <numThreads>") System.exit(1) } val Array(zkQuorum, group, topics, numThreads) = args val sparkConf = new SparkConf().setAppName("KafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap Lines KafkaUtils.createStream = Val (SSC, zkQuorum, Group, topicMap) .map (_._ 2) Val = lines.flatMap words (_. Split ( ",") (. 1)) // 10 seconds window size, the size of the slide 2 seconds, where the window size must be a multiple of the size of the sliding job val wordCounts = words.map ((_, 1L)) reduceByKeyAndWindow. (_ + _, _ - _, seconds (10), seconds (2)) wordCounts.print () ssc.start () ssc.awaitTermination () } }
5. Execute the script
# kafka count bash $SPARK_HOME/bin/spark-submit\ --class com.hw.streaming.KafkaWordCount\ --master yarn-cluster \ --executor-memory 1G \ --total-executor-cores 2 \ --files $HIVE_HOME/conf/hive-site.xml \ --jars $HIVE_HOME/lib/mysql-connector-java-5.1.25-bin.jar,$SPARK_HOME/jars/datanucleus-api-jdo-3.2.6.jar,$SPARK_HOME/jars/datanucleus-core-3.2.10.jar,$SPARK_HOME/jars/datanucleus-rdbms-3.2.9.jar,$SPARK_HOME/jars/guava-14.0.1.jar \ ./SparkPro-1.0-SNAPSHOT-jar-with-dependencies.jar \ master:2181 group_id_1 kaka 1
6. write data, and write them to a file on-line monitoring flume
import random import time readFileName="/root/orders.csv" writeFileName="/root/code/flume_exec_test.txt" with open(writeFileName,'a+')as wf: with open(readFileName,'rb') as f: for line in f.readlines(): for word in line.split(" "): ss = line.strip() if len(ss)<1: continue wf.write(ss+'\n') rand_num = random.random() time.sleep(rand_num)
7. Observe whether consumer spending data, find the following error when executing the script, a window of time is a problem, is to set up a checkpoint.
Window time setting is wrong, the following error will be reported
User class threw exception: java.lang.IllegalArgumentException: requirement failed: The window duration of ReducedWindowedDStream (3000 ms) must be multiple of the slide duration of parent DStream (10000 ms) at scala.Predef$.require(Predef.scala:224) at org.apache.spark.streaming.dstream.ReducedWindowedDStream.<init>(ReducedWindowedDStream.scala:39) at org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$reduceByKeyAndWindow$6.apply(PairDStreamFunctions.scala:348) at org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$reduceByKeyAndWindow$6.apply(PairDStreamFunctions.scala:343) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.SparkContext.withScope(SparkContext.scala:693) at org.apache.spark.streaming.StreamingContext.withScope(StreamingContext.scala:265) at org.apache.spark.streaming.dstream.PairDStreamFunctions.reduceByKeyAndWindow(PairDStreamFunctions.scala:343) at org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$reduceByKeyAndWindow$5.apply(PairDStreamFunctions.scala:311) at org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$reduceByKeyAndWindow$5.apply(PairDStreamFunctions.scala:311) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.SparkContext.withScope(SparkContext.scala:693) at org.apache.spark.streaming.StreamingContext.withScope(StreamingContext.scala:265) at org.apache.spark.streaming.dstream.PairDStreamFunctions.reduceByKeyAndWindow(PairDStreamFunctions.scala:310) at com.badou.streaming.KafkaWordCount$.main(KafkaWordCount.scala:22) at com.badou.streaming.KafkaWordCount.main(KafkaWordCount.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:721)
Error changes, the need to slide the window time is set to a multiple of time. Script given above has been modified, if the installation steps above operation, it will not report the mistake.
If you do not increase the checkpoint, will be error, given as follows:
requirement failed: The checkpoint directory has not been set. Please set it by StreamingContext.checkpoint().
To set the corresponding checkpoint.
# Add the following statement in this statistical code # = Val ssc new new StreamingContext (sparkConf, Seconds The fast-(2)) ssc.setCheckPoint ( "/ root / checkpoint")
If the above execution is complete, you can view the log in your browser, you'll see the corresponding statistics.
# Log 192.168.56.122:8080 # view corresponding log information
Summary, in the test, starting flume when an error is encountered, the following error:
[WARN - kafka.utils.Logging$class.warn(Logging.scala:83)] Error while fetching metadata partition 4 leader: none replicas: isr : isUnderReplicated: false for topic partition [default-flume-topic,4]: [class kafka.common.LeaderNotAvailableException]
The cause of the error encountered mainly flume profiles, kafka sink does not lead to the setting, you can see this should listen topic is kaka, but here it is monitoring the default default-flume-topic, finally found after inspection errors are caused by careless, sinks written in the sink, be sure to pay attention to details, we must learn to look at the log.