Real-time computing and wordcount program based on HDFS

Real-time computing based on HDFS files is actually monitoring an HDFS directory, and as long as there are new files in it, it will be processed in real time. Equivalent to processing real-time file streams.

streamingContext.fileStream<KeyClass, ValueClass, InputFormatClass>(dataDirectory)
streamingContext.fileStream[KeyClass, ValueClass, InputFormatClass](dataDirectory)

Spark Streaming will monitor the specified HDFS directory and process the files present in the directory. It should be noted that all files placed in the HDFS directory must have the same format; the file must be moved into the directory by moving or renaming; once processed, even if the content of the file changes, it will not be processed again Now; the data source based on HDFS file does not have Receiver, so it will not occupy a cpu core.

Case: Monitor the /testdata/hadoop directory on hdfs, when a new file is uploaded, the result will be counted
		object HDFSWordCountDemo {
		  def main(args: Array[String]): Unit = {
		    Logger.getLogger("org").setLevel(Level.WARN)
		    //local[2] There must be 2 or more threads here, one is responsible for receiving data, and the other is responsible for sending the received data to the worker for execution
		    val config = new SparkConf().setAppName("HDFSWordCountDemo")//.setMaster("local[2]")//Package and upload to the cluster to run, no information is received when testing on Windows
		    val sc = new SparkContext(config)
		    //Seconds两秒产生一个RDD
		    val ssc = new StreamingContext(sc, Seconds(2))
		    val fileDstream = ssc.textFileStream("hdfs://hadoop01:8020/testdata/hadoop")
		    fileDstream.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _).print()
		    ssc.start()
		    ssc.awaitTermination()
		  }
		}

		//shell脚本代码
		/home/kitty/opt/spark/bin/spark-submit \
		--class day18.HDFSWordCountDemo \
		--master spark://hadoop01:7077 \
		--driver-memory 512M \
		--executor-memory 512M \
		--total-executor-cores 2 \
		/home/kitty/mytmp/scala-1.0-SNAPSHOT.jar



Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325690381&siteId=291194637