object WordCount {
def main(args: Array[String]): Unit = {
//创建spark配置,设置应用程序名字
//val conf = new SparkConf().setAppName("ScalaWordCount")
val conf = new SparkConf().setAppName("ScalaWordCount").setMaster("local[4]")
//创建spark执行的入口
val sc = new SparkContext(conf)
//指定以后从哪里读取数据创建RDD(弹性分布式数据集)
val lines: RDD[String] = sc.textFile("hdfs://node-4:9000/wc1", 1)
//分区数量
partitionsNum=lines.partitions.length
//切分压平
val words: RDD[String] = lines.flatMap(_.split(" "))
//将单词和一组合
val wordAndOne: RDD[(String, Int)] = words.map((_, 1))
//按key进行聚合
val reduced:RDD[(String, Int)] = wordAndOne.reduceByKey(_+_)
//排序
val sorted: RDD[(String, Int)] = reduced.sortBy(_._2, false)
//将结果保存到HDFS中
reduced.saveAsTextFile(args(1))
//释放资源
sc.stop()
}
}
Parsing process
-
TextFile call () generates two RDD, HadoopRDD [(K, V)], MapPartitionsRDD [String]. HadoopRDD [(K, V)], the file storage system hadoop converted into (K, V), Key is stored in the line offset (LongWritable), Value is the content stored taken this line (in the form of the Text ), l Map of similar input. Each partition read respective input sections
-
MapPartitionsRDD [String], can be removed on a RDD generated (K, V) of Value
and turn into its Text String -
Call flatMap (_. Split ( "")), when, in fact, generate MapPartitionsRDD [String] MapPartitionsRDD [String], and word segmentation function is flattened
-
Call map ((_, 1)), when, in fact, generate MapPartitionsRDD [(String, Int)], from the string function is converted into (String, Int)
-
Call reduceByKey ( + ) when the data will be written to disk first, and then call ShullfeRDD (String, Int), the RDD divided into two stages, the first partial polymerization, the second polymerization is global, according to the Key data their own got me to come from the various partitions over a network
-
Call saveAsTextFile (args (1)), when called here MapPartitionsRDD [NullWritable, Text], the function is a data reduction, and turn into [NullWritable, Text], to be stored in which storage system, output MapReduce the Reduce similar
-
Output to the HDFS, two files are generated, since there are two partitions
In this process generates six rdd. There are two Stage, Stage two months each task, two kinds Task.
Relations RDD, Partition and the task of
Two kinds of task, ShuffleMapTask and ResultTask, according Shullfe sub cutting.
- ShuffleMapTask: MapTask also be called, in front Shullfe responsible for reading data, a data calculation processing, and the intermediate results written to disk
- ResultTask: also called ReduceTask, behind Shullfe responsible for pulling data from an upstream calculation processing data, writing data to the storage system
Stage is also divided according to Shuffle, then in each Stage, the number of partitions, there are that many task
after Shullfe, the number of partitions may change