spark process detailed studies four --WordCount

object WordCount {

  def main(args: Array[String]): Unit = {



//创建spark配置,设置应用程序名字
    //val conf = new SparkConf().setAppName("ScalaWordCount")
    val conf = new SparkConf().setAppName("ScalaWordCount").setMaster("local[4]")
    //创建spark执行的入口
    val sc = new SparkContext(conf)
 
 
   //指定以后从哪里读取数据创建RDD(弹性分布式数据集)
    val lines: RDD[String] = sc.textFile("hdfs://node-4:9000/wc1", 1)
     //分区数量
    partitionsNum=lines.partitions.length
    
    //切分压平
    val words: RDD[String] = lines.flatMap(_.split(" "))
    
    //将单词和一组合
    val wordAndOne: RDD[(String, Int)] = words.map((_, 1))
    
    //按key进行聚合
    val reduced:RDD[(String, Int)] = wordAndOne.reduceByKey(_+_)
   
    //排序
    val sorted: RDD[(String, Int)] = reduced.sortBy(_._2, false)
    
    //将结果保存到HDFS中
    reduced.saveAsTextFile(args(1))


    //释放资源
    sc.stop()

  }

}

Parsing process

  1. TextFile call () generates two RDD, HadoopRDD [(K, V)], MapPartitionsRDD [String]. HadoopRDD [(K, V)], the file storage system hadoop converted into (K, V), Key is stored in the line offset (LongWritable), Value is the content stored taken this line (in the form of the Text ), l Map of similar input. Each partition read respective input sections

  2. MapPartitionsRDD [String], can be removed on a RDD generated (K, V) of Value
    and turn into its Text String

  3. Call flatMap (_. Split ( "")), when, in fact, generate MapPartitionsRDD [String] MapPartitionsRDD [String], and word segmentation function is flattened

  4. Call map ((_, 1)), when, in fact, generate MapPartitionsRDD [(String, Int)], from the string function is converted into (String, Int)

  5. Call reduceByKey ( + ) when the data will be written to disk first, and then call ShullfeRDD (String, Int), the RDD divided into two stages, the first partial polymerization, the second polymerization is global, according to the Key data their own got me to come from the various partitions over a network

  6. Call saveAsTextFile (args (1)), when called here MapPartitionsRDD [NullWritable, Text], the function is a data reduction, and turn into [NullWritable, Text], to be stored in which storage system, output MapReduce the Reduce similar

  7. Output to the HDFS, two files are generated, since there are two partitions

In this process generates six rdd. There are two Stage, Stage two months each task, two kinds Task.

Here Insert Picture Description

Relations RDD, Partition and the task of

Two kinds of task, ShuffleMapTask and ResultTask, according Shullfe sub cutting.

  • ShuffleMapTask: MapTask also be called, in front Shullfe responsible for reading data, a data calculation processing, and the intermediate results written to disk
  • ResultTask: also called ReduceTask, behind Shullfe responsible for pulling data from an upstream calculation processing data, writing data to the storage system

Stage is also divided according to Shuffle, then in each Stage, the number of partitions, there are that many task
Sheet described herein insertion RutuenenHere Insert Picture Descriptionafter Shullfe, the number of partitions may change

Published 44 original articles · won praise 0 · Views 858

Guess you like

Origin blog.csdn.net/heartless_killer/article/details/104592172