Spark RDD creation

Article Directory


The data in RDD can come from 2 places: local collection or external data source

  • Local collection: custom Scala collection
  • External data source: file or folder
  1. Convert local collections/external data sources into RDD
  • sc.parallelize(local collection, number of partitions)

Note: If you do not specify the number of partitions, this method will occupy all resources of the cluster/local machine

  • sc.makeRDD (local collection, number of partitions)

Note: The bottom layer of this method uses parallelize

  • sc.textFile (local file/HDFS file/folder, number of partitions)

Note:

  1. Don't use it to read a lot of small files
  2. If you do not specify the number of partitions, reading files will occupy at least two threads
  • sc.wholeTextFile (local folder/HDFS folder, number of partitions)

Note:

  1. This method is specially used to read small files
  2. If you do not specify the number of partitions, a small file occupies one thread

2. Get the number of RDD partitions

  • rdd.getNumPartitions

Get the number of partitions of rdd, the bottom layer is partitions.length

  • rdd.partitions.length

Get the number of partitions of rdd

Code demo

object RDDDemo01_Create {
    
    
  def main(args: Array[String]): Unit = {
    
    
    //TODO 0.env/创建环境
    val conf: SparkConf = new SparkConf().setAppName("spark").setMaster("local[*]")
    val sc: SparkContext = new SparkContext(conf)
    sc.setLogLevel("WARN")

    //TODO 1.source/加载数据/创建RDD
    val rdd1: RDD[Int] = sc.parallelize(1 to 10) //8
    val rdd2: RDD[Int] = sc.parallelize(1 to 10,3) //3

    val rdd3: RDD[Int] = sc.makeRDD(1 to 10)//底层是parallelize //8
    val rdd4: RDD[Int] = sc.makeRDD(1 to 10,4) //4

    //RDD[一行行的数据]
    val rdd5: RDD[String] = sc.textFile("data/input/words.txt")//2
    val rdd6: RDD[String] = sc.textFile("data/input/words.txt",3)//3
    //RDD[一行行的数据]
    val rdd7: RDD[String] = sc.textFile("data/input/ratings10")//10
    val rdd8: RDD[String] = sc.textFile("data/input/ratings10",3)//10
    //RDD[(文件名, 一行行的数据),(文件名, 一行行的数据)....]
    val rdd9: RDD[(String, String)] = sc.wholeTextFiles("data/input/ratings10")//2
    val rdd10: RDD[(String, String)] = sc.wholeTextFiles("data/input/ratings10",3)//3

    println(rdd1.getNumPartitions)//8 //底层partitions.length
    println(rdd2.partitions.length)//3
    println(rdd3.getNumPartitions)//8
    println(rdd4.getNumPartitions)//4
    println(rdd5.getNumPartitions)//2
    println(rdd6.getNumPartitions)//3
    println(rdd7.getNumPartitions)//10
    println(rdd8.getNumPartitions)//10
    println(rdd9.getNumPartitions)//2
    println(rdd10.getNumPartitions)//3

    //TODO 2.transformation
    //TODO 3.sink/输出
  }
}

Guess you like

Origin blog.csdn.net/zh2475855601/article/details/115087086