spark分布式数据集RDD 的创建

1、启动spark

spark-shell --master local[2]

2、创建一个最简单的RDD

val rdd = sc.makeRDD(List(1,2,3,4,5));

3、查看RDD

rdd.collect()
返回
res0: Array[Int] = Array(1, 2, 3, 4, 5)

4、RDD指定分区(这样9个数据,就放在了3个分区中)

val rdd = sc.makeRDD(List(1,2,3,4,5,6,7,8,9),3)

5、查看分区的方法

执行以下代码,定义rddUtil

import org.apache.spark.rdd.RDD
import scala.reflect.ClassTag
object rddUtil {
  def lookPartition[T: ClassTag](rdd: RDD[T]) = {
    rdd.mapPartitionsWithIndex((i: Int, it: Iterator[T]) => {
      val partitionMap = scala.collection.mutable.Map[Int, List[T]]()
      var valueList = List[T]()
      while (it.hasNext) {
        valueList = valueList :+ it.next
      }
      partitionMap(i) = valueList
      partitionMap.iterator
    }).collect().foreach((partitionMap:(Int, List[T])) => {
      val partition = partitionMap._1
      println("partition:["+partition+"]")
      partitionMap._2.foreach {println(_) }
    })
  }
}

执行查看

 rddUtil.lookPartition(rdd)
partition:[0]
1
2
3
partition:[1]
4
5
6
partition:[2]
7
8
9

猜你喜欢

转载自blog.csdn.net/starkpan/article/details/86646981