1、reduce(func)
Func function by gathering all the elements in the RDD, firstly polymerizing the data partition, then the partition between the polymerization data.
scala> val rdd1 = sc.parallelize(1 to 100)
scala> rdd1.reduce(_ + _)
res0: Int = 5050
scala> val rdd2 = sc.parallelize(Array(("a", 1), ("b", 2), ("c", 3)))
scala> rdd2.reduce((x, y) => (x._1 + y._1, x._2 + y._2))
res2: (String, Int) = (abc,6)
2、collect
With 数组
the return to form of all the elements in the RDD.
所有的数据都会被拉到 driver 端,
So be careful
3、count
Returns the number of elements in RDD.
4、take(n)
In return RDD 前 n 个元素
composed 数组
.
will take the data 拉到 driver 端
should only小数据集使用
5、first
Returns the first element in the RDD Similar to take (1).
6、takeOrdered(n,[ordering])
After ordering the return 前 n 个元素
,默认是升序排列.
Data will be pulled driver end
scala> val rdd1 = sc.makeRDD(Array(100, 20, 130, 500, 60))
scala> rdd1.takeOrdered(2)
res6: Array[Int] = Array(20, 60)
scala> rdd1.takeOrdered(2)(Ordering.Int.reverse)
res7: Array[Int] = Array(500, 130)
7、aggregate
def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U
aggregate function 每个分区
inside the elements by seqOp和初始值
performed 聚合
, and then combine函数
to 每个分区的结果
and 初始值(zeroValue)
be combine operation .
This function returns the final type and does not require the element type consistent RDD
zeroValue
Partition and inter-partition polymerization aggregation when each will be used once
scala> val rdd1 = sc.makeRDD(Array(100, 30, 10, 30, 1, 50, 1, 60, 1), 2)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[8] at makeRDD at <console>:24
scala> rdd1.aggregate(0)(_ + _, _ + _)
res12: Int = 283
scala> val rdd1 = sc.makeRDD(Array("a", "b", "c", "d"), 2)
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[9] at makeRDD at <console>:24
scala> rdd1.aggregate("-")(_ + _, _ + _)
res13: String = --ab-cd
8、fold
折叠
Operation, aggregate的简化
operation, SEQOP and combop same time , you can use the fold
scala> val rdd1 = sc.makeRDD(Array(100, 30, 10, 30, 1, 50, 1, 60, 1), 2)
scala> rdd1.fold(0)(_ + _)
scala> val rdd1 = sc.makeRDD(Array("a", "b", "c", "d"), 2)
scala> rdd1.fold("-")(_ + _)
res17: String = --ab-cd
9、saveAsTextFile(path)
Role : The elements of the data set is stored as textfile to the HDFS file system or other supported file system, for each element, Spark will call the toString method, it would be installed for the text file
10、saveAsSequenceFile(path)
Role : The dataset elements stored in a Hadoop sequencefile format to the specified directory, you can make Hadoop HDFS or other supported file system.
11、saveAsObjectFile(path)
Action : the sequence of elements for RDD into an object stored in the file.
12、countByKey()
Effect : for (K, V) type RDD, a return (K, Int) of the map, it represents the number of elements corresponding to each key.
Applications : it can be used to see if the data skew
scala> val rdd1 = sc.parallelize(Array(("a", 10), ("a", 20), ("b", 100), ("c", 200)))
scala> rdd1.countByKey()
res19: scala.collection.Map[String,Long] = Map(b -> 1, a -> 2, c -> 1)
13、foreach(func)
Role: for the RDD每个元素都执行一次func
Each function is Executor 上执行的
, instead of performing the driver's side.