Spark-Core RDD operator action

1、reduce(func)

Func function by gathering all the elements in the RDD, firstly polymerizing the data partition, then the partition between the polymerization data.

scala> val rdd1 = sc.parallelize(1 to 100)
scala> rdd1.reduce(_ + _)
res0: Int = 5050

scala> val rdd2 = sc.parallelize(Array(("a", 1), ("b", 2), ("c", 3)))
scala> rdd2.reduce((x, y) => (x._1 + y._1, x._2 + y._2))
res2: (String, Int) = (abc,6)

2、collect

With 数组the return to form of all the elements in the RDD.

所有的数据都会被拉到 driver 端, So be careful

3、count

Returns the number of elements in RDD.

4、take(n)

In return RDD 前 n 个元素composed 数组.

will take the data 拉到 driver 端should only小数据集使用

5、first

Returns the first element in the RDD Similar to take (1).

6、takeOrdered(n,[ordering])

After ordering the return 前 n 个元素,默认是升序排列.

Data will be pulled driver end

scala> val rdd1 = sc.makeRDD(Array(100, 20, 130, 500, 60))
scala> rdd1.takeOrdered(2)
res6: Array[Int] = Array(20, 60)
    
scala> rdd1.takeOrdered(2)(Ordering.Int.reverse)
res7: Array[Int] = Array(500, 130)

7、aggregate

def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U

aggregate function 每个分区inside the elements by seqOp和初始值performed 聚合, and then combine函数to 每个分区的结果and 初始值(zeroValue)be combine operation .

This function returns the final type and does not require the element type consistent RDD

zeroValuePartition and inter-partition polymerization aggregation when each will be used once
scala> val rdd1 = sc.makeRDD(Array(100, 30, 10, 30, 1, 50, 1, 60, 1), 2)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[8] at makeRDD at <console>:24

scala> rdd1.aggregate(0)(_ + _, _ + _)
res12: Int = 283

scala> val rdd1 = sc.makeRDD(Array("a", "b", "c", "d"), 2)
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[9] at makeRDD at <console>:24

scala> rdd1.aggregate("-")(_ + _, _ + _)
res13: String = --ab-cd

8、fold

折叠Operation, aggregate的简化operation, SEQOP and combop same time , you can use the fold

scala> val rdd1 = sc.makeRDD(Array(100, 30, 10, 30, 1, 50, 1, 60, 1), 2)
scala> rdd1.fold(0)(_ + _)
scala> val rdd1 = sc.makeRDD(Array("a", "b", "c", "d"), 2)
scala> rdd1.fold("-")(_ + _)
res17: String = --ab-cd

9、saveAsTextFile(path)

Role : The elements of the data set is stored as textfile to the HDFS file system or other supported file system, for each element, Spark will call the toString method, it would be installed for the text file

10、saveAsSequenceFile(path)

Role : The dataset elements stored in a Hadoop sequencefile format to the specified directory, you can make Hadoop HDFS or other supported file system.

11、saveAsObjectFile(path)

Action : the sequence of elements for RDD into an object stored in the file.

12、countByKey()

Effect : for (K, V) type RDD, a return (K, Int) of the map, it represents the number of elements corresponding to each key.

Applications : it can be used to see if the data skew

scala> val rdd1 = sc.parallelize(Array(("a", 10), ("a", 20), ("b", 100), ("c", 200)))

scala> rdd1.countByKey()
res19: scala.collection.Map[String,Long] = Map(b -> 1, a -> 2, c -> 1)

13、foreach(func)

Role: for the RDD每个元素都执行一次func

Each function is Executor 上执行的, instead of performing the driver's side.

Guess you like

Origin www.cnblogs.com/hyunbar/p/12048488.html