SparkCore of Spark: RDD-Data Core/API [Operator]

RDD operators are divided into two categories:

  • Transformation transformation operation: return a new RDD
  • Action Action: The return value is not RDD (no return value or return other)
    Insert picture description here

1. Transformation conversion operator (lazy)

  • All conversions in RDD are evaluated lazily/delayed , which means that they are not directly calculated.
  • Spark only records the transformation operation logic that acts on the RDD, and does the real calculation when it encounters an action operator (Action).
  • RDD is generally divided into Value type and Key-Value type.

The common conversion operators of RDD are as follows:

Transformation operator Meaning
map(func) Act on each element in the source RDD through the function func, and return a new RDD
filter (func) Filter, select the element whose function func is true, and return a new RDD
flatMap(func) Flatten and disperse, return a new RDD, the type is Seq
mapPartitions (func) Similar to map, but runs independently on each slice of the RDD, so when running on an RDD of type T, the function type of func must be Iterator[T] => Iterator[U]
mapPartitionsWithIndex (func) Similar to mapPartitions, but func takes an integer parameter to indicate the index value of the slice, so when running on an RDD of type T, the function type of func must be (Int, Interator[T]) => Iterator[U]
sample(withReplacement, fraction, seed) Sampling the data according to the proportion specified by fraction, you can choose whether to use random numbers for replacement, seed is used to specify the random number generator seed
union(otherDataset) Return a new RDD after the union of the source RDD and the parameter RDD
intersection(otherDataset) Return a new RDD after the intersection of the source RDD and the parameter RDD
distinct([numTasks])) Deduplicate the source RDD and return a new RDD
groupByKey([numTasks]) Called on an RDD of (K,V) and returns an RDD of (K, Iterator[V])
reduceByKey(func, [numTasks]) Call on a (K,V) RDD, return a (K,V) RDD, use the specified reduce function to aggregate the values ​​of the same key, similar to groupByKey, the number of reduce tasks can be passed through the second Optional parameters to set
aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) The seqOp operation aggregates the elements in each partition, and then the combOp operation re-aggregates the aggregation results of all the partitions. The initial value of the two operations is zeroValue. The seqOp operation is to traverse all the elements in the partition (T), the first T Do the operation with zeroValue, and the result will be used as the zeroValue with the second T until the entire partition is traversed. The combOp operation is to aggregate the results of each partition, and then aggregate
sortByKey([ascending], [numTasks]) Sort by key, the default is ascending: Boolean = true
sortBy(func,[ascending], [numTasks]) Similar to sortByKey, but more flexible
join(otherDataset, [numTasks]) Called on RDDs of type (K,V) and (K,W), and return a (K,(V,W)) RDD with all elements corresponding to the same key paired together
cogroup(otherDataset, [numTasks]) Call on RDDs of type (K,V) and (K,W), and return an RDD of type (K,(Iterable,Iterable))
cartesian(otherDataset) Cartesian product, when called on a data set of type T and U, returns a data set of (T, U) pairs (all pairs of elements)
pipe(command, [envVars]) Each partition of the RDD is piped through shell commands (such as Perl or bash scripts). Write the RDD element to the stdin of the process, and the row output to stdout is returned as a string RDD
coalesce(numPartitions) Reduce the number of partitions in the RDD to numpartition
repartition (numPartitions) Randomly re-shuffle the data in the Shuffle RDD to create more or fewer partitions and balance between them

2. Actions (non-lazy) action operator

Actions Meaning
reduce(func) Use the function func (it accepts two parameters and returns one) to aggregate the elements of the data set
collect() Return all the elements of the data set in the form of an array in the driver
count() Returns the number of elements in the data set
first() Return the first element of the data set (similar to take(1))
take(n) Returns an array containing the first n elements of the data set
takeSample(withReplacement,num, [seed]) Returns an array, which contains random num element samples of the data set, which can be replaced or not replaced, and the random number generator seed can be specified in advance
takeOrdered(n, [ordering]) Use RDD's natural order or custom comparator to return the first n elements of RDD
saveAsTextFile(path) Write the elements of the data set as a text file (or text file set) into a given directory in the local file system, HDFS or any other file system supported by Hadoop. Spark will call toString on each element and convert it to a line of text in the file
saveAsSequenceFile(path) 将数据集的元素作为Hadoop SequenceFile写入本地文件系统、 HDFS或任何其他 Hadoop支持的文件系统的给定路径中。这在实现 Hadoop的可写接口的键值对的 RDDs上是可用的。在 Scala中,它也可用于隐式转换为可写的类型 (Spark包括对Int、 Double、 String等基本类型的转换 )
saveAsObjectFile(path) 使用Java序列化以简单的格式编写数据集的元素,然后可以使用 SparkContext.objectFile()加载这些元素
countByKey() 仅在类型(K, V)的 RDDs上可用。返回 (K, Int)对的Map表示 每个键的计数
foreach(func) 对数据集的每个元素运行函数func
foreachPartition(func) 在数据集的每一个分区上,运行函数func
fold def fold(zeroValue: T)(op: (T, T) ⇒ T): T fold是aggregate的简化,将aggregate中的seqOp和combOp使用同一个函数op
lookup def lookup(key: K): Seq[V]lookup用于(K,V)类型的RDD,指定K值,返回RDD中该K对应的所有V值
aggregtate def aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): U将每个分区里面的元素进行聚合,然后用combine函数将每个分区的结果和初始值(zeroValue)进行combine操作

3、案例

3.1 Transformation

优秀博客

3.2 Action

See excellent blog
reference materials:
https://blog.csdn.net/and52696686/article/details/107822714

Guess you like

Origin blog.csdn.net/weixin_45666566/article/details/112547326