RDD 算 子
RDD operators are divided into two categories:
- Transformation transformation operation: return a new RDD
- Action Action: The return value is not RDD (no return value or return other)
1. Transformation conversion operator (lazy)
- All conversions in RDD are evaluated lazily/delayed , which means that they are not directly calculated.
- Spark only records the transformation operation logic that acts on the RDD, and does the real calculation when it encounters an action operator (Action).
- RDD is generally divided into Value type and Key-Value type.
The common conversion operators of RDD are as follows:
Transformation operator | Meaning |
---|---|
map(func) | Act on each element in the source RDD through the function func, and return a new RDD |
filter (func) | Filter, select the element whose function func is true, and return a new RDD |
flatMap(func) | Flatten and disperse, return a new RDD, the type is Seq |
mapPartitions (func) | Similar to map, but runs independently on each slice of the RDD, so when running on an RDD of type T, the function type of func must be Iterator[T] => Iterator[U] |
mapPartitionsWithIndex (func) | Similar to mapPartitions, but func takes an integer parameter to indicate the index value of the slice, so when running on an RDD of type T, the function type of func must be (Int, Interator[T]) => Iterator[U] |
sample(withReplacement, fraction, seed) | Sampling the data according to the proportion specified by fraction, you can choose whether to use random numbers for replacement, seed is used to specify the random number generator seed |
union(otherDataset) | Return a new RDD after the union of the source RDD and the parameter RDD |
intersection(otherDataset) | Return a new RDD after the intersection of the source RDD and the parameter RDD |
distinct([numTasks])) | Deduplicate the source RDD and return a new RDD |
groupByKey([numTasks]) | Called on an RDD of (K,V) and returns an RDD of (K, Iterator[V]) |
reduceByKey(func, [numTasks]) | Call on a (K,V) RDD, return a (K,V) RDD, use the specified reduce function to aggregate the values of the same key, similar to groupByKey, the number of reduce tasks can be passed through the second Optional parameters to set |
aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) | The seqOp operation aggregates the elements in each partition, and then the combOp operation re-aggregates the aggregation results of all the partitions. The initial value of the two operations is zeroValue. The seqOp operation is to traverse all the elements in the partition (T), the first T Do the operation with zeroValue, and the result will be used as the zeroValue with the second T until the entire partition is traversed. The combOp operation is to aggregate the results of each partition, and then aggregate |
sortByKey([ascending], [numTasks]) | Sort by key, the default is ascending: Boolean = true |
sortBy(func,[ascending], [numTasks]) | Similar to sortByKey, but more flexible |
join(otherDataset, [numTasks]) | Called on RDDs of type (K,V) and (K,W), and return a (K,(V,W)) RDD with all elements corresponding to the same key paired together |
cogroup(otherDataset, [numTasks]) | Call on RDDs of type (K,V) and (K,W), and return an RDD of type (K,(Iterable,Iterable)) |
cartesian(otherDataset) | Cartesian product, when called on a data set of type T and U, returns a data set of (T, U) pairs (all pairs of elements) |
pipe(command, [envVars]) | Each partition of the RDD is piped through shell commands (such as Perl or bash scripts). Write the RDD element to the stdin of the process, and the row output to stdout is returned as a string RDD |
coalesce(numPartitions) | Reduce the number of partitions in the RDD to numpartition |
repartition (numPartitions) | Randomly re-shuffle the data in the Shuffle RDD to create more or fewer partitions and balance between them |
2. Actions (non-lazy) action operator
Actions | Meaning |
---|---|
reduce(func) | Use the function func (it accepts two parameters and returns one) to aggregate the elements of the data set |
collect() | Return all the elements of the data set in the form of an array in the driver |
count() | Returns the number of elements in the data set |
first() | Return the first element of the data set (similar to take(1)) |
take(n) | Returns an array containing the first n elements of the data set |
takeSample(withReplacement,num, [seed]) | Returns an array, which contains random num element samples of the data set, which can be replaced or not replaced, and the random number generator seed can be specified in advance |
takeOrdered(n, [ordering]) | Use RDD's natural order or custom comparator to return the first n elements of RDD |
saveAsTextFile(path) | Write the elements of the data set as a text file (or text file set) into a given directory in the local file system, HDFS or any other file system supported by Hadoop. Spark will call toString on each element and convert it to a line of text in the file |
saveAsSequenceFile(path) | 将数据集的元素作为Hadoop SequenceFile写入本地文件系统、 HDFS或任何其他 Hadoop支持的文件系统的给定路径中。这在实现 Hadoop的可写接口的键值对的 RDDs上是可用的。在 Scala中,它也可用于隐式转换为可写的类型 (Spark包括对Int、 Double、 String等基本类型的转换 ) |
saveAsObjectFile(path) | 使用Java序列化以简单的格式编写数据集的元素,然后可以使用 SparkContext.objectFile()加载这些元素 |
countByKey() | 仅在类型(K, V)的 RDDs上可用。返回 (K, Int)对的Map表示 每个键的计数 |
foreach(func) | 对数据集的每个元素运行函数func |
foreachPartition(func) | 在数据集的每一个分区上,运行函数func |
fold | def fold(zeroValue: T)(op: (T, T) ⇒ T): T fold是aggregate的简化,将aggregate中的seqOp和combOp使用同一个函数op |
lookup | def lookup(key: K): Seq[V]lookup用于(K,V)类型的RDD,指定K值,返回RDD中该K对应的所有V值 |
aggregtate | def aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): U将每个分区里面的元素进行聚合,然后用combine函数将每个分区的结果和初始值(zeroValue)进行combine操作 |
3、案例
3.1 Transformation
见优秀博客
3.2 Action
See excellent blog
reference materials:
https://blog.csdn.net/and52696686/article/details/107822714