Spark and Flink among the commonly Detailed Operator

Transformation of the operator 1. Spark

1.1. map

  • Each element will be called once the map method
  • A receiving function, this function is used for each element in the RDD, the function returns a result as a result of the corresponding elements in RDD

1.2. lookup

  • for lookup (K, V) RDD type, designated K value V returns all of the K values ​​corresponding to RDD

1.3. mapPartitions

  • Each partition method called once mapPartitions
  • Return a new RDD by applying a function to each partition of this RDD
  • Details links

1.4. flatMap

  • This probably means that the f function to Seq in all elements, and a set of functions generated in the elements taken out to form a new collection. And then return to this new collection
  • Details links

1.5. mapPartitionsWithIndex

-And mapPartitions function consistent, and be able to obtain the index number of the partition

1.6. mapPartitionsWithContext

-And mapPartitions same function and parameter information with context

1.7. combineByKey [pair]

-combineByKey () is the most common type of key-value rdd gather operation aggregating function (aggregation function). Similar aggregate (), combineByKey () allows the user to return the type of the input value is inconsistent.

1.8. reduceByKey

-reduceByKey underlayer is achieved by combineByKeyWithClassTag
-The first parameter is the default combineByKeyWithClassTag (v: V) => v, so it will not have any effect on the elements
-Second and third two parameters are the same, is passed over reduceByKey the two value becomes a value (V, V) => V

1.9. groupByKey

-groupByKey underlayer is achieved by combineByKeyWithClassTag
-groupByKey return value RDD [(K, Iterable [V])], val value is an iterator that content includes all values ​​value tuples of the key value K
-The implementation process is similar reduceByKey, but already written for you each function, but the parameters mapSideCombine = false, which means, not the end of the map are performed in the end reduce

1.10. aggregateByKey

  • Aggregate aggregateByKey and similar, the polymerization is carried out twice, except that the latter is only valid for the partition, the partition of the former key is further subdivided, also called the bottom combineByKey
    Here Insert Picture Description
  • Details links

2. Spark operator in Action

2.1. aggregate

  • Consistent aggregate functions, input and output types
    Here Insert Picture Description
  • Details links
  • 注意:
    • val mergeResult = (index: Int, taskResult: U) => jobResult = combOp(jobResult, taskResult)
    • val mergeResult = (index: Int, taskResult: U) => Unit{jobResult = combOp (jobResult, taskResult)}

3. the Spark operator blog link

4. Flink DataStream Transformations 算子

Published 85 original articles · won praise 12 · views 3731

Guess you like

Origin blog.csdn.net/fanjianhai/article/details/104331681