The difference between map and mapPartitions operators in spark

the difference:

  1, map is to operate on each element in rdd

  2, mapPartitions is to operate on the iterator of each partition in rdd


mapPartitions advantages:

  1. If it is an ordinary map, for example, there are 10,000 data in a partition, then the function should be executed 10,000 times, and using mapPartions, a task only executes the function once, the function receives all the data once, only executes once, and the performance is high

  2. If you need to frequently create additional objects in the map (such as writing rdd data to the database through jdbc, map needs to create a link for each data, mapPartions just creates a link for a partition)

Disadvantages:

  mapPartions OOM memory overflow may occur, but map will not, because a partition may be very large

Guess you like

Origin www.cnblogs.com/dretrtg/p/12687246.html