spark handling small file issues

Coalesce and repartition solve the problem of small files




repartition(numPartitions: Int)
 Returns a new RDD (or DataFrame) with the number of partitions in numPartitions.
 The level of parallelism in this RDD can be increased or decreased, and shuffle is used internally to reallocate data.
 If you want to reduce the number of partitions, consider using `coalesce`, which can avoid performing shuffle.


 coalesce(numPartitions: Int, shuffle: Boolean = false)
 1) Returns a new RDD that is reduced to `numPartitions` partitions.
 2) This leads to narrow dependencies, eg if going from 1000 partitions to 100 partitions, there will be no shuffle, but each of the 100 new partitions will declare 10 partitions of the current partition.
 3) If you are doing drastic merging, such as reducing numPartitions from 1000 to 1, this will cause computation to happen on
very few nodes (e.g. one node if numPartitions = 1).
4) To avoid this, you can pass shuffle=true, or use repartition directly. 
This will add a shuffle step, meaning the current upstream partition will be executed in parallel (whatever the current partition is).
  
5) Note: With shuffle=true it is actually possible to merge to a larger number of partitions. If you have a small number of partitions like
100, then this is useful, there may be a few partitions that are unusually large. Calling coalesce(1000, shuffle=true) will cause the data to be distributed to 1000 partitions
using the hash partitioner. Solve the problem of small










files




 Data collection stage: Configure reasonable flume parameters, etc.




 Data cleaning: Use coalesce or repatition to set a reasonable number of partitions  Use hbase




to save data File problem val df = sqlContext.createDataFrame(rowRdd, struct) val newDF = df.coalesce(1) Import newDF into hive table or use DataFrame data source method to write data 2 Solve the problem of small files - merge small files 1. Move the files under the file directory (srcDataPath) to the temporary directory /mergePath/${mergeTime}/src 2. Calculate the size of the temporary directory (/mergePath/${mergeTime}/src). Determine the number of partitions based on size. 1024M /128M = 8 3. Use coalesce or repartition, and enter the number of partitions. Write the temporary directory data to the temporary data directory (/mergePath/${mergeTime}/data) 4. Move the temporary data directory file to the file directory (srcDataPath)



























































5. Delete the temporary directory (/mergePath/${mergeTime}/src)




${mergeTime} is a variable, which is used to identify the unique identifier of each merge.






See Code Case Implementation




2. Reasons for generating small files in the shuffle phase






spark.sql.shuffle. Partitions




adjust the parallelism of Shuffle, that is, the number of tasks.




Each partition of shuffle corresponds to a task. The more tasks, the higher the efficiency.




Spark defaults to 200 partitions when performing aggregation (ie shuffle). This is defined by the conf
variable "spark.sql.shuffle.partitions". This is why using DataFrame or
spark to integrate Hive, a large number of small files will be generated after shuffle.















Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325729964&siteId=291194637