How Spark reads some large datasets to the local machine

Recently, I have been using spark to process and analyze the buried point data of some companies. The buried point data is in json format. Now I need to parse the json to get the data of a specific field and do some statistical analysis, so sometimes the data needs to be pulled from the cluster to the driver node for processing. , a problem that often occurs here is that the pull result set is too large, and the memory of the driver node is insufficient, which often leads to OOM, which is our common exception:

````
 java.lang.OutOfMemoryError: Java heap space
````

The code for this writing is generally as follows:

````
//Load HDFS data
 val rdd=sc.textFile("/data/logs/*")
 
 //Get the result set in the driver
 val datas=ArrayBuffer[String]()
 
 / / Pull all the data, pull down the driver side operation
 rdd.collect.foreach(line=>{
 
   datas += line.split('#')(1) //Get a field
 
 })
 
 sc.stop()
````

The basic principle of the above writing method is to read all the data of all partitions to the driver node at one time, and then start processing, so when the amount of data is large, memory overflow often occurs.

(Question 1) How to avoid this situation?

Divide and conquer, only pull the data of one partition to the driver node at a time, and then process the data of the next sub-data after processing.

(Question 2) What if the data in a single partition is too large to fit in the memory?

Add more partitions to the dataset, turning the large partition into multiple small partitions.

(Question 3) What if the result set data is larger than the memory size?

Either increase the memory of the driver node, or persist the data of each partition to the local file, and no longer maintain it in memory.

Let 's look at the key question, how to modify the number of RDD partitions

in spark? We know that RDD is the data source in spark The abstract model of RDD actually divides a big data source into multiple partitioned data, and then processes this large data set in parallel.

By default, if Spark loads data from HDFS, the default number of partitions is divided according to the block size of HDFS. Of course, we can specify the number of partitions when loading.

````
textFile(path,partitionNums)//The second parameter can specify the number of partitions
````

If you do not specify the number of partitions when loading, spark also provides two functions for repartitioning:

````
（1）def coalesce(numPartitions: Int, shuffle: Boolean = false):RDD[T]
（2） def repartition (numPartitions: Int): RDD [T]
````

Next, let's take a look at the difference between coalesce function and repartition function:

by looking at the source code, we know that the repartition function actually calls the package when the second parameter of the coalesce function is equal to true. So let's focus on the coalesce function:

the first parameter of

coalesce is the number of modified partitions The second parameter of coalesce is to control whether shuffle is required

. For example:

the current number of partitions of our RDD is 100:

( 1) If you want to turn into 10, you should use

````
rdd.coalesce (10, false)
````

(2) If you want to become 300, you should use

````
rdd.coalesce(300,true)
````

(3) If you want to become 1, you should use

````
rdd.coalesce(1,true)
````

Here is an explanation: The

number of partitions changes from more to less. Generally, shuffle does not need to be enabled, which has the highest performance, because there is no need to shuffle data across the network. Of course, you can also enable shuffle in specific scenarios, such as partition data is extremely unbalanced. But it is recommended not to use it in general.

When the number of partitions changes from less to more, shuffle must be turned on. If it is not turned on, the partition data will not change. From less to more, the data must be shuffled again to increase. Here we need to pay attention. If the amount of data is very small, it will There are some partitions with empty data.

The last example is an extreme scenario. If shuffle is not turned on from more than 1, then the computing pressure on individual nodes may be particularly high, and cluster resources cannot be fully utilized. Therefore, it is necessary to turn on shuffle to speed up the process of combined computing.

After understanding how to change the number of partitions of rdd, we can combine the problems encountered at the beginning of the article to pull a large amount of data to the driver node. If the overall data set is too large, we can increase the number of partitions and pull in a loop. However, the number of partitions needs to be set according to the specific scenario, because the more the number of partitions, the more tasks will be generated in spark. Too many tasks will also affect the actual pull efficiency. In this case , the data read from hdfs is 144 partitions by default, about 1G of multi-point data, the processing time is about 10 minutes without modifying the number of partitions, and the pull time is about 10 minutes when the number of partitions is adjusted to 10 Between 1-2 minutes, so it should be adjusted according to the actual situation.

The optimized code before the article begins is as follows:

````
  def pt_convert( idx:Int,ds:Iterator[String] ,seq:Int):Iterator[String]={
    if(seq==idx) ds else Iterator()
  }
------------------------------
//Load HDFS data
 val rdd=sc.textFile("/data/logs/*")
 
 //Get the result set in the driver
 val datas=ArrayBuffer[String]()
 //Repartition and optimize the number of partitions reasonably
 val new_rdd = rdd.coalesce (10)
 // get all partition information
 val parts = new_rdd.partitions
 //loop through the data of each partition to avoid OOM
    for(p<-parts){
      //get partition number
      val idx=p.index
    //The second parameter is true to avoid re-shuffle and keep the original partition data    
    val parRdd=new_rdd.mapPartitionsWithIndex[String](pt_convert(_,_,idx),true)
    //read result data
    val data=parRdd.collect()
    //loop through data
    for(line<-data){
    datas += line.split('#')(1) //Get a field
    }
    
      
      }
 

````

Finally, look at the submit command of the spark task:

````
spark-submit  --class  SparkHdfsDataAnalysis
--conf spark.driver.maxResultSize=1g  
--master yarn  
--executor-colors 5   
--driver-memory 2g  
--executor-memory 3g
--num-executors 10    
--jars  $jars     spark-analysis.jar  $1 $2
````

The main parameters here are:

````
spark.driver.maxResultSize=1g  
driver-memory 2g
````

The maximum number of bytes of a single pull data result set and the memory of the driver node. If you pull down a large result set, you need to pay special attention to the settings of these two parameters.

Reference documentation:

https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.rdd.RDD
https://spark.apache.org/docs/latest/ configuration.html

https://stackoverflow.com/questions/21698443/spark-best-practice-for-retrieving-big-data-from-rdd-to-local-machine

If you have any questions, you can scan the code and follow the WeChat public account: I It is a siege division (woshigcs), leaving a message in the background for consultation. Technical debts cannot be owed, and health debts cannot be owed. On the road of seeking the Tao, walk with you.

How Spark reads some large datasets to the local machine

Guess you like