When the execution code is given below
# encoding:utf-8 from pyspark import SparkConf, SparkContext from pyspark.sql import SparkSession conf = SparkConf().setMaster('yarn') sc = SparkContext(conf=conf) spark = SparkSession(sc) rdd = spark.read.csv('/spark/gps/GPS1.csv') print rdd.count() print rdd.repartition(10000).count() print rdd.repartition(10000).collect() # 报错 spark-OutOfMemory:GC overhead limit exceeded
Excuting an order
spark-submit --master yarn bigdata.py
Error content
spark-OutOfMemory:GC overhead limit exceeded
No problem in the implementation of count, not the various parameters influence; but in the execution collect, always being given
Cause Analysis
1. collect data lead to the return Driver, resulting in memory overflow Driver
The solution is to increase the memory Driver
spark-submit --master yarn --executor-cores 4 --driver-memory 3G bigdata.py
2. Too many executor-core, leading to competition for resources between multiple GC time and core, leading to most of the time is spent on the GC
The solution is to reduce the number of core
spark-submit --master yarn --executor-cores 1 bigdata.py
References:
https://blog.csdn.net/amghost/article/details/45303315