spark parameter

Copied from other places, save a copy yourself http://guoke456.iteye.com/admin/blogs/2372445
The following are some configuration parameters in Spark, please refer to Spark Configuration for official documents.

Spark provides three places to configure the system:

Spark properties: control most of the application parameters, which can be set with SparkConf objects or Java system properties
Environment through the conf/spark-env.sh script of each node. Such as IP address, port and other information
Log configuration:
Spark properties can be configured through log4j.properties
Spark properties control most of the application settings, and configure it separately for each application. These properties can be configured directly on SparkConf and then passed to SparkContext. SparkConf allows you to configure some common properties (such as master URL, application name, etc.) as well as arbitrary key-value pairs set via the set() method. For example, we can create an application with two threads as follows.

val conf = new SparkConf()
             .setMaster("local[2]")
             .setAppName("CountingSheep")
             .set("spark.executor.memory", "1g")
val sc = new SparkContext(conf)
dynamically load Spark Attributes

In some cases, you may want to avoid hardcoding certain configurations in SparkConf. For example, you want to run the same application with different masters or different amounts of memory. Spark allows you to simply create an empty conf.

val sc = new SparkContext(new SparkConf())
then you set the variable at runtime:

./bin/spark-submit --name "My app" --master local[4] --conf spark.shuffle.spill=false
  --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar
Spark shell and spark-submit tools support two ways to dynamically load configuration. The first way is a command line option, such as --master, as shown by the shell above. spark-submit can accept any Spark property, indicated with the --conf parameter. But those properties that participate in Spark application startup are represented by specific parameters. Running ./bin/spark-submit --help will display the entire list of options.

bin/spark-submit also reads configuration options from conf/spark-defaults.conf, where each line contains a pair of keys and values ​​separated by spaces or equal signs. For example:

spark.master spark://5.6.7.8:7077
spark.executor.memory 512m
spark.eventLog.enabled true
spark.serializer org.apache.spark.serializer.KryoSerializer
Any values ​​specified in tags or in configuration files will be passed to the application and merged via SparkConf. Properties set on SparkConf have the highest priority, followed by property values ​​passed to spark-submit or spark-shell, and lastly by property values ​​in the spark-defaults.conf file.

Priority order:

SparkConf > CLI > spark-defaults.conf
View Spark

Properties The application web UI at http://<driver>:4040 lists all Spark properties in the Environment tab. This is useful for you to ensure that the properties you set are correct.

Note: Only values ​​specified via spark-defaults.conf, SparkConf and directly on the command line will be displayed. For other configuration properties, you can assume that the program uses the default values.

Available Properties Most properties that

control internal settings have sensible default values, some of the most common options are set as follows:

Application Properties

Property Name Default Value Meaning
spark.app.name (none) The name of your application. This will appear in the UI and log data
spark.driver.cores 1 The number of cpu cores the driver program needs to run
spark.driver.maxResultSize 1g The total size limit of the serialized results of all partitions for each Spark action (such as collect). The value set should not be less than 1m, 0 means no limit. If the total size exceeds this limit, the program will terminate. Large limit values ​​may cause out-of-memory errors in the driver (depending on spark.driver.memory and the memory consumption of objects in the JVM).
spark.driver.memory 512m The number of memory used by the driver process
spark.executor.memory 512m The number of memory used by each executor process. Has the same format as the JVM memory string (eg 512m, 2g)
spark.extraListeners (none) To register listeners, you need to implement SparkListener
spark.local.dir /tmp The directory used for temporary storage space in Spark. In Spark 1.0 and higher, this property is overridden by the SPARK_LOCAL_DIRS(Standalone, Mesos) and LOCAL_DIRS(YARN) environment variables.
spark.logConf false Log valid SparkConf as INFO when SparkContext starts.
spark.master (none) where the cluster manager is connected
Runtime

property name default value meaning
spark.driver.extraClassPath (none) an extra classpath entity to append to the driver's classpath.
spark.driver.extraJavaOptions (none) String of JVM options passed to the driver. For example GC settings or other log settings. Note that it is not legal to set Spark properties or heap size in this option. Spark properties need to be set with --driver-class-path.
spark.driver.extraLibraryPath (none) Specifies the library path used when starting the driver's JVM
spark.driver.userClassPathFirst false (experimental) When loading classes in the driver, whether the user-added jar has a higher priority than Spark's own jar . This property can reduce the conflict between Spark dependencies and user dependencies. It's still an experimental feature for now.
spark.executor.extraClassPath (none) Extra classpath entities to append to the executors' classpath. The main purpose of this setting is the backward compatibility of Spark with older versions. Users generally do not need to set this option
spark.executor.extraJavaOptions (none) JVM option string passed to executors. For example GC settings or other log settings. Note that it is not legal to set Spark properties or heap size in this option. Spark properties need to be set with the SparkConf object or the spark-defaults.conf file used by the spark-submit script. Heap memory can be set through spark.executor.memory
spark.executor.extraLibraryPath (none) specifies the library path used when starting the executor's JVM
spark.executor.logs.rolling.maxRetainedFiles (none) Sets the number of recent rolling log files retained by the system. Older log files will be deleted. Not enabled by default.
spark.executor.logs.rolling.size.maxBytes (none) Maximum rolling size of executor logs. Not enabled by default. Value set to bytes
spark.executor.logs.rolling.strategy (none) Sets the rolling strategy of the executor logs. Not enabled by default. Can be configured as time and size. For time, use spark.executor.logs.rolling.time.interval to set the rolling interval; for size, use spark.executor.logs.rolling.size.maxBytes to set the maximum rolling size
spark.executor.logs.rolling.time.interval The interval at which the daily executor logs are rolled over. Not enabled by default. Legal values ​​are daily, hourly, minutely and any second.
spark.files.userClassPathFirst false (experimental) Whether user-added jars take precedence over Spark's own jars when loading classes in Executors. This property can reduce the conflict between Spark dependencies and user dependencies. It's still an experimental feature for now.
spark.python.worker.memory 512m The amount of memory used by each python worker process during aggregation. During aggregation, if memory exceeds this limit, it will stuff data to disk
spark.python.profile false enables profiling in Python workers. Display the analysis results through sc.show_profiles(). Or display the analysis results before the driver exits. The results can be dumped to disk via sc.dump_profiles(path). If some analysis results have been displayed manually, they will no longer automatically display
the directory of the dump file where the analysis results were saved before the driver exits. Each RDD will dump one file separately. These files can be loaded via ptats.Stats(). If this property is specified, the analysis results will not automatically show
whether spark.python.worker.reuse true reuses the python worker. If it is, it will use a fixed number of Python workers instead of needing to fork() a Python process for each task. This setting is useful if there is a very large broadcast. Because, broadcasting does not need to pass
spark.executorEnv.[EnvironmentVariableName] (none) once per task from JVM to Python worker to add the specified environment variable via EnvironmentVariableName to the executor process. Users can specify multiple EnvironmentVariableNames, set multiple environment variables
spark.mesos.executor.home driver side SPARK_HOME Set the Spark directory installed on the Mesos executor. By default, executors will use the driver's Spark local (home) directory, which is not visible to them. Note that this setting only works if the Spark binary package is not specified via spark.executor.uri
spark.mesos.executor.memoryOverhead executor memory * 0.07, minimum 384m This value is a supplement to spark.executor.memory. It is used to calculate the total memory for mesos tasks. Also, there is a hardcoded setting of 7%. The final value will choose spark.mesos.executor.memoryOverhead or 7% of spark.executor.memory, whichever is greater.
Shuffle behavior.

Property name default value meaning
spark.reducer.maxMbInFlight 48 Map output obtained simultaneously from recursive tasks Maximum size of data (mb). Because each output requires us to create a buffer to receive, this setting represents a fixed memory limit for each task, so unless you have a larger memory, set it a little smaller The
spark.shuffle.blockTransferService netty implementation is used in the executor Pass shuffle and cache blocks directly. There are two implementations available: netty and nio. Netty-based chunk passing is simpler with the same efficiency
spark.shuffle.compress true whether or not to compress the output file of a map operation. In general, this is a good choice.
spark.shuffle.consolidateFiles false If set to "true", during shuffle, consolidated intermediate files will be created. Creating fewer files can provide the efficiency of filesystem shuffles. These shuffles are accompanied by a large number of recursive tasks. When using ext4 and dfs file systems, it is recommended to set it to "true". In ext3, this option may reduce the efficiency of machines (greater than 8 cores) due to file system limitations
spark.shuffle.file.buffer.kb 32 The size of the in-memory cache of each shuffle file output stream, in kb. This cache reduces the number of disk seeks and system accesses in creating only intermediate shuffle files
spark.shuffle.io.maxRetries 3 Netty only, number of automatic retries
spark.shuffle.io.numConnectionsPerPeer 1 Netty only
spark.shuffle.io.preferDirectBufs true Netty only
spark.shuffle.io.retryWait 5 Netty only
spark.shuffle.manager sort Its implementation is used to shuffle data. There are two implementations available: sort and hash. Sort- based shuffle has higher memory usage
spark.shuffle.memoryFraction 0.2 If spark.shuffle.spill is true, the percentage of Java heap memory used by aggregate and merge group operations in shuffle to total memory. At any time, the set size of all in-memory maps used by shuffles is bound by this limit. Beyond this limit, spilling data will be saved to disk. If spilling is too frequent, consider increasing this value
spark.shuffle.sort.bypassMergeThreshold 200 (Advanced) In the sort-based shuffle manager, avoid merge-sorting data if there is no map-side aggregation and there are at most this many reduce partitions
spark.shuffle.spill true if set to "true" to limit the amount of memory by writing excess data to disk. Specify the threshold of spilling through spark.shuffle.memoryFraction
spark.shuffle.spill.compress true Whether to compress the spilled data during shuffle. The compression algorithm is specified via spark.io.compression.codec.
Spark UI

property name default value meaning
spark.eventLog.compress false Whether to compress the event log. Requires spark.eventLog.enabled to be true
spark.eventLog.dir file:///tmp/spark-events Base directory for Spark event logging. Under this base directory, Spark creates a subdirectory for each application. Each application logs to the directory up to. Users may want to set this to a unified location, like HDFS, so the history file can be read by the history server
spark.eventLog.enabled false Whether to record Spark's event log. This is useful to restructure the web UI after the application is complete
spark.ui.killEnabled true Run kill stages and corresponding jobs in the web UI
spark.ui.port 4040 Port of your app dashboard. Display memory and workload data
spark.ui.retainedJobs 1000 Number of jobs remembered by Spark UI and Status API
before garbage collection
spark.ui.retainedStages Serialization

Property Name Default Value Meaning
spark.broadcast.compress true Whether to compress broadcast variables before sending them
spark.closure.serializer org.apache.spark.serializer.JavaSerializer Serialization class used by closures. Currently only the java serializer
spark.io.compression.codec snappy is supported for codecs that compress internal data such as RDD partitions, broadcast variables, shuffle output, etc. By default, Spark provides three choices: lz4, lzf and snappy, you can also specify with full class name.
spark.io.compression.lz4.block.size 32768 Block size used in LZ4 compression. Reducing this block size will also reduce shuffle memory usage
spark.io.compression.snappy.block.size 32768 Block size used in Snappy compression. Reducing this block size will also reduce shuffle memory usage
spark.kryo.classesToRegister (none) If you serialize with Kryo, given a comma-separated list of custom class names representing classes to register
spark.kryo.referenceTracking true When serializing with Kryo, tracks whether the same object is referenced . This is required if your object graph has cycles. This setting is useful for efficiency if they contain multiple copies of the same object. If you know not in these two scenarios, then you can disable it to improve efficiency
spark.kryo.registrationRequired false if registration is required for Kyro to be available. If set to true, then Kyro will throw an exception if an unregistered class is serialized. If set to false, Kryo will write both each object and its unregistered class name. Writing class names can cause significant performance bottlenecks.
spark.kryo.registrator (none) If you use Kryo serialization, set this class to register your custom class. This property is useful if you need to register your class in a custom way. Otherwise spark.kryo.classesToRegister would be simpler. It should set a class that inherits from KryoRegistrator
spark.kryoserializer.buffer.max.mb 64 ​​The maximum value allowed by the Kryo serializer cache. This value must be larger than the object you are trying to serialize
spark.kryoserializer.buffer.mb 0.064 The size of the Kyro serialization buffer. This way each core on the worker has a cache. If necessary, the buffer will grow up to the value set by spark.kryoserializer.buffer.max.mb.
spark.rdd.compress true Whether to compress serialized RDD partitions. Save a lot of space while spending some extra CPU time
spark.serializer org.apache.spark.serializer.JavaSerializer Class used to serialize objects. The default Java serialization class can serialize any serializable java object but it is slow. So we recommend using org.apache.spark.serializer.KryoSerializer
spark.serializer.objectStreamReset 100 When serializing with org.apache.spark.serializer.JavaSerializer, the serializer prevents writing redundant data by caching the object, however this will cause Garbage collection for these objects stops. By requesting 'reset', you flush this information from the serializer and allow old data to be collected. To turn off this periodic reset, you can set the value to -1. By default, it is reset once every hundred objects.
Runtime behavior

Property name Default value Meaning
spark.broadcast.blockSize 4096 The block size transmitted by TorrentBroadcastFactory. Too large a value will reduce concurrency, and too small a value will cause performance bottleneck
spark.broadcast. The factory org.apache.spark.broadcast.TorrentBroadcastFactory broadcast implements class
spark.cleaner.ttl (infinite) The duration of spark recording any metadata (stages generation, task generation, etc.). Periodic cleaning ensures that expired metadata is discarded, which is useful when running long-running tasks, such as sparkstreaming tasks running 24/7. Expired data persisted in memory by RDD will also be cleaned up
spark.default.parallelism local mode: the number of machine cores; Mesos: 8; others: max (executor's core, 2) If the user does not set it, the system uses the default number of tasks (groupByKey, reduceByKey, etc.) for running shuffle operations in the cluster
spark. executor.heartbeatInterval 10000 The time interval between executor reporting heartbeat to the driver, in milliseconds
spark.files.fetchTimeout 60 The timeout time for the driver program to obtain files added through SparkContext.addFile(), in seconds
spark.files.useFetchCache true when fetching files Whether to use the local cache
spark.files.overwrite false Whether to overwrite the file when calling SparkContext.addFile()
spark.hadoop.cloneConf false Whether to clone a hadoop configuration file for each task
spark.hadoop.validateOutputSpecs true Whether to verify the output
spark.storage .memoryFraction 0.6 The heap size of the Spark memory cache occupies the proportion of the total memory. This value cannot be greater than the memory size of the old generation. The default value is 0.6. However, if you manually set the size of the old generation, you can increase the value
spark.storage.memoryMapThreshold 2097152 memory block size
spark.storage.unrollFraction 0.2 Fraction of spark.storage.memoryFraction to use for unrolling blocks in memory.
spark.tachyonStore.baseDir System.getProperty("java.io.tmpdir") Tachyon File System temporary directory
spark.tachyonStore.url tachyon: //localhost:19998 Tachyon File System URL
network

property name default value meaning
spark.driver.host (local hostname) The host name or IP address where the driver listens. This is used to communicate with executors and individual masters
the interface the spark.driver.port (random) driver listens on. This is used to communicate with executors and individual masters
spark.fileserver.port (random) driver's file server listens on port
spark.broadcast.port (random) driver's HTTP broadcast server listens on port
spark.replClassServer.port (random) driver The port on which the HTTP class server listens
spark.blockManager.port (random) The port on which the block manager listens. These exist in both driver and executors
spark.executor.port (random) The port on which the executor listens. Used to communicate with the driver
spark.port.maxRetries 16 Maximum number of retries when binding to a port before giving up
spark.akka.frameSize 10 Maximum message size allowed in "control plane" communication. If your task needs to send large results to the driver, increase this value
spark.akka.threads 4 The number of actor threads to communicate with. When the driver has many CPU cores, it is useful to increase it
spark.akka.timeout 100 Communication timeout between Spark nodes. Units are seconds
spark.akka.heartbeat.pauses 6000 This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you plan to use this feature (Not recommended). Acceptable heart beat pause in seconds for akka. This can be used to control sensitivity to gc pauses. Tune this in combination of spark.akka.heartbeat.interval and spark.akka.failure-detector.threshold if you need to.
spark.akka.failure-detector.threshold 300.0 This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you plan to use this feature (Not recommended). This maps to akka’s akka.remote.transport-failure-detector.threshold. Tune this in combination of spark.akka.heartbeat.pauses and spark.akka.heartbeat.interval if you need to.
spark.akka.heartbeat.interval 1000 This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you plan to use this feature (Not recommended). A larger interval value in seconds reduces network overhead and a smaller value ( ~ 1 s) might be more informative for akka’s failure detector. Tune this in combination of spark.akka.heartbeat.pauses and spark.akka.failure-detector.threshold if you need to. Only positive use case for using failure detector can be, a sensistive failure detector can help evict rogue executors really quick. However this is usually not the case as gc pauses and network lags are expected in a real Spark cluster. Apart from that enabling this leads to a lot of exchanges of heart beats between nodes leading to flooding the network with those.
调度相关属性

Attribute Name Default Value Meaning
spark.task.cpus 1 The number of cores allocated for each task
spark.task.maxFailures 4 The maximum number of
retries for the task spark.scheduler.mode FIFO The task scheduling mode of Spark, and there is a Fair mode
spark. cores.max When the application runs in a Standalone cluster or a Mesos cluster in coarse-grained shared mode, the maximum total number of CPU cores that the application requests from the cluster (not per machine, but the entire cluster). If not set, the value in spark.deploy.defaultCores will be used for Standalone clusters, while Mesos will use the cores available in the cluster
spark.mesos.coarse False If set to true, use the coarse-grained shared mode
spark when running in a Mesos cluster. speculation False The following parameters are related to Spark's speculative execution mechanism. This parameter sets whether to use the speculative execution mechanism. If it is set to true, spark uses the speculative execution mechanism. For the tasks that are lagging behind in the Stage, they are restarted in other nodes, and the calculation result of the first completed task is the final result
spark.speculation .interval 100 How long Spark will check the task running status for speculation, in milliseconds spark.speculation.quantile
The percentage of the total tasks that the Stage must complete before starting the speculation
How many times slower the median is to enable speculation
spark.locality.wait 3000 The following parameters are related to Spark data locality. This parameter is the waiting time for starting the local data task in milliseconds. If it exceeds, the task of the next local priority level will be started. This setting can also be applied to the locality of each priority level (local process->local node->local rack->any node), of course, different priority levels can also be set through parameters such as spark.locality.wait.node The locality of spark.locality.wait.process spark.locality.wait
The local wait time at the local process level
spark.locality.wait.node
locality.wait The local wait time at the local rack level
spark.scheduler.revive.interval 1000 The maximum time interval (milliseconds) for reviving a task that re-acquires resources, which occurs after the task allocates resources to other tasks for running due to insufficient local resources Enter the waiting time. If enough resources are re-acquired within this waiting time, continue to calculate the
Dynamic Allocation

property name Default value meaning
spark.dynamicAllocation.enabled false Whether to enable dynamic resource collection
spark.dynamicAllocation.executorIdleTimeout 600
spark.dynamicAllocation.initialExecutors spark.dynamicAllocation. minExecutors
spark.dynamicAllocation.maxExecutors Integer.MAX_VALUE
spark.dynamicAllocation.minExecutors 0
spark.dynamicAllocation.schedulerBacklogTimeout 5
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout schedulerBacklogTimeout
Security

Property Name Default Value Meaning
spark.authenticate false Whether Spark authenticates its internal connections. If not running on YARN, see spark.authenticate.secret
spark.authenticate.secret None to set up secret authentication between two Spark components. If not running on YARN, but authentication is required, this option must be set to
spark.core.connection.auth.wait.timeout 30 The actual connection waiting for authentication. The unit is seconds
spark.core.connection.ack.wait.timeout 60 The time the connection waits for an answer. The unit is seconds. To avoid undesired timeouts, you can set a larger value
spark.ui.filters None Comma-separated list of filter class names to apply to the Spark web UI. The filter must be a standard javax servlet Filter. Parameters for each filter can also be specified by setting java system properties. spark.<class name of filter>.params='param1=value1,param2=value2'. For example -Dspark.ui.filters=com.test.filter1, -Dspark.com.test.filter1.params='param1=foo,param2=testing'
spark.acls.enable false Whether to enable Spark acls. If enabled, it checks if the user has permission to view or modify the job. UI utilizes the use filter to authenticate and set users
spark.ui.view.acls empty Comma-separated list of users who have permission to view the Spark web UI. By default, only the user who started the Spark job has view permissions
spark.modify.acls empty Comma-separated list of users that have permission to modify the Spark job. By default, only the user who started the Spark job has permission to modify
spark.admin.acls empty Comma-separated list of users or administrators that have permission to view and modify all Spark jobs. This option is useful if you are running on a shared cluster and have a group of administrators or developers to help debug.
Encryption

Property name Default value Meaning
spark.ssl.enabled false Whether to enable ssl
spark.ssl.
spark.ssl.keyPassword None
spark.ssl.keyStore None
spark.ssl.keyStorePassword None
spark.ssl.protocol None
spark.ssl.trustStore None
spark.ssl.trustStorePassword None
Spark Streaming

property name default value meaning
spark.streaming.blockInterval 200 at During this time interval (ms), the data received through Spark Streaming receivers are chunked into data blocks before being saved to Spark. The recommended minimum value is 50ms
spark.streaming.receiver.maxRate infinite The maximum number of records of data that each receiver will receive per second. Effectively, each stream will consume at least this number of records. Setting this configuration to 0 or -1 will not limit
spark.streaming.receiver.writeAheadLogs.enable false Enable write ahead logs for receivers. All the input data received through receivers will be saved to write ahead logs that will allow it to be recovered after driver failures
spark.streaming.unpersist true forces RDDs generated and persisted via Spark Streaming to be automatically non-persistent from Spark memory. Raw input data received via Spark Streaming will also be cleaned up. Setting this property to false allows streaming applications to access raw data and persistent RDDs, as they are not cleaned up automatically. But it will cause higher memory cost.
Cluster Management

Spark On YARN

Property Name Default Value Meaning
spark.yarn.am.memory 512m In client mode, the memory size of am; in cluster mode, use the spark.driver.memory variable
spark.driver .cores In 1 claster mode, the number of cpu cores used by the driver. At this time, the driver runs in am, which is actually the number of am and cores; in client mode, use the spark.yarn.am.cores variable
spark.yarn.am.cores 1 In client mode, the number of CPU cores of am
spark.yarn.am.waitTime 100000 The waiting time when starting
spark.yarn.submit.file.replication 3 The number of copies of files uploaded to HDFS by the application
spark.yarn.preserve.staging. files False If true, after the job ends, keep the files related to the stage instead of deleting
spark.yarn.scheduler.heartbeat.interval-ms 5000 The interval at which Spark AppMaster sends heartbeat information to YARN RM

spark.yarn.max.executor.failures 2 times the number of executors, the minimum value is 3 The maximum number of executor failures that cause the application to declare failure . If this value is exceeded, the startup fails
spark.yarn.historyServer.address The address of the Spark history server (do not add http://). This address will be submitted to the YARN RM after the Spark application is completed, and then the RM will write the information from the RM UI to the history server UI.
spark.yarn.dist.archives (none)
spark.yarn.dist.files (none)
spark.executor.instances 2 Number of executor instances
spark.yarn.executor.memoryOverhead executorMemory * 0.07, with minimum of 384 executor heap memory size Set
spark.yarn.driver.memoryOverhead driverMemory * 0.07, with minimum of 384 driver heap memory size set
spark.yarn.am.memoryOverhead AM memory * 0.07, with minimum of 384 am heap memory size setting, set in client mode
spark.yarn.queue default use yarn's queue
spark.yarn.jar (none)
spark.yarn.access.namenodes (none)
spark.yarn.appMasterEnv.[EnvironmentVariableName] (none) set am environment variable
spark.yarn.containerLauncherMaxThreads 25 am The maximum number of threads to start executors
spark.yarn.am.extraJavaOptions (none)
spark.yarn.maxAppAttempts yarn.resourcemanager.am.max-attempts in YARN am The number of retries
Spark on Mesos

uses less, refer to Running Spark on Mesos .

Spark Standalone Mode

refers to Spark Standalone Mode.

Spark History Server

When you run Spark Standalone Mode or Spark on Mesos mode, you can view the job running status through Spark History Server.

Environment Variables for Spark History Server:

Property Name Meaning
SPARK_DAEMON_MEMORY Memory to allocate to the history server (default: 512m).
SPARK_DAEMON_JAVA_OPTS JVM options for the history server (default: none).
SPARK_PUBLIC_DNS
SPARK_HISTORY_OPTS Configure spark.history.* Properties
Spark History Server properties:

property name default meaning
spark.history. provider org.apache.spark.deploy.history.FsHistoryProvide The class name of the application history backend implementation. Currently there is only one implementation, provided by Spark, which looks at the application logs stored in the file system
spark.history.fs.logDirectory file:/tmp/spark-events
spark.history.updateInterval 10 How long is Spark history in seconds The information displayed by the server is updated. Every update checks the persistence layer event log for any changes.
spark.history.retainedApplications 50 The maximum number of applications displayed on the Spark history server. If this value is exceeded, old application information will be deleted.
spark.history.ui.port 18080 In the official version, the default access port of Spark history server
spark.history.kerberos.enabled false Whether to use kerberos to log in to access the history server, which is useful for the persistence layer on HDFS in a secure cluster. If set to true, configure the following two properties.
spark.history.kerberos.principal empty kerberos principal name
for Spark history server spark.history.kerberos.keytab empty kerberos keytab file location for Spark history server
spark.history.ui.acls.enable false Authorize users to view the application Whether to check the acl when information. If enabled, only the application owner and users specified by spark.ui.view.acls can view application information; if disabled, no checks are done.
Environment Variables Configure certain Spark settings

through environment variables. Environment variables are read from the conf/spark-env.sh script in the Spark installation directory (or conf/spark-env.cmd in Windows). In standalone or Mesos mode, this file can give machine specific information, such as hostname. It also works when running native applications or submitting scripts.

Note that conf/spark-env.sh does not exist by default when Spark is installed. You can create it by copying conf/spark-env.sh.template.

The following variables can be set in spark-env.sh:

Environment variable meaning
JAVA_HOME Path of Java installation
PYSPARK_PYTHON Path of Python binary executable file used by PySpark
SPARK_LOCAL_IP Machine binding IP address
SPARK_PUBLIC_DNS The hostname your Spark application advertises to other machines
In addition to above, Spark standalone cluster scripts can also set some options. Such as the number of cores used per machine and the maximum memory.

Because spark-env.sh is a shell script, some of it can be set programmatically. For example, you can calculate SPARK_LOCAL_IP over a specific network interface.

Configure logging

Spark to use log4j logging. You can configure it by adding the log4j.properties file in the conf directory. One way is to copy the log4j.properties.template file.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326358758&siteId=291194637