MaxCompute Spark usage and common problems

1. Introduction to MaxCompute Spark

MaxCompute Spark is a compatible open source Spark computing service provided by MaxCompute. Based on a unified computing resource and data set permission system, it provides a Spark computing framework, and supports users to submit and run Spark jobs in familiar development and use methods to meet more abundant data processing and analysis scenarios.

1.1 Key features

  • Support native multi-version Spark jobs
    • Community-native Spark runs in MaxCompute, which is fully compatible with Spark's API and supports simultaneous running of multiple Spark versions
  • Unified computing resources
    • Like MaxCompute SQL/MR and other task types, it runs in the unified computing resources opened by the MaxCompute project
  • Unified data and authority management
    • Follow the authorization system of the MaxCompute project, and safely query data within the scope of access user authorization
  • Same experience as open source system
    • Provide native open source real-time Spark UI and query history log function

1.2 System structure

  • Native Spark can run in MaxCompute through the MaxCompute Cupid platform

1.3 Restrictions and limitations

  • Currently MaxCompute Spark supports the following applicable scenarios:
    • Offline computing scenarios: GraphX, Mllib, RDD, Spark-SQL, PySpark, etc.
    • Streaming scene
    • Read and write MaxCompute Table
    • Reference file resources in MaxCompute
    • Read and write services in the VPC environment, such as RDS, Redis, HBase, and services deployed on ECS
    • Read and write OSS unstructured storage
  • Use restrictions
    • Does not support interactive requirements Spark-Shell, Spark-SQL-Shell, PySpark-Shell, etc.
    • Does not support access to MaxCompute external tables, functions and UDFs
    • Only supports local mode and Yarn-cluster mode operation

2. Development environment setup

2.1 Operating mode

  • Submit via Spark client
    • Yarn-Cluster mode, submit tasks to the MaxCompute cluster
    • Local mode
  • Submit via Dataworks
    • In essence, it is also the Yarn-Cluster mode, and the task is submitted to the MaxCompute cluster

2.2 Submit via the client

2.2.1 Yarn-Cluster Mode

  • Download MC Spark client
    • Spark 1.6.3
    • Spark 2.3.0
  • Environment variable configuration
  • Parameter configuration
    • Rename $SPARK_HOME/conf/spark-defaults.conf.template  to spark-defaults.conf
    • Refer to the following for parameter configuration
  • Prepare project engineering
git clone https://github.com/aliyun/MaxCompute-Spark.git
cd spark-2.x
mvn clean package
  • Task submission
// bash环境
cd $SPARK_HOME
bin/spark-submit --master yarn-cluster --class com.aliyun.odps.spark.examples.SparkPi \
/path/to/MaxCompute-Spark/spark-2.x/target/spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar

// 在windows环境提交的命令
cd $SPARK_HOME/bin
spark-submit.cmd --master yarn-cluster --class com.aliyun.odps.spark.examples.SparkPi
\path\to\MaxCompute-Spark\spark-2.x\target\spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar

2.2.2 Local mode

  • Similar to Yarn Cluster mode, users first need to do the above preparations
  • Task submission
## Java/Scala
cd $SPARK_HOME
./bin/spark-submit --master local[4] --class com.aliyun.odps.spark.examples.SparkPi \
/path/to/odps-spark-examples/spark-examples/target/spark-examples-2.0.0-SNAPSHOT-shaded.jar

## PySpark
cd $SPARK_HOME
./bin/spark-submit --master local[4] \
/path/to/odps-spark-examples/spark-examples/src/main/python/odps_table_rw.py
  • IDEA debugging attention
    • IDEA running in Local mode cannot directly reference the configuration in spark-defaults.conf, you need to manually specify the relevant configuration in the code
    • Be sure to note that you need to manually add the relevant dependencies of the MaxCompute Spark client (jars directory) in IDEA, otherwise the following error will appear: the value of spark.sql.catalogimplementation should be one of hive in-memory but was odps

2.3 Submit via DataWorks

2.3.1 Resource upload

  • Essentially the configuration of the MC Spark node corresponds to the parameters and options of the spark-submit command

  • Upload resources:
    • 0~50MB: You can create and upload resources directly on the DataWorks interface
    • 50MB~500MB: You can upload using MaxCompute client (CMD) first, and then add it to data development on the DataWorks interface
  • Resource reference:
    • After the resource is submitted, you can select the required resource (jar/python/file/archive) on the DataWorks Spark node interface
    • When the task is running: resource files will be uploaded to the current working directory of Driver and Executor by default

2.3.2 Parameters and configuration

  • Spark configuration items: corresponding to the --conf option of the spark-submit command
    • accessid,accesskey,projectname,endpoint,runtime.end.point,task.major.version无需配置
    • In addition, you need to add the configurations in spark-default.conf to the configuration items of dataworks one by one
  • Pass parameters to the main class (such as bizdate)
    • First add a parameter in Scheduling -> Parameters, and then reference the parameter in the "Parameter" column of the Spark node. Multiple parameters are separated by spaces
    • This parameter will be passed to the user's main class, and the user can parse the parameter in the code

3. Configuration introduction

3.1 Location of configuration

3.1.1 Location of Spark configuration

  • When using Maxcompute Spark, there are usually several places where users can add Spark configuration, mainly including:
    • Location 1: spark-defaults.conf, the Spark configuration added in the spark-defaults.conf file when submitted by the client
    • Location 2: Dataworks configuration item, the Spark configuration added in the configuration item when the user submits through dataworks, this part of the configuration will eventually be added in location 3
    • Position 3: Configured in the spark-submit --conf option of the startup script
    • Position 4: Configure in the user code, the Spark configuration set by the user when initializing the SparkContext
  • Spark configuration priority
    • User code> spark-submit - Options> spark-defaults.conf configuration> spark-env.sh configuration> default

3.1.2 Two configurations that need to be distinguished

  • One is that it must be configured in spark-defaults.conf or dataworks configuration items to take effect (needed before the task is submitted), but cannot be configured in user code. The main features of this type of configuration are:
    • Related to Maxcompute/Cupid platform: General parameter names will contain odps or cupid, usually these parameters are related to task submission/resource application:
      • Obviously, some resource acquisition (such as driver's memory, core, diskdriver, maxcompute resources) will be used before the task is executed. If these parameters are set in the code, it is obvious that the platform has no way to read them, so these parameters must not be configured In the code
      • Even if some of these parameters are configured in the code, they will not cause the task to fail, but they will not take effect
      • Some of these parameters are configured in the code, which may cause side effects: such as setting spark.master to local in yarn-cluster mode
    • Access the parameters of the VPC:
      • Such parameters are also related to the platform, and the network connection is performed when the task is submitted
  • One is that the configuration in the above three positions can take effect, but the priority of the code configuration is the highest
  • It is recommended to configure the parameters related to task operation and optimization in the code, and the configuration related to the resource platform is configured in the configuration items of spark-defaults.conf or dataworks.

3.2 Resource-related parameters

spark.executor.instances

  • Total number of executors applied for
  • A dozen or dozens of common tasks are enough. If you are dealing with a large amount of data, you can apply for more, 100-2000+

spark.executor.cores

  • Number of cores per executor
  • The maximum degree of parallelism of the job is the number of executors * the number of executor cores

spark.executor.memory

  • Represents the memory of the executor

spark.yarn.executor.memoryOverhead

  • Apply for off-heap memory of executor, the default unit is MB
  • Mainly used for JVM itself, string, NIO Buffer and other overhead
  • The total memory of a single executor is: spark.executor.memory+spark.yarn.executor.memoryOverhead

spark.driver.cores

  • Similar to executor

spark.driver.memory

  • Similar to executor

spark.yarn.driver.memoryOverhead

  • Similar to executor

spark.driver.maxResultSize

  • The default is 1g, which controls the size of the data sent back to the driver by the worker. Once the limit is exceeded, the driver will terminate execution

spark.hadoop.odps.cupid.disk.driver.device_size

  • Represents the size of the local network disk, the default value is 20g
  • When No space left on device appears , the value can be increased appropriately, and the maximum support is 100g
  • Setting this parameter needs to include the unit'g'

3.3 Platform-related parameters

spark.hadoop.odps.project.name

  • The project where the Spark task is running

spark.hadoop.odps.access.id

  • The accessId of the submitted spark task

spark.hadoop.odps.access.key

  • The accessKey of the submitted spark task

spark.hadoop.odps.end.point

spark.hadoop.odps.runtime.end.point

spark.hadoop.odps.task.major.version

  • Represents the currently used platform version
  • Set the public cloud to cupid_v2

spark.sql.catalogImplementation

  • Spark 2.3 version needs to be set to odps
  • Spark 2.4 and later versions will be changed to hive
  • In order to facilitate job migration, it is recommended not to write the configuration in the code

spark.hadoop.odps.cupid.resources

  • This configuration item specifies the Maxcompute resources required for the program to run, in the format of <projectname>.<resourcename>, multiples can be specified, separated by commas.
  • The specified resource will be downloaded to the working directory of the driver and executor. This parameter is often used to reference larger files.
  • After the resource is downloaded to the directory, the default name is <projectname>.<resourcename>
  • If you need to rename, you need to rename through <projectname>.<resourcename>:<new resource name> during configuration

spark.hadoop.odps.cupid.vectorization.enable

  • Whether to enable vectorized read and write, the default is true

spark.hadoop.odps.input.split.size

  • Used to adjust the concurrency of reading Maxcompute table
  • Each partition is 256MB by default, and the parameter unit is MB

spark.hadoop.odps.cupid.vpc.domain.list

  • vpc access dependent parameters, the traditional way of accessing vpc

spark.hadoop.odps.cupid.smartnat.enable

  • vpc access dependent parameters
  • If the region is Beijing or Shanghai, set this parameter to true

spark.hadoop.odps.cupid.eni.enable

  • If the user has opened a dedicated line, it needs to be configured as true

http://spark.hadoop.odps.cupid.eni.info

  • If the user has opened a dedicated line, you need to set this parameter
  • This parameter represents the vpc opened by the user

spark.hadoop.odps.cupid.engine.running.type

  • Ordinary jobs will be forcibly recycled if they haven’t run in 3 days. Streaming jobs need to set this value to longtime

spark.hadoop.odps.cupid.job.capability.duration.hours

  • Streaming job permission file expired time, unit hour

spark.hadoop.odps.moye.trackurl.dutation

  • Streaming job jobview expired time, unit hour

4. Job diagnosis

4.1 Logview

4.1.1 Introduction to Logview

  • The log will be printed when the task is submitted: The log contains a logview link (keyword logview url)
  • The StdErr of the Master and Worker prints the log output by the spark engine, and the StdOut prints the content of the user job output to the console

4.1.2 Use Logview to troubleshoot problems

  • Get Logview, generally first look at the driver’s error report, the driver will contain some critical errors
  • If there is a problem that the class or method cannot be found in the Driver, it is generally a problem of jar package packaging
  • If there is a time out when connecting to an external VPC or OSS in the Driver, it is generally necessary to check the parameter configuration in this case
  • If there are errors in the Driver that the Executor cannot be connected, or the Chunk cannot be found, it is usually because the Executor has exited early. You need to check the Executor error report. There may be OOM
    • Sort according to End Time, the earlier the end time, the easier it is for the Executor node that has problems
    • Sort by Latency, Latency represents the survival time of Executor, the shorter the survival time, the more likely it is the root cause

4.2 Spark UI Sum History Server

  • Spark UI is consistent with the community version. Find the Spark UI link under the summary module of logivew:

  • The use of Spark UI is consistent with the community native version, you can refer to the documentation
  • note
    • Spark UI requires authentication, and only the Owner who submitted the task can open it
    • The Spark UI can only be opened when the job is running. If the task has ended, the Spark UI cannot be opened. At this time, you need to check the Spark History Server UI.

5. Frequently Asked Questions

1. Problems running in local mode

  • 问题一:the value of spark.sql.catalogimplementation should be one of hive in-memory but was odps
    • The reason is that the user did not correctly add the Maxcompute Spark jars directory to the classpath according to the documentation, which caused the community version of the spark package to be loaded, and the jars directory needs to be added to the classpath according to the documentation.
  • Question 2: IDEA Local mode cannot directly reference the configuration in spark-defaults.conf, and the Spark configuration items must be written in the code
  • Question 3: Access OSS and VPC:
    • Local mode is in the user's local environment, and the network is not isolated. The Yarn-Cluster mode is in the network isolation environment of Maxcompute, and the relevant parameters for vpc access must be configured
    • The endpoint for accessing oss in local mode is usually an external network endpoint, while the endpoint for accessing vpc in Yarn-cluster mode is a classic network endpoint

2. The problem of jar package packaging

  • Java/scala programs often encounter Java class not found/class conflict problems:
    • Class conflict: The user Jar package conflicts with the Jar package dependent on Spark or the platform
    • Class not found: User Jar package is not marked as Fat Jar or caused by class conflict
  • Packing needs attention:
    • The difference between relying on provided and compile:
      • provided: The code depends on the jar package, but it is only needed during compilation, and not needed at runtime. During runtime, it will go to the cluster to find the corresponding jar package
      • compile: The code depends on the jar package, which is required during compilation and running. These jar packages do not exist in the cluster, and users need to type into their own jar packages. This type of jar package is generally some third-party library, and has nothing to do with spark operation, and is related to user code logic
    • The jar package submitted by the user must be Fat jar:
      • It is necessary to put all compile-type dependencies into the user jar package to ensure that these dependent classes can be loaded when the code is running
  • Need to be set to provided jar package
    • Jar package whose groupId is org.apache.spark
    • Platform-specific Jar package
      • cupid-sdk
      • hadoop-yarn-client
      • odps-sdk
  • The jar package that needs to be set to compile
    • oss related jar package
      • hadoop-fs-oss
    • Jar packages used by users to access other services:
      • Such as mysql, hbase
    • Third-party libraries that user code needs to reference

3. Need to import Python package

  • Many times users need to use external Python dependencies
    • First of all, it is recommended that users use our packaged public resources, including some commonly used data processing, calculations, and tripartite libraries for connecting to external services (mysql, redis, hbase)
## 公共资源python2.7.13
spark.hadoop.odps.cupid.resources = public.python-2.7.13-ucs4.tar.gz
spark.pyspark.python = ./public.python-2.7.13-ucs4.tar.gz/python-2.7.13-ucs4/bin/python

## 公共资源python3.7.9
spark.hadoop.odps.cupid.resources = public.python-3.7.9-ucs4.tar.gz
spark.pyspark.python = ./public.python-3.7.9-ucs4.tar.gz/python-3.7.9-ucs4/bin/python3
    • If the user’s needs cannot be met, the user can upload the wheel package based on the public resource
    • If the wheel package dependency chain is more complicated, it can be packaged through a Docker container
  • Use Docker container packaging:
    • In order to ensure consistency with the online environment and avoid the problem of not being able to find the so package at runtime, you need to use a Docker container for packaging
    • Docker container essentially only provides a compatible OS environment. Users need to package in the container, compress the entire Python directory and upload it to MaxCompute Resource, and then directly reference it in the Spark task.
    • See documentation

4. Need to import external files

  • Scenarios that need to refer to external files
    • User jobs need to read some configuration files
    • User work requires additional jar package/Python library
  • There are two ways to upload resources:
    • Upload files through Spark parameters
    • Upload files through MaxCompute Resource
  • Upload files through Spark parameters
    • MaxCompute Spark supports the parameters such as --jars and --py-files native to the Spark Community Edition. You can upload files through these parameters when the job is submitted. These files will be uploaded to the user's working directory when the task is running.
    • Add resources required by tasks through DataWorks, see above
  • MaxCompute Resource
    • The spark.hadoop.odps.cupid.resources parameter can directly reference resources in MaxCompute, and these resources will also be uploaded to the user's working directory when the task is running
    • How to use

(1) Upload files through MaxCompute client (a single file supports up to 500MB)

(2) Add spark.hadoop.odps.cupid.resources parameter in the Spark job configuration: the format is <projectname>.<resourcename>, if you need to quote multiple files, you need to separate them with commas

(3) If you need to rename, the format is <projectname>.<resourcename>:<new resource name>

  • How to read uploaded files:
    • If you need to read the uploaded file resources, the file path is as follows:
val dir = new File(".")
val targetFile = "file://" + dir.getCanonicalPath + "/" +文件名
    • Or get the file path directly through the class loader, and then read it
    • Reference documents

5. VPC access issues

  • Maxcompute Spark runs independently in the Maxcompute cluster, and the network is isolated from the outside world, so it cannot directly access vpc and the public network. The following configuration needs to be added.
  • Beijing and Shanghai Region use smartnat
    • Need to configure
      • spark.hadoop.odps.cupid.vpc.domain.list
      • spark.hadoop.odps.cupid.smartnat.enable=true
    • Visit the public network: If you want to visit http://google.com:443 , you need to do the following two steps:
      • Set up the project level whitelist by  submitting a work order, and add  http://google.com:443 to odps.security.outbound.internetlist
      • Configure job-level public network access whitelist: spark.hadoop.odps.cupid.internet.access.list = http://google.com:443
  • Other Regions:
    • Only need to configure spark.hadoop.odps.cupid.vpc.domain.list
    • Cannot access the public network
  • Precautions:
    • vpc.domain.list needs to be compressed into one line and cannot contain spaces
    • Supports simultaneous access to multiple VPCs in the same Region, you need to configure the whitelist of all ip:ports to be accessed
    • Need to add an ip whitelist to the service to be accessed, allowing access to the 100.104.0.0/16 network segment
    • The user must ensure that all IPs that may be accessed have been added to vpc.domain.list. For example, if the user wants to access services located on multiple nodes such as hdfs and hbase, all nodes must be added, otherwise they may encounter To Time out

6. OOM problem

  • Possible OOM situations:
    • Error 1: Cannot allocate memory appears in some Executors. Generally, the system memory is insufficient. At this time, you can adjust the spark.yarn.executor.memoryOverhead parameter. Note that this parameter will be calculated to the total memory and does not need to be increased at once. Too much, just adjust carefully
    • Error 2: Executor throws java.lang.OutOfMemoryError: Java heap space
    • Mistake 3: GC overhead limit exceeded
    • Mistake 4: No route to host: workerd*********/Could not find CoarseGrainedScheduler, this kind of error is usually some Executor exits early. If the data processed by a task is very large, OOM is prone to occur
  • Driver OOM: The possibility of Driver OOM is relatively small, but it is also possible
    • If you need to use the collect operator to pull all the RDD data to the Driver for processing, you must ensure that the driver's memory is large enough, otherwise the OOM memory overflow problem will occur.
    • SparkContext and DAGScheduler run on the Driver side. Stage segmentation is also run on the Driver side. If the user program has too many steps and splits out too many stages, this part of information consumes the driver's memory. At this time, you need to increase the driver's memory. Sometimes if there are too many stages, the driver side may have stack overflow problems
  • Some solutions:
    • Limit the parallelism of executors and reduce the cores: multiple tasks running at the same time will share the memory of an Executor, which reduces the memory available for a single task, and reducing the parallelism can relieve memory pressure
    • Increase the memory of a single Executor
    • Increase the number of partitions and reduce the load of each executor
    • Consider the problem of data skew, because the data skew leads to insufficient memory for one task, and enough memory for other tasks

7. No space left on device

  • This error means that the local disk is insufficient. Usually this error will appear on the executor and cause the executor to hang
  • solution
    • Directly increase more disk space: the default driver and executor each provide a 20g local disk, when the disk space is insufficient, you can adjust spark.hadoop.odps.cupid.disk.driver.device_size
    • If the error is still reported after adjusting the size of the local disk to 100g, it means that the shuffle data written by a single executor has exceeded the upper limit, and data skew may have been encountered. In this case, the data can be repartitioned. Or increase the number of executors

8. Questions about applying for resources

  • Several phenomena of not being able to apply for resources:

(1) The following logs are generally printed on the driver side

    • WARN YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

(2) Only the driver can be seen in logview, and the number of workers is 0

(3) Only the driver can be seen in spark ui, and the number of workers is 0

  • solution:
    • Adjust task resources: adjust the total number of executors requested by users or the number of resources of a single executor (usually memory). If a single executor requests too much memory, it may not be easy to apply for
    • Reasonably arrange task execution time
  • Other considerations:
    • Must configure spark.master=yarn-cluster to apply for resources correctly

9. Other issues

  • How to switch Spark version
    • Introduction to version number rules: example spark-2.3.0-odps0.32.5
      • spark-2.3.0 is the spark version number of the community version, and Maxcompute Spark adapts based on the community version
      • odps 0.32.5 is the minor version number of Maxcompute Spark. With the upgrade of the minor version number, some bug fixes and SDK upgrades may be carried out
    • The Spark version that the user submits the job may have the following situations:
      • Case 1: Submit the task directly through the local client, the spark version is the version of the user's local client
      • Scenario 2: The user submits tasks through dataworks, depending on the default spark version of the dataworks gateway. The default version of the current public cloud dataworks public resource group gateway is spark-2.3.0-odps0.32.1
      • Case 3: The user submits a task through dataworks, configures the parameter spark.hadoop.odps.spark.version, and then searches for the corresponding spark client according to the configured version number. The user can configure spark.hadoop.odps.spark.version=spark -2.3.0-odps0.32.5 Manual switch version
      • Case 4: This case has the highest priority. The user can configure the following parameters when submitting the task on the local client or dataworks. The priority of class loading is the highest, so this version of spark will be used first when the spark task starts

spark.hadoop.odps.cupid.resources = public.__spark_libs__2.3.0odps0.32.5.zip spark.driver.extraClassPath = ./public.__spark_libs__2.3.0odps0.32.5.zip/* spark.executor.extraClassPath = ./public.__spark_libs__2.3.0odps0.32.5.zip/*

  • The configuration items need to be accessed in the code:
    • The parameters at the beginning of spark can be directly read through the interface provided by the SparkConf class
  • Spark History Server rendering speed is slow
    • You can add compression configuration: spark.eventLog.compress=true
  • How to correctly kill a running Spark task
    • Usually there are two ways to kill running Spark tasks

(1) Execute kill + instanceId through odps cmd;

(2) Execute stop through the dataworks interface

    • Note that you cannot kill a Spark task by executing Ctrl + C directly on the task submission interface of the spark client or dataworks.
  • Log Chinese garbled, add the following configuration
    • spark.executor.extraJavaOptions=-Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8
    • spark.driver.extraJavaOptions=-Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8
    • If it is a pyspark job, you need to set the following two parameters:
      • spark.yarn.appMasterEnv.PYTHONIOENCODING=utf8
      • spark.executorEnv.PYTHONIOENCODING=utf8
      • In addition, add the following code at the beginning of the python script:
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

 

Original link

This article is the original content of Alibaba Cloud and may not be reproduced without permission.

Guess you like

Origin blog.csdn.net/weixin_43970890/article/details/112508922