Table of contents
Getting Started with Spark Basics
Spark environment construction
PySpark development environment construction
Python On Spark execution principle
Better reading experience: Introduction to PySpark Basics (1): Basic Concepts + Environment Construction - Nuggets (juejin.cn)
Getting Started with Spark Basics
Version: Spark3.2.0
Feature: Improved support for Pandas API
Basic concept of spark
- Apache Spark is a unified analytics engine for large-scale data processing
-
- Spark's core data structure: Resilient Distributed Dataset (RDD), which supports in-memory computing in large-scale clusters
- Spark developed from the idea of MapReduce, retains the advantages of its distributed parallel computing and improves its obvious defects. Allowing intermediate data to be stored in memory increases operating speed , and APIs that provide rich operational data increase development speed
- How to understand "unified analysis engine"?
-
- Spark can perform custom calculations on any type of data , such as structured, semi-structured, unstructured and other types of data structures;
- Spark supports the use of multiple languages, such as Python, Java, Scala, R, and SQL to develop applications and calculate data
Comparison between spark and hadoop
- At the computing level, Spark has a huge performance advantage over MR (MapReduce)
- Spark only does calculations , while the Hadoop ecosystem includes not only calculations (MR) but also storage (HDFS) and resource management scheduling (YARN). HDFS and YARN are still the core architectures of many big data systems
- The difference between Spark processing data and MapReduce processing data:
-
- When Spark processes data, it can store the intermediate processing result data in memory
- Spark provides a very rich operator (API), which can complete complex tasks in a Spark program
*Hadoop's process-based computing and Spark's thread-based advantages and disadvantages
Each map/reduce task in MR in Hadoop runs as a java process. The advantage is that the processes are independent of each other, and each task exclusively enjoys process resources without mutual interference, which is convenient for monitoring, but the problem lies between tasks. It is not convenient to share data, and the execution efficiency is relatively low. For example, when multiple map tasks read different data source files, the data source needs to be loaded into each map task, resulting in repeated loading and waste of memory. The thread-based calculation is for data sharing and improving execution efficiency. Spark uses the smallest execution unit of thread, but the disadvantage is that there will be resource competition among threads.
Spark features
- Fast speed: Spark supports memory computing, and supports acyclic data flow through DAG (directed acyclic graph) execution engine
- Strong versatility:
-
- On the basis of Spark, Spark also provides multiple tool libraries including Spark SQL, Spark Streaming, MLib and GraphX, and we can seamlessly use these tool libraries in one application
- Spark supports a variety of operating modes, including on Hadoop and Mesos, and also supports Standalone's independent operating mode, and can also run on cloud Kubernetes (after 2.3)
- Spark supports data acquisition from multiple sources such as HDFS, HBase, Cassandra, and Kafka
spark framework module
Spark Core : the core of Spark, and the core functions of Spark are provided by the Spark Core module, which is the basis of Spark operation. Spark Core uses RDD as data abstraction, provides APIs in Python, Java, Scala, and R languages, and can be programmed to perform massive offline data batch processing calculations.
SparkSQL : Based on SparkCore, it provides a processing module for structured data. SparkSQL supports data processing in the SQL language, and SparkSQL itself is aimed at offline computing scenarios. At the same time, based on SparkSQL, Spark provides the StructuredStreaming module, which can perform stream computing of data based on SparkSQL .
SparkStreaming : Based on SparkCore, it provides streaming computing functions for data.
MLlib : Based on SparkCore, it performs machine learning calculations, and has a large number of built-in machine learning libraries and API algorithms. It is convenient for users to perform machine learning calculations in a distributed computing mode.
GraphX : Based on SparkCore, it performs graph computing and provides a large number of graph computing APIs, which are convenient for graph computing in a distributed computing mode.
spark running mode
- Local mode (stand-alone): The local mode is to use an independent process to simulate the entire Spark runtime environment through its internal multiple threads
- Standalone mode (cluster): Each role in Spark exists as an independent process and forms a Spark cluster environment
- Hadoop YARN mode (cluster): Each role in Spark runs inside the container of YARN and forms a Spark cluster environment
- Kubernetes mode (container cluster): Each role in Spark runs inside the Kubernetes container and forms the Spark cluster environment
spark architecture
Analogy to the Yarn architecture:
YARN mainly has 4 types of roles:
- resource management level
-
- Cluster resource manager (Master): ResourceManager
- Stand-alone resource manager (Worker): NodeManager
- Task Computing Level
-
- Single task manager (Master): ApplicationMaster
- Single task executor (Worker): Task (job role of the computing framework in the container)
The spark architecture also consists of 4 types of roles:
- Resource management:
-
- Master: manages the resources of the entire cluster
- Worker: Manage the resources of a single server
- Task calculation:
-
- Driver: manages the work of a single Spark task runtime (single task manager)
- Exector: A worker for a single task runtime (single task executor)
Spark environment construction
Install spark first, you need to install anaconda before installing spark
It can be downloaded from the mirror source of Tsinghua University:
You can also download it from the official website:
Take the image source as an example:
Since python3.8 is used, the version downloaded by anaconda is:Anaconda3-2021.05-Linux-x86_64.sh
After the download is complete, upload it to the linux server
Then sh 安装包路径/Anaconda3-2021.05-Linux-x86_64.sh
install it by
Create pyspark
the environment after the installation is complete: then you can activate the current environment conda create -n pyspark python=3.8
byconda activate pyspark
Then you need to install the jieba package in the virtual environment:pip install pyhive pyspark jieba -i https://pypi.tuna.tsinghua.edu.cn/simple
The jieba package is a commonly used Chinese word segmentation library in Python. Its function is to process Chinese text into words
Commonly used conda commands are as follows:
Disable activation of the default base environment:
conda config --set auto_activate_base false
Create environment:conda create -n env_name
View all environments:conda info --envs
View all packages installed in the current environment:conda list
View information about a package installed in the current environment:conda list --show <package_name>
To delete an environment:conda remove -n env_name --all
Activate the environment:conda activate airflow
Exit the current environment:conda deactivate
After installing anaconda, install spark:
- Download the installation package (version 3.2): Index of /spark
- Unzip the installation package to the corresponding path:
tar -zxvf spark-3.2.0-bin-hadoop3.2.tgz -C /opt/module/
- The installation path name is too long, you can
mv
change it by:mv spark-3.2.0-bin-hadoop3.2 spark
- Configure environment variables: or in
my_env.sh
:
Among them JAVA_HOME
and HADOOP_HOME
have been configured when installing Hadoop
PYSPARK_PYTHON
Configure the python executor, which is the anaconda environment we installed
Here you need to pay attention to the difference between HADOOP_CONF_DIR and HADOOP_HOME:
The HADOOP_CONF_DIR environment variable is the Hadoop configuration directory, which points to the directory where the Hadoop configuration files are located. In Hadoop, there are many configuration files such as core-site.xml, hdfs-site.xml, mapred-site.xml , etc. These configuration files contain various configuration information of the Hadoop cluster, such as the number of copies of HDFS, block size, addresses of NameNode and DataNode, etc. When Hadoop starts, it reads these configuration files and uses the configuration information in them.
If you want to change or use these configuration information, you can use the HADOOP_CONF_DIR environment variable to specify the directory where these files are located
Since spark can use the spark on yarn mode at runtime, yarn-site.xml needs to be read, so this path needs to be configured;
And HADOOP_HOME is the installation path of hadoop;
local mode
Start a JVM Process ( a process with multiple threads), execute the task Task
Local mode can limit the number of threads that simulate the Spark cluster environment, that is, Local[N] or Local[*]
Among them, N means that N threads can be used, and each thread has a cpu core. If N is not specified, the default is 1 thread (the thread has 1 core). Usually the CPU has several Cores, and a few threads are designated to maximize the use of computing power
It should be noted that the Local mode can only run one Spark program. If multiple Spark programs are executed, it will be executed by multiple independent Local processes.
run in local mode
1. bin/pyspark
: Provide an interactive Python interpreter environment, where you can write ordinary python code and spark code
The running interface is as follows:
- SparkContext is one of the core components of Spark, and it is the main entry point for communicating with the Spark cluster . The SparkContext is responsible for communicating with the cluster manager in order to launch applications on the cluster . It is also responsible for distributing the application's code to the various nodes in the cluster and distributing the data to those nodes. Prior to Spark 2.0, the SparkContext was the main entry point for programmatically interacting with RDDs.
- SparkSession is a new concept introduced in Spark 2.0. It is a new entry point for accessing all Spark functionality. It provides a way to interact with various Spark functions with a small number of constructs. It also provides many new features, such as DataFrame and Dataset APIs, which make working with Spark easier and more intuitive.
- Port 4040: When each Spark program is running, it will be bound to port 4040 of the machine where the Driver is located ; if port 4040 is occupied, it will be postponed to 4041 ... 4042...
Open ip:4040
, you can see the monitoring page:
Since it is local mode, there is only onedriver
2. bin/spark-shell
: Use scala language, just for understanding
3. bin/spark-submit
: Submit the specified Spark code to run in the Spark environment
Use the sample code: bin/spark-submit /home/wuhaoyi/module/spark/examples/src/main/python/pi.py 10
(10 is the parameter value)
The result is as follows:
pyspark/spark-shell/spark-submit comparison
Standalone mode
StandAlone is a complete Spark runtime environment:
The Master role exists in the Master process; the Worker role exists in the Worker process; Driver and Executor run in the Worker process, and the Worker provides resources for them to run
Three processes of the StandAlone cluster:
- Master node Master process: Master role, manages the entire cluster resources, and hosts the Driver running various tasks
- Slave node Workers: Worker role, manages the resources of each machine, and allocates corresponding resources to run Executor (Task);
- History Server HistoryServer (optional): After the Spark Application is running, save the event log data to HDFS , and start the HistoryServer to view the relevant information of the application operation
StandAlone cluster construction
Using three Linux virtual machines, all need to install the anaconda environment
The files that need to be configured are as follows (each machine needs to be configured):
① workers
: Configure three worker nodes
# A Spark Worker will be started on each of the machines listed below.
slave1
master
slave3
②spark-env.sh
:
# 设置JAVA安装目录
JAVA_HOME=/usr/java/default
# Hadoop相关
# HADOOP配置文件目录,读取HDFS上文件和运行YARN集群
HADOOP_CONF_DIR=/home/wuhaoyi/module/hadoop/etc/hadoop
YARN_CONF_DIR=/home/wuhaoyi/module/hadoop/etc/hadoop
# master相关
# 告知Spark的master运行在哪个机器上
export SPARK_MASTER_HOST=slave1
# 告知spark master的通讯端口
export SPARK_MASTER_PORT=7077
# 告知spark master的 webui端口
SPARK_MASTER_WEBUI_PORT=8080
# worker相关
# worker cpu可用核数
SPARK_WORKER_CORES=56
# worker可用内存
SPARK_WORKER_MEMORY=100g
# worker的工作通讯地址
SPARK_WORKER_PORT=7078
# worker的 webui地址
SPARK_WORKER_WEBUI_PORT=8081
# 设置历史服务器
# 将spark程序运行的历史日志 存到hdfs的/sparklog文件夹中
SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs://slave1:8020/sparklog/ -Dspark.history.fs.cleaner.enabled=true"
The sparklog folder above needs to be created by yourself
③spark-default.conf
:
# # 开启spark的日志记录功能
spark.eventLog.enabled true
# # 设置spark日志记录的路径
spark.eventLog.dir hdfs://slave1:8020/sparklog/
# # 设置spark日志是否启动压缩
spark.eventLog.compress true
cluster start
Start the history server:sbin/start-history-server.sh
jps name is HistoryServer
Start all masters and workers:sbin/start-all.sh
Shut down all masters and workers:sbin/stop-all.sh
Start the master/worker on the current node:sbin/start-master.sh
sbin/start-worker.sh
Shut down the master/worker on the current node:sbin/stop-master.sh
sbin/stop-worker.sh
After starting the cluster, you can view the Master's WEB UI:http://10.245.150.47:8080/
You can also view the history server:http://10.245.150.47:18080/
Click App ID
to view the detailed record of the spark program running
Connect to StandAlone cluster
--master spark://ip地址:7077
(7077 is the communication address of the configured master)
Example: bin/pyspark --master spark://slave1:7077
,
Spark application architecture
Submit the program to spark:bin/spark-submit --master spark://slave1:7077 /home/wuhaoyi/module/spark/examples/src/main/python/pi.py 10
Check the running status of the program:
You can see that when the Spark Application runs on the cluster, it consists of two parts: Driver Program and Executors
1.Driver Program
- Equivalent to AppMaster, the entire application manager, responsible for the scheduling and execution of all jobs in the application;
- To run the JVM Process and the MAIN function of the program, a SparkContext context object must be created;
- There is only one SparkApplication;
2.Executors
- It is equivalent to a thread pool, running JVM Process, which has many threads, each thread runs a Task task, and a Task task requires 1 Core CPU to run, so it can be considered that the number of threads in the Executor is equal to the number of CPU Core cores;
- A Spark Application can have multiple, and the number and resource information can be set;
*The whole process of program submission and operation
- When a user program creates a SparkContext, the newly created SparkContext instance is connected to the ClusterManager. The Cluster Manager will allocate computing resources for this submission according to the information such as CPU and memory set by the user when submitting, and start the Executor.
- Driver divides the user program into different stages of execution . Each stage of execution consists of a set of identical tasks . These tasks act on different partitions of the data to be processed. After the phase division is completed and the Task is created, the Driver will send the Task to the Executor ;
- After the Executor receives the Task, it will download the runtime dependencies of the Task. After the execution environment of the Task is prepared, it will start executing the Task and report the running status of the Task to the Driver;
- Driver will process different status updates according to the running status of the received Task. Task is divided into two types: one is Shuffle Map Task, which realizes data reshuffle, and the result of shuffling is saved to the file system of the node where Executor is located; the other is Result Task, which is responsible for generating result data;
- Driver will continuously call Task, send Task to Executor for execution, and stop when all Tasks are executed correctly or exceed the execution limit and still fail to execute successfully;
Spark program execution hierarchy
- In a Spark Application, it contains multiple jobs;
- Each Job consists of multiple Stages , and each Job is executed according to the DAG graph
- Each Stage contains multiple Task tasks , and each Task is executed as a thread Thread, requiring 1Core CPU
The following describes the three core concepts of the Spark Application program runtime:
- Job: It consists of parallel computing parts of multiple Tasks. Generally, an action operation (such as save, collect) in Spark will generate a Job
- Stage: The component unit of a Job. A Job will be divided into multiple Stages. Stages are executed in sequence depending on each other, and each Stage is a collection of multiple Tasks, similar to map and reduce stages
- Task: The unit work content assigned to each Executor. It is the smallest execution unit in Spark . Generally speaking, how many Parititions (the concept at the physical level, that is, branches can be understood as dividing data into different parts for parallel processing ) are How many Tasks will there be , and each Task will only process data on a single branch
Spark On YARN mode
Nature:
The Master role is assumed by YARN's ResourceManager
The Worker role is assumed by YARN's NodeManager
Configuration process:
- Configure the Hadoop cluster
- Configure the environment variable:
HADOOP_CONF_DIR
; so that the relevant information of the configuration file can be read when spark is running:
Connect to YARN
bin/pyspark --master yarn --deploy-mode client|cluster
# --deploy-mode 选项是指定部署模式, 默认是 客户端模式
# client就是客户端模式
# cluster就是集群模式
# --deploy-mode 仅可以用在YARN模式下
Note: pyspark and spark-shell cannot run cluster mode;
bin/spark-submit --master yarn --deploy-mode client|cluster /xxx/xxx/xxx.py 参数
spark-submit can run in cluster mode
The difference between the two DeployMode
The location where the Driver runs is different:
- Cluster mode is: Driver runs inside the YARN container , and ApplicationMaster is in the same container
- Client mode: Driver runs in the client process, for example, Driver runs in the process of the spark-submit program
Cluster mode:
Client mode:
The usage scenarios of two DeployMode
Client mode : used for learning and testing, not recommended for production (if you want to use it, the performance is slightly lower, and the stability is slightly lower)
1. The Driver runs on the Client, and the communication cost with the cluster is high
2. Driver output results will be displayed on the client
Cluster mode : use this mode in the production environment
1. The driver program is in the YARN cluster, and the communication cost with the cluster is low
2. Driver output results cannot be displayed on the client
3. In this mode, the Driver runs the ApplicationMaster node, which is managed by Yarn. If there is a problem, Yarn will restart the ApplicationMaster(Driver)
Detailed operation process of two DeployMode
Client:
1) The Driver runs on the local machine where the task is submitted. After the Driver starts, it will communicate with the ResourceManager to apply for starting the ApplicationMaster;
2) Then the ResourceManager allocates the Container, and starts the ApplicationMaster on the appropriate NodeManager. At this time,
The function of ApplicationMaster is equivalent to an ExecutorLaucher , which is only responsible for applying for Executor memory from ResourceManager ;
ApplicationMaster is responsible for the startup of Executor
3) After the ResourceManager receives the resource application from the ApplicationMaster, it will allocate the Container, and then the ApplicationMaster will start the Executor process on the NodeManager specified by the resource allocation;
4) After the Executor process is started, it will reversely register with the Driver , and the Driver will start executing the main function after all Executor registrations are completed ;
5) When the Action operator is executed later, a Job is triggered, and Stages are divided according to wide dependencies. Each Stage generates a corresponding TaskSet, and then the Task is distributed to each Executor for execution.
In Client mode, since the Driver runs on the local machine, the scheduling of spark tasks is done by the local machine, so the communication efficiency will be relatively low
Cluster:
1) After the task is submitted, it will communicate with the ResourceManager to apply for starting the ApplicationMaster
2) Then the ResourceManager allocates the Container, and starts the ApplicationMaster on the appropriate NodeManager. At this time,
ApplicationMaster is Driver
3) After the Driver starts, apply for Executor memory from the ResourceManager, and the ResourceManager will allocate the Container after receiving the resource application from the ApplicationMaster, and then start the Executor process on the appropriate NodeManager
4) After the Executor process starts, it will reversely register with the Driver
5) After all Executors are registered, the Driver starts to execute the main function, and then when the Action operator is executed, a job is triggered, and stages are divided according to wide dependencies. Each stage generates a corresponding taskSet, and then the task is distributed to each Executor for execution
PySpark development environment construction
PySpark: It is a Python class library officially provided by Spark . It has a complete Spark API built in. You can use the PySpark class library to write Spark applications and submit them to run in the Spark cluster.
Environment construction steps:
1. Install the Windows anaconda environment:
Download address: Index of /anaconda/archive/ | Tsinghua University Open Source Software Mirror Station | Tsinghua Open Source Mirror
You can install it directly when downloading, and you can specify the path during the installation process, and there is nothing to check in the rest
After the installation is complete, open Anaconda Prompt
the program
A message indicating base
that the installation was successful:
2. Configure the domestic mirror source:
OpenAnaconda Prompt
enter:conda config --set show_channel_urls yes
The purpose of this setting is to display the installation source of the package when installing the package
Then find C:\Users\用户名.condarc
the file and replace the original content in the file with the following content:
channels:
- defaults
show_channel_urls: true
default_channels:
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2
custom_channels:
conda-forge: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
msys2: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
bioconda: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
menpo: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
pytorch: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
simpleitk: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
3. Create a virtual environment:
# 创建虚拟环境 pyspark, 基于Python 3.8
conda create -n pyspark python=3.8
# 切换到虚拟环境内
conda activate pyspark
# 在虚拟环境内安装包
pip install pyhive pyspark jieba -i https://pypi.tuna.tsinghua.edu.cn/simple
4. Install pyspark:pip install pyspark -i https://pypi.tuna.tsinghua.edu.cn/simple
5. Configure the Hadoop patch file in Windows:
- Copy hadoop.dll in bin in the folder to: C:\Windows\System32
- Configure the HADOOP_HOME environment variable to point to the path of the hadoop patch folder
Download address:
mirrors / cdarlint / winutils · GitCode
or:
The required file content is as follows:
6. Configure the local interpreter in pycharm
File->Settings->Python Interpreter
Click Add Interpreter and select Conda Interpreter:
Then it will automatically load the environment that has been created in conda, if not, you can choose Load Environments
manual loading in the upper right corner;
After that select pyspark:
Click OK;
7. Configure the Linux interpreter via SSH
The local interpreter will be slower in performance, and some memory-consuming operations cannot be completed, so configure the linux interpreter:
Python On Spark execution principle
The purpose of PySpark is to wrap a layer of Python API on the outer layer of the Spark architecture without destroying the existing runtime architecture of Spark, and realize the interaction between Python and Java with the help of Py4j , and then realize the writing of Spark applications through Python