PySpark basic introduction (1): basic concept + environment construction

Table of contents

Getting Started with Spark Basics

Basic concept of spark

spark architecture

Spark environment construction

local mode

Standalone mode

Spark On YARN mode

PySpark development environment construction

Python On Spark execution principle 

 

Better reading experience: Introduction to PySpark Basics (1): Basic Concepts + Environment Construction - Nuggets (juejin.cn)

 

Getting Started with Spark Basics

Version: Spark3.2.0

Feature: Improved support for Pandas API

Basic concept of spark

  • Apache Spark is a unified analytics engine for large-scale data processing
    • Spark's core data structure: Resilient Distributed Dataset (RDD), which supports in-memory computing in large-scale clusters
    • Spark developed from the idea of ​​MapReduce, retains the advantages of its distributed parallel computing and improves its obvious defects. Allowing intermediate data to be stored in memory increases operating speed , and APIs that provide rich operational data increase development speed
  • How to understand "unified analysis engine"?
    • Spark can perform custom calculations on any type of data , such as structured, semi-structured, unstructured and other types of data structures;
    • Spark supports the use of multiple languages, such as Python, Java, Scala, R, and SQL to develop applications and calculate data

Comparison between spark and hadoop

  • At the computing level, Spark has a huge performance advantage over MR (MapReduce)
  • Spark only does calculations , while the Hadoop ecosystem includes not only calculations (MR) but also storage (HDFS) and resource management scheduling (YARN). HDFS and YARN are still the core architectures of many big data systems
  • The difference between Spark processing data and MapReduce processing data:
    • When Spark processes data, it can store the intermediate processing result data in memory
    • Spark provides a very rich operator (API), which can complete complex tasks in a Spark program

*Hadoop's process-based computing and Spark's thread-based advantages and disadvantages

Each map/reduce task in MR in Hadoop runs as a java process. The advantage is that the processes are independent of each other, and each task exclusively enjoys process resources without mutual interference, which is convenient for monitoring, but the problem lies between tasks. It is not convenient to share data, and the execution efficiency is relatively low. For example, when multiple map tasks read different data source files, the data source needs to be loaded into each map task, resulting in repeated loading and waste of memory. The thread-based calculation is for data sharing and improving execution efficiency. Spark uses the smallest execution unit of thread, but the disadvantage is that there will be resource competition among threads.

Spark features

  1. Fast speed: Spark supports memory computing, and supports acyclic data flow through DAG (directed acyclic graph) execution engine
  2. Strong versatility:
    1. On the basis of Spark, Spark also provides multiple tool libraries including Spark SQL, Spark Streaming, MLib and GraphX, and we can seamlessly use these tool libraries in one application
    2. Spark supports a variety of operating modes, including on Hadoop and Mesos, and also supports Standalone's independent operating mode, and can also run on cloud Kubernetes (after 2.3)
    3. Spark supports data acquisition from multiple sources such as HDFS, HBase, Cassandra, and Kafka

spark framework module

Spark Core : the core of Spark, and the core functions of Spark are provided by the Spark Core module, which is the basis of Spark operation. Spark Core uses RDD as data abstraction, provides APIs in Python, Java, Scala, and R languages, and can be programmed to perform massive offline data batch processing calculations.

SparkSQL : Based on SparkCore, it provides a processing module for structured data. SparkSQL supports data processing in the SQL language, and SparkSQL itself is aimed at offline computing scenarios. At the same time, based on SparkSQL, Spark provides the StructuredStreaming module, which can perform stream computing of data based on SparkSQL .

SparkStreaming : Based on SparkCore, it provides streaming computing functions for data.

MLlib : Based on SparkCore, it performs machine learning calculations, and has a large number of built-in machine learning libraries and API algorithms. It is convenient for users to perform machine learning calculations in a distributed computing mode.

GraphX : Based on SparkCore, it performs graph computing and provides a large number of graph computing APIs, which are convenient for graph computing in a distributed computing mode.

spark running mode

  1. Local mode (stand-alone): The local mode is to use an independent process to simulate the entire Spark runtime environment through its internal multiple threads
  2. Standalone mode (cluster): Each role in Spark exists as an independent process and forms a Spark cluster environment
  3. Hadoop YARN mode (cluster): Each role in Spark runs inside the container of YARN and forms a Spark cluster environment
  4. Kubernetes mode (container cluster): Each role in Spark runs inside the Kubernetes container and forms the Spark cluster environment

spark architecture

Analogy to the Yarn architecture:

YARN mainly has 4 types of roles:

  • resource management level
    • Cluster resource manager (Master): ResourceManager
    • Stand-alone resource manager (Worker): NodeManager
  • Task Computing Level
    • Single task manager (Master): ApplicationMaster
    • Single task executor (Worker): Task (job role of the computing framework in the container)

The spark architecture also consists of 4 types of roles:

  • Resource management:
    • Master: manages the resources of the entire cluster
    • Worker: Manage the resources of a single server
  • Task calculation:
    • Driver: manages the work of a single Spark task runtime (single task manager)
    • Exector: A worker for a single task runtime (single task executor)

Spark environment construction

Install spark first, you need to install anaconda before installing spark

It can be downloaded from the mirror source of Tsinghua University:

Index of /anaconda/archive/ | Tsinghua University Open Source Software Mirror Station | Tsinghua Open Source Mirror

You can also download it from the official website:

Free Download | Anaconda

Take the image source as an example:

Since python3.8 is used, the version downloaded by anaconda is:Anaconda3-2021.05-Linux-x86_64.sh

After the download is complete, upload it to the linux server

Then sh 安装包路径/Anaconda3-2021.05-Linux-x86_64.shinstall it by

Create pysparkthe environment after the installation is complete: then you can activate the current environment conda create -n pyspark python=3.8
byconda activate pyspark

Then you need to install the jieba package in the virtual environment:pip install pyhive pyspark jieba -i https://pypi.tuna.tsinghua.edu.cn/simple

The jieba package is a commonly used Chinese word segmentation library in Python. Its function is to process Chinese text into words


Commonly used conda commands are as follows:

Disable activation of the default base environment:

conda config --set auto_activate_base false

Create environment:conda create -n env_name

View all environments:conda info --envs

View all packages installed in the current environment:conda list

View information about a package installed in the current environment:conda list --show <package_name>

To delete an environment:conda remove -n env_name --all

Activate the environment:conda activate airflow

Exit the current environment:conda deactivate


After installing anaconda, install spark:

  1. Download the installation package (version 3.2): Index of /spark
  2. Unzip the installation package to the corresponding path:tar -zxvf spark-3.2.0-bin-hadoop3.2.tgz -C /opt/module/
  3. The installation path name is too long, you can mvchange it by:mv spark-3.2.0-bin-hadoop3.2 spark
  4. Configure environment variables: or in my_env.sh:

Among them JAVA_HOMEand HADOOP_HOMEhave been configured when installing Hadoop

PYSPARK_PYTHONConfigure the python executor, which is the anaconda environment we installed

Here you need to pay attention to the difference between HADOOP_CONF_DIR and HADOOP_HOME:

The HADOOP_CONF_DIR environment variable is the Hadoop configuration directory, which points to the directory where the Hadoop configuration files are located. In Hadoop, there are many configuration files such as core-site.xml, hdfs-site.xml, mapred-site.xml , etc. These configuration files contain various configuration information of the Hadoop cluster, such as the number of copies of HDFS, block size, addresses of NameNode and DataNode, etc. When Hadoop starts, it reads these configuration files and uses the configuration information in them.

If you want to change or use these configuration information, you can use the HADOOP_CONF_DIR environment variable to specify the directory where these files are located

Since spark can use the spark on yarn mode at runtime, yarn-site.xml needs to be read, so this path needs to be configured;

And HADOOP_HOME is the installation path of hadoop;

local mode

Start a JVM Process ( a process with multiple threads), execute the task Task

Local mode can limit the number of threads that simulate the Spark cluster environment, that is, Local[N] or Local[*]

Among them, N means that N threads can be used, and each thread has a cpu core. If N is not specified, the default is 1 thread (the thread has 1 core). Usually the CPU has several Cores, and a few threads are designated to maximize the use of computing power

It should be noted that the Local mode can only run one Spark program. If multiple Spark programs are executed, it will be executed by multiple independent Local processes.

run in local mode

1. bin/pyspark: Provide an interactive Python interpreter environment, where you can write ordinary python code and spark code

The running interface is as follows:

  • SparkContext is one of the core components of Spark, and it is the main entry point for communicating with the Spark cluster . The SparkContext is responsible for communicating with the cluster manager in order to launch applications on the cluster . It is also responsible for distributing the application's code to the various nodes in the cluster and distributing the data to those nodes. Prior to Spark 2.0, the SparkContext was the main entry point for programmatically interacting with RDDs.
  • SparkSession is a new concept introduced in Spark 2.0. It is a new entry point for accessing all Spark functionality. It provides a way to interact with various Spark functions with a small number of constructs. It also provides many new features, such as DataFrame and Dataset APIs, which make working with Spark easier and more intuitive.
  • Port 4040: When each Spark program is running, it will be bound to port 4040 of the machine where the Driver is located ; if port 4040 is occupied, it will be postponed to 4041 ... 4042...

Open ip:4040, you can see the monitoring page:

Since it is local mode, there is only onedriver


2. bin/spark-shell: Use scala language, just for understanding

3. bin/spark-submit: Submit the specified Spark code to run in the Spark environment

Use the sample code: bin/spark-submit /home/wuhaoyi/module/spark/examples/src/main/python/pi.py 10(10 is the parameter value)

The result is as follows:

pyspark/spark-shell/spark-submit comparison

Standalone mode

StandAlone is a complete Spark runtime environment:

The Master role exists in the Master process; the Worker role exists in the Worker process; Driver and Executor run in the Worker process, and the Worker provides resources for them to run

Three processes of the StandAlone cluster:

  • Master node Master process: Master role, manages the entire cluster resources, and hosts the Driver running various tasks
  • Slave node Workers: Worker role, manages the resources of each machine, and allocates corresponding resources to run Executor (Task);
  • History Server HistoryServer (optional): After the Spark Application is running, save the event log data to HDFS , and start the HistoryServer to view the relevant information of the application operation

StandAlone cluster construction

Using three Linux virtual machines, all need to install the anaconda environment

The files that need to be configured are as follows (each machine needs to be configured):

workers: Configure three worker nodes

# A Spark Worker will be started on each of the machines listed below.
slave1
master
slave3

spark-env.sh

# 设置JAVA安装目录
JAVA_HOME=/usr/java/default

# Hadoop相关
# HADOOP配置文件目录,读取HDFS上文件和运行YARN集群
HADOOP_CONF_DIR=/home/wuhaoyi/module/hadoop/etc/hadoop
YARN_CONF_DIR=/home/wuhaoyi/module/hadoop/etc/hadoop

# master相关
# 告知Spark的master运行在哪个机器上
export SPARK_MASTER_HOST=slave1
# 告知spark master的通讯端口
export SPARK_MASTER_PORT=7077
# 告知spark master的 webui端口
SPARK_MASTER_WEBUI_PORT=8080

# worker相关
# worker cpu可用核数
SPARK_WORKER_CORES=56
# worker可用内存
SPARK_WORKER_MEMORY=100g
# worker的工作通讯地址
SPARK_WORKER_PORT=7078
# worker的 webui地址
SPARK_WORKER_WEBUI_PORT=8081

# 设置历史服务器
# 将spark程序运行的历史日志 存到hdfs的/sparklog文件夹中
SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs://slave1:8020/sparklog/ -Dspark.history.fs.cleaner.enabled=true"

The sparklog folder above needs to be created by yourself

spark-default.conf

# # 开启spark的日志记录功能
spark.eventLog.enabled  true
# # 设置spark日志记录的路径
spark.eventLog.dir       hdfs://slave1:8020/sparklog/
# # 设置spark日志是否启动压缩
spark.eventLog.compress         true

cluster start

Start the history server:sbin/start-history-server.sh

jps name is HistoryServer

Start all masters and workers:sbin/start-all.sh

Shut down all masters and workers:sbin/stop-all.sh

Start the master/worker on the current node:sbin/start-master.sh sbin/start-worker.sh

Shut down the master/worker on the current node:sbin/stop-master.sh sbin/stop-worker.sh

After starting the cluster, you can view the Master's WEB UI:http://10.245.150.47:8080/

You can also view the history server:http://10.245.150.47:18080/

Click App IDto view the detailed record of the spark program running

Connect to StandAlone cluster

--master spark://ip地址:7077(7077 is the communication address of the configured master)

Example: bin/pyspark --master spark://slave1:7077,

Spark application architecture

Submit the program to spark:bin/spark-submit --master spark://slave1:7077 /home/wuhaoyi/module/spark/examples/src/main/python/pi.py 10

Check the running status of the program:

You can see that when the Spark Application runs on the cluster, it consists of two parts: Driver Program and Executors

1.Driver Program

  • Equivalent to AppMaster, the entire application manager, responsible for the scheduling and execution of all jobs in the application;
  • To run the JVM Process and the MAIN function of the program, a SparkContext context object must be created;
  • There is only one SparkApplication;

2.Executors

  • It is equivalent to a thread pool, running JVM Process, which has many threads, each thread runs a Task task, and a Task task requires 1 Core CPU to run, so it can be considered that the number of threads in the Executor is equal to the number of CPU Core cores;
  • A Spark Application can have multiple, and the number and resource information can be set;

*The whole process of program submission and operation

  1. When a user program creates a SparkContext, the newly created SparkContext instance is connected to the ClusterManager. The Cluster Manager will allocate computing resources for this submission according to the information such as CPU and memory set by the user when submitting, and start the Executor.
  2. Driver divides the user program into different stages of execution . Each stage of execution consists of a set of identical tasks . These tasks act on different partitions of the data to be processed. After the phase division is completed and the Task is created,  the Driver will send the Task to the Executor ;
  3. After the Executor receives the Task, it will download the runtime dependencies of the Task. After the execution environment of the Task is prepared, it will start executing the Task and report the running status of the Task to the Driver;
  4. Driver will process different status updates according to the running status of the received Task. Task is divided into two types: one is Shuffle Map Task, which realizes data reshuffle, and the result of shuffling is saved to the file system of the node where Executor is located; the other is Result Task, which is responsible for generating result data;
  5. Driver will continuously call Task, send Task to Executor for execution, and stop when all Tasks are executed correctly or exceed the execution limit and still fail to execute successfully;

Spark program execution hierarchy

  1. In a Spark Application, it contains multiple jobs;

  2. Each Job consists of multiple Stages , and each Job is executed according to the DAG graph

  3. Each Stage contains multiple Task tasks , and each Task is executed as a thread Thread, requiring 1Core CPU

The following describes the three core concepts of the Spark Application program runtime:

  1. Job: It consists of parallel computing parts of multiple Tasks. Generally, an action operation (such as save, collect) in Spark will generate a Job
  2. Stage: The component unit of a Job. A Job will be divided into multiple Stages. Stages are executed in sequence depending on each other, and each Stage is a collection of multiple Tasks, similar to map and reduce stages
  3. Task: The unit work content assigned to each Executor. It is the smallest execution unit in Spark . Generally speaking, how many Parititions (the concept at the physical level, that is, branches can be understood as dividing data into different parts for parallel processing ) are How many Tasks will there be , and each Task will only process data on a single branch

Spark On YARN mode

Nature:

The Master role is assumed by YARN's ResourceManager

The Worker role is assumed by YARN's NodeManager

Configuration process:

  • Configure the Hadoop cluster
  • Configure the environment variable: HADOOP_CONF_DIR; so that the relevant information of the configuration file can be read when spark is running:

Connect to YARN

bin/pyspark --master yarn --deploy-mode client|cluster
# --deploy-mode 选项是指定部署模式, 默认是 客户端模式
# client就是客户端模式
# cluster就是集群模式
# --deploy-mode 仅可以用在YARN模式下

Note: pyspark and spark-shell cannot run cluster mode;

bin/spark-submit --master yarn --deploy-mode client|cluster /xxx/xxx/xxx.py 参数

spark-submit can run in cluster mode

The difference between the two DeployMode

The location where the Driver runs is different:

  • Cluster mode is: Driver runs inside the YARN container , and ApplicationMaster is in the same container
  • Client mode: Driver runs in the client process, for example, Driver runs in the process of the spark-submit program

Cluster mode:

Client mode:

The usage scenarios of two DeployMode

Client mode : used for learning and testing, not recommended for production (if you want to use it, the performance is slightly lower, and the stability is slightly lower)

1. The Driver runs on the Client, and the communication cost with the cluster is high

2. Driver output results will be displayed on the client

Cluster mode : use this mode in the production environment

1. The driver program is in the YARN cluster, and the communication cost with the cluster is low

2. Driver output results cannot be displayed on the client

3. In this mode, the Driver runs the ApplicationMaster node, which is managed by Yarn. If there is a problem, Yarn will restart the ApplicationMaster(Driver)

Detailed operation process of two DeployMode

Client:

1) The Driver runs on the local machine where the task is submitted. After the Driver starts, it will communicate with the ResourceManager to apply for starting the ApplicationMaster;

2) Then the ResourceManager allocates the Container, and starts the ApplicationMaster on the appropriate NodeManager. At this time,

The function of ApplicationMaster is equivalent to an ExecutorLaucher , which is only responsible for applying for Executor memory from ResourceManager ;

ApplicationMaster is responsible for the startup of Executor

3) After the ResourceManager receives the resource application from the ApplicationMaster, it will allocate the Container, and then the ApplicationMaster will start the Executor process on the NodeManager specified by the resource allocation;

4) After the Executor process is started, it will reversely register with the Driver , and the Driver will start executing the main function after all Executor registrations are completed ;

5) When the Action operator is executed later, a Job is triggered, and Stages are divided according to wide dependencies. Each Stage generates a corresponding TaskSet, and then the Task is distributed to each Executor for execution.

In Client mode, since the Driver runs on the local machine, the scheduling of spark tasks is done by the local machine, so the communication efficiency will be relatively low

Cluster:

1) After the task is submitted, it will communicate with the ResourceManager to apply for starting the ApplicationMaster

2) Then the ResourceManager allocates the Container, and starts the ApplicationMaster on the appropriate NodeManager. At this time,

ApplicationMaster is Driver

3) After the Driver starts, apply for Executor memory from the ResourceManager, and the ResourceManager will allocate the Container after receiving the resource application from the ApplicationMaster, and then start the Executor process on the appropriate NodeManager

4) After the Executor process starts, it will reversely register with the Driver

5) After all Executors are registered, the Driver starts to execute the main function, and then when the Action operator is executed, a job is triggered, and stages are divided according to wide dependencies. Each stage generates a corresponding taskSet, and then the task is distributed to each Executor for execution

PySpark development environment construction

PySpark: It is a Python class library officially provided by Spark . It has a complete Spark API built in. You can use the PySpark class library to write Spark applications and submit them to run in the Spark cluster.

Environment construction steps:

1. Install the Windows anaconda environment:

Download address: Index of /anaconda/archive/ | Tsinghua University Open Source Software Mirror Station | Tsinghua Open Source Mirror

You can install it directly when downloading, and you can specify the path during the installation process, and there is nothing to check in the rest

After the installation is complete, open  Anaconda Prompt the program

A message indicating basethat the installation was successful:

2. Configure the domestic mirror source:

OpenAnaconda Prompt

enter:conda config --set show_channel_urls yes

The purpose of this setting is to display the installation source of the package when installing the package

Then find C:\Users\用户名.condarcthe file and replace the original content in the file with the following content:

channels:
  - defaults
show_channel_urls: true
default_channels:
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2
custom_channels:
  conda-forge: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  msys2: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  bioconda: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  menpo: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  pytorch: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  simpleitk: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud

3. Create a virtual environment:

# 创建虚拟环境 pyspark, 基于Python 3.8
conda create -n pyspark python=3.8

# 切换到虚拟环境内
conda activate pyspark

# 在虚拟环境内安装包
pip install pyhive pyspark jieba -i https://pypi.tuna.tsinghua.edu.cn/simple 

4. Install pyspark:pip install pyspark -i https://pypi.tuna.tsinghua.edu.cn/simple

5. Configure the Hadoop patch file in Windows:

  • Copy hadoop.dll in bin in the folder to: C:\Windows\System32
  • Configure the HADOOP_HOME environment variable to point to the path of the hadoop patch folder

Download address:
mirrors / cdarlint / winutils · GitCode

or:

GitHub - steveloughran/winutils: Windows binaries for Hadoop versions (built from the git commit ID used for the ASF relase)

The required file content is as follows:

6. Configure the local interpreter in pycharm

File->Settings->Python Interpreter

Click Add Interpreter and select Conda Interpreter:

Then it will automatically load the environment that has been created in conda, if not, you can choose Load Environmentsmanual loading in the upper right corner;

After that select pyspark:

Click OK;

7. Configure the Linux interpreter via SSH

The local interpreter will be slower in performance, and some memory-consuming operations cannot be completed, so configure the linux interpreter:

Python On Spark execution principle

The purpose of PySpark is to wrap a layer of Python API on the outer layer of the Spark architecture without destroying the existing runtime architecture of Spark, and realize the interaction between Python and Java with the help of Py4j , and then realize the writing of Spark applications through Python

Guess you like

Origin blog.csdn.net/qq_51235856/article/details/130458061