summarize

Article directory

summarize
1. Introduction to Spark
2. Installation method
3. Test whether the installation is successful
Fourth, the module classification of Spark program
5. Data processing tasks
Summary of tasks:
Reference

1. Introduction to Spark

hadoop ecosystem:
insert image description here

1.1 Scala and PySpark

(1) Scala is a multi-paradigm programming language designed to integrate various features of object-oriented programming and functional programming.

Scala runs on the Java Virtual Machine and is compatible with existing Java programs.
Scala source code is compiled to Java bytecode, so it can run on the JVM and can call existing Java class libraries.

(2) Apache Spark is written in the Scala programming language. To support Python with Spark, the Apache Spark community released a tool, PySpark. Using PySpark, it is also possible to use RDDs in the Python programming language.

(3) PySpark provides PySpark Shell, which links the Python API to the spark core and initializes the Spark context. Integrating Python with Spark makes data science more convenient.

Spark's development language is Scala, which is the embodiment of Scala's advantages in parallel and concurrent computing, which is a victory for the idea of functional programming at the micro level. In addition, Spark borrows functional programming ideas at many macro design levels, such as interfaces, lazy evaluation, and fault tolerance.

1.2 Principle of Spark

Spark is the mainstream big data processing tool in the industry.

Distributed: It means that the computing nodes do not share memory and need to exchange data through network communication.
Spark is a distributed computing platform. The most typical application of Spark is to build on a large number of cheap computing nodes.These nodes can be cheap hosts or virtual Docker Containers (Docker containers)。

Spark's architecture diagram:

Spark programs are scheduled and organized by Manager Node
Specific computing tasks are executed by the Worker Node (worker node)
Finally return the result to Drive Program (driver).

On the physical Worker Node, data is also divided into different partitions (data shards). It can be said that a partition is the basic data unit of Spark.

insert image description here

Figure 1 Spark architecture diagram

The Spark computing cluster can have more powerful computing power than traditional single-machine high-performance servers, which is brought about by the parallel work of these hundreds of thousands, or even more than 10,000 worker nodes.

1.3 A specific chestnut

When performing a specific task, how does Spark cooperate with so many worker nodes to obtain the final result through parallel computing? Here we use a task to explain the working process of Spark.

A specific task process:
(1) First read the file textFile from the local hard disk;
(2) Then read the file hadoopFile from the distributed file system HDFS;
(3) Then process them separately;
(4) Then put the two files Follow the IDs jointo get the final result.

When processing this task on the Spark platform, the task will be disassembled into a sub-task DAG (Directed Acyclic Graph, directed acyclic graph), and then the method of executing each step of the program will be determined according to the DAG. As can be seen from Figure 2, this Spark program reads files from textFile and hadoopFile respectively, and then joins after a series of operations such as map and filter, and finally obtains the processing result.

Figure 2 Directed acyclic graph of tasks of a Spark program

The most critical process is to understandWhich parts can be processed purely in parallel and which parts must be shuffled and reduced: Here shufflerefers to that all partition data must be shuffled before the next data can be obtained. The most typical operation is the groupByKeyoperation and joinoperation in Figure 2. Taking the joinoperation as an example, the textFile data and the hadoopFile data must be fully matched to obtain jointhe result dataframe(the structure of the data saved by Spark). The groupByKeyoperation needs to merge all the same keys in the data, and also needs a global shuffle to complete.
In contrast, map, filterand other operations only need to process and convert data one by one, and do not need to perform operations between data, so each partition can be processed in parallel .
Before the final calculation result is obtained, reducethe operations, summarize the statistical results from each partition, as the number of partitions gradually decreases, reducethe degree of parallelism of the operation gradually decreases, until the final calculation results are summarized to the master node (the master node). node). It can be said that the triggering of shuffleand reduceoperations determines the boundaries of pure parallel processing stages.

insert image description here

Figure 3 DAG stages divided by shuffle operation

Note:
(1)The shuffle operation requires data exchange between different computing nodes, which consumes a lot of computing, communication and storage resources, so shuffleoperations should be avoided as much as possible by spark programs. shuffleIt can be understood as a serial operation, which needs to wait until the previous parallel work is completed before it can start sequentially.

(2) Briefly describe the calculation process of Spark: the internal data of the Stage is efficiently and parallelly calculated, and the resource-consuming shuffleoperation reduce.

2. Installation method

Windows 10: Not suitable for developing programs, because command-line tools are not supported, there are many hidden pits, and there is less information on solutions
Windows Subsystem Linux (WSL): It needs to install more software and configure more environment variables, which is very troublesome
ubuntu/CentOS: not tried, but similar to WSL
docker: simple, efficient, and portable

Docker way (in ubuntu environment):

install docker: curl -fsSL https://get.docker.com | bash -s docker --mirror Aliyun;
Pull the image:docker pull jupyter/pyspark-notebook
Create the container:

docker run \
    -d \
    -p 8022:22 \
    -p 4040:4040 \
    -v /home/fyb:/data \
    -e GRANT_SUDO=yes \
    --name myspark \
    jupyter/pyspark-notebook

Configure SSH login for docker container
- Install common software such as openssh-server:apt update && apt install openssh-server htop tmux
- Set to allow root to log in via ssh:echo "PermitRootLogin yes" >> /etc/ssh/sshd_config
- Restart the ssh service: service ssh --full-restart, set the root user password:passwd root
- Test whether ssh in the docker container is set up successfully:ssh [email protected] -p 8022
Configure the python environment inside the container:
- After logging into the SSH session as root, install the python dependencies:apt install pip
- Install PySpark dependencies:pip3 install pyspark numpy pandas tqdm
- Test that it installed correctly and performed all modifications:python3 /usr/local/spark/examples/src/main/python/pi.py

3. Test whether the installation is successful

Fourth, the module classification of Spark program

module name	Module meaning
RDD	RDDs and its related, accumulators, and broadcast variables
Spark SQL, Datasets, and DataFrames	processing structured data with relational queries
Structured Streaming	processing structured data streams with relation queries
Spark Streaming	processing data streams using DStreams
MLlib	applying machine learning algorithms
GraphX	processing graphs
PySpark	processing data with Spark in Python

5. Data processing tasks

5.1 Using Python to link the Spark environment

import pandas as pd
from pyspark.sql import SparkSession

# 创建spark应用 mypyspark
spark = SparkSession.builder.appName('mypyspark').getOrCreate()

5.2 Create dateframe data

This is similar to tools such as pandas. When creating a table, pay attention to the list list composed of the table headers and put them after the data.

test = spark.createDataFrame([('001','1',100,87,67,83,98), ('002','2',87,81,90,83,83), ('003','3',86,91,83,89,63),
                            ('004','2',65,87,94,73,88), ('005','1',76,62,89,81,98), ('006','3',84,82,85,73,99),
                            ('007','3',56,76,63,72,87), ('008','1',55,62,46,78,71), ('009','2',63,72,87,98,64)],                           
                             ['number','class','language','math','english','physic','chemical'])
test.show(5)

insert image description here

5.3 Use spark to perform the following logic: find the number of rows and columns of data

# 查看表前2行
test.head(2)

test.describe().show()

# 列出表头属性
test.columns

# 列出第一行的数据
test.first()

# 数据大小  shape
print('test.shape: %s行 %s列'%(test.count(), len(test.columns)))
# 上面打印出 test.shape: 9行 7列

5.4 Use spark to filter samples with class 1

Here either can be df.filterused df.where:

# 方法一
test.filter(test['class'] ==1).show()
# 方法二
test.filter('class == 1' ).show()

insert image description here

5.5 Use spark to filter samples with language >90 or math>90

test.filter('language>90 or math>90').show()
test.where('language>90 or math>90').show()
test.filter((test['language']>90)|(test['math']>90)).show()

insert image description here

Summary of tasks:

—	mission name	difficulty
Task 1:	PySpark data processing	low, 1
Task 2:	PySpark data statistics	Medium, 1
Task 3:	PySpark group by aggregation	Medium, 2
Task 4:	SparkSQL basic syntax	high, 3
Task 5:	SparkML Basics: Data Encoding	Medium, 3
Task 6:	SparkML Basics: Classification Models	Medium, 3
Task 7:	SparkML Basics: Clustering Models	Medium, 2
Task 8:	Spark RDD	high, 3
Task 9:	Spark Streaming	high, 2

Reference

[1] pyspark environment installation under window

[Spark] (task1) PySpark basic data processing

summarize

Article directory

1. Introduction to Spark

1.1 Scala and PySpark

1.2 Principle of Spark

1.3 A specific chestnut

2. Installation method

3. Test whether the installation is successful

Fourth, the module classification of Spark program

5. Data processing tasks

5.1 Using Python to link the Spark environment

5.2 Create dateframe data

5.3 Use spark to perform the following logic: find the number of rows and columns of data

5.4 Use spark to filter samples with class 1

5.5 Use spark to filter samples with language >90 or math>90

Summary of tasks:

Reference

Guess you like