[Spark] (task1) PySpark basic data processing

summarize

1. Introduction to Spark

hadoop ecosystem:
insert image description here

1.1 Scala and PySpark

(1) Scala is a multi-paradigm programming language designed to integrate various features of object-oriented programming and functional programming.

Scala runs on the Java Virtual Machine and is compatible with existing Java programs.
Scala source code is compiled to Java bytecode, so it can run on the JVM and can call existing Java class libraries.

(2) Apache Spark is written in the Scala programming language. To support Python with Spark, the Apache Spark community released a tool, PySpark. Using PySpark, it is also possible to use RDDs in the Python programming language.

(3) PySpark provides PySpark Shell, which links the Python API to the spark core and initializes the Spark context. Integrating Python with Spark makes data science more convenient.

Spark's development language is Scala, which is the embodiment of Scala's advantages in parallel and concurrent computing, which is a victory for the idea of ​​functional programming at the micro level. In addition, Spark borrows functional programming ideas at many macro design levels, such as interfaces, lazy evaluation, and fault tolerance.

1.2 Principle of Spark

Spark is the mainstream big data processing tool in the industry.

  • Distributed: It means that the computing nodes do not share memory and need to exchange data through network communication.
  • Spark is a distributed computing platform. The most typical application of Spark is to build on a large number of cheap computing nodes.These nodes can be cheap hosts or virtual Docker Containers (Docker containers)

Spark's architecture diagram:

  • Spark programs are scheduled and organized by Manager Node
  • Specific computing tasks are executed by the Worker Node (worker node)
  • Finally return the result to Drive Program (driver).

On the physical Worker Node, data is also divided into different partitions (data shards). It can be said that a partition is the basic data unit of Spark.

insert image description here

Figure 1 Spark architecture diagram

The Spark computing cluster can have more powerful computing power than traditional single-machine high-performance servers, which is brought about by the parallel work of these hundreds of thousands, or even more than 10,000 worker nodes.

1.3 A specific chestnut

When performing a specific task, how does Spark cooperate with so many worker nodes to obtain the final result through parallel computing? Here we use a task to explain the working process of Spark.

A specific task process:
(1) First read the file textFile from the local hard disk;
(2) Then read the file hadoopFile from the distributed file system HDFS;
(3) Then process them separately;
(4) Then put the two files Follow the IDs jointo get the final result.

  • When processing this task on the Spark platform, the task will be disassembled into a sub-task DAG (Directed Acyclic Graph, directed acyclic graph), and then the method of executing each step of the program will be determined according to the DAG. As can be seen from Figure 2, this Spark program reads files from textFile and hadoopFile respectively, and then joins after a series of operations such as map and filter, and finally obtains the processing result.
    insert image description here
Figure 2 Directed acyclic graph of tasks of a Spark program
  • The most critical process is to understandWhich parts can be processed purely in parallel and which parts must be shuffled and reduced: Here shufflerefers to that all partition data must be shuffled before the next data can be obtained. The most typical operation is the groupByKeyoperation and joinoperation in Figure 2. Taking the joinoperation as an example, the textFile data and the hadoopFile data must be fully matched to obtain jointhe result dataframe(the structure of the data saved by Spark). The groupByKeyoperation needs to merge all the same keys in the data, and also needs a global shuffle to complete.

  • In contrast, map, filterand other operations only need to process and convert data one by one, and do not need to perform operations between data, so each partition can be processed in parallel .

  • Before the final calculation result is obtained, reducethe operations, summarize the statistical results from each partition, as the number of partitions gradually decreases, reducethe degree of parallelism of the operation gradually decreases, until the final calculation results are summarized to the master node (the master node). node). It can be said that the triggering of shuffleand reduceoperations determines the boundaries of pure parallel processing stages.

insert image description here

Figure 3 DAG stages divided by shuffle operation

Note:
(1)The shuffle operation requires data exchange between different computing nodes, which consumes a lot of computing, communication and storage resources, so shuffleoperations should be avoided as much as possible by spark programs. shuffleIt can be understood as a serial operation, which needs to wait until the previous parallel work is completed before it can start sequentially.

(2) Briefly describe the calculation process of Spark: the internal data of the Stage is efficiently and parallelly calculated, and the resource-consuming shuffleoperation reduce.

2. Installation method

  • Windows 10: Not suitable for developing programs, because command-line tools are not supported, there are many hidden pits, and there is less information on solutions
  • Windows Subsystem Linux (WSL): It needs to install more software and configure more environment variables, which is very troublesome
  • ubuntu/CentOS: not tried, but similar to WSL
  • docker: simple, efficient, and portable

Docker way (in ubuntu environment):

  • install docker: curl -fsSL https://get.docker.com | bash -s docker --mirror Aliyun;
  • Pull the image:docker pull jupyter/pyspark-notebook
  • Create the container:
docker run \
    -d \
    -p 8022:22 \
    -p 4040:4040 \
    -v /home/fyb:/data \
    -e GRANT_SUDO=yes \
    --name myspark \
    jupyter/pyspark-notebook
  • Configure SSH login for docker container
    • Install common software such as openssh-server:apt update && apt install openssh-server htop tmux
    • Set to allow root to log in via ssh:echo "PermitRootLogin yes" >> /etc/ssh/sshd_config
    • Restart the ssh service: service ssh --full-restart, set the root user password:passwd root
    • Test whether ssh in the docker container is set up successfully:ssh [email protected] -p 8022
  • Configure the python environment inside the container:
    • After logging into the SSH session as root, install the python dependencies:apt install pip
    • Install PySpark dependencies:pip3 install pyspark numpy pandas tqdm
    • Test that it installed correctly and performed all modifications:python3 /usr/local/spark/examples/src/main/python/pi.py

3. Test whether the installation is successful

Fourth, the module classification of Spark program

module name Module meaning
RDD RDDs and its related, accumulators, and broadcast variables
Spark SQL, Datasets, and DataFrames processing structured data with relational queries
Structured Streaming processing structured data streams with relation queries
Spark Streaming processing data streams using DStreams
MLlib applying machine learning algorithms
GraphX processing graphs
PySpark processing data with Spark in Python

5. Data processing tasks

5.1 Using Python to link the Spark environment

import pandas as pd
from pyspark.sql import SparkSession

# 创建spark应用 mypyspark
spark = SparkSession.builder.appName('mypyspark').getOrCreate()

5.2 Create dateframe data

This is similar to tools such as pandas. When creating a table, pay attention to the list list composed of the table headers and put them after the data.

test = spark.createDataFrame([('001','1',100,87,67,83,98), ('002','2',87,81,90,83,83), ('003','3',86,91,83,89,63),
                            ('004','2',65,87,94,73,88), ('005','1',76,62,89,81,98), ('006','3',84,82,85,73,99),
                            ('007','3',56,76,63,72,87), ('008','1',55,62,46,78,71), ('009','2',63,72,87,98,64)],                           
                             ['number','class','language','math','english','physic','chemical'])
test.show(5)

insert image description here

5.3 Use spark to perform the following logic: find the number of rows and columns of data

# 查看表前2行
test.head(2)

test.describe().show()

# 列出表头属性
test.columns

# 列出第一行的数据
test.first()

# 数据大小  shape
print('test.shape: %s行 %s列'%(test.count(), len(test.columns)))
# 上面打印出 test.shape: 9行 7列

5.4 Use spark to filter samples with class 1

Here either can be df.filterused df.where:

# 方法一
test.filter(test['class'] ==1).show()
# 方法二
test.filter('class == 1' ).show()

insert image description here

5.5 Use spark to filter samples with language >90 or math>90

test.filter('language>90 or math>90').show()
test.where('language>90 or math>90').show()
test.filter((test['language']>90)|(test['math']>90)).show()

insert image description here

Summary of tasks:

mission name difficulty
Task 1: PySpark data processing low, 1
Task 2: PySpark data statistics Medium, 1
Task 3: PySpark group by aggregation Medium, 2
Task 4: SparkSQL basic syntax high, 3
Task 5: SparkML Basics: Data Encoding Medium, 3
Task 6: SparkML Basics: Classification Models Medium, 3
Task 7: SparkML Basics: Clustering Models Medium, 2
Task 8: Spark RDD high, 3
Task 9: Spark Streaming high, 2

Reference

[1] pyspark environment installation under window

Guess you like

Origin blog.csdn.net/qq_35812205/article/details/123501085