Introduction and installation of PySpark

What is Spark

Definition: Apache Spark is a unified analytics engine for large-scala data processing .

 Simply put, Spark is a distributed computing framework for scheduling hundreds or thousands of server clusters to calculate massive data at the TB, PB or even EB level

 Python On Spark

As the world's top distributed computing framework, Spark supports many programming languages ​​for development. The Python language is the direction that Spark focuses on supporting.

 PySpark

Park's support for the Python language is mainly reflected in the Python third-party library: PySpark.

PySpark is a third-party library in Python language officially developed by Spark.

Python developers can use the pip program to quickly install PySpark and use it directly like other third-party libraries.

 Why PySpark

Python application scenarios and employment directions are very rich, among which the most bright directions are:

Big Data Development and Artificial Intelligence

Summarize:

1. What is Spark, what is PySpark

  • Spark is a top-level open source project under the Apache Foundation, which is used for large-scale distributed computing on massive data.
  • PySpark is the Python implementation of Spark. It is the programming entry provided by Spark for Python developers. It is used to complete the development of Spark tasks with Python code.
  • PySpark can not only be used as a Python third-party library, but also can be submitted to the Spark cluster environment to schedule large-scale clusters for execution.

2. Why learn PySpark?

Big data development is a star track among Python's many employment directions, with high salaries and many jobs, and Spark (PySpark) is the core technology in big data development

Installation of PySpark library 

Like other third-party Python libraries, PySpark can also be installed using the pip program.

In the "CMD" command prompt program, enter:

pip install pyspark

Or use the domestic proxy mirror website (source from Tsinghua University)

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pyspark

 Construct the PySpark execution environment entry object

To use the PySpark library to complete data processing, you first need to build an execution environment entry object.

The entry object of PySpark's execution environment is: the class object of class SparkContext

"""
演示pyspark
"""
# 导包
from pyspark import SparkConf, SparkContext

# 创建SparkConf类对象
# 链式调用
conf = SparkConf().\
    setMaster("local[*]").\
    setAppName("test_spark_app")
# .setMaster设置运行模式
# .setAppName设置程序的名称
# 可以写成这样
# conf = SparkConf()
# conf.setMaster("local[*]")
# conf.setAppName("test_spark_app")

# 基于SparkConf类对象创建SparkContext类对象
sc = SparkContext(conf=conf)
# 打印PySpark类对象
print(sc.version)
# 停止SparkContext对象的运行(停止PySpark程序)
sc.stop()

PySpark's programming model

The SparkContext class object is the entry point for all functions in PySpark programming.

The programming of PySpark is mainly divided into the following three steps:

 

  •  Complete data input through the SparkContext object
  • After inputting the data, the RDD object is obtained, and the RDD object is iteratively calculated
  • Finally, complete the data output work through the member methods of the RDD object

Summarize:

1. How to install the PySpark library

        pip install pyspark

2. Why should the SparkContext object be constructed as the execution entry point?

        The functions of PySpark start from the SparkContext object

3. What is the programming model of PySpark?

  • Data input: complete data reading through SparkContext
  • Data calculation: the read data is converted into an RDD object, and the member method of the RDD is called to complete the calculation
  • Data output: call the data output related member methods of RDD, and output the results to lists, tuples, dictionaries, text files, databases, etc.

Guess you like

Origin blog.csdn.net/qq1226546902/article/details/132038032