pyspark installation tutorial

1. Configure the pyspark environment under Windows

The use of pyspark in python is not achieved simply by importing the pyspark package, but requires different environments to jointly build a spark environment before using pyspark in python.
The environment required to build pyspark: python3, jdk, spark, Scala, Hadoop (optional)

1.1 JDK download and installation

Download address: http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
Open the environment variables in Windows:

  • Create JAVA_HOME: C:\Program Files\Java\jdk1.8.0_181
  • 创建CLASSPATH:.;%JAVA_HOME%\lib;%JAVA_HOME%\lib\tools.jar

Add in Path: %JAVA_HOME%\bin;
test whether the installation is successful: open the cmd command line, enter java -version
insert image description here

1.2 Scala download and install

Download address: https://downloads.lightbend.com/scala/2.12.8/scala-2.12.8.msi
download and install

  • Create SCALA_HOME: C:\Program Files (x86)\scala
  • Path added: %SCALA_HOME%\bin

Test whether the installation is successful: open the cmd command line and enter scala -version
insert image description here

1.3 spark download and install

Download address: http://mirror.bit.edu.cn/apache/spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop2.7.tgz
You can also choose to download the specified version: http ://spark.apache.org/downloads.html
After downloading, decompress and place it in any directory, but the directory name cannot have spaces.
Environment variables:

  • Create SPARK_HOME: D:\spark-2.2.0-bin-hadoop2.7
  • Path added: %SPARK_HOME%\bin
    Test whether the installation is successful: open the cmd command line, enter spark-shell
    insert image description here

1.4 Download and install Hadoop

If you need to fetch data from hdfs, you should install hadoop first.
Download address:
http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz
Unzip to the specified directory.
Environment variables:

  • Create HADOOP_HOME: D:\hadoop-2.7.7
  • Path added: %HADOOP_HOME%\bin
    Test whether the installation is successful: open the cmd command line, enter hadoop

The hadoop test reported an error: Error: JAVA_HOME is incorrectly set. Reference: https://blog.csdn.net/qq_24125575/article/details/76186309

1.5 pyspark download and install

Install pyspark in pycharm
insert image description here

2. Introduction to the principle of pyspark

The implementation mechanism of pyspark can be represented by the following picture
insert image description here

On the python driver side, SparkContext uses Py4J to start a JVM and generate a JavaSparkContext. Py4J is only used on the driver side for communication between local python and java SparkContext objects. The transfer of large amounts of data uses another mechanism.
The conversion of RDD in python will be mapped to PythonRDD in java environment. On the remote worker machine, the PythonRDD object starts some child processes and communicates with these child processes through pipes to send user code and data.

Guess you like

Origin blog.csdn.net/qq_51808107/article/details/131180756