pyspark installation tutorial
1. Configure the pyspark environment under Windows
The use of pyspark in python is not achieved simply by importing the pyspark package, but requires different environments to jointly build a spark environment before using pyspark in python.
The environment required to build pyspark: python3, jdk, spark, Scala, Hadoop (optional)
1.1 JDK download and installation
Download address: http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
Open the environment variables in Windows:
- Create JAVA_HOME: C:\Program Files\Java\jdk1.8.0_181
- 创建CLASSPATH:.;%JAVA_HOME%\lib;%JAVA_HOME%\lib\tools.jar
Add in Path: %JAVA_HOME%\bin;
test whether the installation is successful: open the cmd command line, enter java -version
1.2 Scala download and install
Download address: https://downloads.lightbend.com/scala/2.12.8/scala-2.12.8.msi
download and install
- Create SCALA_HOME: C:\Program Files (x86)\scala
- Path added: %SCALA_HOME%\bin
Test whether the installation is successful: open the cmd command line and enter scala -version
1.3 spark download and install
Download address: http://mirror.bit.edu.cn/apache/spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop2.7.tgz
You can also choose to download the specified version: http ://spark.apache.org/downloads.html
After downloading, decompress and place it in any directory, but the directory name cannot have spaces.
Environment variables:
- Create SPARK_HOME: D:\spark-2.2.0-bin-hadoop2.7
- Path added: %SPARK_HOME%\bin
Test whether the installation is successful: open the cmd command line, enter spark-shell
1.4 Download and install Hadoop
If you need to fetch data from hdfs, you should install hadoop first.
Download address:
http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz
Unzip to the specified directory.
Environment variables:
- Create HADOOP_HOME: D:\hadoop-2.7.7
- Path added: %HADOOP_HOME%\bin
Test whether the installation is successful: open the cmd command line, enter hadoop
The hadoop test reported an error: Error: JAVA_HOME is incorrectly set. Reference: https://blog.csdn.net/qq_24125575/article/details/76186309
1.5 pyspark download and install
Install pyspark in pycharm
2. Introduction to the principle of pyspark
The implementation mechanism of pyspark can be represented by the following picture
On the python driver side, SparkContext uses Py4J to start a JVM and generate a JavaSparkContext. Py4J is only used on the driver side for communication between local python and java SparkContext objects. The transfer of large amounts of data uses another mechanism.
The conversion of RDD in python will be mapped to PythonRDD in java environment. On the remote worker machine, the PythonRDD object starts some child processes and communicates with these child processes through pipes to send user code and data.