1. Install the JDK
Download and install jdk-12.0.1_windows-x64_bin.exe, configure the environment variables:
New System Variable JAVA_HOME, installation path is Java
New System Variable CLASSPATH, value;.% JAVA_HOME% \ lib \ dt.jar;% JAVA_HOME% \ lib \ tools.jar; (Note that the preceding dot)
Configuring the system variable PATH, add% JAVA_HOME% bin;% JAVA_HOME% jrebin
In the CMD type: java or java -version, does not show the internal commands, the installation is successful.
2. Install Hadoop, and configuration environment variable
下载hadoop:https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz
Hadoop-2.7.7.tar.gz extracting a specific path, such as: D: \ adasoftware \ hadoop
Adding system variables HADOOP_HOME: D: \ adasoftware \ hadoop
Add the system variable PATH: D: \ adasoftware \ hadoop \ bin
Mounting assembly winutils: winutils the corresponding hadoop version replace bin bin in their installation directory hadoop
3.Spark environment variable configuration
spark is on top of hadoop, during operation will call the relevant hadoop library, if you do not configure related hadoop runtime environment, will prompt an error message based on relevant, although not affect the operation.
Hadoop download the corresponding version of the spark: http: //spark.apache.org/downloads.html
Unzip the file to: D: \ adasoftware \ spark-2.4.3-bin-hadoop2.7
Add PATH value: D: \ adasoftware \ spark-2.4.3-bin-hadoop2.7 \ bin;
New System Variable SPARK_HOME: D: \ adasoftware \ spark-2.4.3-bin-hadoop2.7;
4. Download and install anaconda
anaconda integrated python interpreter and most python library after installing python and anaconda do not have to install these components such as the pandas numpy. download link. Finally, the python added to the path environment variable.
5. Run the CMD pyspark in a similar view illustrating the normal installation configuration:
This warning appears because the JDK version 12, is too high, but does not affect the operation. No effect.
6. pycharm disposed in the spark
Open PyCharm, create a Project. Then select "Run" -> "Edit Configurations" -> Click + to create a new python Configurations
Select "Environment variables" increase SPARK_HOME catalog PYTHONPATH directories.
SPARK_HOME: Spark installation directory
PYTHONPATH: Spark installation directory of Python
Select File-> setting-> your project-> project structure
Add content root upper right corner to add: py4j-some-version.zip and the path pyspark.zip (both files are in the python file in the folder Spark)
Save to Wuxi gynecological hospital http://www.ytsgfk120.com/
7. test whether the configuration is successful, the program code below to create a python program into the family to stay:
import os
import sys
# Path for spark source folder
os.environ['SPARK_HOME'] = "D:\adasoftware\spark"
# Append pyspark to Python Path
sys.path.append("D:\adasoftware\spark\python")
try:
from pyspark imports SparkContext
from pyspark imports SparkConf
print("Successfully imported Spark Modules")
except ImportError as e:
print("Can not import Spark Modules", e)
sys.exit(1)
If the program normal output: "Successfully imported Spark Modules" means that the environment may have been executed properly.