This article mainly explains how Spark is built in the Windows environment
1. Installation of JDK
1.1 Download JDK
First, you need to install the JDK, and configure the environment variables. If the old driver has been installed, it can be ignored. JDK (full name is JavaTM Platform Standard Edition Development Kit) installation, go to Oracle official website to download, the download address is Java SE Downloads .
The two places marked in red in the above picture are clickable. After clicking in, you can see some more detailed information about this latest version, as shown in the following figure:
After downloading, we can install JDK directly. The installation of JDK under windows is very simple. According to the normal software installation ideas, double-click the downloaded exe file, and then set your own installation directory (this installation directory is set in the environment variable when needed).
1.2 JDK environment variable settings
Next, set the corresponding environment variables. The setting method is: right-click on the desktop [Computer] - [Properties] - [Advanced System Settings], then select [Advanced] - [Environment Variables] in the system properties, and then in Find the "Path" variable in the system variables, and select the "Edit" button, a dialog box will appear, in which you can add the path name of the bin folder under the JDK directory installed in the previous step. The path name of the bin folder here is : C:\Program Files\Java\jre1.8.0_92\bin, so add this to the path name, pay attention to the English semicolon ";" to separate. as the picture shows:
After this is set, you can run the following commands in the cmd command line window opened in any directory. Check if the setting is successful.
java -version
Observe whether the relevant java version information can be output. If it can be output, it means that the JDK installation step is all over. as the picture shows:
Second, the installation of Scala
We download Scala from the official website: http://www.scala-lang.org/ , the latest version is 2.12.3, as shown in the figure
Because we are in a Windows environment, which is the purpose of this article, we choose the corresponding Windows version to download, as shown in the figure:
After downloading the msi file of Scala, you can double-click to execute the installation. After the installation is successful, the Scala bin directory will be added to the PATH system variable by default (if not, similar to the JDK installation steps above, add the bin directory path under the Scala installation directory to the system variable PATH ), in order to verify If the installation is successful, open a new cmd window, enter scala
and press Enter. If you can enter the Scala interactive command environment normally, the installation is successful. As shown below:
Note: If the version information cannot be displayed and the interactive command line of Scala cannot be entered, there are usually two possibilities:
1. The path name of the bin folder under the Scala installation directory has not been correctly added to the Path system variable. The method described can be added.
2. If Scala is not installed correctly, repeat the above steps.
3. Installation of Spark
We go to Spark official website to download: http://spark.apache.org/ , we choose Spark with Hadoop version, as shown in the figure:
After downloading I got about 200M files: spark-2.2.0-bin-hadoop2.7
The Pre-built version is used here, which means that it has been compiled and used directly. Spark also has source code that can be downloaded, but it can only be used after manually compiling it. After the download is complete, decompress the file (you may need to decompress it twice). It is best to decompress it to the root directory of a disk and rename it to Spark, which is simple and not easy to make mistakes. And it should be noted that there should be no spaces in Spark's file directory path names, and folder names like "Program Files" are not allowed. We create a new Spark folder in the C drive to store it, as shown in the figure:
After decompression, you can basically run it under the cmd command line. But at this time, every time you run spark-shell (spark's command line interactive window), you need to cd
go to the Spark installation directory first, which is troublesome, so you can add Spark's bin directory to the system variable PATH . For example, my Spark bin directory path here is D:\Spark\bin
, then add this path name to the PATH of the system variable. The method is the same as the environment variable setting during the JDK installation process. After setting the system variable, in any directory In the cmd command line, you can directly execute the spark-shell
command to enable Spark's interactive command line mode.
After the system variables are set, you can run spark-shell in any cmd in the current directory, but you may encounter various errors at this time. This is mainly because Spark is based on hadoop, so it is also necessary to configure a Hadoop here. operating environment. The error is as shown:
Next, we also need to install Hadoop.
Fourth, the installation of Hadoop
In Hadoop Releases , you can see the various historical versions of Hadoop. Since the downloaded Spark is based on Hadoop 2.7 (in the first step of Spark installation, we chose yes Pre-built for Hadoop 2.7
), I choose version 2.7.1 here, choose OK After clicking on the corresponding version, you will enter the detailed download page, as shown in the following figure:
Select the red mark in the figure to download. The src version above is the source code. If you need to make changes to Hadoop or want to compile it yourself, you can download the corresponding src file. What I download here is the compiled version, that is, "hadoop" in the figure -2.7.1.tar.gz" file.
Download and unzip to the specified directory, here is C:\Hadoop, as shown in the figure:
Then go to the environment variables section and set HADOOP_HOME to the Hadoop decompression directory, as shown in the figure:
Then set the bin directory in this directory to the PATH of the system variable, which is C:\Hadoop\bin here. If the HADOOP_HOME system variable has been added, you can also use %HADOOP_HOME%\bin to specify the bin folder path name. After these two system variables are set, open a new cmd window and enter the spark-shell
command directly. as the picture shows:
Under normal circumstances, it can run successfully and enter the Spark command line environment, but some users may encounter a null pointer error. At this time, it is mainly because there is no winutils.exe file in the bin directory of Hadoop. The solution here is:
You can go to https://github.com/steveloughran/winutils to select the version number of Hadoop you installed, and then go to the bin directory to find the winutils.exe
file. The download method is to click the winutils.exe
file. After entering, there is a Download
button in the upper right part of the page, click Just download it. as the picture shows:
Download the winutils.exe file
After downloading winutils.exe
, put this file into the bin directory of Hadoop, here is C:\Hadoop\hadoop-2.7.1\bin.
Enter in the opened cmd
C:\Hadoop\hadoop-2.7.1\bin\winutils.exe chmod 777 /tmp/Hive //Modify permissions, 777 is to get all permissions
But we found that some other errors were reported (this error also occurs in the Linux environment)
1 <console>:14: error: not found: value spark 2 import spark.implicits._ 3 ^ 4 <console>:14: error: not found: value spark 5 import spark.sql
The reason is that there is no permission to write the metastore_db file in spark.
Processing method: We grant permission to 777
In Linux environment , we operate under root:
1 sudo chmod 777 /home/hadoop/spark 2 3 #For convenience, you can give all permissions 4 sudo chmod a+w /home/hadoop/spark
In the window environment:
The folder where Spark is stored cannot be set to read-only or hidden, as shown in the figure:
Grant full control permissions as shown:
After these steps, and then open a new cmd window again, if it is normal, you should be able spark-shell
to run Spark by direct input. The normal operation interface should be as shown below:
1. Installation of JDK
1.1 Download JDK
First, you need to install the JDK, and configure the environment variables. If the old driver has been installed, it can be ignored. JDK (full name is JavaTM Platform Standard Edition Development Kit) installation, go to Oracle official website to download, the download address is Java SE Downloads .
The two places marked in red in the above picture are clickable. After clicking in, you can see some more detailed information about this latest version, as shown in the following figure:
After downloading, we can install JDK directly. The installation of JDK under windows is very simple. According to the normal software installation ideas, double-click the downloaded exe file, and then set your own installation directory (this installation directory is set in the environment variable when needed).
1.2 JDK environment variable settings
Next, set the corresponding environment variables. The setting method is: right-click on the desktop [Computer] - [Properties] - [Advanced System Settings], then select [Advanced] - [Environment Variables] in the system properties, and then in Find the "Path" variable in the system variables, and select the "Edit" button, a dialog box will appear, in which you can add the path name of the bin folder under the JDK directory installed in the previous step. The path name of the bin folder here is : C:\Program Files\Java\jre1.8.0_92\bin, so add this to the path name, pay attention to the English semicolon ";" to separate. as the picture shows:
After this is set, you can run the following commands in the cmd command line window opened in any directory. Check if the setting is successful.
java -version
Observe whether the relevant java version information can be output. If it can be output, it means that the JDK installation step is all over. as the picture shows:
Second, the installation of Scala
We download Scala from the official website: http://www.scala-lang.org/ , the latest version is 2.12.3, as shown in the figure
Because we are in a Windows environment, which is the purpose of this article, we choose the corresponding Windows version to download, as shown in the figure:
After downloading the msi file of Scala, you can double-click to execute the installation. After the installation is successful, the Scala bin directory will be added to the PATH system variable by default (if not, similar to the JDK installation steps above, add the bin directory path under the Scala installation directory to the system variable PATH ), in order to verify If the installation is successful, open a new cmd window, enter scala
and press Enter. If you can enter the Scala interactive command environment normally, the installation is successful. As shown below:
Note: If the version information cannot be displayed and the interactive command line of Scala cannot be entered, there are usually two possibilities:
1. The path name of the bin folder under the Scala installation directory has not been correctly added to the Path system variable. The method described can be added.
2. If Scala is not installed correctly, repeat the above steps.
3. Installation of Spark
We go to Spark official website to download: http://spark.apache.org/ , we choose Spark with Hadoop version, as shown in the figure:
After downloading I got about 200M files: spark-2.2.0-bin-hadoop2.7
The Pre-built version is used here, which means that it has been compiled and used directly. Spark also has source code that can be downloaded, but it can only be used after manually compiling it. After the download is complete, decompress the file (you may need to decompress it twice). It is best to decompress it to the root directory of a disk and rename it to Spark, which is simple and not easy to make mistakes. And it should be noted that there should be no spaces in Spark's file directory path names, and folder names like "Program Files" are not allowed. We create a new Spark folder in the C drive to store it, as shown in the figure:
After decompression, you can basically run it under the cmd command line. But at this time, every time you run spark-shell (spark's command line interactive window), you need to cd
go to the Spark installation directory first, which is troublesome, so you can add Spark's bin directory to the system variable PATH . For example, my Spark bin directory path here is D:\Spark\bin
, then add this path name to the PATH of the system variable. The method is the same as the environment variable setting during the JDK installation process. After setting the system variable, in any directory In the cmd command line, you can directly execute the spark-shell
command to enable Spark's interactive command line mode.
After the system variables are set, you can run spark-shell in any cmd in the current directory, but you may encounter various errors at this time. This is mainly because Spark is based on hadoop, so it is also necessary to configure a Hadoop here. operating environment. The error is as shown:
Next, we also need to install Hadoop.
Fourth, the installation of Hadoop
In Hadoop Releases , you can see the various historical versions of Hadoop. Since the downloaded Spark is based on Hadoop 2.7 (in the first step of Spark installation, we chose yes Pre-built for Hadoop 2.7
), I choose version 2.7.1 here, choose OK After clicking on the corresponding version, you will enter the detailed download page, as shown in the following figure:
Select the red mark in the figure to download. The src version above is the source code. If you need to make changes to Hadoop or want to compile it yourself, you can download the corresponding src file. What I download here is the compiled version, that is, "hadoop" in the figure -2.7.1.tar.gz" file.
Download and unzip to the specified directory, here is C:\Hadoop, as shown in the figure:
Then go to the environment variables section and set HADOOP_HOME to the Hadoop decompression directory, as shown in the figure:
Then set the bin directory in this directory to the PATH of the system variable, which is C:\Hadoop\bin here. If the HADOOP_HOME system variable has been added, you can also use %HADOOP_HOME%\bin to specify the bin folder path name. After these two system variables are set, open a new cmd window and enter the spark-shell
command directly. as the picture shows:
Under normal circumstances, it can run successfully and enter the Spark command line environment, but some users may encounter a null pointer error. At this time, it is mainly because there is no winutils.exe file in the bin directory of Hadoop. The solution here is:
You can go to https://github.com/steveloughran/winutils to select the version number of Hadoop you installed, and then go to the bin directory to find the winutils.exe
file. The download method is to click the winutils.exe
file. After entering, there is a Download
button in the upper right part of the page, click Just download it. as the picture shows:
Download the winutils.exe file
After downloading winutils.exe
, put this file into the bin directory of Hadoop, here is C:\Hadoop\hadoop-2.7.1\bin.
Enter in the opened cmd
C:\Hadoop\hadoop-2.7.1\bin\winutils.exe chmod 777 /tmp/Hive //Modify permissions, 777 is to get all permissions
But we found that some other errors were reported (this error also occurs in the Linux environment)
1 <console>:14: error: not found: value spark 2 import spark.implicits._ 3 ^ 4 <console>:14: error: not found: value spark 5 import spark.sql
The reason is that there is no permission to write the metastore_db file in spark.
Processing method: We grant permission to 777
In Linux environment , we operate under root:
1 sudo chmod 777 /home/hadoop/spark 2 3 #For convenience, you can give all permissions 4 sudo chmod a+w /home/hadoop/spark
In the window environment:
The folder where Spark is stored cannot be set to read-only or hidden, as shown in the figure:
Grant full control permissions as shown:
After these steps, and then open a new cmd window again, if it is normal, you should be able spark-shell
to run Spark by direct input. The normal operation interface should be as shown below: