Spark study notes--Spark environment construction under Windows (transfer)

This article mainly explains how Spark is built in the Windows environment

1. Installation of JDK

1.1 Download JDK

  First, you need to install the JDK, and configure the environment variables. If the old driver has been installed, it can be ignored. JDK (full name is JavaTM Platform Standard Edition Development Kit) installation, go to Oracle official website to download, the download address is Java SE Downloads  .

  The two places marked in red in the above picture are clickable. After clicking in, you can see some more detailed information about this latest version, as shown in the following figure:

  After downloading, we can install JDK directly. The installation of JDK under windows is very simple. According to the normal software installation ideas, double-click the downloaded exe file, and then set your own installation directory (this installation directory is set in the environment variable when needed).

1.2 JDK environment variable settings

  Next, set the corresponding environment variables. The setting method is: right-click on the desktop [Computer] - [Properties] - [Advanced System Settings], then select [Advanced] - [Environment Variables] in the system properties, and then in Find the "Path" variable in the system variables, and select the "Edit" button, a dialog box will appear, in which you can add the path name of the bin folder under the JDK directory installed in the previous step. The path name of the bin folder here is : C:\Program Files\Java\jre1.8.0_92\bin, so add this to the path name, pay attention to the English semicolon ";" to separate. as the picture shows:

  After this is set, you can run the following commands in the cmd command line window opened in any directory. Check if the setting is successful.

java -version

  Observe whether the relevant java version information can be output. If it can be output, it means that the JDK installation step is all over. as the picture shows:

Second, the installation of Scala

  We download Scala from the official website: http://www.scala-lang.org/  , the latest version is 2.12.3, as shown in the figure

Because we are in a Windows environment, which is the purpose of this article, we choose the corresponding Windows version to download, as shown in the figure:

  After downloading the msi file of Scala, you can double-click to execute the installation. After the installation is successful, the Scala bin directory will be added to the PATH system variable by default (if not, similar to the JDK installation steps above, add the bin directory path under the Scala installation directory to the system variable PATH ), in order to verify If the installation is successful, open a new cmd window, enter scalaand press Enter. If you can enter the Scala interactive command environment normally, the installation is successful. As shown below:

Note: If the version information cannot be displayed and the interactive command line of Scala cannot be entered, there are usually two possibilities: 
1. The path name of the bin folder under the Scala installation directory has not been correctly added to the Path system variable. The method described can be added. 
2. If Scala is not installed correctly, repeat the above steps.

3. Installation of Spark

We go to Spark official website to download: http://spark.apache.org/  , we choose Spark with Hadoop version, as shown in the figure:

  After downloading I got about 200M files: spark-2.2.0-bin-hadoop2.7

  The Pre-built version is used here, which means that it has been compiled and used directly. Spark also has source code that can be downloaded, but it can only be used after manually compiling it. After the download is complete, decompress the file (you may need to decompress it twice). It is best to decompress it to the root directory of a disk and rename it to Spark, which is simple and not easy to make mistakes. And it should be noted that there should be no spaces in Spark's file directory path names, and folder names like "Program Files" are not allowed. We create a new Spark folder in the C drive to store it, as shown in the figure:

  After decompression, you can basically run it under the cmd command line. But at this time, every time you run spark-shell (spark's command line interactive window), you need to cdgo to the Spark installation directory first, which is troublesome, so you can add Spark's bin directory to the system variable PATH . For example, my Spark bin directory path here is D:\Spark\bin, then add this path name to the PATH of the system variable. The method is the same as the environment variable setting during the JDK installation process. After setting the system variable, in any directory In the cmd command line, you can directly execute the spark-shellcommand to enable Spark's interactive command line mode.

  After the system variables are set, you can run spark-shell in any cmd in the current directory, but you may encounter various errors at this time. This is mainly because Spark is based on hadoop, so it is also necessary to configure a Hadoop here. operating environment. The error is as shown:

Next, we also need to install Hadoop.

Fourth, the installation of Hadoop

  In Hadoop Releases , you can see the various historical versions of Hadoop. Since the downloaded Spark is based on Hadoop 2.7 (in the first step of Spark installation, we chose yes Pre-built for Hadoop 2.7), I choose version 2.7.1 here, choose OK After clicking on the corresponding version, you will enter the detailed download page, as shown in the following figure:

  Select the red mark in the figure to download. The src version above is the source code. If you need to make changes to Hadoop or want to compile it yourself, you can download the corresponding src file. What I download here is the compiled version, that is, "hadoop" in the figure -2.7.1.tar.gz" file.

Download and unzip to the specified directory, here is C:\Hadoop, as shown in the figure:

Then go to the environment variables section and set HADOOP_HOME to the Hadoop decompression directory, as shown in the figure:

Then set the bin directory in this directory to the PATH of the system variable, which is C:\Hadoop\bin here. If the HADOOP_HOME system variable has been added, you can also use %HADOOP_HOME%\bin to specify the bin folder path name. After these two system variables are set, open a new cmd window and enter the spark-shellcommand directly. as the picture shows:

  Under normal circumstances, it can run successfully and enter the Spark command line environment, but some users may encounter a null pointer error. At this time, it is mainly because there is no winutils.exe file in the bin directory of Hadoop. The solution here is: 

  You can go to  https://github.com/steveloughran/winutils  to select the version number of Hadoop you installed, and then go to the bin directory to find the winutils.exefile. The download method is to click the winutils.exefile. After entering, there is a Downloadbutton in the upper right part of the page, click Just download it. as the picture shows:

Download the winutils.exe file


  After downloading winutils.exe, put this file into the bin directory of Hadoop, here is C:\Hadoop\hadoop-2.7.1\bin.


Enter in the opened cmd 

C:\Hadoop\hadoop-2.7.1\bin\winutils.exe chmod 777 /tmp/Hive //Modify permissions, 777 is to get all permissions

But we found that some other errors were reported (this error also occurs in the Linux environment)

1 <console>:14: error: not found: value spark
2        import spark.implicits._
3               ^
4 <console>:14: error: not found: value spark
5        import spark.sql

The reason is that there is no permission to write the metastore_db file in spark.

Processing method: We grant permission to 777

In Linux environment , we operate under root:

1 sudo chmod 777 /home/hadoop/spark
2
3 #For convenience, you can give all permissions
4 sudo chmod a+w /home/hadoop/spark

In the window environment:

The folder where Spark is stored cannot be set to read-only or hidden, as shown in the figure:

Grant full control permissions as shown:

After these steps, and then open a new cmd window again, if it is normal, you should be able spark-shellto run Spark by direct input. The normal operation interface should be as shown below:

1. Installation of JDK

1.1 Download JDK

  First, you need to install the JDK, and configure the environment variables. If the old driver has been installed, it can be ignored. JDK (full name is JavaTM Platform Standard Edition Development Kit) installation, go to Oracle official website to download, the download address is Java SE Downloads  .

  The two places marked in red in the above picture are clickable. After clicking in, you can see some more detailed information about this latest version, as shown in the following figure:

  After downloading, we can install JDK directly. The installation of JDK under windows is very simple. According to the normal software installation ideas, double-click the downloaded exe file, and then set your own installation directory (this installation directory is set in the environment variable when needed).

1.2 JDK environment variable settings

  Next, set the corresponding environment variables. The setting method is: right-click on the desktop [Computer] - [Properties] - [Advanced System Settings], then select [Advanced] - [Environment Variables] in the system properties, and then in Find the "Path" variable in the system variables, and select the "Edit" button, a dialog box will appear, in which you can add the path name of the bin folder under the JDK directory installed in the previous step. The path name of the bin folder here is : C:\Program Files\Java\jre1.8.0_92\bin, so add this to the path name, pay attention to the English semicolon ";" to separate. as the picture shows:

  After this is set, you can run the following commands in the cmd command line window opened in any directory. Check if the setting is successful.

java -version

  Observe whether the relevant java version information can be output. If it can be output, it means that the JDK installation step is all over. as the picture shows:

Second, the installation of Scala

  We download Scala from the official website: http://www.scala-lang.org/  , the latest version is 2.12.3, as shown in the figure

Because we are in a Windows environment, which is the purpose of this article, we choose the corresponding Windows version to download, as shown in the figure:

  After downloading the msi file of Scala, you can double-click to execute the installation. After the installation is successful, the Scala bin directory will be added to the PATH system variable by default (if not, similar to the JDK installation steps above, add the bin directory path under the Scala installation directory to the system variable PATH ), in order to verify If the installation is successful, open a new cmd window, enter scalaand press Enter. If you can enter the Scala interactive command environment normally, the installation is successful. As shown below:

Note: If the version information cannot be displayed and the interactive command line of Scala cannot be entered, there are usually two possibilities: 
1. The path name of the bin folder under the Scala installation directory has not been correctly added to the Path system variable. The method described can be added. 
2. If Scala is not installed correctly, repeat the above steps.

3. Installation of Spark

We go to Spark official website to download: http://spark.apache.org/  , we choose Spark with Hadoop version, as shown in the figure:

  After downloading I got about 200M files: spark-2.2.0-bin-hadoop2.7

  The Pre-built version is used here, which means that it has been compiled and used directly. Spark also has source code that can be downloaded, but it can only be used after manually compiling it. After the download is complete, decompress the file (you may need to decompress it twice). It is best to decompress it to the root directory of a disk and rename it to Spark, which is simple and not easy to make mistakes. And it should be noted that there should be no spaces in Spark's file directory path names, and folder names like "Program Files" are not allowed. We create a new Spark folder in the C drive to store it, as shown in the figure:

  After decompression, you can basically run it under the cmd command line. But at this time, every time you run spark-shell (spark's command line interactive window), you need to cdgo to the Spark installation directory first, which is troublesome, so you can add Spark's bin directory to the system variable PATH . For example, my Spark bin directory path here is D:\Spark\bin, then add this path name to the PATH of the system variable. The method is the same as the environment variable setting during the JDK installation process. After setting the system variable, in any directory In the cmd command line, you can directly execute the spark-shellcommand to enable Spark's interactive command line mode.

  After the system variables are set, you can run spark-shell in any cmd in the current directory, but you may encounter various errors at this time. This is mainly because Spark is based on hadoop, so it is also necessary to configure a Hadoop here. operating environment. The error is as shown:

Next, we also need to install Hadoop.

Fourth, the installation of Hadoop

  In Hadoop Releases , you can see the various historical versions of Hadoop. Since the downloaded Spark is based on Hadoop 2.7 (in the first step of Spark installation, we chose yes Pre-built for Hadoop 2.7), I choose version 2.7.1 here, choose OK After clicking on the corresponding version, you will enter the detailed download page, as shown in the following figure:

  Select the red mark in the figure to download. The src version above is the source code. If you need to make changes to Hadoop or want to compile it yourself, you can download the corresponding src file. What I download here is the compiled version, that is, "hadoop" in the figure -2.7.1.tar.gz" file.

Download and unzip to the specified directory, here is C:\Hadoop, as shown in the figure:

Then go to the environment variables section and set HADOOP_HOME to the Hadoop decompression directory, as shown in the figure:

Then set the bin directory in this directory to the PATH of the system variable, which is C:\Hadoop\bin here. If the HADOOP_HOME system variable has been added, you can also use %HADOOP_HOME%\bin to specify the bin folder path name. After these two system variables are set, open a new cmd window and enter the spark-shellcommand directly. as the picture shows:

  Under normal circumstances, it can run successfully and enter the Spark command line environment, but some users may encounter a null pointer error. At this time, it is mainly because there is no winutils.exe file in the bin directory of Hadoop. The solution here is: 

  You can go to  https://github.com/steveloughran/winutils  to select the version number of Hadoop you installed, and then go to the bin directory to find the winutils.exefile. The download method is to click the winutils.exefile. After entering, there is a Downloadbutton in the upper right part of the page, click Just download it. as the picture shows:

Download the winutils.exe file


  After downloading winutils.exe, put this file into the bin directory of Hadoop, here is C:\Hadoop\hadoop-2.7.1\bin.


Enter in the opened cmd 

C:\Hadoop\hadoop-2.7.1\bin\winutils.exe chmod 777 /tmp/Hive //Modify permissions, 777 is to get all permissions

But we found that some other errors were reported (this error also occurs in the Linux environment)

1 <console>:14: error: not found: value spark
2        import spark.implicits._
3               ^
4 <console>:14: error: not found: value spark
5        import spark.sql

The reason is that there is no permission to write the metastore_db file in spark.

Processing method: We grant permission to 777

In Linux environment , we operate under root:

1 sudo chmod 777 /home/hadoop/spark
2
3 #For convenience, you can give all permissions
4 sudo chmod a+w /home/hadoop/spark

In the window environment:

The folder where Spark is stored cannot be set to read-only or hidden, as shown in the figure:

Grant full control permissions as shown:

After these steps, and then open a new cmd window again, if it is normal, you should be able spark-shellto run Spark by direct input. The normal operation interface should be as shown below:

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325844032&siteId=291194637