The Spark official website provides precompiled packages with Hadoop and Scala, which greatly simplifies the installation process.
Avoid pitfalls: According to my observation, the Hadoop attached to Spark is not a complete Hadoop body, but only includes file management components such as HDFS and Hbase that Spark relies on. If you need to use the full Hadoop function at the same time, you need to install Hadoop and Hadoop separately. Spark, this tutorial is not for you
Below I will use a brand new Linux virtual machine to install:
Virtual machine software: VMware® Workstation 16 Pro
System: ubuntu-22.04.1-desktop-amd64
install java
Note that the JAVA version here should be consistent with the version supported by your Spark, here I use Java 17
Official website: Overview - Spark 3.3.0 Documentation
Spark runs on Java 8/11/17, Scala 2.12/2.13, Python 3.7+ and R 3.5+. Java 8 prior to version 8u201 support is deprecated as of Spark 3.2.0. For the Scala API, Spark 3.3.0 uses Scala 2.12. You will need to use a compatible Scala version (2.12.x).
Be sure to write JAVA_HOME in the environment variable. I won’t go into details on how to install java. I just found a tutorial link on the Internet to install java in a linux environment - Conan. Doyle - Blog Garden
Download Spark
Official website download: Downloads | Apache Spark
Pay attention to select the version with Hadoop and Spark in the first box
Install
Unzip to the specified directory
sudo tar -xzvf [你的下载文件路径] -C [你的Spark安装路径]
The path in [] looks at it and changes it. After the change, it looks like this
sudo tar -xzvf ~/Downloads/spark-3.3.0-bin-hadoop3-scala2.13.tgz -C ~/Software/Spark
Verify successful installation
Go to the directory where you installed
cd [你的Spark安装路径]
Run the sample code - find pi (approximate numbers retain 10 decimal places)
./bin/run-example SparkPi 10
It will output a lot of things, but as long as the result comes out, it should be fine
It's that simple
reference:
Official website documentation: Overview - Spark 3.3.0 Documentation