Article directory
1. Environmental overview
- Linux distribution: Ubuntu
- Virtual machine application: VMware Workstation Pro
- Hadoop version: 3.1.3|pseudo-distributed cluster
- JDK version: JDK1.8.0_162
- Spark version: 2.4.0
- Scala version: 2.12.8
- Python version: 3.6.8 | 3.7.16
2. Ubuntu
2.1 CD file
First enter the link Download Ubuntu Desktop | Download | Ubuntu to download the image file
Put the downloaded CD image file into the following file path:
D:\Program Files\Virtual Machines
- The drop file path is not important can be freely chosen
- But you need to remember this path because you need to specify it later
2.2 Create a virtual machine
On the VMware Workstation home page, select Create a virtual machine
Select a typical recommended configuration - the next step
Installation source - choose to install the operating system later - next
Select Guest OS - Next Step
The virtual machine name and file location can be adjusted according to the needs and then click Next
Select disk capacity - split into multiple files - next step
After creating the basic settings of the virtual machine - click on the custom hardware
Select the CD file to specify the image file we downloaded [ here we need to use the file path we recorded in 2.1 ]
Click to start the virtual machine. The language item defaults to English. I have chosen Chinese here.
Updates and other software - just click continue
Installation type - click to install now
Confirm writing to disk:
Choose China-shanghai as the region
Set username and password
Then wait for the installation to complete:
3.Hadoop
3.1 Sudo
First, you need to create a user for hadoop and set a password to open the terminal and enter the following command
sudo useradd -m hadoop -s /bin/bash
If you do not use sudo here, you will be prompted for insufficient permissions
Set the authentication password for the new user hadoop just created
passwd hadoop
- Here I set the password of user hadoop to hadoop
- When setting the password, the command line will prompt "Invalid password: the password is less than 8 characters", "Failed to pass the dictionary check", etc.
- can be ignored
- Complete the password setting after two consecutive input of the same password
Add administrator privileges to hadoop users
adduser hadoop sudo
switch to user hadoop
3.2 SSH
Continue to configure SSH under hadoop user login:
1. Update apt
In order to ensure that the openssh-server installation process goes smoothly, execute the following command to update the APT tool:
sudo apt-get update
After entering the command, wait for the networking program to automatically update
Proceed to the next step when the "Finish" prompt appears
2. Install SSH
sudo apt-get install openssh-server
After entering the command, it will ask whether to continue to execute and answer Y
Wait for the installation to complete as above
3. After installation, you can use the following command to log in to the machine
ssh localhost
At this time, there will be the following prompt (SSH first login prompt), enter yes
Then press the prompt to enter the password hadoop
This will log in to the machine
4. Configure SSH login without password
exit #退出刚才的 ssh localhost模式
cd ~/.ssh/ #若没有该目录,请先执行一次 ssh localhost
ssh-keygen -t rsa #命令行会有提示 多次敲击回车键就可以向后推进 回车键意味着无密码
sudo cat ./id_rsa.pub >> ./authorized_keys # 加入授权
-
At this time, use the ssh localhost command again, and you can log in directly without entering a password.
-
Logging in without a password cannot be omitted, otherwise there will be a problem of insufficient permissions and authentication when starting the hadoop cluster
-
If you omit the passwordless login setting, an error will be reported when starting the hadoop cluster later: localhost: root localhost: permission denied
3.3 JDK
Download jdk-8u162-linux-x64.tar file address:
Java Archive Downloads - Java SE 8 (oracle.com)
After you have the JDK file in the machine, you need to transfer it to the virtual machine
- Here you can use the VMwaretools tool reference content four
- Or use a USB flash drive to copy and paste files between the host and the virtual machine
1. First copy the file in the host
2. Then insert the U disk loaded with the required files in our virtual machine into the virtual machine
3. Then copy the file in the virtual machine and pop up the U disk to complete the file transfer
- After having the tar file, you can decompress and install the file offline
- The JDK installation is shown here:
1. Unzip jdk
Log in to the hadoop user and use the privileged command sudo to decompress
sudo tar -zxvf jdk-8u162-linux-x64.tar.gz
2. Set the environment variables of JDK
vi ~/.bashrc
Determine the location of the JDK file. The decompressed jdk file in my virtual machine is located in the /opt folder path
So add in your .bashrc file:
export JAVA_HOME=/opt/jdk1.8.0_162
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH
3. Make the environment variables of the added JDK take effect
source ~/.bashrc
4. Check whether the JDK is installed successfully
java -version
Complete the inspection JDK has been installed successfully
3.4 hadoop
Download hadoop-3.1.3.tar.gz file address: Apache Hadoop
Log in hadoop user to use the tar command in privileged mode to decompress hadoop
sudo tar -zxf ./hadoop-3.1.3.tar.gz
The folder after decompression is completed:
Change the folder name to hadoop to remove unnecessary version number information for future work
mv ./hadoop-3.1.3/ ./hadoop
Modify file permissions to assign files to hadoop users
sudo chown -R hadoop ./hadoop
Hadoop needs to be configured in environment variables to make its commands effective
In the .bashrc file add:
export HADOOP_HOME=/opt/hadoop
export CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath):$CLASSPATH
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
Make changes to environment variables take effect
source ~/.bashrc
View hadoop version information
/opt/hadoop/bin/hadoop version
3.5 Pseudo-distributed clusters
1. Enter the hadoop installation directory
cd ./hadoop
2. Log in to the hadoop account with modify permissions
su hadoop
3. Modify the configuration file core-site.xml
sudo vi ./etc/hadoop/core-site.xml
Add after the core-site.xml file:
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/home/yt/桌面/files/hadoop/tmp</value>
<description>Abase for other temporary directories.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
4. Log in to the hadoop account that has the permission to modify the content of the file
su hadoop
5. Modify the configuration file hdfs-site.xml
vi ./etc/hadoop/hdfs-site.xml
Add in the hdfs-site.xml file
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/yt/桌面/files/hadoop/tmp/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/yt/桌面/files/hadoop/tmp/dfs/data</value>
</property>
</configuration>
6. Modify the configuration under sbin
Find the sbin folder in the Hadoop installation directory
Modify four files in it
Add at the top of the start-dfs.sh and stop-dfs.sh files:
HDFS_DATANODE_USER=root
HADOOP_SECURE_DN_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root
Add at the top of the start-yarn.sh and stop-yarn.sh files:
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root
7. Perform NameNode formatting
./bin/hdfs namenode -format
Observe that the prompt [successfully formatted] appears in the window
[hadoop@:] sudo ./bin/hdfs namenode -format
After formatting is complete, multiple files appear in the tmp/name folder under the hadoop main folder
8. Adjust file permissions
Add permission to folder
sudo chmod -R 777 ./logs
sudo chmod -R hadoop /opt/hadoop/tmp
9. Start hadoop
sudo /opt/hadoop/sbin/start-all.sh
10. Use the jps command to check whether hadoop has started successfully. If it displays: DataNode, NodeManager, NameNode, SecondaryNameNode and ResourceManager, it means it has started successfully.
jps
11. Use a browser to enter the hadoop web interface
https:localhost:9870
12. stop hadoop
Go to the directory /opt/hadoop/sbin and run:
sudo ./stop-all.sh
四.VMwaretools
4.1 Installation
-
This section shows an attempt to install VMwaretools
-
The main purpose is to realize file transfer between win and virtual machine environment by dragging and dropping
-
Click on the VM menu and click on the virtual machine - install VMware tools
-
A new disk will appear in the file directory with the following files
We copy the required gz file to the desktop for easy use of the terminal to find its location
Open the terminal and enter the following command
The corresponding decompressed files will appear in the same directory
Double-click to enter the file to open terminal mode here
sudo ./vmware-install.pl
Then if there is no specific configuration requirement for the interactive window
Press Enter all the time or reply YES to continue the installation of the tool until enjoy completes
Click on the VM menu bar and click on the virtual machine
The "Install VMware tools" button below has changed to "Reinstall VMware tools"
4.2 use
First, create a folder on the memory of the non-virtual machine to remember the location of the file
Click on the menu bar Virtual Machine - Settings - Options
Check "Shared Folders" Enable and click "Add"
Enabled successfully
Then in the following path of the virtual machine:
其他\计算机\mnt\hgfs\
Able to find shared files This folder has the same structure and content as the folder we created in the host.
5.Spark
5.1 scala
Enter the password under the hadoop user on the command line
su hadoop
Enter the compressed package file path
cd /opt
The sudo tar command completes the decompression
sudo tar -zxf scala-2.12.8.tgz
modify name
sudo mv ./scala-2.12.8/ ./scala
Modify the configuration file
sudo vi ~/.bashrc
add configuration
Use the source command to make the modified configuration file take effect
source ~/.bashrc
Test whether the installation is successful and successfully enter the code environment
5.2 spark
Unzip Spark
cd /opt
sudo tar -zxf spark-2.4.0-bin-with.tgz
mv ./spark-2.4.0-bin-without-hadoop/ ./spark
After a simple configuration, copy the template file and use the vi editor to modify the configuration of the spark-env.sh file
cd /opt/spark/conf
sudo cp spark-env.sh.template spark-env.sh
sudo vi spark-env.sh
In the opened vi editing environment:
export SCALA_HOME=/opt/scala
export LD_LIBRARY_PATH = $HADOOP_HOME/lib/native
export SPARK_LOCAL_DIRS=/opt/spark
export SPARK_DIST_CLASSPATH=$(/opt/hadoop/bin/hadoop classpath)
export JAVA_HOME=/opt/jdk1.8.0_162
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
export SPARK_MASTER_HOST=hostname
export SPARK_MASTER_PORT=7077
Edit environment variables
vi ~/.bashrc
In the .bashrc file add:
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin
make it effective
source ~/.bashrc
edit slaves
cd /opt/spark/conf
cp slaves.template slaves
vi slaves
Comment out localhost and add your own host name under this line
#localhost
hostname
Modify spark-config.sh in the sbin directory to add jdk environment variables
export JAVA_HOME=/opt/jdk1.8.0_162
start spark
cd /opt/spark/bin
./spark-shell
Able to perform simple file reading and calculation Spark test installation is successful:
5.3 pyspark
First you need to install and configure the Python environment
-
The old version of python must be used to run pyspark, version 3.10 cannot be used
-
Otherwise, an error will be reported TypeError: “bytes” object cannot be interpreted as an integer
-
After completing the installation of the python3.7 version (the installation here refers to the sixth part of the current blog content Python)
-
The pyspark file test can perform simple file reading and calculation
If you want to start pyspark directly in the Python environment, you need to configure the following environment variables:
sudo vi ~/.bashrc
Add the following statement in the open environment:
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/pyspark:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH
Use the source statement to make the environment configuration take effect
source ~/.bashrc
Test that pyspark can be run directly in the Python environment by importing statements:
5.4 Pseudo-distributed
According to the previous installation process, the pseudo-distributed configuration has been completed
Adjust the printing level of the log file
cd /opt/spark/conf
cp log4j.properties.template log4j.properties
vi log4j.properties
将 log4j.rootcategory=INFO,console 改为: log4j.rootcategory=WARN,console
Start Spark
/opt/spark/sbin/start-all.sh
After starting, use the jps command to observe, which means that the pseudo-distributed cluster has been successfully started
$ jps
12770 Master
12949 Worker
Stop Spark
/opt/spark/sbin/stop-all.sh
Six. Python
6.1 Source code form
6.1 python
Download Python-3.6.8.tgz from the Huawei mirror: Index of python-local/3.6.8 (huaweicloud.com)
Login hadoop user
su hadoop
Unzip and install the Python package
sudo tar -zxf Python-3.6.8.tgz
Rename to remove version number information
mv ./Python-3.6.8/ ./python
Enter the python directory
cd python
install dependencies
sudo apt-get install -y gcc make build-essential libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev libncursesw5-dev xz-utils tk-dev libffi-dev liblzma-dev
check dependencies
sudo ./configure --enable-optimizations --prefix=/usr/local/bin/python3.6
If gcc is missing here, an error will be reported and the makefile cannot be generated
Install the gcc library through the apt tool
sudo apt install gcc
Here, if you encounter "unable to obtain lock /var/lib/dpkg/lock-frontend" error
You can run the code to kill the process. The process number is modified according to its own error report.
sudo kill 22035
If there is an error such as download failure and failure to download
Run the following code and check the network
sudo apt-get update
After successfully creating the Makefile
When the file exists, the command line continues to enter
sudo make
sudo make install
Detect installation and Python path
$ Python3.6 --version
python 3.6.8
The installation is successful, but there are problems when using multiple Python versions in the computer at this time
Use the following command
sudo ln -s -f /usr/local/bin/python3.6/bin/python3.6 /usr/bin/python3.6
sudo ln -s -f /usr/local/bin/python3.6/bin/pip3.6 /usr/bin/pip3.6
Turn the above installed python into our default Python version
6.2 pip
Get the pip-20.2.3.tar.gz file under the /opt path
Download link: Links for pip (tsinghua.edu.cn)
login hadoop
su hadoop
Go to the directory with the compressed package
cd /opt
to decompress
sudo tar -zxvf pip-20.2.3.tar.gz
Enter the directory of the pip file
cd /opt/pip-20.2.3
Install (if the python version is not specified, the permission error may be reported here)
sudo python setup.py build
sudo python setup.py install
6.3 numpy
After installing the pip tool
pip install numpy -i http://pypi.douban.com/simple --trusted-host pypi.douban.com
successfully installed numpy library
6.2 default python
If there is no special requirement, you can directly use the built-in python version
For example, python3.10 exists in ubuntu 20.4.2
Complete the installation and configuration of the Python environment with the following simple commands
python3 --version
sudo apt install python3-pip
update-alternatives --install /usr/bin/python python /usr/bin/python3 1
python --version
pip -V
sudo pip install numpy
- This method is the easiest but it may be because the Python version is too new
- cause some incompatibilities
6.3 apt way
update apt-tools
sudo apt update
add ppa source
sudo apt install software-properties-common
Install Python 3.7
sudo apt install python3.7
Complete the Python version installation Add the version option to keep the version you need high priority
update-alternatives --install /usr/bin/python python /usr/bin/python3.10 1
update-alternatives --install /usr/bin/python python /usr/bin/python3.7 2
The version selection window appears and the Python version can be changed at any time
# update-alternatives --config python
At this time, entering python on the command line will no longer appear to point to unknown
Use the pip tool that comes with Python3.7 to install third-party libraries such as numpy
sudo python3.7 -m pip install numpy -i http://pypi.douban.com/simple --trusted-host pypi.douban.com
7. Reference blog
[1] VMware virtual machine installation Ubuntu-CSDN blog
[2] Ubuntu Appears: Please remove the installation medium-CSDN Blog
[3] ubuntu error: su authentication failed - CSDN blog
[4] VMware virtual machine often freezes-Blog Park
[5] VMware Tools installation - CSDN blog
[6] Set shared folder host and vmware to transfer files - CSDN blog
[7] The use of vi command in Ubuntu-Blog Garden
[8] Recovery after linux .bashrc file misuse - CSDN Blog
[9] ERROR: but there is no HDFS_NAMENODE_USER defined. Aborting operation-CSDN博客
[10] Build HDFS pseudo-distributed environment under ubuntu - CSDN Blog
[11] Hadoop installation and construction pseudo-distributed tutorial - CSDN blog
[12] Ubuntu user permissions - CSDN blog
[13] Error: localhost: permission denied - CSDN blog
[14] ERROR: Attempting to operate on hdfs namenode as root-CSDN博客
[15] Summary of Hadoop pseudo-distributed installation problems - CSDN blog
[16] Datanode failed to start: Too many failed volumes-HUAWEI CLOUD
[17] Hadoop process start incomplete - CSDN blog
[18] Linux modify the permissions of all files under the folder - CSDN blog
[19] The difference between Linux file profile, bashrc, bash_profile - know almost
[20] Spark pseudo-distributed installation - blog garden
[21] Ubuntu-Error: no acceptable C compiler found in $PATH - 博客园
[22] The solution lock is being held by process xx - CSDN blog
[23] sudo: make: command not found - CSDN Blog
[24] ZipImportError: can't decompress data-CSDN blog
[25] Install Python Pip on Ubuntu 20.04 - Programmer Sought
[26] Switch Python version under Ubuntu - CSDN blog
[27] Spark startup reports JAVA_HOME is not set exception - CSDN blog
[28] Falling back to unsupported java.lang.NumberFormatException-CSDN博客
[29] pyspark:TypeError: ‘bytes’ object cannot be interpreted as an integer-CSDN问答
[30] Install python3.6 on Ubuntu 20.4
[31] How to completely uninstall pip on ubuntu - Billion Speed Cloud
[32] ubuntu18.04 install python3.7.7 through apt-CSDN blog
[33] Spark startup: Your hostname resolves to a loopback address-CSDN blog
[34] Unable to load native-hadoop library for your platform-CSDN博客