Big Data: VMware | Ubuntu | Hadoop | Spark | VMwaretools | Python Installation and Configuration Summary

1. Environmental overview

  • Linux distribution: Ubuntu
  • Virtual machine application: VMware Workstation Pro
  • Hadoop version: 3.1.3|pseudo-distributed cluster
  • JDK version: JDK1.8.0_162
  • Spark version: 2.4.0
  • Scala version: 2.12.8
  • Python version: 3.6.8 | 3.7.16

2. Ubuntu

2.1 CD file

First enter the link Download Ubuntu Desktop | Download | Ubuntu to download the image file

insert image description here

Put the downloaded CD image file into the following file path:

 D:\Program Files\Virtual Machines
  • The drop file path is not important can be freely chosen
  • But you need to remember this path because you need to specify it later

2.2 Create a virtual machine

On the VMware Workstation home page, select Create a virtual machine

insert image description here

Select a typical recommended configuration - the next step

insert image description here

Installation source - choose to install the operating system later - next

insert image description here

Select Guest OS - Next Step

insert image description here

The virtual machine name and file location can be adjusted according to the needs and then click Next

insert image description here

Select disk capacity - split into multiple files - next step

insert image description here

After creating the basic settings of the virtual machine - click on the custom hardware

insert image description here

Select the CD file to specify the image file we downloaded [ here we need to use the file path we recorded in 2.1 ]

insert image description here

Click to start the virtual machine. The language item defaults to English. I have chosen Chinese here.

insert image description here

Updates and other software - just click continue

insert image description here

Installation type - click to install now

insert image description here

Confirm writing to disk:

insert image description here

Choose China-shanghai as the region

insert image description here

Set username and password

insert image description here

Then wait for the installation to complete:

insert image description here

3.Hadoop

3.1 Sudo

First, you need to create a user for hadoop and set a password to open the terminal and enter the following command

sudo useradd -m hadoop -s /bin/bash

If you do not use sudo here, you will be prompted for insufficient permissions

Set the authentication password for the new user hadoop just created

passwd hadoop 

insert image description here

  • Here I set the password of user hadoop to hadoop
  • When setting the password, the command line will prompt "Invalid password: the password is less than 8 characters", "Failed to pass the dictionary check", etc.
  • can be ignored
  • Complete the password setting after two consecutive input of the same password

Add administrator privileges to hadoop users

adduser hadoop sudo

insert image description here

switch to user hadoop

insert image description here

3.2 SSH

Continue to configure SSH under hadoop user login:

1. Update apt

In order to ensure that the openssh-server installation process goes smoothly, execute the following command to update the APT tool:

sudo apt-get update

After entering the command, wait for the networking program to automatically update

insert image description here

insert image description here

Proceed to the next step when the "Finish" prompt appears

2. Install SSH

sudo apt-get install openssh-server 

After entering the command, it will ask whether to continue to execute and answer Y

insert image description here

Wait for the installation to complete as above

3. After installation, you can use the following command to log in to the machine

ssh localhost 

At this time, there will be the following prompt (SSH first login prompt), enter yes

insert image description here

Then press the prompt to enter the password hadoop

insert image description here

This will log in to the machine

insert image description here

4. Configure SSH login without password

exit #退出刚才的 ssh localhost模式
cd ~/.ssh/  #若没有该目录,请先执行一次 ssh localhost
ssh-keygen -t rsa   #命令行会有提示 多次敲击回车键就可以向后推进 回车键意味着无密码

insert image description here

sudo cat ./id_rsa.pub >> ./authorized_keys # 加入授权 

insert image description here

  • At this time, use the ssh localhost command again, and you can log in directly without entering a password.

  • Logging in without a password cannot be omitted, otherwise there will be a problem of insufficient permissions and authentication when starting the hadoop cluster

  • If you omit the passwordless login setting, an error will be reported when starting the hadoop cluster later: localhost: root localhost: permission denied

3.3 JDK

Download jdk-8u162-linux-x64.tar file address:
Java Archive Downloads - Java SE 8 (oracle.com)

insert image description here

After you have the JDK file in the machine, you need to transfer it to the virtual machine

  • Here you can use the VMwaretools tool reference content four
  • Or use a USB flash drive to copy and paste files between the host and the virtual machine

1. First copy the file in the host

insert image description here

2. Then insert the U disk loaded with the required files in our virtual machine into the virtual machine

insert image description here

3. Then copy the file in the virtual machine and pop up the U disk to complete the file transfer

insert image description here

insert image description here

  • After having the tar file, you can decompress and install the file offline
  • The JDK installation is shown here:

1. Unzip jdk

Log in to the hadoop user and use the privileged command sudo to decompress

sudo tar -zxvf jdk-8u162-linux-x64.tar.gz 

2. Set the environment variables of JDK

vi ~/.bashrc 

Determine the location of the JDK file. The decompressed jdk file in my virtual machine is located in the /opt folder path

insert image description here

So add in your .bashrc file:

export JAVA_HOME=/opt/jdk1.8.0_162

export JRE_HOME=${JAVA_HOME}/jre 

export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib 

export PATH=${JAVA_HOME}/bin:$PATH 

3. Make the environment variables of the added JDK take effect

 source ~/.bashrc 

4. Check whether the JDK is installed successfully

java -version

Complete the inspection JDK has been installed successfully

insert image description here

3.4 hadoop

Download hadoop-3.1.3.tar.gz file address: Apache Hadoop

Log in hadoop user to use the tar command in privileged mode to decompress hadoop

sudo tar -zxf ./hadoop-3.1.3.tar.gz 

The folder after decompression is completed:

insert image description here

Change the folder name to hadoop to remove unnecessary version number information for future work

mv ./hadoop-3.1.3/ ./hadoop 

insert image description here

Modify file permissions to assign files to hadoop users

sudo chown -R hadoop ./hadoop 

Hadoop needs to be configured in environment variables to make its commands effective

In the .bashrc file add:

export HADOOP_HOME=/opt/hadoop
export CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath):$CLASSPATH
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Make changes to environment variables take effect

source ~/.bashrc

View hadoop version information

/opt/hadoop/bin/hadoop version

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-YL8i0xWI-1684997700223) (C:\Users\yt\AppData\Roaming\Typora\typora-user-images\ image-20230523221032234.png)]

3.5 Pseudo-distributed clusters

1. Enter the hadoop installation directory

cd ./hadoop 

2. Log in to the hadoop account with modify permissions

su hadoop

3. Modify the configuration file core-site.xml

sudo vi ./etc/hadoop/core-site.xml

Add after the core-site.xml file:

<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/home/yt/桌面/files/hadoop/tmp</value>
<description>Abase for other temporary directories.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

[External link image transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the image and upload it directly (img-xkBH7SsW-1684997700223) (C:\Users\yt\AppData\Roaming\Typora\typora-user-images\ image-20230518145348801.png)]

4. Log in to the hadoop account that has the permission to modify the content of the file

su hadoop

5. Modify the configuration file hdfs-site.xml

vi ./etc/hadoop/hdfs-site.xml

Add in the hdfs-site.xml file

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/yt/桌面/files/hadoop/tmp/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/yt/桌面/files/hadoop/tmp/dfs/data</value>
</property>
</configuration>

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-UqLVMxCr-1684997700224) (C:\Users\yt\AppData\Roaming\Typora\typora-user-images\ image-20230518150847548.png)]

6. Modify the configuration under sbin

Find the sbin folder in the Hadoop installation directory

Modify four files in it

Add at the top of the start-dfs.sh and stop-dfs.sh files:

HDFS_DATANODE_USER=root
HADOOP_SECURE_DN_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root

Add at the top of the start-yarn.sh and stop-yarn.sh files:

YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root

7. Perform NameNode formatting

./bin/hdfs namenode -format

Observe that the prompt [successfully formatted] appears in the window

[hadoop@:]   sudo  ./bin/hdfs namenode -format

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-gZq8P6ZT-1684997700225) (C:\Users\yt\AppData\Roaming\Typora\typora-user-images\ image-20230518180110423.png)]

After formatting is complete, multiple files appear in the tmp/name folder under the hadoop main folder

8. Adjust file permissions

Add permission to folder

sudo chmod -R 777 ./logs
sudo chmod -R hadoop /opt/hadoop/tmp

9. Start hadoop

sudo /opt/hadoop/sbin/start-all.sh

10. Use the jps command to check whether hadoop has started successfully. If it displays: DataNode, NodeManager, NameNode, SecondaryNameNode and ResourceManager, it means it has started successfully.

jps

[External link image transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the image and upload it directly (img-eRRluKfb-1684997700225) (C:\Users\yt\AppData\Roaming\Typora\typora-user-images\ image-20230519174653430.png)]

11. Use a browser to enter the hadoop web interface

https:localhost:9870

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-d0QWMpCl-1684997700226) (C:\Users\yt\AppData\Roaming\Typora\typora-user-images\ image-20230519175505981.png)]

12. stop hadoop

Go to the directory /opt/hadoop/sbin and run:

sudo ./stop-all.sh

四.VMwaretools

4.1 Installation

  • This section shows an attempt to install VMwaretools

  • The main purpose is to realize file transfer between win and virtual machine environment by dragging and dropping

  • Click on the VM menu and click on the virtual machine - install VMware tools

  • A new disk will appear in the file directory with the following files

[External link image transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the image and upload it directly (img-6MkHGVhV-1684997700227) (C:\Users\yt\AppData\Roaming\Typora\typora-user-images\ image-20230514215241638.png)]

We copy the required gz file to the desktop for easy use of the terminal to find its location

[External link image transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the image and upload it directly (img-cRpRJ0q1-1684997700228) (C:\Users\yt\AppData\Roaming\Typora\typora-user-images\ image-20230514215529245.png)]

Open the terminal and enter the following command

insert image description here

The corresponding decompressed files will appear in the same directory

[External link image transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the image and upload it directly (img-aGzouyc0-1684997700229) (C:\Users\yt\AppData\Roaming\Typora\typora-user-images\ image-20230514220047567.png)]

Double-click to enter the file to open terminal mode here

sudo ./vmware-install.pl

Then if there is no specific configuration requirement for the interactive window

Press Enter all the time or reply YES to continue the installation of the tool until enjoy completes

insert image description here

insert image description here

Click on the VM menu bar and click on the virtual machine

The "Install VMware tools" button below has changed to "Reinstall VMware tools"

4.2 use

First, create a folder on the memory of the non-virtual machine to remember the location of the file

insert image description here

Click on the menu bar Virtual Machine - Settings - Options

insert image description here

Check "Shared Folders" Enable and click "Add"

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-Pss3WxGJ-1684997700232) (C:\Users\yt\AppData\Roaming\Typora\typora-user-images\ image-20230514235814664.png)]

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-CuyBRSGy-1684997700233) (C:\Users\yt\AppData\Roaming\Typora\typora-user-images\ image-20230515000003311.png)]

[External link image transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the image and upload it directly (img-zUkKjkWs-1684997700233) (C:\Users\yt\AppData\Roaming\Typora\typora-user-images\ image-20230515000037210.png)]

Enabled successfully

insert image description here

Then in the following path of the virtual machine:

其他\计算机\mnt\hgfs\

Able to find shared files This folder has the same structure and content as the folder we created in the host.

5.Spark

5.1 scala

Enter the password under the hadoop user on the command line

su hadoop

Enter the compressed package file path

cd /opt

The sudo tar command completes the decompression

sudo tar -zxf scala-2.12.8.tgz

insert image description here

modify name

sudo mv ./scala-2.12.8/ ./scala

Modify the configuration file

sudo vi ~/.bashrc 

add configuration

[External link image transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the image and upload it directly (img-xAJzTAI1-1684997700236) (C:\Users\yt\AppData\Roaming\Typora\typora-user-images\ image-20230519182356456.png)]

Use the source command to make the modified configuration file take effect

source ~/.bashrc 

Test whether the installation is successful and successfully enter the code environment

insert image description here

5.2 spark

Unzip Spark

cd /opt 
sudo tar -zxf spark-2.4.0-bin-with.tgz 
mv ./spark-2.4.0-bin-without-hadoop/ ./spark

After a simple configuration, copy the template file and use the vi editor to modify the configuration of the spark-env.sh file

cd /opt/spark/conf
sudo cp spark-env.sh.template spark-env.sh
sudo vi spark-env.sh

In the opened vi editing environment:

export SCALA_HOME=/opt/scala
export LD_LIBRARY_PATH = $HADOOP_HOME/lib/native
export SPARK_LOCAL_DIRS=/opt/spark
export SPARK_DIST_CLASSPATH=$(/opt/hadoop/bin/hadoop classpath)
export JAVA_HOME=/opt/jdk1.8.0_162
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
export SPARK_MASTER_HOST=hostname
export SPARK_MASTER_PORT=7077

Edit environment variables

vi ~/.bashrc

In the .bashrc file add:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin

make it effective

source ~/.bashrc

edit slaves

cd /opt/spark/conf
cp slaves.template slaves
vi slaves

Comment out localhost and add your own host name under this line

#localhost
hostname

Modify spark-config.sh in the sbin directory to add jdk environment variables

export JAVA_HOME=/opt/jdk1.8.0_162 

start spark

cd /opt/spark/bin
./spark-shell

Able to perform simple file reading and calculation Spark test installation is successful:

insert image description here

5.3 pyspark

First you need to install and configure the Python environment

  • The old version of python must be used to run pyspark, version 3.10 cannot be used

  • Otherwise, an error will be reported TypeError: “bytes” object cannot be interpreted as an integer

  • After completing the installation of the python3.7 version (the installation here refers to the sixth part of the current blog content Python)

  • The pyspark file test can perform simple file reading and calculation

insert image description here

If you want to start pyspark directly in the Python environment, you need to configure the following environment variables:

sudo vi ~/.bashrc

Add the following statement in the open environment:

export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/pyspark:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH

Use the source statement to make the environment configuration take effect

source ~/.bashrc

Test that pyspark can be run directly in the Python environment by importing statements:

insert image description here

5.4 Pseudo-distributed

According to the previous installation process, the pseudo-distributed configuration has been completed

Adjust the printing level of the log file

cd /opt/spark/conf
cp log4j.properties.template log4j.properties
vi log4j.properties
将 log4j.rootcategory=INFO,console 改为: log4j.rootcategory=WARN,console

Start Spark

/opt/spark/sbin/start-all.sh  

After starting, use the jps command to observe, which means that the pseudo-distributed cluster has been successfully started

$ jps
12770 Master
12949 Worker

Stop Spark

/opt/spark/sbin/stop-all.sh

Six. Python

6.1 Source code form

6.1 python

Download Python-3.6.8.tgz from the Huawei mirror: Index of python-local/3.6.8 (huaweicloud.com)

Login hadoop user

su hadoop

Unzip and install the Python package

sudo tar -zxf Python-3.6.8.tgz 

Rename to remove version number information

mv ./Python-3.6.8/ ./python 

Enter the python directory

cd python

install dependencies

sudo apt-get install -y gcc make build-essential libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev libncursesw5-dev xz-utils tk-dev libffi-dev liblzma-dev

check dependencies

sudo ./configure --enable-optimizations --prefix=/usr/local/bin/python3.6

If gcc is missing here, an error will be reported and the makefile cannot be generated

Install the gcc library through the apt tool

sudo apt install gcc

Here, if you encounter "unable to obtain lock /var/lib/dpkg/lock-frontend" error

insert image description here

You can run the code to kill the process. The process number is modified according to its own error report.

sudo kill 22035

If there is an error such as download failure and failure to download

insert image description here

Run the following code and check the network

sudo apt-get update

After successfully creating the Makefile

insert image description here

When the file exists, the command line continues to enter

sudo make
sudo make install

Detect installation and Python path

$ Python3.6 --version
python 3.6.8

The installation is successful, but there are problems when using multiple Python versions in the computer at this time

Use the following command

sudo ln -s -f /usr/local/bin/python3.6/bin/python3.6 /usr/bin/python3.6
sudo ln -s -f /usr/local/bin/python3.6/bin/pip3.6 /usr/bin/pip3.6

Turn the above installed python into our default Python version

insert image description here

6.2 pip

Get the pip-20.2.3.tar.gz file under the /opt path

Download link: Links for pip (tsinghua.edu.cn)

login hadoop

su hadoop

Go to the directory with the compressed package

cd /opt

to decompress

sudo tar -zxvf pip-20.2.3.tar.gz

Enter the directory of the pip file

cd /opt/pip-20.2.3

Install (if the python version is not specified, the permission error may be reported here)

sudo python setup.py build
sudo python setup.py install

6.3 numpy

After installing the pip tool

pip install numpy -i http://pypi.douban.com/simple --trusted-host pypi.douban.com

successfully installed numpy library

insert image description here

6.2 default python

If there is no special requirement, you can directly use the built-in python version

For example, python3.10 exists in ubuntu 20.4.2

Complete the installation and configuration of the Python environment with the following simple commands

python3 --version
sudo apt install python3-pip
update-alternatives --install /usr/bin/python python /usr/bin/python3 1
python --version
pip -V
sudo pip install numpy
  • This method is the easiest but it may be because the Python version is too new
  • cause some incompatibilities

6.3 apt way

update apt-tools

sudo apt update

add ppa source

sudo apt install software-properties-common

Install Python 3.7

sudo apt install python3.7

Complete the Python version installation Add the version option to keep the version you need high priority

update-alternatives --install /usr/bin/python python /usr/bin/python3.10 1
update-alternatives --install /usr/bin/python python /usr/bin/python3.7 2

The version selection window appears and the Python version can be changed at any time

# update-alternatives --config python

At this time, entering python on the command line will no longer appear to point to unknown

insert image description here

Use the pip tool that comes with Python3.7 to install third-party libraries such as numpy

sudo python3.7 -m pip install numpy -i http://pypi.douban.com/simple --trusted-host pypi.douban.com

insert image description here

7. Reference blog

[1] VMware virtual machine installation Ubuntu-CSDN blog

[2] Ubuntu Appears: Please remove the installation medium-CSDN Blog

[3] ubuntu error: su authentication failed - CSDN blog

[4] VMware virtual machine often freezes-Blog Park

[5] VMware Tools installation - CSDN blog

[6] Set shared folder host and vmware to transfer files - CSDN blog

[7] The use of vi command in Ubuntu-Blog Garden

[8] Recovery after linux .bashrc file misuse - CSDN Blog

[9] ERROR: but there is no HDFS_NAMENODE_USER defined. Aborting operation-CSDN博客

[10] Build HDFS pseudo-distributed environment under ubuntu - CSDN Blog

[11] Hadoop installation and construction pseudo-distributed tutorial - CSDN blog

[12] Ubuntu user permissions - CSDN blog

[13] Error: localhost: permission denied - CSDN blog

[14] ERROR: Attempting to operate on hdfs namenode as root-CSDN博客

[15] Summary of Hadoop pseudo-distributed installation problems - CSDN blog

[16] Datanode failed to start: Too many failed volumes-HUAWEI CLOUD

[17] Hadoop process start incomplete - CSDN blog

[18] Linux modify the permissions of all files under the folder - CSDN blog

[19] The difference between Linux file profile, bashrc, bash_profile - know almost

[20] Spark pseudo-distributed installation - blog garden

[21] Ubuntu-Error: no acceptable C compiler found in $PATH - 博客园

[22] The solution lock is being held by process xx - CSDN blog

[23] sudo: make: command not found - CSDN Blog

[24] ZipImportError: can't decompress data-CSDN blog

[25] Install Python Pip on Ubuntu 20.04 - Programmer Sought

[26] Switch Python version under Ubuntu - CSDN blog

[27] Spark startup reports JAVA_HOME is not set exception - CSDN blog

[28] Falling back to unsupported java.lang.NumberFormatException-CSDN博客

[29] pyspark:TypeError: ‘bytes’ object cannot be interpreted as an integer-CSDN问答

[30] Install python3.6 on Ubuntu 20.4

[31] How to completely uninstall pip on ubuntu - Billion Speed ​​Cloud

[32] ubuntu18.04 install python3.7.7 through apt-CSDN blog

[33] Spark startup: Your hostname resolves to a loopback address-CSDN blog

[34] Unable to load native-hadoop library for your platform-CSDN博客

Guess you like

Origin blog.csdn.net/yt266666/article/details/130867335