Hadoop cluster deployment
- Preface
- 1. Virtual environment installation and configuration
- Second, the network configuration in the virtual machine
- Three, Hadoop pseudo-distribution environment installation and configuration
-
- Task 1: Download and install Java JDK-8u181 version
- Task 2: Java JDK-8u181 version environment variable configuration
- Task 3: Download and install Hadoop-2.10.0
- Task 4: Hadoop-2.10.0 version environment variable configuration
- Task 5: Hadoop-2.10.0 version core configuration file configuration
- Task 6: Format the DFS distributed file system
- Task 7: Start the hadoop-2.10.0 service
- Task 8: Hadoop HDFS file system operation
- Fourth, use Ambria to install and deploy a Hadoop cluster
- to sum up
Preface
Experimental background: a data analysis platform for campus community websites.
In this project, we will start with the installation and configuration of the Linux operating system in a virtual environment and gradually learn the cluster deployment of the big data analysis platform.
1. Virtual environment installation and configuration
(1) Install Xshell and Xftp. Xshell version: Xshell-6.0.0189p, Xftp version: Xftp-6.0.0185p.
For the installation process of this software, please see the blog: Install Xshell and Xftp
(2) Install the virtual machine and centos operating system VM version: VMware 15.5.0, CD image file version: CentOS-7-x86_64-DVD-1611
Installation of this software Please see the blog for the process: install the virtual machine and centos operating system
(3) jdk-8u181-linux-x64.tar and hadoop-2.10.0.tar two compressed files
Second, the network configuration in the virtual machine
Step 1: View the local network configuration
Record the local: (1) MAC address (2) IP address (3) Subnet mask (4) Default gateway
Win+R Open the running window and enter cmd and
enter ipconfig /all to view all networks and find the connected network Can
Step 2: Set up the virtual machine network environment
Here is my configuration:
(1) Turn off the firewall
[root@localhost lixu]# systemctl stop firewalld //停止firewalld防火墙
[root@localhost lixu]# systemctl disable firewalld //disable防火墙
[root@localhost lixu]# systemctl status firewalld //查看firewalld是否已经关闭
(2) Enter the selinux file and modify the enable to disabled
vi /etc/sysconfig/selinux
(3) Configure and view the network card file
BOOTPROTO="static" //将DHCP改为static
IPADDR=192.168.43.79 //根据自己的当前局域网进行设置
NETMASK=255.255.255.0 //根据自己的当前局域网进行设置
DNS=192.168.43.1 //根据自己的当前局域网进行设置
GATEWAY=192.168.43.1 //根据自己的当前局域网进行设置
(4) Set the host name
hostnamectl set-hostname bp01
hostname
(5) Set the host name and IP address mapping
vi /etc/hosts
(6) Restart the network service
service network restart
(7) Xshell to connect to the virtual machine:
a: xshell login to the 79 host
b: create the /opt/tools directory
cd /opt
mkdir tools
c: Create /opt/hadoop directory
cd /opt
mkdir hadoop
Three, Hadoop pseudo-distribution environment installation and configuration
Task 1: Download and install Java JDK-8u181 version
1. Download the Java JDK-8u171 version from
Java JDK-8u181 . You can choose the version you want to install, or you can choose other versions. Put the Java JDK-8u181 installation package in the /opt/tools directory
Task 2: Java JDK-8u181 version environment variable configuration
1. Create the /opt/hadoop/java directory
su root
cd /opt/hadoop
mkdir java
2. Copy the installation media
cp /opt/tools/jdk-8u181-linux-x64.tar.gz /opt/hadoop/java/
3. File decompression
tar -xvf -C /opt/hadoop/java/jdk-8u181-linux-x64.tar.gz
4. Configure Java environment variables
su root
vi /etc/profile
Add the following two to the profile file
JAVA_HOME=/opt/hadoop/java/jdk1.8.0_181 //根据自己的环境设置
export PATH=$PATH:$JAVA_HOME/bin //统一必须怎么写
5. Verify the JAVA environment
su root
java -version
Task 3: Download and install Hadoop-2.10.0
Hadoop-2.10.0 download address
1. Unzip Hadoop-210.0 version
su root
cd /opt/tools/
cp hadoop-2.10.0.tar.gz /opt/hadoop/
cd /opt/hadoop/
tar -xvf hadoop-2.10.0.tar.gz
Task 4: Hadoop-2.10.0 version environment variable configuration
1. Configure Hadoop environment variables
vi /etc/profile
source /etc/profile
Enter the following two sentences in the profile file
HADOOP_HOME=/opt/hadoop/hadoop-2.10.0//根据自己实际的情况进行配置
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
Next, start to configure Hdoop's core configuration file
hadoop.env.sh
core-site.xml
hdfs-site.xml
mapped-site.xml
yarn-site.xml
Task 5: Hadoop-2.10.0 version core configuration file configuration
(1) hadoop.env.sh
description: This file is the Hadoop operating environment configuration file. The operation of Hadoop requires JDK. We will modify the value of export JAVA_HOME to the path of the JDK we installed.
cd /opt/hadoop/hadoop-2.10.0/etc/hadoop
vi hadoop-env.sh
Enter the following in the hadoop-env.sh file:
export JAVA_HOME=/opt/hadoop/java/jdk1.8.0_181
(2) core-site.xml [Hadoop core configuration file]
cd /opt/hadoop/hadoop-2.10.0/etc/hadoop
vi core-site.xml
Enter the following in the core-site.xml file:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://bp01:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hadoop/hadoop-2.10.0/tmp</value>
</property>
</configuration>
(3) hdfs-site.xml [HDFS core configuration file]
cd /opt/hadoop/hadoop-2.10.0/etc/hadoop
vi hdfs-site.xml
Enter the following in the hdfs-site.xml file:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
(4)mapred-site.xml
cd /opt/hadoop/hadoop-2.10.0/etc/hadoop
vi mapred-site.xml
Enter the following in the mapred-site.xml file:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
(5) yarn-site.xml [Yarn framework configuration file]
cd /opt/hadoop/hadoop-2.10.0/etc/hadoop
vi yarn-site.xml
Enter the following in the yarn-site.xml file:
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>bp01</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
(6) Configure SSH password-free login.
1) Enter the .ssh directory under the hadoop directory.
2) Run ssh-keygen, and generate the public key to access the machine according to the local key.
3) Run cp id_rsa.pub authorized_keys to
change the machine Add the public key to the trusted list of the machine, create a new one if there is no ssh directory
Task 6: Format the DFS distributed file system
hdfs namenode -format
If you see the word "succefully format" in the formatted log, it means that the formatting is successful. Otherwise, fail
Task 7: Start the hadoop-2.10.0 service
Start DFS and resourcemanager
cd /opt/hadoop/hadoop-2.10.0/sbin
vim start-dfs.sh
vim start-yarn.sh
Start-dfs.sh header is added:
start-yarn.sh header is added:
Note: There is an error in restarting the two configuration files. Entering ip in the browser cannot access the webpage, so it is in core-site.xml [Hadoop In the core configuration file, change bp01 to 192.168.43.128, because the network has changed during the configuration experiment, so here is 192.168.43.128. Operation: first enter the following command to close the two processes, and restart after the modification is completed
: MapReduce management interface
http://192.168.43.128:8088
Hadoop management interface
http://192.168.43.128:50070
Task 8: Hadoop HDFS file system operation
Reference document address:
Hadoop HDFS file system Shell commands: File system (FS) shell includes various shell commands, these commands directly interact with Hadoop distributed file system (HDFS) and other file systems supported by Hadoop, such as local FS, WebHDFS , S3fs, etc.
View the file system help file
hadoop fs -help
1. View the remaining space of the file system
Syntax: hadoop fs -df [-h] URI [URI …] The
-H option will format the file size in a "human readable" way (for example, 64.0M instead of 67108864).
View the remaining space of the entire file system
hadoop fs -df -h /
2. Create a file directory
Syntax: hadoop fs -mkdir [-p] The --p option behaves similarly to UNIX MKDIR -P, creating a parent directory along the path.
Note: This is the path
. 3. Upload the aviation FOC data file.
Syntax: hadoop fs -put [-f] [-p] [-l] [-d] [-|… ].
-p: save access and modification time, ownership And permissions. (Assuming permissions can be spread across file systems)
-F: If the target already exists, overwrite it.
-L: Allow data nodes to be saved to disk with a delay, and force the replication factor to 1. This mark will result in reduced durability. use caution.
-D: Use the suffix to skip the creation of temporary files.
Create/1824113/FOC subdirectory
Upload the T2020.csv file to the/1824113/FOC directory
vi T2020.csv
hadoop fs -put T2020.csv /1824113/FOC
4. Find aviation FOC data file
Syntax: hadoop fs -find
hadoop fs -find / -name T2020.csv -print
5. Download aviation FOC data file
Syntax: hadoop fs -get [-ignorecrc] [-crc] [-p] [-f]
hadoop fs -get /T00/FOC/T2020.csv T2020.dat
Download T2020.csv locally and name it T2020.dat
6. View the access permissions of aviation FOC data files
Syntax: hadoop fs -getfacl [-R]
hadoop fs -getfacl -R /
View the permissions of all files and directories in the root directory of the file system
7. View the size of aviation FOC data files
Syntax: hadoop fs -du [-s] [-h] [-v] [-x] URI [URI …]
The -S option will cause a summary of file lengths to be displayed instead of individual files. Without the -S option, the calculation is done by going one level deep from the given path.
The -H option will format the file size in a "human readable" way (for example, 64.0M instead of 67108864).
The -V option will display the name of the column as the header row.
The -x option will exclude snapshots in the result calculation. Without the -x option (default), the result is always calculated from all iNoDs, including all snapshots under the given path.
Here is 27 bytes
8. Aviation FOC data file copy
Syntax: hadoop fs -cp [-f] [-p | -p[topax]] URI [URI …]
The -f option will overwrite the destination if it already exists.
The -p option will save the file attributes [Topx] (time stamp, ownership, permissions, ACL, XAttr). If -p is specified as no ARG, the timestamp, ownership, and permissions are retained. If -PA is specified, ACCEL is also retained because ACL is a super permission set. The determination of whether to retain the extended attributes of the original namespace is independent of the -P flag.
9. Verify whether the FOC data file has been changed.
Syntax: hadoop fs -checksum URI
returns the checksum information of the file
10. FOC data file addition
Syntax: hadoop fs -appendToFile…
Add the local data file [Add data to the end of the file] to the HDFS file system data file, you can add multiple local files to the HDFS file at the same time.
hadoop fs -appendToFile T2001.dat /T00/FOC/T2001.dat
hadoop fs -du -s -h /T00/FOC/T2001.dat
11. Merge and download FOC data files
Syntax: hadoop fs -getmerge [-nl]
Take the source directory and target file as input, and connect the file in the SRC to the destination local file. Optionally, NL can be set to add a new line character (LF) at the end of each file. Skip empty files can be used to avoid unwanted newlines in the case of empty files.
12. FOC data file movement
Syntax: hadoop fs -mv URI [URI …]
Move files from source to destination. This command allows multiple sources, in this case, the destination needs to be a directory. Moving files across file systems is not allowed.
hadoop fs -mv /T00/FOC/T2001.csv /T00/FOC/T2001-20180716.dat
MapReduce test
cd /opt/hadoop/hadoop-2.10.0/share/hadoop/mapreduce
Upload to HDFS, enter the following command:
Hadoop jar hadoop-mapreduce-examples-2.10.0.jar wordcount /1824113/FOC/T2020.dat /out/1.csv
View Results:
Fourth, use Ambria to install and deploy a Hadoop cluster
1. Install the Ambria service
2. Use Ambria to install and configure the Hadoop cluster
At this point, the editor is still hurrying to produce...
to sum up
1. Virtual environment installation and configuration
2. Network configuration in virtual machine
3. Hadoop pseudo-distribution environment installation and configuration
4. Use Ambria to install and deploy Hadoop cluster