Hadoop-2-installation
0, software version
This article is based on Apache Hadoop 2.9.2 (2018 Nov 19) version
JDK uses jdk1.8.0_212 version
Linxu chooses CentOS release 6.8 (Final)
1. Configure Linux
1) Configure static IP
Modify the /etc/sysconfig/network-scripts/ifcfg-eth0 file
###首先修改这2项
#系统启动时激活网卡
ONBOOT=yes
#配置IP为静态类型,不要自动获取IP
BOOTPROTO=static
###再配置这3项
#手动指定IP
IPADDR=192.168.xxx.xxx
#指定网关
GATEWAY=192.168.xxx.xxx
#指定DNS域名解析,这个和上面的网关是一样的
DNS1=192.168.xxx.xxx
After the modification, use the service network restart
command to restart the network service, and use the ping
command to check whether the configuration is correct
If the network service fails to restart, restart the system
If you are using a cloned virtual machine, you need to modify the mac address
The /etc/udev/rules.d/70-persistent-net.rules file lowermost NAME = "eth1" of ATTR {address} copies the value of the / etc / sysconfig / network-scripts / ifcfg-eth0 file In the HWADDR attribute in the file, modify eth1 to eth0 at the same time, and delete the above eth0
#/etc/udev/rules.d/70-persistent-net.rules
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{
address}=="???", ATTR{
type}=="1", KERNEL=="eth*", NAME="eth1"
#/etc/sysconfig/network-scripts/ifcfg-eth0
HWADDR=???
2) Modify the host name
Modify the /etc/sysconfig/network file
HOSTNAME=???
After the modification is completed, restart the system
3) Add host
The above-configured static IP and host names are added to the / etc / hosts file
2. Local (independent) mode
1) Install JDK
Download JDK 8 from the Oracle official website and configure the environment variables
export JAVA_HOME=/???/jdk1.8.0_212
export PATH=$JAVA_HOME/bin:$PATH
2) Install Hadoop
Download from the Hadoop official website and unzip
3) New user
Use useradd
the command, create a new user
4) Assign Hadoop directory permissions
Use chown -R 用户名:组名 Hadoop目录
the newly unzipped Hadoop directory to grant permissions to the newly created user
5) Switch to a new user
Use su 用户名
command
6) Test
【1】grep
-
First enter the Hadoop directory and create a new directory, for example
input
-
Copy all the XML files in etc/hadoop in the Hadoop directory to the input directory,
cp -v etc/hadoop/*.xml input
-
Execute Hadoop commands
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar grep input output 'dfs[a-z.]+'
This command means to execute the grep program in the official sample jar, the input directory is input, the output directory is output, and the content in quotation marks is the filtering rule (regular expression)
-
Check the Hadoop directory, you can see that there is an additional directory named output, there are two files in it, one named
part-r-00000
, is the output result; the other named_SUCCESS
, its size is 0, which only means that the execution is successful -
Use
cat
the command to view the contents of part-r-00000 file, the results show1 dfsadmin
Precautions:
- The output directory cannot exist, otherwise it will report an error org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/??? already exists
- The host name must be configured in the host, otherwise it will report an error java.net.UnknownHostException: ???: ???: unknown name or service
【2】wordcount
-
First enter the Hadoop directory and create a new directory, for example
wcinput
-
Enter the newly created directory, create a new file with any name, for example
wc.input
, enter some characters in the file, and separate each character with a space or Tab -
Execute Hadoop commands
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar wordcount wcinput wcoutput
This command means to execute the wordcount program in the official sample jar, the input directory is wcinput, and the output directory is wcoutput
-
Check the Hadoop directory, you can see that there is an additional directory named wcoutput, there are 2 files in it, one named
part-r-00000
, is the output result; the other named_SUCCESS
, its size is 0, which only means that the program is executed successfully -
Use
cat
the command to view the contents of part-r-00000 document, which is based on each character is a single line at the end of each line beginning with the number of characters, and the characters appear
3. Pseudo-distributed mode [Run Hadoop on a single node]
1) Run Hadoop on HDFS
[1] Modify the configuration file
① Default configuration parameters of etc/hadoop/core-site.xml
<configuration>
<!-- 修改NameNode的服务器的IP和端口号,默认端口号为9000 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://本机IP:端口号</value>
</property>
<!-- 修改临时文件目录 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/指定路径</value>
</property>
</configuration>
② Default configuration parameters of etc/hadoop/hdfs-site.xml
<configuration>
<!-- 修改副本数 -->
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
[2] Format the file system and start HDFS
① Execute formatting command (formatting is only required for initial startup )bin/hdfs namenode -format
② Start HDFS (run in the background)
NameNode sbin/hadoop-daemon.sh start namenode
DataNode sbin/hadoop-daemon.sh start datanode
After startup, can be used jps
to verify that the launch is successful, the process number can appear
[3] Use the management page to view the NameNode status
Link is http://文件系统主机IP:50070/
[4] Create a directory on HDFS and upload files
① Create directory bin/hdfs dfs -mkdir -p /父路径/子路径
② Upload file bin/hdfs dfs -put /文件系统主机目标文件 /HDFS目标路径
③ View catalog bin/hdfs dfs -ls /HDFS路径
④ View files bin/hdfs dfs -cat /HDFS目标文件
[5] Execute the MapReduce program and view the results
① Execution procedure bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar MapReduce程序 /HDFS输入路径 /HDFS输出路径
② View the result bin/hdfs dfs -cat /HDFS输出路径/*
* represents all files in the path
【6】Download file
① Use the namenode management page to download, Utilities --> Browse the file system, and then find the corresponding file, click Download, note that the port number is 50075, and the browser client also needs to configure the file system host IP to the host file
② Use Hadoop commands bin/hadoop fs -get /HDFS目标文件 /文件系统主机目标路径
[7] View and delete directories
① View bin/hadoop fs -ls -R /HDFS目录
② Delete bin/hadoop fs -rm -r /HDFS目录
[8] Close HDFS
Execute commands sbin/hadoop-daemon.sh stop namenode
andsbin/hadoop-daemon.sh stop datanode
2) Run Hadoop on YARN
[1] Modify the configuration file
① Copy etc / hadoop / mapred-site.xml.template the etc / hadoop / mapred-site.xml, and then modify the contents of the default configuration parameters
<configuration>
<!-- 修改执行MapReduce作业时的框架名称 -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
② Default configuration parameters of etc/hadoop/yarn-site.xml
<configuration>
<!-- 修改服务器列表 -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- 修改ResourceManager的主机地址 -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>YARN的ResourceManager的主机IP</value>
</property>
</configuration>
[2] Start YARN
Note that before starting YARN, you need to make sure that HDFS has been started
① Start ResourceManager sbin/yarn-daemon.sh start resourcemanager
② Start NodeManager sbin/yarn-daemon.sh start nodemanager
Pay attention to use to jps
view the corresponding startup process
[3] Use the management page to view the ResourceManager status
Link is http://文件系统主机IP:8088/
[4] Run a MapReduce job
Use command bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar MapReduce程序 /HDFS输入路径 /HDFS输出路径
After the task is executed, you can see the following log
19/08/11 19:20:20 INFO mapreduce.Job: Running job: job_1565522114534_0001
19/08/11 19:20:29 INFO mapreduce.Job: Job job_1565522114534_0001 running in uber mode : false
19/08/11 19:20:29 INFO mapreduce.Job: map 0% reduce 0%
19/08/11 19:20:36 INFO mapreduce.Job: map 100% reduce 0%
19/08/11 19:20:42 INFO mapreduce.Job: map 100% reduce 100%
19/08/11 19:20:43 INFO mapreduce.Job: Job job_1565522114534_0001 completed successfully
【5】Close YARN
① Close ResourceManager sbin/yarn-daemon.sh stop resourcemanager
② Close NodeManager sbin/yarn-daemon.sh stop nodemanager