2.1, Hadoop software
2.2 Pseudo-distributed deployment of Hadoop
2.3. Key file analysis of Hadoop deployment
2.4 All three Hadoop processes are started with hostname
2.5 Summary of Hadoop installation process
2.6. Hadoop web page interpretation
2.7, first met hdfs command

1. The last course review

https://blog.csdn.net/SparkOnYarn/article/details/104904205

Second, get acquainted with Hadoop

Hadoop 的网网： hadoop.apache.org 、 spark.apache.org 、 kafka.apache.org

Broadly defined: the ecosphere dominated by apache hadoop software (hive, SQOOP, flume, flink, hbase ...)
Narrow sense: simply refers to apache hadoop software

Apache hadoop software:
1.x: basically not used
2.x: the current mainstream market, the corresponding cdh5.X
3.x: some companies try to use, the version corresponding to cdh is cdh6.X

The download URL of the hadoop of the cdh version. The components of this course mainly use the official cdh provided:
http://archive.cloudera.com/cdh5/cdh/5/
http://archive.cloudera.com/cdh5/cdh /5/hadoop-2.6.0-cdh5.16.2.tar.gz
This shows that the version of hadoop is 2.6.0, the version of cdh we use is 2.6.0, he can be comparable to apache hadoop2.9
hadoop-2.6.0-cdh5.16.2.tar.gz
apache hadoop2.6.0 + future patch == apache hadoop2.9
Each version of cdh hadoop will be upgraded and packaged as follows. For example, if a component has a bug, we upgrade from cdh5.14 to cdh5.16, enter changes.log to view the upgrade.
CDH5.14.0 hadoop-2.6.0
CDH5.16.2 hadoop-2.6.0
Apache's hadoop2.9 and 3.X versions have appeared. The Apache Foundation's hadoop is open source, and its main bugs are submitted and promoted by the personnel of cloudera company.
The benefits of using the cdh version of hadoop: version compatibility does not need to be considered, for example, hbase will be installed in the future, and the branch installed by hbase needs to be under the cdh5.16.2 branch like hadoop.
http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.16.2-changes.log
http://archive.cloudera.com/cdh5/cdh/5/hbase-1.2.0 -cdh5.16.2-changes.log

2.1, Hadoop software

hdfs storage
mapreduce calculation jobs mine valuable data for mining-> hive sql spark flink
yarn resource (memory, Vcore) scheduling + job scheduling

Big data is composed of massive data. One machine cannot complete storage, and one machine is a single-point calculation; for example, we have 1,000 machines, which are stored by hdfs, mapreduce for distributed storage, and yarn according to CPU and memory. Perform resource
job scheduling.

Why is mapreduce not used in the industry?

Development is difficult, the amount of code is large, maintenance is difficult, and calculation is slow, so everyone basically does not use MR
Course version: hadoop-2.6.0-cdh5.16.2

2.2, Hadoop deployment

1. Create user and decompress software

1、创建hadoop用户：
- useradd hadoop

2、mkdir app data lib log software sourcecode tmp
[hadoop@hadoop ~]$ ll
total 28
drwxrwxr-x 3 hadoop hadoop 4096 Mar 20 16:21 app		压缩包解压后的文件夹	尽量做软连接
drwxrwxr-x 2 hadoop hadoop 4096 Mar  8 17:49 data		数据目录
drwxrwxr-x 2 hadoop hadoop 4096 Mar 20 16:23 lib		第三方的jar
drwxrwxr-x 2 hadoop hadoop 4096 Mar 20 16:23 log		日志文件夹爱
drwxrwxr-x 2 hadoop hadoop 4096 Mar  8 20:27 software		压缩包
drwxrwxr-x 2 hadoop hadoop 4096 Mar 20 16:23 sourcecode		源代码编译
drwxrwxr-x 2 hadoop hadoop 4096 Mar 20 16:23 tmp			临时文件夹

//linux本身已经自带了tmp目录，为什么我们还要建一个tmp目录呢，系统自带的会30天定期删除。

3、进行解压缩，并且做一个软连接：
[hadoop@hadoop ~]$ tar -xzvf hadoop-2.6.0-cdh5.16.2.tar.gz -C /home/hadoop/app/
[hadoop@hadoop app]$ ln -s hadoop-2.6.0-cdh5.16.2 hadoop

Software installation premise: java environment, ssh without password

2. Install java jdk environment:

1、mkdir /usr/java，创建这个目录

2、rz把软件进行上传，解压到这个目录以后配置环境变量，如下所示：
#env
export JAVA_HOME=/usr/java/jdk1.8.0_45
#export JAVA_HOME=/usr/java/jdk1.7.0_45
export PATH=$JAVA_HOME/bin:$PATH
"/etc/profile" 82L, 1900C written

3、使的环境变量生效：
[root@hadoop java]# source /etc/profile
[root@hadoop java]# which java
/usr/java/jdk1.8.0_45/bin/java
[root@hadoop java]# echo $JAVA_HOME
/usr/java/jdk1.8.0_45
[root@hadoop java]# java -version
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)

3. Edit hadoop-env.sh

	[hadoop@hadoop hadoop]$ vi hadoop-env.sh 
	# The java implementation to use.
	export JAVA_HOME=/usr/java/jdk1.8.0_45

4. Understand several modes of hadoop deployment:

Local (Standalone) Mode Local mode, not used
Pseudo-Distributed Mode Pseudo-distributed mode, only one is needed for learning and testing
Fully-Distributed Mode The distributed model is used in the production environment in cluster mode.

5. Start to modify core-site.xml

//vi /etc/hosts，需要配置好ip和hostname的映射：
106.54.226.205 hadoop

1、修改etc/hadoop/core-site.xml:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hadoop:9000</value>
    </property>
</configuration>

2、修改etc/hadoop/hdfs-site.xml:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

6. Configure ssh hadoop without password:

[root@hadoop ~]# pwd
/root
[root@hadoop ~]# rm -rf .ssh
[root@hadoop ~]# ssh-keygen
//连续三次回车
[root@hadoop ~]# cd .ssh
[root@hadoop .ssh]# ll
total 8
-rw------- 1 root root 1671 Mar 20 17:26 id_rsa
-rw-r--r-- 1 root root  393 Mar 20 17:26 id_rsa.pub

//把公钥文件追加到信任文件中：
[root@hadoop .ssh]# cat id_rsa.pub >> authorized_keys

//权限修改为600
[root@hadoop .ssh]# chmod 600 authorized_keys 

//测试如下没问题：
[hadoop@hadoop001 .ssh]$ ssh hadoop001 date
Tue Mar 24 15:19:49 CST 2020

Encountered a problem:

Connection timed out:

[hadoop@hadoop001 ~]$ ssh hadoop001 date

ssh: connect to host hadoop001 port 22: Connection timed out

The reason for this (intranet ip + hostname configuration error in / etc / hosts):

[hadoop@hadoop001 ~]$ ssh hadoop001 date
The authenticity of host 'hadoop001 (172.17.0.5)' can't be established.
RSA key fingerprint is 33:6f:23:f9:ff:10:39:2d:cd:42:72:66:c8:7d:5a:6a.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'hadoop001,172.17.0.5' (RSA) to the list of known hosts.
Tue Mar 24 15:25:19 CST 2020

Pit: The root user deploys ssh, 600 permissions do not need to be modified; while using other users, 600 permissions need to be modified.

7. Configure environment variables && effective environment variables && which hadoop:

#env
export HADOOP_HOME=/home/hadoop/app/hadoop
export PATH=${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin:$PATH

source ~/.bashrc
[hadoop@hadoop001 app]$ source ~/.bashrc
[hadoop@hadoop001 app]$ which hadoop
~/app/hadoop/bin/hadoop
[hadoop@hadoop001 app]$ which hdfs
~/app/hadoop/bin/hdfs

8. Format:

hdfs namenode -format The following sentence shows that there is no problem: 20/03/24 15:42:16 INFO common. Storage: Storage directory / tmp / hadoop-hadoop / dfs / name has been successfully formatted.

9. Start: (The first start will let us enter a yes)

[hadoop@hadoop001 sbin]$ which start-dfs.sh
~/app/hadoop/sbin/start-dfs.sh
[hadoop@hadoop001 sbin]$ start-dfs.sh
20/03/24 15:50:56 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/03/24 15:50:57 WARN hdfs.DFSUtil: Namenode for null remains unresolved for ID null.  Check your hdfs-site.xml file to ensure namenodes are configured properly.
Starting namenodes on [hadoop]
hadoop: ssh: Could not resolve hostname hadoop: Name or service not known
localhost: starting datanode, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs/hadoop-hadoop-datanode-hadoop001.out
Starting secondary namenodes [0.0.0.0]
The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.
RSA key fingerprint is 33:6f:23:f9:ff:10:39:2d:cd:42:72:66:c8:7d:5a:6a.
Are you sure you want to continue connecting (yes/no)? yes

Blog reference:

http://blog.itpub.net/30089851/viewspace-2127102/ Failure (modify user and user group)
ssh In addition to the first authentication, if you still need a password, you need to check the system log.
http://blog.itpub.net/30089851/viewspace-1992210/ ssh multiple advanced classes will talk about the pit

2.3. Key file analysis of Hadoop deployment

1. Note:
hadoop001 is to start the namenode, 0.0.0.0 is to start the secondarynamenode; localhost is to start the datanode;
why do you need to enter yes for the first time, check ~ / .ssh / known_hosts

[hadoop@hadoop001 hadoop]$ start-dfs.sh
20/03/24 15:53:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [hadoop001]
hadoop001: starting namenode, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs/hadoop-hadoop-namenode-hadoop001.out
localhost: starting datanode, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs/hadoop-hadoop-datanode-hadoop001.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs/hadoop-hadoop-secondarynamenode-hadoop001.out
20/03/24 15:53:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

View the trust file record, the first time you enter yes, the sentence is maintained in the known_hosts file;

[hadoop@hadoop001 .ssh]$ cat known_hosts 
localhost ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA0RbK9Erb/qkvjdPvION6G35oFhaS8ZfA13dyILU069Hzd36i2tgUZa9D4IPbvZNk6DvDFW2zH8jW/j1RS99ZZ4+9yl/CrcrJRP9GjjBS2W1rQCUjolRdLf9PzAZs/AbFvpjxwMbd7vSb5AOeqQ0pTC4BFlvX6IJLGdZmhUYTqNCWj0e40l409o/Hidy0oEXByDaJWmRvRuI6jc5V1v9/FZNze96W/oJC6FLR7MrVgSJA2MmZvzvS3zbCgKU/umgD4ENy+JRBiifHwBWVTkIADHhVq7Ob14eFlawVnEY6tQkdgrwSc2LWBgFQXAYbYmeFLOrEJQKi3e/h4fMLaoeubQ==
hadoop001,172.17.0.5 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA0RbK9Erb/qkvjdPvION6G35oFhaS8ZfA13dyILU069Hzd36i2tgUZa9D4IPbvZNk6DvDFW2zH8jW/j1RS99ZZ4+9yl/CrcrJRP9GjjBS2W1rQCUjolRdLf9PzAZs/AbFvpjxwMbd7vSb5AOeqQ0pTC4BFlvX6IJLGdZmhUYTqNCWj0e40l409o/Hidy0oEXByDaJWmRvRuI6jc5V1v9/FZNze96W/oJC6FLR7MrVgSJA2MmZvzvS3zbCgKU/umgD4ENy+JRBiifHwBWVTkIADHhVq7Ob14eFlawVnEY6tQkdgrwSc2LWBgFQXAYbYmeFLOrEJQKi3e/h4fMLaoeubQ==
0.0.0.0 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA0RbK9Erb/qkvjdPvION6G35oFhaS8ZfA13dyILU069Hzd36i2tgUZa9D4IPbvZNk6DvDFW2zH8jW/j1RS99ZZ4+9yl/CrcrJRP9GjjBS2W1rQCUjolRdLf9PzAZs/AbFvpjxwMbd7vSb5AOeqQ0pTC4BFlvX6IJLGdZmhUYTqNCWj0e40l409o/Hidy0oEXByDaJWmRvRuI6jc5V1v9/FZNze96W/oJC6FLR7MrVgSJA2MmZvzvS3zbCgKU/umgD4ENy+JRBiifHwBWVTkIADHhVq7Ob14eFlawVnEY6tQkdgrwSc2LWBgFQXAYbYmeFLOrEJQKi3e/h4fMLaoeubQ==

Add a pit: for
example, our public key file has been changed, a record will appear when executing the command, there is an old connection, as long as you find this record and delete it; if you encounter ssh trust during work Error, subconsciously go to find this file.

Test to modify the localhost record in known_hosts:

1. If you try to restart after modifying the content, you will find that it has failed

localhost: key_from_blob: can't read key type
localhost: key_read: key_from_blob EEEEB3NzaC1yc2EAAAABIwAAAQEA0RbK9Erb/qkvjdPvION6G35oFhaS8ZfA13dyILU069Hzd36i2tgUZa9D4IPbvZNk6DvDFW2zH8jW/j1RS99ZZ4+9yl/CrcrJRP9GjjBS2W1rQCUjolRdLf9PzAZs/AbFvpjxwMbd7vSb5AOeqQ0pTC4BFlvX6IJLGdZmhUYTqNCWj0e40l409o/Hidy0oEXByDaJWmRvRuI6jc5V1v9/FZNze96W/oJC6FLR7MrVgSJA2MmZvzvS3zbCgKU/umgD4ENy+JRBiifHwBWVTkIADHhVq7Ob14eFlawVnEY6tQkdgrwSc2LWBgFQXAYbYmeFLOrEJQKi3e/h4fMLaoeubQ==
localhost:  failed

2. Solution: Go to the ~ / .ssh / known_hosts file, find the line corresponding to localhost, and delete it using dd.
Restart start-dfs.sh, then enter yes again for authentication, and a new record will be added to known_hosts.

2.4. All three processes are started with hadoop001

Node name	Corresponding file name
purpose	core-site.xml
DataNode	slaves
secondarynamenode	hdfs-site.xml

1、cat /etc/hadoop

1、修改slaves，删除原有的，直接写入hadoop001

2、修改hdfs-site.xml，如下这两段话进行插入：
	<property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>hadoop001:50090</value>
    </property>
    
    <property>
        <name>dfs.namenode.secondary.https-address</name>
        <value>hadoop001:50091</value>
    </property>

[hadoop@hadoop001 .ssh]$ jps
11488 NameNode
13634 Jps
12308 DataNode
11805 SecondaryNameNode
[hadoop@hadoop001 .ssh]$ netstat -nlp|grep 11805
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 0.0.0.0:50090               0.0.0.0:*                   LISTEN      11805/java

The results are as follows: The three processes of Hadoop are started with the machine name hadoop001 as follows:

[hadoop@hadoop001 hadoop]$ start-dfs.sh
20/03/24 16:48:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [hadoop001]
hadoop001: starting namenode, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs/hadoop-hadoop-namenode-hadoop001.out
hadoop001: starting datanode, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs/hadoop-hadoop-datanode-hadoop001.out
Starting secondary namenodes [hadoop001]
hadoop001: starting secondarynamenode, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs/hadoop-hadoop-secondarynamenode-hadoop001.out
20/03/24 16:48:31 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

2.5, summarize the installation process of hadoop

1. Create users and user directories, su-hadoop to switch

2. Upload the compressed package and decompress it to make a soft connection

3. Environmental requirements: java1.8 and ssh must be installed

4. Explicit configuration of java_home

5. Configure core-site.xml and hdfs-site.xml

6. SSH passwordless trust relationship

7. Hadoop configuration environment variables

8. Format words namenode

9, start-dfs.sh for the first start, enter yes to verify

10.Datanode and secondarynamenode must know which side of the official website to modify the parameters

11, namenode is the name node, boss, read and write requests must go through it first; datanode is the data node, younger brother, store data, retrieve data; secondarynamenode is the second name node, n + 1 is time for backup.

Because of the single point problem and the intensity of 1 hour, there will be a high availability HA configuration in the future.

note:

The components of big data are basically master-slave architecture, hdfs, hbase (read and write requests do not go through the boss, the master process)

2.6. Hadoop web page

How to use hostname to access, the internal network ip + hostname is configured in / etc / hosts, and in the Windows local directory: ‪C: \ Windows \ System32 \ drivers \ etc \ hosts, the external network IP + hostname is configured.

1. safemode is off, safe mode

2. Interpretation of disk space: the total capacity of the cluster is 49.21G, the use of DFS is 12k, the use of DFS is 13.39G, and the reservation of DFS is 33.33G; as shown in the figure below, we have 34G left. This space is our dfs by default.

[hadoop @ hadoop001 hadoop] $ df -h
Filesystem Size Used Avail Use% Mounted on
/ dev / vda1 50G 14G 34G 29% /
Insert picture description here
2. File visualization system interface:

2.7. Commands on hdfs

1. hdfs --help View command help:
Insert picture description here
eg:

1、[hadoop@hadoop001 data]$ hdfs dfs -mkdir ruozedata

2、[hadoop@hadoop001 data]$ hdfs dfs -put ruoze.log /ruozedata/

3、下载到本地目录下：
[hadoop@hadoop001 data]$ hdfs dfs -get /ruozedata/ruoze.log ./
20/03/24 17:37:31 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@hadoop001 data]$ ll
total 4
-rw-r--r-- 1 hadoop hadoop 10 Mar 24 17:37 ruoze.log

4、删除hdfs上的该文件：
[hadoop@hadoop001 data]$ hdfs dfs -rm -r /ruozedata/ruoze.log
20/03/24 17:38:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Deleted /ruozedata/ruoze.log

3. Homework for this course

1. Summary of basic commands on hdfs

2. Pseudo-distributed deployment of hdfs with hadoop installed

Spark on yarn

Published 23 original articles · praised 0 · visits 755

Private letter concerns

Sword refers to data warehouse-Hadoop one