Hadoop Introduction to Big Data Framework Hadoop

1. Write in front

Starting today, I want to start a new series of big data framework learning. I was very curious about big data related technologies when I was in school, but I didn’t have much experience with these things because I didn’t have a practical scene. After I joined the company, I felt more and more Knowledge about big data is very important. Whether you are engaged in development or algorithms, you will deal with massive amounts of data every day. If you want to truly understand the data in our hands, you must know its entire life cycle. When it comes to the reading, storage, calculation and application of massive data, I want to use my spare time after work to slowly get in touch with knowledge related to big data. Regarding this area, there is now a very mature framework system. Most of the summaries are complete, so it is good to follow Shangda to learn this part. Since it is a spare time study, now I plan to learn according to the components. After each component is learned, record a blog to summarize and review the notes, and then open The next component, so, starts a new learning journey, let's start.

Today's article is related notes about getting started with Hadoop . If you want to learn the big data framework systematically, Hadoop is an unavoidable door. Although the popularity of Hadoop may be declining now, it is big data storage and processing after all. At the beginning, as an introduction to big data, it is very beneficial for later learning to understand its operating principle, what components it has, and how to solve the storage and calculation of big data. Here is the Hadoop framework. In fact, the system is very clear, and it may be a good summary of Shangda. First, it is an overview, what is Hadoop, what components are there, how to cooperate with it to deal with big data problems, and then why it exists, that is, why? After the two are done, it is necessary to have a big data environment, and then learn the details of each component (hdfs, yarn, mapduce). After this round is over, Hadoop can go through it again. This article mainly introduces the door first, learns the overview, and builds a cluster environment to prepare for learning each component later. The content is as follows:

insert image description here
ok, let’s, go!

2. Hadoop overview

2.1 What

2.1.1 Basic concepts and development history

Hadoop is a distributed system infrastructure developed by the Apache Foundation, which mainly solves: the storage of massive data and the analysis and calculation of massive data.

In a broad sense, Hadoop generally refers to the Hadoop ecosystem

To understand a new concept, it is necessary to understand its development history:

  • Doug Cutting , the founder of Hadoop , optimized and upgraded the query engine and index engine based on the Lucene framework (related to ES) in order to achieve a full-text search function similar to Google

  • 2001 Lucene became a sub-project of the Apache Foundation

  • For massive data scenarios, the Lucene framework and Google have encountered the same difficulties, the storage of massive data is difficult, and the retrieval of data is slow

  • In 2003, Google published three papers on big data (Troika GFS, Map-Reduce, BigTable), which solved the above problems, but Google only had three papers at this time and no open source code

  • From 2003 to 2004, the founder team of Hadoop began to learn and imitate Google's solution to these problems. On this basis, they implemented the DFS and MapReduce mechanism, and encapsulated the Lucene framework in another layer, called the miniature version of Nutch, which solved the problem. The puzzle above. Correspondence:

    • GFS → HDFS
    • Map-Reduce → Map-Reduce
    • BigTable → HBase

    So, Google is the source of Hadoop ideas

  • In 2005, Hadoop was officially introduced into the Apache Foundation as part of Lucene's sub-project Nutch

  • In 2006, MapReduce and Nutch Distributed File System (NDFS) were included in the Hadoop project, and Hadoop was officially born, marking the advent of the era of big data

Three Hadoop distribution versions: Apache (free, the most basic), Cloudera (service charge, package, complete function, product CDH), Hortonworks (service charge, package, product HDP)

2.1.2 Hadoop composition (interview focus)

insert image description here
The calculation here refers to the running calculation of the program, and the resource scheduling is to determine which machine to run, complete the calculation, how much memory needs to be allocated, etc. when performing the calculation. This is resource scheduling. core:In 1.x, both calculation and scheduling are done by MapReduce, and the coupling is too high. Therefore, in version 2.x, Yarn is added to share the responsibility of MapReduce. MapReduce is only responsible for calculation, and Yarn is responsible for resource scheduling.

So the three key components of Hadoop: HDFS manages data storage, YARN manages resource scheduling, and MapReduce manages computing.

2.1.3 Overview of HDFS architecture

HDFS is responsible for the storage of massive data, that is, how to store massive data on multiple machines. There are a few concepts that need to be understood first, and I will write an article later to sort out the details.

insert image description here
Components:

  • NameNode(NN): Store the metadata of the file , such as file name, file directory structure, file attributes (generation time, number of copies, file permissions), and the block list of each file and the DataNode where the block is located . This thing is similar to an index , which does not store specific data.
  • DataNode: Store file block data in the local file system , as well as the checksum of the block data
  • Secondary NameNode(2NN): Backup NameNode metadata at regular intervals

2.1.4 Overview of YARN architecture

Yet Another Resource Negotiator (YARN), a resource manager for Hadoop

insert image description here
This picture is the essence of YARN. First, know some concepts in YARN:

  1. ResourceManager (RM): the boss of the entire cluster resources (memory, cpu)
  2. NodeManager(NM): The boss of a single node server resource
  3. AplicationMaster(AM): The boss of a single task operation
  4. Container: container, which is equivalent to an independent server, which encapsulates the resources required for task operation

After a task comes, it will first generate an ApplicationMaster on a certain node of the server. The boss of a single task will apply for resources from RM according to the resources required to run the task. At this time, RM will allocate the corresponding node to run For tasks, this node can be the server where the AM is located, or you can put the task node on another server and allocate resources to it. Once there are resources, a container will be opened to run the task.

2.1.5 Overview of MapReduce architecture

In the case of MapReduce, there is no need to organize too much here. Since it solves the problem of data calculation, how to efficiently use multiple machines for calculation? Know here that it will have two phases:

  • Map: Distribute tasks to each machine for calculation and process input data in parallel
  • Reduce: Summarize and return the results of Map on each machine

2.1.6 Relationship among HDFS, YARN, and MapReduce

Here we need a preliminary understanding, under the overall framework of Hadoop, how this component works together to complete the storage and calculation of big data. The general content is as follows:

insert image description here
When a customer submits a batch of tasks, he will first go to the ResourceManager and say he wants to run a task. At this time, RM will find a node, open an AM on it, and hand over the task to him. After the AM gets the task, run it. How many resources (memory, cpu) the task requires. At this time, AM applies to RM for the required resources. After RM receives the application, it goes to the entire cluster to see which nodes have resources to run this task. After finding it, re-node Open the container to run the task. At this time, each container performs calculations independently. This is the map stage. After each container is calculated, the results will be aggregated to a container. This is the reduce node. After the aggregation, write At this time, HDFS starts to work, DataNode is responsible for storing data, NameNode is responsible for recording the metadata of this data, and 2NN is responsible for backing up NN at intervals.

2.2 Why

Advantages of Hadoop (4 high):

  1. High reliability: The underlying layer maintains multiple data copies. Even if a Hadoop computing element or storage fails, it will not cause data loss, that is, one piece of data will be stored on multiple machines.

    insert image description here
  2. High scalability: distribute task data among clusters, which can easily expand thousands of nodes (dynamically added, dynamically deleted)

    insert image description here
  3. Efficiency: Under the idea of ​​MapReduce, Hadoop works in parallel to speed up task processing

    insert image description here
  4. High fault tolerance: can automatically reassign failed tasks

    insert image description here

3. Environment preparation and installation

When the big data environment was installed in school, I actually went through it again. I went through the installation of multiple components for 3 virtual machines. I won’t record such details here, but only record some new things.

3.1 New knowledge of environment variable configuration

When configuring environment variables before, it is generally /etc/profileconfigured directly in , export a = …:$PATHbut after arriving at the company, I found that it is generally not configured directly here, but through a few lines of code inside:
insert image description here
this will traverse all the endings /etc/profile.dbelow .shfile, and then let the variables inside take effect globally. Therefore, when configuring environment traversal, you can profile.dcreate a new .shscript under this, and then create environment variables in it

cd /etc/profile.d
vim bigdata_env.sh
# 在这里面输入
#JAVA_HOME
export JAVA_HOME=/opt/module/jdk1.8.0
export PATAH=$PATH:JAVE_HOME
....

# 环境变量生效
source /etc/profile

In this way, variables can be sorted.

3.2 Hadoop Directory

Here to understand the role of each directory file under the Hadoop directory

insert image description here

  • bin directory : start commands with hdfs, yarn, mapred, this is very important, and needs to be configured in environment variables
  • etc directory : a lot of configuration information inside, hdfs-site.xml, mapred-site.xml, core-site.xml, yarn-site.xmlthis is also very important
  • include directory: a large number of header files, generally not used
  • lib and libexec: dynamic link libraries, extension packages, etc., generally not used
  • sbin : This contains commands to start components, commonly used start-all.sh, stop-all.sh, or to start each component separately, etc. This is also very important and needs to be configured in the environment variable
  • share directory: provide learning materials, documentation and case jar packages, etc.

3.3 Hadoop operation mode

Hadoop three operating modes:

  • local mode: data is stored locally in Linux
  • Pseudo-distributed mode: data storage to HDFS
  • Fully distributed: data storage to HDFS, multiple servers work

3.4 Local mode: official wordcount case

bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount wcinput/ ./wcoutput
  • There must be an input path and a result output path
  • If the result output path exists, an exception will be thrown

4. Hadoop cluster mode construction (emphasis)

Hadoop cluster mode is generally very important. Only by setting up the cluster several times in person can you have a certain understanding of the overall architecture. Although I have already built it before, I will go through it again this time, because Hadoop here is 3. The x version, the previous one is a bit old, and the other is to familiarize yourself with the whole process, so I won’t record it in detail here, but at least you must know the steps generally required to build a hadoop cluster. I will briefly summarize here:

  1. First of all, there must be N machines as nodes. I use 3 virtual machines here. Each virtual machine needs to be configured with a host name and an intranet ip address. These two are to ensure that the machines in the internal cluster can communicate.
  2. Then jdk and hadoop have to be installed on each machine. In the actual operation, it is usually installed in one machine, and then distributed to other machines through scripts. So, script distribution and ssh password-free login are also involved.
  3. After installing jdk and hadoop on each machine, the next step is to modify the relevant configuration under hadoop etc so that these machines can work in harmony

This is the basic process, and the specific details are described below.

4.1 Document distribution

Generally, nodes are configured from one server, and then distribution scripts are written to distribute hadoop and java to other machines. So here comes the file distribution command. Two commonly used:

  1. scp: used for copying files or directories between clusters, commonly usedscp -r $pdir/$fname(源地址) $user@$host:$pdir/$fname(目标地址)

  2. rsync: used for file synchronization, the difference from the above command is that only the changed files are updated, not all are copiedrsync -av $pdir/$fname(源地址) $user@$host:$pdir/$fname(目标地址)

    You can use scp for the first distribution, and you can use rsync for later changes, which will be more efficient.

  3. xsync cluster distribution script
    Here is a script for cluster file synchronization. After executing this script, the files are copied to the same directory of all nodes in a loop (one-key synchronization).vim ~/bin/xsync

    #! /bin/bash
    
    # 1. 判断参数个数  如果小于1, 直接退出
    if [ $# -lt 1 ]
    then
    		echo Not Enought Argument!
    		exit;
    fi
    
    # 遍历集群所有机器i
    for host in hadoop102 hadoop103 hadoop104
    do 
    	echo ==============================$host============================
    	# 遍历命令里面的所有文件@貌似是接受过来的命令参数,依次发送
    	for file in $@
    	do
    		# 如果文件在本机上存在, 分发到别的机器上
    		if [ -e $file ]
    			then
    				# 获取父目录  -P表示,如果有软连接的文件,得到的还是原先的文件 dirname会得到文件的父目录
    				pdir=$(cd -P $(dirname $file); pwd)
    				# 获取当前文件的名称basename名称
    				fname=$(basename $file)
    				# 这里配置了ssh免密连接,所以ssh host到了另一台主机,然后创建父目录,-p是如果存在也不会报错
    				ssh $host "mkdir -p $pdir"
    				rsync -av $pdir/$fname $host:$pdir
    			else
    				echo $file does not exists!
    		fi
    	done
    done
    

    chmod 777 xsync, If an ordinary user directly uses a certain command, he can create a bin in his own directory, and then put the command or script in it. At this time, the user can use the command directly, but after switching to root, it cannot be used. , because at that time I was looking for the bin of the root home directory, obviously there is no such command in it.

    # 如果往其他机器分发的时候,报错没有权限,可以用sudo, 但是sudo的时候,需要加脚本命令的绝对路径
    # 尤其是分发环境变量
    sudo ./bin/xsync /etc/profile.d/my_env.sh
    
    # 分发完环境变量之后, 记得在其他机器上source下,否则环境变量还是不生效, 感觉这个也可以写个脚本
    

4.2 ssh password-free login

I didn’t know the principle of ssh password-free login before. Here,
insert image description here
whether you are building a hadoop cluster or configuring git before, you often set up a password-free login operation. I didn’t notice it before. The reason behind it is that What, this study accidentally learned a principle of communication between the next two servers. Here, take the configuration of git password-free login as an example, and write the above process again.

When we use github, after configuring the username and password locally, we will configure an ssh password-free login.

  1. The executed ssh-key-gen command, in the local .ssh directory, will generate a public key and a private key at this time
  2. We will copy the content of this public key to github's ssh keys
  3. At this time, when we access github locally, the request or data will be encrypted with the local private key A, and then uploaded to github
  4. After github receives the data, it will go to the authorized key to find the local uploaded public key to decrypt the data. If it cannot find it, it will report that it has no permission.
  5. After decryption, get the request, and when sending the data, encrypt it with public key A and send it locally
  6. Get the data locally, decrypt it with private key A, and get the data

In this way, the encrypted transmission process of ssh is completed

Therefore, if you want to use ssh to access each other's machines in the cluster, if you don't configure password-free login, you have to upload a file every time, and you need the login password of the other machine, which is troublesome, so this thing still needs to under configuration.

If you want to configure hadoop102 to access 103 and 104 without secret, you need to generate a public key with ssh on 102

ssh-keygen -t rsa

# 这时候.ssh目录下面会多出一个id_rsa和id_rsa.pub来
ssh-copy-id hadoop102
ssh-copy-id hadoop103
ssh-copy-id hadoop104

# 这样就实现了102免密登陆其他机器, 此时103和104的.ssh目录下面,会多出一个authorized_keys目录, 记录的就是可以访问此机器的公钥

# 如果想在103和104上也可以免密访问彼此,和102同样的操作也来一遍

If you switch to root at 102 at this time, you will find that you still need a password to log in to 103 and 104. So this ssh password-free login is for a specific user.

4.3 Cluster configuration

This is the highlight of this cluster construction. The new experience learned here is that before building a hadoop cluster, you must first design and plan, mainly to see which node the NameNode is placed, where the DataNode is placed, which node the ResourceManager is placed , and the NodeManager Where to put it, where to put 2NN, etc. When I built it before, I put 3 bold brothers on one machine. It turns out that it is not good to do so.

insert image description here
Deployment planning:

  • Do not install NameNode and SecondaryNameNode on the same server, because both things consume more memory
  • ResourceManager also consumes a lot of memory. Do not configure it on the same machine as NameNode and SecondaryNameNode

With such a deployment first, when the configuration file is modified below, the deployment can be modified accordingly. Another point here is: There are two types of Hadoop configuration files: default configuration files and custom configuration files. Only when users want to modify a default configuration, they need to modify the custom configuration files and change the corresponding properties.

Custom configuration files, mainly have several key configuration files under hadoop/etc: core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml, according to the above cluster planning, here We configure. The idea of ​​​​configuration is to configure it on hadoop102, and then use the distribution script to distribute it to 103 and 104.

  1. Configure core-site.xml
    on hadoop102, configure core-site.xml, where you can specify the address of hdfs NameNode, and hdfs storage directory, etc.

    <configuration>
            <!-- 指定NameNode的地址 -->
            <property>
                    <name>fs.defaultFS</name>
                    <value>hdfs://hadoop102:8020</value>
            </property>
            <!-- 指定hadoop数据的存储目录 -->
            <property>
                    <name>hadoop.tmp.dir</name>
                    <value>/opt/module/hadoop-3.1.3/data</value>
            </property>
            <!-- 配置HDFS网页登录使用的静态用户为icss-->
            <property>
                    <name>hadoop.http.staticuser.user</name>
                    <value>icss</value>
    				<!-- 有了这个,才能直接在hdfs网页上进行删除文件或者目录操作,否则没权限 -->
            </property>
    </configuration>
    

    Specify the address of hdfs NameNode here. According to the above picture, the default is file:///, we set it to hdfs://hadoop102:8020 here, that is, put the namenode of hdfs on hadoop102, and set the storage directory of hadoop. The default is /tmp/, this file has a time limit in linux, so you need to manually configure the directory

  2. Configure hdfs-site.xml
    above to configure the address of namenode to be equivalent to an internal access address (access between internal modules). Here, it is also necessary to configure an interface for accessing namenode on a web user, so that it can be accessed from a web page. Access from the command line is not required. This file is mainly to configure the web page access address of NameNode and 2NN.

    <configuration>
            <!-- nn web端访问地址 -->
            <property>
                    <name>dfs.namenode.http-address</name>
                    <value>hadoop102:9870</value>
            </property>
            <!-- 2nn web端访问地址 -->
            <property>
                    <name>dfs.namenode.secondary.http-address</name>
                    <value>hadoop104:9868</value>
            </property>
    </configuration>
    
  3. Configure Yarn-site.xml
    to specify MR, the address of ResourceManager, and environment variable inheritance. This is a bug in version 3.1.x. This problem does not exist in version 3.2. Here you need to add HADOOP_MAPRED_HOME, which does not have this environment variable by default

    <configuration>
            <!-- 指定MR走shuffle -->
            <property>
                    <name>yarn.nodemanager.aux-services</name>
                    <value>mapreduce_shuffle</value>
            </property>
            <!-- 指定ResourceManager的地址 -->
            <property>
                    <name>yarn.resourcemanager.hostname</name>
                    <value>hadoop103</value>
            </property>
            <!-- 环境变量的继承 -->
            <property>
                    <name>yarn.nodemanager.env-whitelist</name>
                    <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
            </property>
    </configuration>
    
  4. Configure mapred-site.xml
    here to specify that the mapreduce program uses yarn. In fact, it can also use other resource scheduling. The default is local scheduling.

    <configuration>
            <!-- 指定MapReduce程序走Yarn -->
            <property>
                    <name>mapreduce.framework.name</name>
                    <value>yarn</value>
            </property>
    </configuration>
    

    After configuring on 102, distribute to 103 and 104xsync hadoop/

  5. Configuring workers nodes
    is a difference between 3.x and 2.x. In 2.x, workers are not specified. They were also called slaves at that time. In order to avoid discrimination, all nodes are collectively referred to as workers. This is under the Hadoop directory. The workers directory, vim workers, and then enter, without spaces and blank lines

    hadoop102
    hadoop103
    hadoop104
    

    The input here is the host name of the three nodes, and then distributed to each nodexsync workers

4.4 Cluster startup

If the cluster is started for the first time, the NameNode needs to be formatted on the hadoop102 node

hdfs namenode -format

At this point, the initialization is complete, there will be two more directories under the hadoop general directory of hadoop102: data and logs, enter /opt/module/hadoop-3.1.3/data/dfs/name/current, there is a VERSION in it File, cat to see, it records the relevant information of the cluster, the version number and so on

namespaceID=1937245237
clusterID=CID-b62fd8a6-0686-44b4-8a4e-88dde7336013
cTime=1670748093796
storageType=NAME_NODE
blockpoolID=BP-1316835940-192.168.56.102-1670748093796
layoutVersion=-64

Come to the sbin directory and start hdfs

sbin/start-dfs.sh

Come to hadoop102 virtual machine, open the browser, enter hadoop102:9870, you can view the data information stored on HDFS

insert image description here
Start YARN on the node (hadoop103) configured with ResourceManager

sbin/start-yarn.sh

In the browser of the hadoop103 machine, enter the URL hadoop103:8088 to view the job information running on YARN

insert image description here

4.5 Cluster Test

4.5.1 Test hdfs

Come to hadoop102, use the command to create a directory on hdfs

hadoop fs -mkdir /input

hadoop fs -put input/word.txt /input

At this time, go to the hdfs page and you can see that there is a directory, and there will be work.txt in it

insert image description here
So, a question here is, how do these data of the web page come from? Or where is the data on hdfs stored locally? The block size here is the size of the current data block, and the upper limit of the data allowed to be stored is 128MB.

Answer: When configuring hdfs, we specified a storage directory for hadoop data, which is the data directory under hadoop-3.1.3. When hdfs is initialized, this data directory is also initialized, so the data on hdfs is stored in it. Inside, step by step

cd data/dfs/data/current/BP-1316835940-192.168.56.102-1670748093796/current/finalized/subdir0/subdir0

Here, you will see two files

insert image description here
And under the corresponding paths of hadoop103 and hadoop104, there are also these two files respectively, that is, copies of the data are stored, so that even if the datanode of 102 hangs, the data is still there. In this way, the Replication of the data is 3, and 3 copies are stored.

Here is a cluster crash handling method: If you delete the data directory under hadoop102 by mistake, the NameNode of the cluster will be deleted. If you directly execute the format cluster command, you will find that the NameNode is gone. There was a startup error for the NameNode. This is because, every time a new cluster is initialized, NameNode and DataNode will have a cluster ID of the current new cluster. Only when the two match, the two can work normally. This is in data/dfs/name/current/ From VERSION and data/dfs/data/current/VERSION, we can see
insert image description here
that if the cluster crashes and the NameNode is directly formatted, a new cluster id will be generated on the NameNode, and the version of the DataNode at this time is still the previous one. As a result, the cluster IDs of the NameNode and DataNode are inconsistent, and the NameNode of the new cluster cannot access the data of the DataNode of the old cluster. At this time, the NameNode of the new cluster will fail to start.

Therefore, if the cluster gets an error during operation and the NameNode needs to be reformatted, it is necessary to stop the process of the namenode and datanode first, and delete the data and logs directories of all machines (data on the old machine), and then format it .
insert image description here

4.5.2 Testing YARN

There is word.txt in the cluster, here we use the cluster mode to create a wordcount to see the working conditions of the nodes in the cluster, on hadoop102

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /input /wcoutput

At this time, when I came to the web interface on hadoop103 and looked at the execution status, I found that there was one more task

insert image description here
If you look at the data of hdfs, you will find that there is an extra wcoutput directory, which is the result file of counting words.

4.5.3 Configuring History Server

The history server is like this. The program page running above, after the wordcount program is running, refresh the above interface, and you will find that the Tracking UI behind becomes a history

insert image description here
At this point, if you click History, a link error will appear, because the history server is not configured, which means that we cannot see the execution of the previously completed tasks. And if we want to view a historical execution of the completed task, we need the configuration of the history server.
Configure the history server, upload hadoop102 to mapred-site.xml, and add the following parameters.

  • Historical server-side address: an address and port for internal module-to-module communication
  • History server web address: a web address and port for the user to interact with the module
<configuration>
        <!-- 指定MapReduce程序走Yarn -->
        <property>
                <name>mapreduce.framework.name</name>
                <value>yarn</value>
        </property>
        <!-- 历史服务器端地址 -->
        <property>
                <name>mapreduce.jobhistory.address</name>
                <value>hadoop102:10020</value>
        </property>
        <!-- 历史服务器web端地址 -->
        <property>
                <name>mapreduce.jobhistory.webapp.address</name>
                <value>hadoop102:19888</value>
        </property>
</configuration>

Distribute to other nodes xsync hadoop/, restart yarn, and then start the history server on hadoop102

bin/mapred --daemon start historyserver

jps看是否历史服务器启动, 查看JobHistory: 网址输入hadoop102:19888

# 再次运行wordcount程序

# 此时程序运行完点击History的话,就能自动跳到上面这个网址里面去了

insert image description here
However, if you want to see detailed and specific aggregation logs, that is, click on logs, an error will appear:

insert image description here
The reason for this is that the cluster has not yet configured log aggregation. This concept means: After the application is running, the running log of the program will be uploaded to the HDFS system. If there is no configuration, there will be no upload, and you will not be able to read it on the page at this time.

insert image description here
The benefits of the log aggregation function: It is convenient to view the running details of the program, which is convenient for development and debugging.

Note: To enable the log aggregation function, NodeManager, ResourceManager and HistoryServer need to be restarted.

4.5.4 Configuring the Log Aggregation Function

Add the following configuration information to yarn-site.xml of hadoop102:

<configuration>
        <!-- 指定MR走shuffle -->
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>
        <!-- 指定ResourceManager的地址 -->
        <property>
                <name>yarn.resourcemanager.hostname</name>
                <value>hadoop103</value>
        </property>
        <!-- 环境变量的继承 -->
        <property>
                <name>yarn.nodemanager.env-whitelist</name>
                <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
        </property>

        <!-- 开启日志聚集功能更 -->
        <property>
                <name>yarn.log-aggregation-enable</name>
                <value>true</value>
        </property>
        <!-- 设置日志聚集服务器地址 -->
        <property>
                <name>yarn.log.server.url</name>
                <value>http://hadoop102:19888/jobhistory/logs</value>
        </property>
        <!-- 设置日志保留时间为7-->
        <property>
                <name>yarn.log-aggregation.retain-seconds</name>
                <value>604800</value>
        </property>
</configuration>

Distributed to 103 and 104.

Close yarn and history server, and then restart.

mapred --daemon stop historyserver
sbin/stop-yarn.sh

mapred --daemon start historyserver
sbin/start-yarn.sh

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /input /wcoutput

Then execute the wordcount program, enter the history server and click Logs at this time, and you can see the actual running log of the program.
insert image description here
At this time, if an exception is thrown during the program, the location can be quickly located.

4.5.5 Summary of cluster start/stop methods

# 整体启动/停止hdfs
start-dfs.sh/stop-dfs.sh

# 整体启动/停止yarn
start-yarn.sh/stop-yarn.sh

# 分别启动/停止hdfs组件
hdfs --daemon start/stop namenode/datanode/secondarynamenode

# 启动/停止yarn组件
yarn --daemon start/stop resourcemanager/nodemanager

5. Other

5.1 Two common scripts related to the cluster

Here are two cluster-related scripts, the first one is to open and close the cluster with one click, create a new my_hadoop under the ~/bin directory, and then write the script

#! /bin/bash

if [ $# -lt 1 ]
then
        echo "No Args Input..."
        exit;
fi

case $1 in
"start")
        echo "======================== start hadoop cluster =========================="
        echo "------------------------ start hdfs ------------------------------"
        ssh hadoop102 "/opt/module/hadoop-3.1.3/sbin/start-dfs.sh"
        echo "----------------------- start yarn -------------------------------"
        ssh hadoop103 "/opt/module/hadoop-3.1.3/sbin/start-yarn.sh"
        echo "---------------------- start historyserver ----------------------"
        ssh hadoop102 "/opt/module/hadoop-3.1.3/bin/mapred --daemon start historyserver"
;;
"stop")
        echo "======================= stop hadoop cluster =========================="
        echo "----------------------- stop historyserver---------------------------"
        ssh hadoop102 "/opt/module/hadoop-3.1.3/bin/mapred --daemon stop historyserver"
        echo "----------------------- stop yarn -------------------------------"
        ssh hadoop103 "/opt/module/hadoop-3.1.3/sbin/stop-yarn.sh"
        echo "---------------------- stop hdfs ----------------------"
        ssh hadoop102 "/opt/module/hadoop-3.1.3/sbin/stop-dfs.sh"
;;
*)
        echo "Input Args Error..."
;;
esac

chmod a+x to it.

The second is to check whether the nodes on each node are normally started with one click, create a new jpsall in the ~/bin directory, and write scripts

#! /bin/bash

for host in hadoop102 hadoop103 hadoop104
do
        echo ====================== $host =============================
        ssh $host jps
done

Give it chmod a+x, and then give the xsync bin directory to other node machines.

5.2 Two interview questions

Here are two interview questions
insert image description here

5.3 Cluster time synchronization

The significance of time synchronization is that if the timing tasks are run in the cluster, if the time of multiple machines is different, it will cause problems.

Production environment: If the server can connect to the external network, time synchronization is not required. At this time, all machines will be synchronized with the time of the external network.
insert image description here
If the server cannot be connected to the external network, it is necessary to synchronize the time of all machines. At this time, use hadoop102 as the The benchmark, hadoop103 and hadoop104 timing and 102 time synchronization can be. Since the virtual machine can connect to the external network, it will not be set here, and simply record what should be done if the time synchronization is performed.

First, do the following on hadoop102: time service configuration (must be root user)

# 查看所有节点的ntpd服务状态和开机自启动状态
sudo systemctl status ntpd
sudo systemctl start ntpd
sudo systemctl is-enabled ntpd

Configure the ntp.conf configuration file of hadoop102

sudo vim /etc/ntp.conf

Modification:
insert image description here
Add the following two lines of code at the end of this file: When the node loses the network connection, the local time can still be used as the time server to provide time synchronization for other nodes in the cluster

server 127.127.1.0
fudge 127.127.1.0 stratum 10

Modify the /etc/sysconfig/ntpd file of hadoop102 and add the following content to synchronize the hardware time and system time together. Generally, the hardware time will be more accurate

sudo vim /etc/sysconfig/ntpd

# 增加
SYNC_HWCLOCK=yes

# 重启ntpd服务
sudo systemctl start ntpd
# 设置开机启动
sudo systemctl enable ntpd

In this way, the reference time server has been set at 102, and then the time at 103 and 104 is aligned with 102 timing logic.

# 关闭103104上的ntp服务和自启动,这个是防止与外界互联网进行时间同步,总不能一会和102时间同步,一会和外网时间同步
# 103104执行下面命令
sudo systemctl stop ntpd
sudo systemctl disable ntpd

# 在103104上配置1分钟与102的时间服务器同步一次
sudo crontab -e
# 编写定时任务如下
*/1 * * * * /usr/sbin/ntpddate hadoop102

# hadoop103上修改下机器时间
sudo date -s "2022-09-11 11:11:11"

# 1031分钟后查看机器是否和102的时间服务同步
sudo date

6. Mr. Xiao

This lesson mainly organizes the relevant knowledge related to the introduction of hadoop. The overview part can give you a simple understanding of hadoop. Here I actually made a simplification, and then there are some scripts that can improve the efficiency of daily development.

I think the key points here are:

  1. Hadoop components and how they work together
  2. Steps to build a hadoop cluster environment: This interview may ask, how to set up a cluster? What configuration files are modified, and what does each configuration file do?
  3. How to deal with hadoop cluster crashes, why do you do this?
  4. The port number used in hadoop

As an introduction, this article does not have too much knowledge, and it took me two or three weeks to finish reading the introductory article. Learning, another reason is that I was attacked by a virus last week. I am not in good health and don't want to learn. Try to keep the progress of updating every two weeks at most.

In the next article, walk into HDFS to see the details and principles. How is massive data stored under the operation of HDFS? How does NameNode and DataNode cooperate? What is 2NN doing here? wait.

Guess you like

Origin blog.csdn.net/wuzhongqiang/article/details/128355800