Preliminary analysis of distributed installation and configuration of hadoop (unstoppable)

There are many installation tutorials for hadoop in linux, such as the ones I referenced in the process of building it myself:
https://blog.csdn.net/weixin_44198965/article/details/89603788
https://blog.csdn. net/qq_25615395/article/details/89083580
https://juejin.im/post/6856984821059895303/

However, there are a lot of tutorials, but maybe the environment is different, or the idea is different, so some problems of this kind will always be found in the process of reference implementation.
Of course, the reason for these problems is likely to be that everyone's technology stack is different. Others think you should know, but you don't actually know, which leads to some unexplained details that become obstacles when you do it yourself.
Therefore, combining the above tutorials and my own actual operations, I think it can still be recorded. In case someone who just happens to be able to accept my description ideas, then this article has more meaning.

Unzip

In the previous article Hadoop installation environment preparation and related knowledge analysis , I have already downloaded the installation package, so I just need to unzip it first:

tar -zxvf hadoop-3.1.3.tar.gz

Distributed deployment should require at least 3 machines, so I prepared 3 machines, namely 192.168.139.9, 192.168.139.19, and 192.168.139.29, all of which are centos6.5 virtual machines, with 1G of memory allocated.
One thing to mention is that the third of the first three links was written by my colleague. He encountered that the virtual machine memory was only allocated 512M, and the memory overflowed during runtime.

The deployment of the three machines can be decompressed one by one and then configured and run, or you can decompress and configure only on one machine, and then use the scp mentioned in the previous article to copy to the other two machines. Here I am Choose a machine to decompress and configure, and then copy to the other two machines.

Configuration

The basic distributed mode construction is mainly to configure the following files. The foundation mentioned here is to start and detect the successful operation status as the goal.

hadoop-env.sh
core-site.xml
hdfs-site.xml
yarn-site.xml
mapred-site.xml
workers

The above files are all in the directory under the hadoop installation directory etc/hadoop. For example, my installation directory is /root/soft/bigdata/hadoop/hadoop-3.1.3, then the path of the first file is /root/soft/bigdata/hadoop/hadoop-3.1.3/etc/hadoop/hadoop-env.sh.
Some tutorials may list the above files, but in actual operation, it is found that the 3.1.3 version actually needs to modify the following few, otherwise the startup will fail, and the 2.7 version does not need to be changed.

start-dfs.sh
stop-dfs.sh
start-yarn.sh
stop-yarn.sh

The above files have changed directories. In the directory under the installation directory, sbinI will start-all.shtalk about this part of the configuration when the execution fails.

hadoop-env.sh

Just modify JAVA_HOME in this file, as follows:

export JAVA_HOME=/root/soft/jdk1.8.0_261

There is actually a question here, because my machine is configured with the global JAVA_HOME. Originally, I thought that what I wrote directly in JAVA_HOME=$JAVA_HOMEit should be able to read the system environment variables directly, but in actual operation it is not what I imagined, so I still This file must be changed to point to the actual path of jdk.

core-site.xml

<property>
	<name>fs.defaultFS</name>
	<value>hdfs://master:9000</value>
</property>
<property>
	<name>hadoop.tmp.dir</name>
	<value>/opt/hadoop/hadoopdata</value>
</property>

The above two configurations need to be added to <configuration>and in the </configuration>middle, and the following are the same.

For the meaning of the two configurations, the first one I checked a lot of information, but I haven't found an accurate explanation yet, so I will explain it later.
It can be said that the master configured inside is actually the hostname of the machine, that /etc/sysconfig/networkis, HOSTNAMEthe value in this file , or the ip address, both of which have been tested and are available.
The master here is named this way by me, not necessarily called this.
There is another saying that the domain name can also be configured, but in general, we don't use this by ourselves. I didn't try it. In theory, it should be feasible.
The second configuration points to a temporary hadoop storage directory. Some tutorials say that it needs to be created manually in advance, but I did not create it manually in the actual operation, but hadoop namenode -formatit will be created automatically when the latter is executed , which means that it is actually created here. It can point to a non-existent directory.

hdfs-site.xml

<property>
	<name>dfs.replication</name>
	<value>3</value>
</property>

The above configuration seems to be the number of replica sets, but I changed the number during the actual operation, and the number of active nodes displayed as a result remains the same, so it is a question that needs to be understood later. Here is just to draw according to other people’s tutorials. ladle.

yarn-site.xml

<property>
	<name>yarn.nodemanager.aux-services</name>
	<value>mapreduce_shuffle</value>
</property>
<property>
	<name>yarn.resourcemanager.address</name>
	<value>master:18040</value>
</property>
<property>
	<name>yarn.resourcemanager.scheduler.address</name>
	<value>master:18030</value>
</property>
<property>
	<name>yarn.resourcemanager.resource-tracker.address</name>
	<value>master:18025</value>
</property>
<property>
	<name>yarn.resourcemanager.admin.address</name>
	<value>master:18141</value>
</property>
<property>
	<name>yarn.resourcemanager.webapp.address</name>
	<value>master:18088</value>
</property>

The specific meaning of the above configuration also needs to be constantly added in the subsequent learning and understanding process. The master inside is also the host name, and ip can also be used, as explained above. And the following ports, if you compare other tutorials, you will find that many are different, so they are also custom ports. The role should be to start the process on these ports when Hadoop is started, so it is good to ensure that there is no conflict with other application ports of this machine. .
One thing that needs to be said first is that yarn.resourcemanager.webapp.addressthis configuration can actually verify the number of active nodes in the cluster and can be accessed by a browser.

mapred-site.xml

<property>
	<name>mapreduce.framework.name</name>
	<value>yarn</value>
</property>

The above configuration is only literally, it seems to be easy to understand, but the actual effect is still not particularly clear, I only know that the default is local, here needs to be changed to yarn, but also to save it later and read the official website documents and other materials. Do understand.

workers

This file, my understanding is to configure the replica node, this will really affect the number of active nodes seen later, not the above dfs.replication.
This file is called slaves in hadoop version 2.7 and workers in version 3.1.3. My configuration here is as follows:

#localhost
node01
node02

The default file has a line on top localhost, so when some people may run single pseudo-distributed mode, you will find no change here seems to be, because the machine itself is, all point to localhost, also 127.0.0.1.
The two I configured here are the hostnames of 192.168.139.19 and 192.168.139.29. The thing to note here is that it seems that it must be the hostname, and host mapping is to be done, that is, to modify the /etc/hostsfile. The configuration of my three machines is as follows:

192.168.139.19 node01
192.168.139.29 node02
192.168.139.9 master

Originally, I thought this is just a mapping relationship, so can it be changed to a virtual domain name that is different from hostname? Anyway, it is necessary to configure hosts, but in fact, when the configuration is changed to be different from hostname, there will be problems when starting later. During execution start-all.sh, the two slave nodes cannot start.
As in the above configuration, two nodes are configured in my workers, then two active nodes will be seen in the web interface after subsequent startup. If the hostname master is added here, the subsequent web interface will be able to see three Active node.

Configure hadoop environment variables

With the above configuration, it is almost done. You can directly go to the bin directory and sbin directory of the hadoop installation directory to perform various operations.
But for more convenience, in order to easily execute hadoop-related commands in any directory, you need to configure the environment variables of hadoop again. The configuration method is the same as JAVA_HOME and REDIS_HOME:

export HADOOP_HOME=/opt/hadoop/hadoop
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

Note: The above configuration is actually divided into permanent configuration and temporary configuration. Temporary configuration is to perform the above operations directly on the bash command line. Here in the current running stage, it is also said that the current login will be valid and the machine restarts. For permanent configuration, you need to put the above code in a /etc/profilefile and then execute it source /etc/profile, so the environment variable configuration above is best to modify the file, and I do this too.

Format (initialize)

After the environment variables, even if it is the last step of the pure installation stage, the formatting operation needs to be performed on the master node, which looks more like initialization.

hadoop namenode -format

The above operation is based on the configuration of hadoop environment variables. If the environment variables are not configured, or there is a problem with the configuration, the hadoop command may not be found.
This operation will initialize core-site.xmlthe hadoop.tmp.dirdirectory configured in the configuration . In my case /opt/hadoop/hadoopdata, this directory did not exist originally. After executing the above command, it will be automatically created, and some subdirectories and files will be generated inside.
What I want to say here is that some tutorials say that several nodes need to copy the content here, but I did not copy it, and it seems that it has no effect at present.
There is also a saying that there is a VERSION file in it. The,, and clusterIDin datanodeUuidit storageIDwill affect the number of active nodes seen on the web interface. I did not copy or generate it, and it seems that it does not affect the number seen.

Start the hadoop cluster (there is a pit here)

At this point, the gourd-style construction is complete, and the rest seems to be startup and verification. According to some online tutorials, execute start-all.shstartup, and then execute to jpscheck the startup situation, so I executed the hadoop namenode -formatmachine where the above command is located. The start command was run in:

start-all.sh

For the first version of 2.7, this seems to be the case. Executing the above command will prompt to enter the ssh passwords of the other two machines, and then enter the corresponding machine login passwords when prompted to start.
I just got to this point with version 2.7, without further verification.

And why is it said that it seems that the construction is completed, and only the startup and verification are left, because the 3.1.3 version has not actually been completed yet.
When executing the above command, you will find that the following exception is thrown first:

ERROR: Attempting to operate on yarn nodemanager as root
ERROR: but there is no YARN_NODEMANAGER_USER defined. Aborting operation.ls

After the search is to be found on the Internet to configure some user rights, so they also need to modify mentioned at the beginning start-dfs.sh, stop-dfs.sh, start-yarn.sh, stop-yarn.shfour files, add the following configuration at the beginning of two dfs.sh file:

HDFS_DATANODE_USER=root
HADOOP_SECURE_DN_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root

Add the following configuration to the two yarn.sh files:

YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root

With the above configuration, start again, you will find the following prompts are output:

node02: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
node01: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password)

After that, it was over. There was no prompt to enter the ssh password like 2.7. At the same time, the execution jpswill find that the display of this machine is far from what other documents say successful, and jpsafter the execution of the other two machines , there is jpsnothing except output .
The reason for this is that sshkey must be configured for version 3.1.3, and 2.7 can also do so, but it does not seem to be mandatory, and 3.1.3 has no choice.

sshkey configuration

The function of sshkey is to avoid secrets when ssh is connected, and configure public and private keys to interact with each other. The steps are as follows:

ssh-keygen -t rsa
cat id_rsa.pub >>  authorized_keys
scp authorized_keys [email protected]:/root/.ssh/authorized_keys

The first line is to generate public and private keys on the local machine (i.e. 192.168.39.9 here), the second line is to append the public key to the file authorized_keys, and the third step is to copy the authorized_keys file above to another machine.
There will be a problem here, because the 192.168.139.19 machine has not generated public and private keys, so there is no .ssh directory. Executing the third line will prompt that the directory or file does not exist. I don't know if this is the case in all linux systems.
Therefore, it is more accurate to first generate your own public and private keys on the three machines in turn. The second step is to append the public key of the first machine to the authorized_keys file, and the third step is to copy the authorized_keys to other machines.
Some people on the Internet say that it is necessary to append the public keys of several machines to a file, and then copy them to each machine, but personally think it is not necessary.
Because as far as the current operation is concerned, only by executing start-all.shcommands on the namenode machine can other machines be started normally, even if several public keys are appended to a file as mentioned above, and then copied to each machine, the result is in the non-namenode When the node executes the start command, it will find that it cannot be found eventually ResourceManager.
It’s just that almost all articles say this, but my actual operation is not the case. It is not clear whether other articles are wrong, or I just took a process that is wrong and the result looks right. .

At this point, it is considered that the configuration is completed, and when you return to the main node to execute the start-all.shstartup, you will see that the startup is successful, and then execute jps to verify several nodes, and you will have the following output:

[root@master hadoop]# jps
52709 NodeManager
52008 NameNode
52329 SecondaryNameNode
52589 ResourceManager
53935 Jps

[root@master ~]# jps
1552 SecondaryNameNode
1937 Jps
1340 NameNode
1789 ResourceManager

Many tutorials seem to be over here, as if this means success, but this is not the case. This just seems to be successful, or just a single node startup is successful.

Node interaction problem

In the above yarn-sit.xmlconfiguration, there is the following paragraph:

<property>
	<name>yarn.resourcemanager.webapp.address</name>
	<value>master:18088</value>
</property>

You can access it through a browser and see the running status of the node, but when I visited, I saw that the number of active nodes was 1,
and the node that executed the startup command was active. (Note here that the virtual machine installation is equipped with a virtual domain name. If it is accessed by an external physical machine browser, you need to configure the virtual domain name mapping in the physical machine's hosts, otherwise the access will not work)
This is a very embarrassing problem, so check the information And log, found that there are such content in the 192.168.139.19 and 29 logs:

org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.139.9:18025. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
ERROR org.apache.hadoop.yarn.server.nodemanager.NodeManager: RECEIVED SIGNAL 15: SIGTERM

It seems that the yarn resource manager cannot be connected, but I finally found out that in the hosts configuration, the master was mapped to 127.0.0.1, and then when other nodes connected to the yarn resource manager, the master switched to 127.0.0.1, and then a loopback occurred.
Therefore, remove /etc/hoststhe master mapping behind 127.0.0.1, first execute stop-all.shstop, then execute start-all.shstart, and then on the web interface, you can see that the number of active nodes is no longer 1.
Insert picture description here

Guess you like

Origin blog.csdn.net/tuzongxun/article/details/107843326