[Big Data Foundation] Hadoop3.1.3 Installation Tutorial

source:

https://dblab.xmu.edu.cn/blog/2441/

Preface: Reinstall to solve all bugs! In fact, most of the derivative problems in the problem can be solved by reinstalling.

Experimental content

Create Hadoop user

First press ctrl+alt+t to open a terminal window, and enter the following command to create a new user:

sudo useradd -m hadoop -s /bin/bash

Then use the following command to set the password, which can be simply set to hadoop, and enter the password twice as prompted:

sudo passwd hadoop

insert image description here
It can add administrator privileges to hadoop users, which is convenient for deployment and avoids some difficult permission problems for novices:

sudo adduser hadoop sudo

insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
Change the source
insert image description here
insert image description here
and install the strongest (bushi) editor vim

sudo apt-get install vim

insert image description here
The commonly used modes of vim are divided into command mode, insert mode, visual mode, and normal mode. In this tutorial, only normal mode and insert mode will be used. Switching between the two will help you through this guide.

Normal Mode
Normal mode is mainly used to browse text content. At the beginning, vim is opened in normal mode. Press the Esc key in any mode to return to normal mode
Insert Edit Mode
Insert edit mode is used to add content to the text. In normal mode, enter the i key to enter the insert editing mode.
Exit vim
. If you use vim to modify any text, you must remember to save it. Esc key to return to normal mode, then enter: wq to save the text and exit vim

Install SSH, configure SSH passwordless login

Both cluster and single-node mode need to use SSH login (similar to remote login, you can log in to a Linux host and run commands on it), Ubuntu has installed SSH client by default, and also needs to install SSH server:

sudo apt-get install openssh-server

insert image description here
insert image description here
After installation, you can use the following command to log in to the machine:

ssh localhost

At this time, there will be the following prompt (SSH login prompt for the first time), enter yes. Then follow the prompts to enter the password hadoop, so you can log in to the machine.
insert image description here
But you need to enter a password every time you log in like this. We need to configure SSH to log in without a password, which is more convenient.

First exit the ssh just now, and return to our original terminal window, then use ssh-keygen to generate a key, and add the key to the authorization:

exit                           # 退出刚才的 ssh localhost
cd ~/.ssh/                     # 若没有该目录,请先执行一次ssh localhost
ssh-keygen -t rsa              # 会有提示,都按回车就可以
cat ./id_rsa.pub >> ./authorized_keys  # 加入授权

insert image description here
The meaning of ~: In the Linux system, ~ represents the user's home folder, that is, the directory "/home/username". If your user name is hadoop, then ~ represents "/home/hadoop/". In addition, the text after # in the command is a comment, you only need to enter the previous command.

At this time, use the ssh localhost command to log in directly without entering a password, as shown in the figure below.
insert image description here

Install java environment

In the Linux command line interface, execute the following Shell command (note: the current login user name is hadoop):

cd /usr/lib
sudo mkdir jvm #创建/usr/lib/jvm目录用来存放JDK文件
cd ~ #进入hadoop用户的主目录
cd Downloads  #注意区分大小写字母,刚才已经通过FTP软件把JDK安装包jdk-8u162-linux-x64.tar.gz上传到该目录下
sudo tar -zxvf ./jdk-8u162-linux-x64.tar.gz -C /usr/lib/jvm  #把JDK文件解压到/usr/lib/jvm目录下

insert image description here
insert image description here
After the JDK file is decompressed, you can execute the following command to check in the /usr/lib/jvm directory:

cd /usr/lib/jvm
ls

insert image description here
As you can see, there is a jdk1.8.0_162 directory under the /usr/lib/jvm directory.
Next, continue to execute the following command to set the environment variable:

cd ~
vim ~/.bashrc

Use vim to open the environment variable configuration file of the hadoop user, please add the following lines at the beginning of this file:

export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_162
export JRE_HOME=${
    
    JAVA_HOME}/jre
export CLASSPATH=.:${
    
    JAVA_HOME}/lib:${
    
    JRE_HOME}/lib
export PATH=${
    
    JAVA_HOME}/bin:$PATH

insert image description here
Save the .bashrc file and exit the vim editor. Then, continue to execute the following command to make the configuration of the .bashrc file take effect immediately:

source ~/.bashrc

At this time, you can use the following command to check whether the installation is successful:

java -version

insert image description here
So far, the Java environment has been successfully installed. Now you can enter the Hadoop installation.

Install Hadoop3.1.3

Here, it is best to back up the files decompressed by hadoop, so that it is easy to reconfigure if there is a problem with the subsequent installation.

sudo tar -zxf ~/下载/hadoop-3.1.3.tar.gz -C /usr/local    # 解压到/usr/local中
cd /usr/local/
sudo mv ./hadoop-3.1.3/ ./hadoop            # 将文件夹名改为hadoop
sudo chown -R hadoop ./hadoop       # 修改文件权限

Hadoop is ready to use after decompression. Enter the following command to check whether Hadoop is available, and if successful, the Hadoop version information will be displayed:

cd /usr/local/hadoop
./bin/hadoop version

insert image description here

Hadoop stand-alone configuration (non-distributed)

Hadoop default mode is non-distributed mode (local mode) and can run without additional configuration. Non-distributed, that is, a single Java process, which is convenient for debugging.

Now we can execute the example to get a feel for Hadoop in action. Hadoop comes with rich examples (run ./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar to see all examples), including wordcount, terasort, join, grep, etc.

Here we choose to run the grep example, we take all the files in the input folder as input, filter the words that match the regular expression dfs[az.]+ and count the number of occurrences, and finally output the results to the output folder.

cd /usr/local/hadoop
mkdir ./input
cp ./etc/hadoop/*.xml ./input   # 将配置文件作为输入文件
./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar grep ./input ./output 'dfs[a-z.]+'
cat ./output/*          # 查看运行结果

insert image description here
After the execution is successful, as shown below, the relevant information of the job is output, and the output result is the word dfsadmin that conforms to the regular pattern. One reload appears to
solve all troubles:
insert image description here
Note that Hadoop will not overwrite the result file by default, so if you run the above example again, you will be prompted Error, you need to delete ./output first.

rm -r ./output

Hadoop pseudo-distributed configuration

Hadoop can run in a pseudo-distributed manner on a single node. The Hadoop process runs as a separate Java process. The node acts as both a NameNode and a DataNode. At the same time, it reads files in HDFS.

Hadoop's configuration files are located in /usr/local/hadoop/etc/hadoop/, pseudo-distributed need to modify two configuration files core-site.xml and hdfs-site.xml. Hadoop configuration files are in xml format, and each configuration is implemented by declaring the name and value of a property.

Modify the configuration file core-site.xml (it will be more convenient to edit through gedit: gedit ./etc/hadoop/core-site.xml), the

<configuration>
</configuration>

change into:

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/usr/local/hadoop/tmp</value>
        <description>Abase for other temporary directories.</description>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>
gedit ./etc/hadoop/core-site.xml

insert image description here

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/usr/local/hadoop/tmp</value>
        <description>Abase for other temporary directories.</description>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

insert image description here
Similarly, modify the configuration file hdfs-site.xml:

gedit ./etc/hadoop/hdfs-site.xml

insert image description here

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/usr/local/hadoop/tmp/dfs/name</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/usr/local/hadoop/tmp/dfs/data</value>
    </property>
</configuration>

insert image description here
Hadoop configuration file description:

The running mode of Hadoop is determined by the configuration file (the configuration file will be read when running Hadoop), so if you need to switch from the pseudo-distributed mode to the non-distributed mode, you need to delete the configuration items in core-site.xml.

In addition, although pseudo-distributed only needs to configure fs.defaultFS and dfs.replication to run (as in the official tutorial), if the hadoop.tmp.dir parameter is not configured, the default temporary directory is /tmp/hadoo-hadoop, And this directory may be cleared by the system when restarting, so the format must be executed again. So we set it up and also specify dfs.namenode.name.dir and dfs.datanode.data.dir, otherwise you might get an error in the next steps.

After the configuration is complete, perform the formatting of the NameNode:

cd /usr/local/hadoop
./bin/hdfs namenode -format

insert image description here

insert image description here
Then start the NameNode and DataNode daemons.

cd /usr/local/hadoop
./sbin/start-dfs.sh  #start-dfs.sh是个完整的可执行文件,中间没有空格

The following WARN prompt may appear during startup: WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable WARN prompt can be ignored and will not affect normal use.
You may encounter various problems at this step, please move to the "problem" directory.
insert image description here
After the startup is complete, you can use the command jps to judge whether the startup is successful. If the startup is successful, the following processes will be listed: "NameNode", "DataNode" and "SecondaryNameNode" (if the SecondaryNameNode is not started, please run sbin/stop-dfs.sh Close the process and try the start attempt again). If there is no NameNode or DataNode, the configuration is unsuccessful. Please check the previous steps carefully, or check the startup log to troubleshoot the cause.

After successful startup, you can visit the web interface http://localhost:9870 to view NameNode and Datanode information, and you can also view files in HDFS online.
insert image description here
insert image description here
insert image description here

Running Hadoop Pseudo-Distributed Instance

In the stand-alone mode above, the grep example reads local data, and the pseudo-distributed reads data on HDFS. To use HDFS, you first need to create a user directory in HDFS:

./bin/hdfs dfs -mkdir -p /user/hadoop

Note: The command in the textbook "Principles and Applications of Big Data Technology" is the shell command mode starting with "./bin/hadoop dfs". In fact, there are three shell command modes.

  1. hadoop fs
  2. hadoop dfs
  3. hdfs dfs

hadoop fs is applicable to any different file system, such as local file system and HDFS file system
hadoop dfs can only be applied to HDFS file system
hdfs dfs has the same command function as hadoop dfs, and can only be applied to HDFS file system

Then copy the xml file in ./etc/hadoop as an input file to the distributed file system, that is, copy /usr/local/hadoop/etc/hadoop to /user/hadoop/input in the distributed file system. We are using the hadoop user and have created the corresponding user directory /user/hadoop, so you can use a relative path such as input in the command, and its corresponding absolute path is /user/hadoop/input:

./bin/hdfs dfs -mkdir input
./bin/hdfs dfs -put ./etc/hadoop/*.xml input

insert image description here
After the copy is complete, you can view the file list with the following command:

./bin/hdfs dfs -ls input

insert image description here
The method of running MapReduce jobs in pseudo-distributed mode is the same as that of stand-alone mode, the difference is that pseudo-distributed reads files in HDFS (you can delete the local input folder and output folder created in the stand-alone step to verify at this point).

./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar grep input output 'dfs[a-z.]+'

insert image description here
Command to view the running results (the output results in HDFS are viewed):

./bin/hdfs dfs -cat output/*

The results are as follows. Notice that we have changed the configuration file just now, so the running results are different.
insert image description here
We can also retrieve the running results locally:

rm -r ./output    # 先删除本地的 output 文件夹(如果存在)
./bin/hdfs dfs -get output ./output     # 将 HDFS 上的 output 文件夹拷贝到本机
cat ./output/*

insert image description here
When Hadoop runs the program, the output directory cannot exist, otherwise it will prompt an error "org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://localhost:9000/user/hadoop/output already exists", so to execute again, You need to execute the following command to delete the output folder:

./bin/hdfs dfs -rm -r output    # 删除 output 文件夹

insert image description here
To shut down Hadoop, run

./sbin/stop-dfs.sh

insert image description here
When starting hadoop next time, there is no need to initialize the NameNode, just run ./sbin/start-dfs.sh.

Problems and solutions

Note: Questions 2~5 have a once-and-for-all solution: reinstall.
In fact, many of the problems in 2~5 come from a bug, but the more you change, the more complicated you change. In the process of querying data, you can learn a lot of relevant knowledge and deepen your understanding. If it comes out, it is better to draw out the sword and cut the complicated knots like Alexander did. Simple, crude, but effective.
If you want to deepen your understanding of technology and underlying technology, please move to questions 2~5:

1.java installation failed

insert image description here
insert image description here

ERROR: JAVA_HOME is not set and could not be found.

Reference documents:

https://blog.csdn.net/qq_44081582/article/details/104640421

  1. Switch to the [hadoop]/etc/hadoop directory
  2. Execution: vim hadoop-env.sh
  3. Modify the java_home path and hadoop_conf_dir path to the specific installation path
    For example:
export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_162
export HADOOP_CONF_DIR=/usr/local/hadoop

insert image description here
insert image description here
This completes the configuration.

2.ERROR: Attempting to operate on hdfs namenode as root

insert image description here
This is the root of all evil. Although adding sudo to the command line can avoid many permission problems, try not to add sudo this time.
Reference documents:

https://blog.csdn.net/weixin_49736959/article/details/108897129、

Use vim configuration:
insert image description here
insert image description here
insert image description here
insert image description here

insert image description here

3.host:9000 failed on connection exception: java.net.ConnectException: Connection refused;

Description of the problem:
insert image description here
insert image description here
insert image description here
In fact, the port is occupied. I opened the firewall and changed 9000 to 8020 to solve it.

4.localhost: ERROR: Cannot set priority of datanode process 6374

Description of the problem:
insert image description here
This problem is really complicated, which means that you have modified the configuration file to a great extent. The information I found has not solved this problem well. It is not too late to reinstall at this time.
The final effect is as follows:

jps

insert image description here

5.mkdir: Call From algernon-virtual-machine/127.0.1.1 to localhost:9000 failed on connection exception: java.net.ConnectException: 拒绝连接;

insert image description here
This question is the same as question 3, modify 9000 to 8020, and the work is completed.
insert image description here

successful result

Localhost can be queried:
insert image description here
Pseudo-distributed output results:
insert image description here
In fact, this experiment ends with these two pictures, and the results of each picture represent the successful result of an experiment. The key to the Linux configuration operation is that if the previous step is not completed, the latter step will naturally be stuck. Once a road is opened, it will go forward without hesitation; once it gets stuck, it can be as short as a minute or two, and as long as a week or two.
The so-called "one minute on stage, ten years off stage", the gold content of the entire experimental report is in these two pictures, but for this, I don't know how many files have been reconfigured, and even reinstalled once.

experience

Alexander the Great is said to have solved a problem in Phrygia, the capital of Gordium.
On his way into the city he found an old chariot, the yoke of which was fastened together with knots so tightly that it was impossible to see how they had been tied together. The oracle said that whoever untied these knots would rule Asia.
Alexander studied the rough knot for a while, then took two steps back, said that the oracle did not care how the knot was untied, and drew his sword at the knot, which was split in two That's it.

If you carefully read the solutions to problems 2~5, you will find that some of the solutions in these documents are contradictory, and some need to set the file as root, and then continue to search after reporting an error. The method is to change the file to hadoop—and then I checked again and found that the solution is to use the root user-doesn't this go back again! But I was like this when I was in the experimental class today. Not only that, the answers on the Internet are all kinds of strange, some of them are temporary, but not the root cause. If you read a blog and modify your configuration file, you can only get a bunch of contradictions in the end. bug file. Of course, there is also some hard core knowledge.

In fact, the three tricks for computer problems: restarting, reinstalling, and rebuying are indeed brutal and effective.
But if we encounter a problem, do we have to reinstall it?
I have always believed that: reloading is a last resort.
If you just follow the tutorial ignorantly and run the process again, it’s okay if you succeed once, but it’s uncomfortable if you get stuck. At this time, choose to reinstall, honestly follow the tutorial, and according to the rule of thumb, "so many people have done it", then it will be no problem to follow the textbook;
but what have you learned? The process of configuring the environment on linux to do experiments is like looking for a way out in a maze blindfolded, being led around by others, and walking to the exit. In fact, it is no different from that on flat ground, but if you hit a wall by yourself, you can find experience and methods , lessons, when no one gives guidance, it will be easier to find the direction.
So don't be afraid of tossing, you will become proficient after tossing.
It will not be too late to start from scratch when the system is reinstalled like the Ship of Theseus.
However, the requirements for backup are quite high, and snapshots are always prepared, and the software remembers the backup.
Similarly, the allocation of folders, environment configuration, and paths in configuration files have always been places where bugs frequently occur. Paying attention to the specifications during the experiment can also avoid most problems.
Alexander has the courage to cut the knot and the perseverance to overcome all difficulties.

Macedonian king active in the 4th century BC. When he was on an expedition to the Persian territory of Lydia, a chariot was enshrined in the temple. The chariot was strapped to the pillars of the temple by the former king Gerdios. There is a local legend: "The one who untangles this knot will become the king of Asia." This is a knot that many skilled challengers have not untied.

Philosopher: So, what do you think Alexander the Great would have done with that knot?

Youth: You untied the knot very cleverly, and soon became the king of Asia, right?

Philosopher: No, not so. Alexander the Great saw that the knot was very strong, so he immediately took out his dagger and cut it in two.

Youth: What? !

Philosopher: According to legend, he went on to say at that time: " Fate is not determined by legends, but by my own sword. I don't need the power of legends, but rely on my own sword to create destiny. "

Guess you like

Origin blog.csdn.net/Algernon98/article/details/129232375