Ubantu installs Hadoop3.1.3 stand-alone, pseudo-distributed

After installing ubantu, download hadoop and install it.

1. Default environment
Ubuntu 18.04 64-bit as the system environment (or Ubuntu 14.04, Ubuntu16.04 is also acceptable, both 32-bit and 64-bit)
Hadoop3.1.3.tar.gz file download addressPortal
You can use Thunder to download, which is faster
2. Preparation
Create a user named "hadoop" and use /bin/bash as the shell

sudo useradd -m hadoop -s /bin/bash

Use the following command to set the password. You can simply set it to 123456. Enter the password twice as prompted:

sudo passwd hadoop
123456
123456

Add administrator rights to hadoop users to facilitate deployment

sudo adduser hadoop sudo

3. Update apt
After logging in as the hadoop user, we first update apt. Then we use apt to install software. If it is not updated, some software may not be installed. Execute the following command:

sudo apt-get update

Insert image description here

2. Install SSH and configure SSH passwordless login

Both cluster and single-node modes require SSH login (similar to remote login, you can log in to a Linux host and run commands on it). Ubuntu has SSH client installed by default, and you also need to install SSH server:

sudo apt-get install openssh-server

After installation, you can use the following command to log in to the machine:

ssh localhost

At this time, there will be the following prompt (SSH first login prompt), enter yes. Then enter the password 123456 as prompted to log in to this machine.

In this way, you need to enter a password every time you log in. We need to configure SSH passwordless login to make it more convenient.

First, exit the ssh just now and return to our original terminal window. Then use ssh-keygen to generate the key and add the key to the authorization:

exit                           # 退出刚才的 ssh localhost
cd ~/.ssh/                     # 若没有该目录,请先执行一次ssh localhost
ssh-keygen -t rsa              # 会有提示,都按回车就可以,3,4下
cat ./id_rsa.pub >> ./authorized_keys  # 加入授权

At this point, use the ssh localhost command to log in directly without entering a password.
Insert image description here

3. Install JAVA environment

Hadoop3.1.3 requires JDK version 1.8 and above.

  1. Install the JDK1.8 installation package jdk-8u162-linux-x64.tar.gz
  2. You can click here to download the JDK1.8 installation package from Baidu Cloud Disk (extraction code: ziyu). Please download the compressed format file jdk-8u162-linux-x64.tar.gz to your local computer, assuming it is saved in the "/home/hadoop/Desktop/" directory.
  3. Execute the following command
cd /usr/lib
sudo mkdir jvm #创建/usr/lib/jvm目录用来存放JDK文件
cd ~ #进入hadoop用户的主目录
cd Desktop  #注意区分大小写字母,刚才已经通过FTP软件把JDK安装包jdk-8u162-linux-x64.tar.gz上传到该目录下
sudo tar -zxvf ./jdk-8u162-linux-x64.tar.gz -C /usr/lib/jvm  #把JDK文件解压到/usr/lib/jvm目录下
  1. Set environment variables
    Open the environment variable configuration file for the hadoop user. Please add the following lines at the beginning of the file:
export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_162
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH

Save the .bashrc file and exit the vim editor. Then, continue to execute the following command to make the configuration of the .bashrc file take effect immediately:
source ~/.bashrc

Please add image description

Use java -version to check whether the java environment is successfully installed. 22# 4. Install Hadoop3.1.3

  1. Install Hadoop into /usr/local/:
sudo tar -zxf ~/Desktop/hadoop-3.1.3.tar.gz -C /usr/local    # 解压到/usr/local中
cd /usr/local/
sudo mv ./hadoop-3.1.3/ ./hadoop            # 将文件夹名改为hadoop
sudo chown -R hadoop ./hadoop       # 修改文件权限
  1. Hadoop is ready to use after decompression. Enter the following command to check whether Hadoop is available. If successful, the Hadoop version information will be displayed:
cd /usr/local/hadoop
./bin/hadoop version

Please add image description

Four Hadoop stand-alone configurations

The default mode of Hadoop is non-distributed mode (local mode) and requires no additional configuration to run. Non-distributed, a single Java process, convenient for debugging.

  1. We choose to run the grep example. We take all the files in the input folder as input, filter the words that match the regular expression dfs[a-z.]+ and count the number of occurrences, and finally output the results to the output folder.
cd /usr/local/hadoop
mkdir ./input
cp ./etc/hadoop/*.xml ./input   # 将配置文件作为输入文件
./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar grep ./input ./output 'dfs[a-z.]+'
cat ./output/*          # 查看运行结果

Insert image description here

4.2Hadoop pseudo-distributed installation

 Hadoop 可以在单节点上以伪分布式的方式运行,Hadoop 进程以分离的 Java 进程来运行,节点既作为 NameNode 也作为 DataNode,同时,读取的是 HDFS 中的文件。
 Hadoop 的配置文件位于 /usr/local/hadoop/etc/hadoop/ 中,伪分布式需要修改2个配置文件 core-site.xml 和 hdfs-site.xml 。Hadoop的配置文件是 xml 格式,每个配置以声明 property 的 name 和 value 的方式来实现。
  1. Modify the configuration file core-site.xml (it is more convenient to edit through gedit: gedit ./etc/hadoop/core-site.xml), change the
    Insert image description here
 <configuration>
 </configuration>

change into

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/usr/local/hadoop/tmp</value>
        <description>Abase for other temporary directories.</description>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>
  1. Similarly, modify the configuration file hdfs-site.xml:
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/usr/local/hadoop/tmp/dfs/name</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/usr/local/hadoop/tmp/dfs/data</value>
    </property>
</configuration>
  1. Hadoop configuration file description

    The running mode of Hadoop is determined by the configuration file (the configuration file will be read when running Hadoop), so if you need to switch from pseudo-distributed mode back to non-distributed mode, you need to delete the in core-site.xml Configuration items.
    In addition, although pseudo-distributed only needs to configure fs.defaultFS and dfs.replication to run (as in the official tutorial), if the hadoop.tmp.dir parameter is not configured, the temporary one will be used by default. The directory is /tmp/hadoo-hadoop, and this directory may be cleared by the system during restart, so format must be executed again. So we set it up and also specify dfs.namenode.name.dir and dfs.datanode.data.dir, otherwise we might get an error in the next steps.

  2. After the configuration is complete, perform formatting of the NameNode:

cd /usr/local/hadoop
./bin/hdfs namenode -format
  成功的话,会看到 “successfully formatted” 的提示,

Insert image description here
If Error: JAVA_HOME is not set and could not be found. is prompted at this step, it means that the JAVA_HOME environment variable was not set properly before. Please follow the tutorial to set the JAVA_HOME variable first, otherwise The subsequent process cannot proceed. If you have set JAVA_HOME in the .bashrc file according to the previous tutorial, but the error: JAVA_HOME is not set and could not be found. still occurs, then please go to the hadoop installation directory to modify the configuration file **"/usr/local/ hadoop/etc/hadoop/hadoop-env.sh", **Find the line "export JAVA_HOME=${JAVA_HOME}" in it, and then change it to the specific address of the JAVA installation path, for example, "export JAVA_HOME=/ usr/lib/jvm/default-java", and then start Hadoop again
4. Start the NameNode and DataNode daemons

cd /usr/local/hadoop
./sbin/start-dfs.sh  #start-dfs.sh是个完整的可执行文件,中间没有空格
  1. After the startup is completed, you can use the command jps to determine whether it started successfully.

    If started successfully, the following processes will be listed: “NameNode”, “DataNode” and “SecondaryNameNode”
    Please add image description

5. Run Hadoop pseudo-distribution instance

  伪分布读取的则是HDFS上的数据,要使用HDFS,首先要在HDFS中创建目录
./bin/hdfs dfs -mkdir -p /user/hadoop
 命令是以”./bin/hadoop dfs”开头的Shell命令方式,实际上有三种shell命令方式。
1. hadoop fs
2. hadoop dfs
3. hdfs dfs

hadoop fs适用于任何不同的文件系统,比如本地文件系统和HDFS文件系统
hadoop dfs只能适用于HDFS文件系统
hdfs dfs跟hadoop dfs的命令作用一样,也只能适用于HDFS文件系统

接着将 ./etc/hadoop 中的 xml 文件作为输入文件复制到分布式文件系统中,即将 /usr/local/hadoop/etc/hadoop 复制到分布式文件系统中的 /user/hadoop/input 中。我们使用的是 hadoop 用户,并且已创建相应的用户目录 /user/hadoop ,因此在命令中就可以使用相对路径如 input,其对应的绝对路径就是 /user/hadoop/input:
./bin/hdfs dfs -mkdir input
./bin/hdfs dfs -put ./etc/hadoop/*.xml input

Please add image description
The running results can be retrieved locally:

rm -r ./output    # 先删除本地的 output 文件夹(如果存在)
./bin/hdfs dfs -get output ./output     # 将 HDFS 上的 output 文件夹拷贝到本机
cat ./output/*
 Hadoop 运行程序时,输出目录不能存在,否则会提示错误 “org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://localhost:9000/user/hadoop/output already exists” ,因此若要再次执行,**需要执行如下命令删除 output 文件夹:**
 ./bin/hdfs dfs -rm -r output    # 删除 output 文件夹

When running the program, the output directory cannot exist
When running the Hadoop program, in order to prevent overwriting the results, the output directory (such as output) specified by the program cannot exist, otherwise an error will be prompted, so The output directory needs to be deleted before running. When actually developing an application, you can consider adding the following code to the program, which can automatically delete the output directory every time it is run and avoid cumbersome command line operations:

Configuration conf = new Configuration();
Job job = new Job(conf);
 
/* 删除输出目录 */
Path outputPath = new Path(args[1]);
outputPath.getFileSystem(conf).delete(outputPath, true);

6. Finally, Hadoop must be shut down correctly, otherwise an error will be reported next time it is started, and it needs to be reformatted.

To shut down Hadoop, run

./sbin/stop-dfs.sh

Note
When you start hadoop next time, there is no need to initialize the NameNode. You only need to run ./sbin/start-dfs.sh a> is fine!

7. Install Hadoop cluster (update if conditions permit)

Guess you like

Origin blog.csdn.net/huangdxian/article/details/120773137