The three modes of Hadoop mentioned before: stand-alone mode , pseudo-cluster mode and cluster mode .
Stand-alone mode : Hadoop only exists as a library, which can execute MapReduce tasks on a single computer, and is only used for developers to build a learning and experiment environment.
Pseudo-cluster mode : In this mode, Hadoop will run on a single machine in the form of a daemon process, which is generally used for developers to build a learning and experiment environment.
Cluster mode : This mode is the production environment mode of Hadoop, that is to say, this is the mode that Hadoop really uses to provide production-level services.
HDFS configuration and startup
HDFS is similar to a database and is started as a daemon process. To use HDFS, you need to use HDFS client to connect to HDFS server through network (socket) to realize the use of file system.
In the chapter of Hadoop running environment , we have configured the basic environment of Hadoop, and the container name is hadoop_single. If you closed the container last time or shut down the computer and the container closed, start and enter the container.
After entering the container, we confirm that Hadoop exists:
hadoop version
Hadoop exists if the result shows the Hadoop version number.
Next we will move to the formal steps.
Create a new hadoop user
Create a new user named hadoop:
adduser hadoop
Install a small tool for modifying user passwords and rights management:
yum install -y passwd sudo
Set hadoop user password:
passwd hadoop
Enter the password for the next two times, be sure to remember it!
Modify the owner of the hadoop installation directory to be the hadoop user:
chown -R hadoop /usr/local/hadoop
Then modify the /etc/sudoers file with a text editor, in
root ALL=(ALL) ALL
add a line after
hadoop ALL=(ALL) ALL
Then exit the container.
Close and submit container hadoop_single to mirror hadoop_proto:
docker stop hadoop_single docker commit hadoop_single hadoop_proto
Create new container hdfs_single:
docker run -d --name=hdfs_single --privileged hadoop_proto /usr/sbin/init
This way the new user is created.
Start HDFS
Now enter the newly created container:
docker exec -it hdfs_single su hadoop
Should now be the hadoop user:
whoami
Should show "hadoop"
Generate SSH keys:
ssh-keygen -t rsa
Here you can keep pressing Enter until the generation ends.
Then add the generated key to the trust list:
ssh-copy-id [email protected]
View the container IP address:
ip addr | grep 172
So you know that the IP address of the container is 172.17.0.2, your IP may be different from this.
Before starting HDFS, we make some simple configurations. All Hadoop configuration files are stored in the etc/hadoop subdirectory under the installation directory, so we can enter this directory:
cd $HADOOP_HOME/etc/hadoop
Here we modify two files: core-site.xml and hdfs-site.xml
In core-site.xml, we add the attribute under the tag:
<property> <name>fs.defaultFS</name> <value>hdfs://<你的IP>:9000</value> </property>
Add the property under the tag in hdfs-site.xml:
<property> <name>dfs.replication</name> <value>1</value> </property>
Format file structure:
hdfs namenode -format
Then start HDFS:
start-dfs.sh
The startup is divided into three steps, starting the NameNode, DataNode and Secondary NameNode respectively.
We can run jps to see the Java process:
So far, the HDFS daemon process has been established. Since HDFS itself has an HTTP panel, we can visit http://your container IP:9870/ through a browser to view the HDFS panel and detailed information:
If this page appears, it means that HDFS is configured and started successfully.
Note: If you are not using a Linux system with a desktop environment and no browser, you can skip this step. If you are using Windows but not using Docker Desktop, this step will be difficult for you.
HDFS uses
HDFS Shell
Back to the hdfs_single container, the following commands will be used to operate HDFS:
# Display files and subdirectories under the root directory /, absolute path hadoop fs -ls / # Create a new folder, absolute path hadoop fs -mkdir /hello # upload files hadoop fs -put hello.txt /hello/ # download file hadoop fs -get /hello/hello.txt # output file content hadoop fs -cat /hello/hello.txt
The most basic commands of HDFS are described above, and there are many other operations supported by traditional file systems.
HDFS API
HDFS has been supported by many back-end platforms. Currently, the official distribution includes C/C++ and Java programming interfaces. In addition, the package managers of node.js and Python languages also support importing HDFS clients.
Here is a list of dependencies for the package manager:
Maven:
<dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>3.1.4</version> </dependency>
Gradle:
providedCompile group: 'org.apache.hadoop', name: 'hadoop-hdfs-client', version: '3.1.4'
NPM:
npm i webhdfs
pip:
pip install hdfs
Here is an example of Java connecting to HDFS (don't forget to change the IP address):
example
package com.runoob;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
public class Application { public static void main(String[] args) { try { // Configure connection address Configuration conf = new Configuration(); conf.set("fs.defaultFS", "hdfs://172.17.0.2:9000"); FileSystem fs = FileSystem.get(conf); // Open file and read output Path hello = new Path("/hello/hello.txt"); FSDataInputStream ins = fs.open(hello); int ch = ins.read(); while (ch != -1) {
System.out.print((char)ch);
ch = ins.read();
}
System.out.println();
} catch (IOException ioe) {
ioe.printStackTrace();
}
}
}