大数据入门（三）HDFS Shell与Java API

一、HDFS Shell

大多数HDFS Shell命令与Unix Shell是类似的，主要不同之处是HDFS Shell命令操作的是远程Hadoop服务器上的文件，而Unix Shell命令操作的是本地文件。

完整的HDFS Shell命令见官网：FileSystemShell 和 HDFS Commands Guide，也可使用hadoop fs --help命令查看。下面演示几个常用命令。

命令	说明
hadoop fs	FS relates to a generic file system which can point to any file systems like local, HDFS etc. So this can be used when you are dealing with different file systems such as Local FS, HFTP FS, S3 FS, and others. (使用面最广，可以操作任何文件系统)
hadoop dfs	专门针对hdfs分布式文件系统，已经Deprecated
hdfs dfs	同hadoop dfs，并且当使用hadoop dfs时内部会被转为hdfs dfs命令

使用方法：hadoop fs -mkdir [-p] <path>...

说明：-p 参数会递归创建目录，即使上级目录不存在，也会按目录层级自动创建目录

put 与copeFromLocal 上传文件：

put 使用方法：hadoop fs -put <localsrc> ... <dst>

说明：从本地文件系统中复制单个或者多个源路径/文件到目标文件系统，也支持从标准输入中读取输入写入目标文件系统。

copyFromLocal：同put。唯一的区别是 -put 更宽松，可以把本地或者HDFS上的文件拷贝到HDFS中；而 -copyFromLocal 则只能拷贝本地文件到HDFS中。

ls 列出文件：

使用方法：hadoop fs -ls <args>

注意：在HDFS中未带参数的“-ls”命令没有任何返回值，它默认返回HDFS的“home”目录下的内容。在HDFS中，没有当前目录这一概念，也没有cd命令。

扫描二维码关注公众号，回复： 15833030 查看本文章

lsr 命令：作用是会递归列出子目录中的文件及目录信息，目前已经DEPRECATED，可以用hadoop fs -ls -R <args>代替

cat：将指定路径下的文件的内容输出到stdout

get 与copeToLocal 下载文件：复制文件到本地文件系统。hadoop fs -get [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>

-p：保留访问和修改时间，所有权和权限。(假设权限可以跨文件系统传播)

-f：如果目标已经存在，则覆盖目标

-ignorecrc：跳过下载文件的CRC检查

-crc：为下载的文件写CRC校验和

扩展阅读：HADOOP中的CRC数据校验文件

copeToLocal：同get。唯一的区别是copeToLocal限定了目标路径只能是本地文件。

rm：删除指定路径下的文件。hadoop fs -rm [-f] [-r |-R] [-skipTrash] [-safely] URI [URI ...]

HDFS有个垃圾回收站（trash）机制，默认情况下是禁用的，可通过fs.trash.interval（在core-site.xml中）设置一个大于零的值（单位分钟）来启用该机制。当启用了trash后，HDFS会将删除的文件移动到一个垃圾目录（/user/用户名/.Trash/）。如果想直接删除某个文件，而不把其放在回收站，可以使用-skipTrash命令。

这里设置的回收周期是一天（60*24）。

HDFS的trash具体实现其实是在NameNode中开启了一个后台线程Emptier（org.apache.Hadoop.fs.TrashPolicyDefault.Emptier，可通过fs.trash.classname更改TrashPolicy类），这个线程会定期自动删除回收站下面超过生命周期的文件和目录。另外，用户也可以手动删除回收站中的文件。删除回收站中的文件的操作和删除其他文件目录是一样的，唯一不同的是，HDFS会检测这个文件目录是不是回收站，如果是，HDFS就不会再把它们放入回收站中了。

rmr 命令：作用是递归删除文件目录下的所有文件，目前已经DEPRECATED，可以用hadoop fs -rm [-R | -r] <URI>代替

expunge 命令：hadoop fs -expunge [-immediate]

第一次执行hadoop fs -expunge时，HDFS会新建一个checkpoint并把最近删除存放在Trash中的文件移至checkpoint下；在下一次hadoop fs -expunge执行时（注意，如果你在core-site.xml文件中配置了fs.trash.checkpoint.interval属性，该命令会在达到指定的时间后自动执行，手动执行无效），该checkpoint下的所有文件会被永久删除。（注：详见 File Deletes and Undeletes）

如果传递了-immediate选项，会忽略fs.trash.interval设置，将当前用户的所有垃圾文件立即删除。

du：hadoop fs [generic options] -du [-s] [-h] [-v] [-x] <path> ... 显示目录中所有文件的大小

说明：第一列标示了该目录下的文件总大小，第二列标示了该目录下所有文件在集群上的总存储大小（等于文件大小*副本系数），第三列就是你所查询的目录。

setrep：hadoop fs -setrep [-R] [-w] <numReplicas> <path> 用于修改文件的副本系数。如果path是一个目录，则该命令会递归地更改以path为根的目录树下所有文件的副本系数。

-w参数会等待命令执行完毕，这可能会花费很长时间。而-R参数是为了向后兼容，它没有任何效果。

mv：hadoop fs -mv URI [URI ...] <dest> 将文件从源移动到目标。该命令还允许多个源，在这种情况下，目标需要是一个目录。不允许跨文件系统移动文件。

这里把整个dir2目录全部移到了dir1中了。

cp：hadoop fs -cp [-f] [-p | -p[topax]] URI [URI ...] <dest> 将文件从源复制到目标。此命令还允许多个源，在这种情况下，目标必须是一个目录。

head：hadoop fs -head URI 将文件的首个1KB字节的内容显示到标准输出。
tail：hadoop fs -tail [-f] URI 将文件尾部的1KB字节的内容显示到标准输出。如Unix一样，可以使用-f参数，持续输出最后的1KB字节内容。

stat：以指定格式打印的文件/目录的统计信息。格式接受八进制(%a)和符号(%a)的权限，字节的文件大小(%b)，类型(%F)，所有者的组名(%g)，名称(%n)，块大小(%o)，复制(%r)，所有者的用户名(%u)，访问日期(%x， %x)，和修改日期(%y， %y)。%x和%y显示UTC日期为“yyyy-MM-dd HH:mm:ss”，%x和%y显示1970年1月1日UTC时间的毫秒数。如果没有指定格式，默认使用%y。

hadoop fs -stat "type:%F perm:%a %u:%g size:%b mtime:%y atime:%x name:%n" /user/hadoop/dir1/testHDFS.txt

test：hadoop fs -test -[defswrz] URI

-d：如果路径是一个目录，返回0。
-e：如果路径存在，则返回0。
-f：如果路径是一个文件，则返回0。
-s：如果路径不为空，返回0。
-w：如果该路径存在且已授予写权限，则返回0。
-r：如果路径存在且有读权限，则返回0。
-z：如果文件长度为0，则返回0。

bash中用$?获取最近一次调用的返回值。

text：hadoop fs -text <src> 获取源文件并以文本格式输出该文件。允许的格式是zip和TextRecordInputStream。

touchz：hadoop fs -touchz URI [URI ...] 创建一个0字节的空文件

安全模式命令：HDFS的NameNode在启动时会自动进入安全模式。安全模式是NameNode的一种状态，在这个阶段，文件系统不允许有任何修改。安全模式的目的是在系统启动时检查各个DataNode上数据块的有效性，同时根据策略对数据块进行必要的复制或删除，当数据块副本数满足最小副本数条件时，会自动退出安全模式。需要注意的是，HDFS进入安全模式后会导致Hive和HBase的启动异常。扩展：HDFS 安全模式的理解安全模式操作命令如下：

# 进入安全模式
hadoop dfsadmin -safemode enter

# 退出安全模式
hadoop dfsadmin -safemode leave

# 查看集群是否处于安全模式
hadoop dfsadmin -safemode get

二、HDFS Java API

2.1 IDEA（Windows）连接虚拟机Hadoop集群

在Windows下连接虚拟机上的Hadoop集群需要做一些前期准备，准备工作如下：

首先，把第一节大数据入门（一）Hadoop伪分布式安装中下载的hadoop-3.0.0.tar.gz，在Windows中找个目录进行解压缩
然后从 winutils镜像上下载hadoop.dll和winutils.exe，并将其拷贝到hadoop-3.0.0.tar.gz解压缩的目录的bin目录中
最后配置环境变量HADOOP_HOME，值是hadoop-3.0.0.tar.gz解压缩的目录，然后在系统变量path里增加$HADOOP_HOME\bin 即可。

2.2 IDEA 项目测试

新建一个spring项目，pom文件依赖如下所示：

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>2.5.0</version>
        <relativePath/> <!-- lookup parent from repository -->
    </parent>
    <groupId>com.example</groupId>
    <artifactId>hadoopdemo</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <name>hadoopdemo</name>
    <description>Demo project for Hadoop</description>
    <properties>
        <java.version>1.8</java.version>
    </properties>
    <dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter</artifactId>
        </dependency>

        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <optional>true</optional>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <scope>test</scope>
        </dependency>

        <!--引入hadoop-client Jar包  -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>3.1.1</version>
            <exclusions>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>log4j-over-slf4j</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>slf4j-api</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>slf4j-log4j12</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <!-- 引入hadoop-common Jar包 -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>3.1.1</version>
            <exclusions>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>log4j-over-slf4j</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>slf4j-api</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>slf4j-log4j12</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <!-- 引入hadoop-hdfs Jar包 -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>3.1.1</version>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-maven-plugin</artifactId>
                <configuration>
                    <excludes>
                        <exclude>
                            <groupId>org.projectlombok</groupId>
                            <artifactId>lombok</artifactId>
                        </exclude>
                    </excludes>
                </configuration>
            </plugin>
        </plugins>
    </build>

</project>

然后就是HDFS的配置类

package com.example.hadoopdemo.config;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.springframework.context.annotation.Bean;

import java.net.URI;

@org.springframework.context.annotation.Configuration
public class HDFSConfig {
    private static String HDFS_URI = "hdfs://hadoop0:9000/";

    @Bean
    public FileSystem getFileSystem() throws Exception {
        Configuration configuration = new Configuration();
        FileSystem fs = FileSystem.get(new URI(HDFS_URI), configuration, "root");
        return fs;
    }
}

配置完以上信息后，我们就可以来使用HDFS了。Hadoop中文件操作类基本上是在“org.apache.hadoop.fs”包中，这些API能够支持的操作包括打开文件、读写文件、删除文件等。示例代码如下：

package com.example.hadoopdemo.utils;

import org.apache.hadoop.fs.*;
import org.apache.hadoop.hdfs.DistributedFileSystem;
import org.apache.hadoop.hdfs.protocol.DatanodeInfo;
import org.apache.hadoop.io.IOUtils;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Component;

import java.io.IOException;
import java.io.OutputStream;

@Component
public class HDFSUtils {
    @Autowired
    private FileSystem hdfs;

    // 判断文件、目录是否存在
    public boolean isExists(String pathString) throws IOException {
        Path path = new Path(pathString);
        return hdfs.exists(path);
    }

    // 上传文件
    public void upload(String src, String dst) throws IOException {
        Path scrPath = new Path(src);
        Path dstPath = new Path(dst);
        hdfs.copyFromLocalFile(scrPath, dstPath);
    }

    // 下载文件
    public void download(String src, String dst) throws IOException {
        Path scrPath = new Path(src);
        Path dstPath = new Path(dst);
        hdfs.copyToLocalFile(scrPath, dstPath);
    }

    // 创建文件并写入数据
    public void createFile(String dst, String context) throws IOException {
        Path dstPath = new Path(dst);
        FSDataOutputStream outputStream = hdfs.create(dstPath);
        outputStream.write(context.getBytes());
        outputStream.close();
    }

    // 创建目录
    public boolean createDir(String dir) throws IOException {
        return hdfs.mkdirs(new Path(dir));
    }

    // 文件重命名
    public boolean renameFile(String src, String dst) throws IOException {
        return hdfs.rename(new Path(src), new Path(dst));
    }

    // 删除文件或目录，recursive标志用来确定是否要递归删除
    public boolean delete(String src, boolean recursive) throws IOException {
        return hdfs.delete(new Path(src), recursive);
    }

    // 读取文件, 输出到outputStream
    public void readFile(String src, OutputStream outputStream) throws IOException {
        FSDataInputStream inputStream = hdfs.open(new Path(src));
        IOUtils.copyBytes(inputStream, outputStream, 1024, false);
        IOUtils.closeStream(inputStream);
    }

    // 查看指定目录下的文件信息
    public void list(String src) throws IOException {
        FileStatus[] fileStatusList = hdfs.listStatus(new Path(src));
        for (FileStatus fileStatus : fileStatusList) {
            // 文件路径
            System.out.println(fileStatus.getPath());
            // 文件的修改时间
            System.out.println(fileStatus.getModificationTime());
        }
    }

    // 获取DataNode信息
    public void getNodeMessage() throws IOException {
        DistributedFileSystem distributedFileSystem = (DistributedFileSystem) hdfs;
        DatanodeInfo[] datanodeInfos = distributedFileSystem.getDataNodeStats();
        for (int i = 0; i < datanodeInfos.length; i++) {
            System.out.println("DataNode_" + i + "_Name: " + datanodeInfos[i].getHostName());
        }
    }
}

测试代码如下：

package com.example.hadoopdemo;

import com.example.hadoopdemo.utils.HDFSUtils;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;

@SpringBootTest
public class HDFSTest {
    @Autowired
    private HDFSUtils hdfsUtils;

    @Test
    public void test() throws Exception {
        System.out.println("hdfs目录是否存在：" + hdfsUtils.isExists("/hdfs"));
        System.out.println("创建hdfs目录：" + hdfsUtils.createDir("/hdfs"));
        hdfsUtils.upload("C:\\Users\\Minghui.Ni\\Desktop\\test.txt", "/hdfs");
        System.out.println("test.txt文件是否上传成功：" + hdfsUtils.isExists("/hdfs/test.txt"));
        System.out.println("test.txt重命名为hdfsTest.txt：" + hdfsUtils.renameFile("/hdfs/test.txt", "/hdfs/hdfsTest.txt"));
        hdfsUtils.readFile("/hdfs/hdfsTest.txt", System.out);
        System.out.println("\n");
        hdfsUtils.createFile("/hdfs/test2.txt", "Hello HDFS\n");
        hdfsUtils.readFile("/hdfs/test2.txt", System.out);
        hdfsUtils.list("/hdfs");
        hdfsUtils.getNodeMessage();
        hdfsUtils.delete("/hdfs", true);
        System.out.println("hdfs目录是否删除成功：" + !hdfsUtils.isExists("/hdfs"));
    }
}

运行截图如下：

三、其他

Spring for Apache Hadoop - Reference Documentation