Hadoop从入门到精通系列之--5.HDFS的API

一客户端环境

1.1 配置环境变量

1.2 eclipse/IDEA准备

二 HDFS的具体API操作

2.1创建HDFS客户端对象并测试创建文件夹

一客户端环境

前面博客叙述了HDFS的shell操作，回顾一下：使用bin/hadoop fs -命令或者bin/hdfs dfs -命令均可，命令与Linux命令基本一致，真实环境中，基本上对HDFS的操作也是通过这种方式，但是还有一种方式，就是客户端模式，就是使用代码操作，使用客户端操作的思路也很简单，其实就是需要拿到一个客户端对象，通过对象和封装好的方法操作HDFS集群，不明白没关系，继续向下看。

1.1 配置环境变量

配置自己本地机器的环境变量，如下图所示：

其实和配置JDK的环境变量差不多，先配一个HOME，然后在path添加HADOOP，配置过后在cmd窗口中输入hadoop -version检验，出现和下图差不多的样子则配置hadoop成果。

1.2 eclipse/IDEA准备

其实eclipse和IDEA差不多，但是很多人会觉得IDEA好用，主要原因是那些人大多数都是上班的，IDEA的代码提示以及debug真的比eclipse好用，但是IDEA运行占用的资源比eclipse要高，如果电脑内存没有8g以上，真的不建议IDEA，并且IDEA还需要进行一系列破解，不是不会破解，我本人是一个喜欢破解软件的人，但是破解的软件确实稳定性差一些，日常学习eclipse足够，当然IDEA更好。

在eclipse中新建一个maven工程，并且在工程的pom.xml文件中添加依赖。关于maven应该不用多说了。

<dependencies>
		<dependency>
			<groupId>junit</groupId>
			<artifactId>junit</artifactId>
			<version>RELEASE</version>
		</dependency>
		<dependency>
			<groupId>org.apache.logging.log4j</groupId>
			<artifactId>log4j-core</artifactId>
			<version>2.8.2</version>
		</dependency>
		<dependency>
			<groupId>org.apache.hadoop</groupId>
			<artifactId>hadoop-common</artifactId>
			<version>2.7.2</version>
		</dependency>
		<dependency>
			<groupId>org.apache.hadoop</groupId>
			<artifactId>hadoop-client</artifactId>
			<version>2.7.2</version>
		</dependency>
		<dependency>
			<groupId>org.apache.hadoop</groupId>
			<artifactId>hadoop-hdfs</artifactId>
			<version>2.7.2</version>
		</dependency>
		<dependency>
			<groupId>jdk.tools</groupId>
			<artifactId>jdk.tools</artifactId>
			<version>1.8</version>
			<scope>system</scope>
			<systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>
		</dependency>
</dependencies>

以上配置大致都认识，junit测试，hadoop，这里提示一下如果导入报错，是因为maven默认的jdk是1.5，加上上面最后一段指定jdk1.8即可。

还有一个问题：如果在eclipse/IDEA控制台显示以下信息。在src/main/resources目录下，新建一个文件，名称为log4j.properties，这个文件添加的内容，如下图所示：

1.log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).  
2.log4j:WARN Please initialize the log4j system properly.  
3.log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

log4j.rootLogger=INFO, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
log4j.appender.logfile=org.apache.log4j.FileAppender
log4j.appender.logfile.File=target/spring.log
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n

这样eclipse/IDEA就可以正确打印日志了。

二 HDFS的具体API操作

在具体开始操作之前，先梳理一下思路：操作HDFS有哪些操作呢？不外乎查看，上传，下载，等等，如何进行操作呢？首先，肯定需要一个客户端对象，其次，肯定会有封装好的方法，然后使用对象点方法名的方式完成对集群的操作，整个流程应该就是这样。梳理了思路就好编程了。

2.1创建HDFS客户端对象并测试创建文件夹

HDFS是一个文件系统，创建一个文件系统的对象就可以操作集群了，解释一下代码，主要是通过FileSystem.get()方法创建，方法需要传入三个参数，分别是1.HDFS的集群地址new URI("hdfs://hadoop102:9000")，2.configuration，3.用户名。根据自己集群的情况书写，当有了这些参数的时候，一个文件系统的对象就创建好了fs，调用fs.mkdirs()方法就可以在集群上创建目录了。最后关闭这个对象即可。

总结：记住操作HDFS集群只有三步：1.获取对象（FileSystem.get）2.使用对象操作 3.关闭资源

public class HdfsClient{	
@Test
public void testMkdirs() throws IOException, InterruptedException, URISyntaxException{
		
	// 1 获取文件系统
	Configuration configuration = new Configuration();
	FileSystem fs = FileSystem.get(new URI("hdfs://hadoop102:9000"), configuration, "wanglei");	
	// 2 创建目录
	fs.mkdirs(new Path("/1108/daxian/banzhang"));
	// 3 关闭资源
	fs.close();
	}
}

2.2 测试文件上传

与上面一样三步走，只需要考虑具体操作的方法即可，上传是copyFromLocalFile，是不是和shell操作一样。

@Test
public void testCopyFromLocalFile() throws IOException, InterruptedException, URISyntaxException {

	// 1 获取文件系统
	Configuration configuration = new Configuration();
	configuration.set("dfs.replication", "2");
	FileSystem fs = FileSystem.get(new URI("hdfs://hadoop102:9000"), configuration, "wanglei");
	// 2 上传文件
	fs.copyFromLocalFile(new Path("e:/banzhang.txt"), new Path("/banzhang.txt"));
	// 3 关闭资源
	fs.close();
}

2.3 测试文件下载

@Test
public void testCopyToLocalFile() throws IOException, InterruptedException, URISyntaxException{

	// 1 获取文件系统
	Configuration configuration = new Configuration();
	FileSystem fs = FileSystem.get(new URI("hdfs://hadoop102:9000"), configuration, "wanglei");
	// 2 执行下载操作
	// boolean delSrc 指是否将原文件删除
	// Path src 指要下载的文件路径
	// Path dst 指将文件下载到的路径
	// boolean useRawLocalFileSystem 是否开启文件校验
	fs.copyToLocalFile(false, new Path("/banzhang.txt"), new Path("e:/banhua.txt"), true);	
	// 3 关闭资源
	fs.close();
}

2.4 测试删除文件夹

@Test
public void testDelete() throws IOException, InterruptedException, URISyntaxException{

	// 1 获取文件系统
	Configuration configuration = new Configuration();
	FileSystem fs = FileSystem.get(new URI("hdfs://hadoop102:9000"), configuration, "wanglei");
	// 2 执行删除
	fs.delete(new Path("/0508/"), true);	
	// 3 关闭资源
	fs.close();
}

2.5 测试查看文件详情

查看文件详情还是需要解释一下，使用的方法是listFiles，返回的是一个迭代器iterator，所以要遍历迭代器，可以查看文件的长度，名称，权限等信息，如果想要查看存储的块信息，使用getBlockLocations获得数组对象，遍历数组对象即可查看块存储信息，稍微复杂一点。

@Test
public void testListFiles() throws IOException, InterruptedException, URISyntaxException{

	// 1获取文件系统
	Configuration configuration = new Configuration();
	FileSystem fs = FileSystem.get(new URI("hdfs://hadoop102:9000"), configuration, "wanglei"); 
	// 2 获取文件详情
	RemoteIterator<LocatedFileStatus> listFiles = fs.listFiles(new Path("/"), true);	
	while(listFiles.hasNext()){
		LocatedFileStatus status = listFiles.next();
		// 输出详情
		// 文件名称
		System.out.println(status.getPath().getName());
		// 长度
		System.out.println(status.getLen());
		// 权限
		System.out.println(status.getPermission());
		// 分组
		System.out.println(status.getGroup());
			
		// 获取存储的块信息
		BlockLocation[] blockLocations = status.getBlockLocations();
			
		for (BlockLocation blockLocation : blockLocations) {
				
			// 获取块存储的主机节点
			String[] hosts = blockLocation.getHosts();
				
			for (String host : hosts) {
				System.out.println(host);
			}
		}
			
		System.out.println("---------------------");
	}

        // 3 关闭资源
        fs.close();
}

2.6 判断是文件夹还是文件

@Test
public void testListStatus() throws IOException, InterruptedException, URISyntaxException{
		
	// 1 获取文件配置信息
	Configuration configuration = new Configuration();
	FileSystem fs = FileSystem.get(new URI("hdfs://hadoop102:9000"), configuration, "wanglei");
		
	// 2 判断是文件还是文件夹
	FileStatus[] listStatus = fs.listStatus(new Path("/"));
		
	for (FileStatus fileStatus : listStatus) {
		
		// 如果是文件
		if (fileStatus.isFile()) {
				System.out.println("f:"+fileStatus.getPath().getName());
			}else {
				System.out.println("d:"+fileStatus.getPath().getName());
			}
		}
		
	// 3 关闭资源
	fs.close();
}

今天时间有限，只写到这里。