HDFS—Java API operation

Original author: jiangw-Tony

Original address: Basic use of HDFS

HDFS is mainly the development of the client in production applications. The core step is to construct an HDFS access client object from the api provided by hdfs, and then manipulate (add, delete, modify, and check) files on HDFS through the client object.

1. Environment construction

1. Create a Maven project HdfsClientDemo

2. Add the following code to the pom.xml file of the project: import the corresponding dependent coordinates + log addition

<dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.8.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.8.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.8.2</version>
        </dependency>
</dependencies>

3. Created the cn.itcast.hdfs package under java, and created a new Java file under the directory

At this point, we have added all the jar packages that we use to write the business code using the HDFS api, and then we can play with the code happily.

Two, FileSystem instance acquisition explanation (emphasis)

To operate hdfs in java, you must first obtain a client instance:

Configuration conf = new Configuration()
FileSystem fs = FileSystem.get(conf)

And our operation target is HDFS, so the fs object obtained should be an instance of DistributedFileSystem; where does the get method determine which client class to instantiate? Judging from the configuration value of a parameter fs.defaultFS in conf; if fs.defaultFS is not specified in our code, and the corresponding configuration is not given under the project classpath, the default value in conf comes from the jar package of hadoop The core-default.xml, the default value is: file:///, the obtained will not be an instance of DistributedFileSystem, but a client object of the local file system.

The methods of the DistributedFileSystem instance are as follows:

Three, HDFS commonly used JAVA API code demonstration

1. Create a folder

2. Upload files

3. Download the file

4. Delete files or folders

5. Rename the file or folder

6. View the directory information, only display the file information under the folder

7. View file and folder information

Four, HDFS streaming data access

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
 
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.junit.Before;
import org.junit.Test;
 
/**
* 相对那些封装好的方法而言的更底层一些的操作方式 上层那些 mapreduce spark 等运算
框架，去 hdfs 中获取数据的时候，就是调的这种底层的 api
*/
public class StreamAccess {
	FileSystem fs = null;
	@Before
	public void init() throws Exception {
		Configuration conf = new Configuration();
		System.setProperty("HADOOP_USER_NAME", "root");
		conf.set("fs.defaultFS", "hdfs:// hadoop01:9000");
		fs = FileSystem.get(conf);
		// fs = FileSystem.get(new URI("hdfs://hadoop01:9000"), conf, "hadoop");
	}
	@Test
	public void testDownLoadFileToLocal() throws IllegalArgumentException,IOException {
		// 先获取一个文件的输入流----针对 hdfs 上的
		FSDataInputStream in = fs.open(new Path("/jdk-7u65-linux-i586.tar.gz"));
		// 再构造一个文件的输出流----针对本地的
		FileOutputStream out = new FileOutputStream(new File("c:/jdk.tar.gz"));
		// 再将输入流中数据传输到输出流
		IOUtils.copyBytes(in, out, 4096);
	}
	@Test
	public void testUploadByStream() throws Exception {
		// hdfs 文件的输出流
		FSDataOutputStream fsout = fs.create(new Path("/aaa.txt"));
		// 本地文件的输入流
		FileInputStream fsin = new FileInputStream("c:/111.txt");
		IOUtils.copyBytes(fsin, fsout, 4096);
	}
	/**
	* hdfs 支持随机定位进行文件读取，而且可以方便地读取指定长度 用于上层分布式运
	算框架并发处理数据
	*/
	@Test
	public void testRandomAccess() throws IllegalArgumentException, IOException {
		// 先获取一个文件的输入流----针对 hdfs 上的
		FSDataInputStream in = fs.open(new Path("/iloveyou.txt"));
		// 可以将流的起始偏移量进行自定义
		in.seek(22);
		// 再构造一个文件的输出流----针对本地的
		FileOutputStream out = new FileOutputStream(new File("d:/iloveyou.line.2.txt"));
		IOUtils.copyBytes(in, out, 19L, true);
	}
}

Five, classic case

In mapreduce, spark, and other computing frameworks, there is a core idea to move operations to data, or in other words, to localize operations as much as possible in concurrent computing, which requires obtaining information about the location of the data and reading the corresponding range. take. The following simulation implementation: Get all block location information of a file, and then read the content in the specified block.

@Test
public void testCat() throws IllegalArgumentException, IOException {
	FSDataInputStream in = fs.open(new Path("/weblog/input/access.log.10"));
	// 拿到文件信息
	FileStatus[] listStatus = fs.listStatus(new Path("/weblog/input/access.log.10"));
	// 获取这个文件的所有 block 的信息
	BlockLocation[] fileBlockLocations = fs.getFileBlockLocations(
	listStatus[0], 0L, listStatus[0].getLen());
	// 第一个 block 的长度
	long length = fileBlockLocations[0].getLength();
	// 第一个 block 的起始偏移量
	long offset = fileBlockLocations[0].getOffset();
	System.out.println(length);
	System.out.println(offset);
	// 获取第一个 block 写入输出流
	// IOUtils.copyBytes(in, System.out, (int)length);
	byte[] b = new byte[4096];
	FileOutputStream os = new FileOutputStream(new File("d:/block0"));
	while (in.read(offset, b, 0, 4096) != -1) {
		os.write(b);
		offset += 4096;
		if (offset > length)
		return;
	}
	os.flush();
	os.close();
	in.close();
}