Basic operations of hdfs (command line and java code)

The initial setup of the hadoop distributed mode is completed, and it seems to be available from the command line or the web interface, and then you can enter the next step, which can be said to be further verification, or it can be said to be HDFS-related learning.
HDFS is a distributed file storage system that can add, delete, modify, and check files. The native support has basic command lines, and then clients in various languages.
This part is mainly to record and practice basic operations, and should also be used to further verify the availability of the previous environment installation.

Environmental description

The following content is based on hadoop3.1.3version.

Command line operation

Create a directory

When the file system falls into reality, it is naturally directories and files, so the first is the creation of file directories:

hdfs dfs -mkdir /test1

Many online tutorials are written here hadoopinstead of hdfs, it may be an old version of the command. Now if you use the hadoopoperation, although it is also possible, it will output a prompt and let it be replaced hdfs.

List files and directories

The directory is created above. To further verify whether the directory is created successfully, you can execute the following command to list the file directory:

hdfs dfs -ls /

The above means to list the directories and files in the root directory of the hdfs file system. It should be emphasized again and again that it is the hdfs file system, not the file system of the system where the above commands are executed.
Like me, I have created several directories before, so the following content will appear after executing the above command:

Found 4 items
drwxr-xr-x   - root supergroup          0 2020-08-07 22:23 /demo1
drwxr-xr-x   - root supergroup          0 2020-08-07 01:27 /foodir
drwxr-xr-x   - root supergroup          0 2020-08-10 18:56 /hbase
drwxr-xr-x   - root supergroup          0 2020-08-10 19:15 /test1

File creation in linux

Once you have the directory, you can upload files to hdfs. Here is an extension of the simple operation of file creation in Linux. First of all echo, as mentioned before, echo can create files in the following ways:

echo "test" >test2.txt
echo "test1" >> test3.txt

In the above two operations, if the file does not exist, the file will be created and the content will be written. If the file exists, the second will be appended to the file, and the first will overwrite the original content in the file.
However, to create a file with echo as above, there must be content to be written. If this is the echo test2.txtcase, the file will not be created, but will only be printed out to the console. Creating a blank file requires touch test2.txtthis operation.

File upload to hdfs

Once you have the file, you can upload the local file to the hdfs file system:

hdfs dfs -put test3.txt /test1/

The above command means to upload the test3.txt file in the current directory where the command is executed to the /test1 directory of hdfs.
It should be noted here that if we execute the upload command again, we will find that we can no longer upload, but report an error:

2020-08-10 19:35:54,862 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
put: `/test1/test3.txt': File exists

If the file already exists, it is not allowed to upload again. This actually involves the design idea of ​​hdfs file storage.
HDFS is a distributed file operating system. The bottom layer consists of many small files, and these small files parse a large file into byte arrays, and then split them by bytes.
Except for the last small file, the number of bytes in the other small files is the same, and then each small file has an offset, which makes the subsequent calculation performance of the large amount of data higher.
Because of this design, if you want to modify a file, it may involve the offset of each small file and the byte reorganization and splitting, and there will be many problems here.
Therefore, the file modification of hdfs actually only supports appending content at the end of the file, and does not allow modification to the existing content. Because of this, the file cannot be uploaded again when it already exists, and the underlying layer is not as simple as overwriting as imagined.

Append file content

As mentioned above, the file modification of hdfs actually only supports appending at the end, because the wisdom of appending at the end involves the processing of the last small file, and it is not possible to operate all small files. The operation of appending file content is as follows:

hdfs dfs -appendToFile test3.txt /test1/test3.txt

View the content of a file in hdfs

After the file is uploaded successfully, you can use the file list view above to confirm whether the upload is successful, but this can only see the general information of the file. If you want to confirm whether the content of the file has been written, you can use the operation of viewing the file content:

hdfs dfs -cat /test1/test3.txt

If I operate here, the following information will be output, and the last line is the content of the file:

2020-08-10 19:31:00,019 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2020-08-10 19:31:02,046 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
test1

Delete Files

There are additions and changes to the above. As a file system, it is naturally necessary to delete operations, as follows:

hdfs dfs -rm /test1/test2.txt

The above is based on the command line that comes with hadoop to perform some basic operations of hdfs. To repeat, the prerequisite for me to use commands in the directory above is to configure environment variables related to hadoop.
As a java programmer, if learning hadoop is limited to the command line, it is naturally unreliable, so I also tried the java client to perform some simple operations similar to the command line above.
The following content basically refers to the content on the Internet, the example operation is too simple, so there is not much change, and there is not much explanation, only as a basic verification and record.

java operation

Dependent package import

The first is the dependency package import. Although my project is a springboot project, it seems that there is no official springboot-related hadoop starter, so I have to directly introduce the hadoop jar. Due to some dependency conflicts, it is excluded, and the final maven configuration as follows:

<dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-hdfs</artifactId>
   <version>3.1.3</version>
   <exclusions>
      <exclusion> <groupId>org.slf4j</groupId> <artifactId>slf4j-log4j12</artifactId></exclusion>
      <exclusion> <groupId>log4j</groupId> <artifactId>log4j</artifactId> </exclusion>
      <exclusion> <groupId>javax.servlet</groupId> <artifactId>servlet-api</artifactId> </exclusion>
   </exclusions>
</dependency>
<dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-common</artifactId>
   <version>3.1.3</version>
   <exclusions>
      <exclusion> <groupId>org.slf4j</groupId> <artifactId>slf4j-log4j12</artifactId></exclusion>
      <exclusion> <groupId>log4j</groupId> <artifactId>log4j</artifactId> </exclusion>
      <exclusion> <groupId>javax.servlet</groupId> <artifactId>servlet-api</artifactId> </exclusion>
   </exclusions>
</dependency>
<dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-client</artifactId>
   <version>3.1.3</version>
   <exclusions>
      <exclusion> <groupId>org.slf4j</groupId> <artifactId>slf4j-log4j12</artifactId></exclusion>
      <exclusion> <groupId>log4j</groupId> <artifactId>log4j</artifactId> </exclusion>
      <exclusion> <groupId>javax.servlet</groupId> <artifactId>servlet-api</artifactId> </exclusion>
   </exclusions>
</dependency>

Obtain hdfs file system object

To operate hdfs in java, you need to get the file object first, execute url and user name, etc. There are many connection configurations, which need to be supplemented when the actual project needs it. The basic available simple code is as follows:

private static String hdfsPath = "hdfs://192.168.139.9:9000";
/**
 * 获取HDFS文件系统对象
 *
 * @return
 * @throws Exception
 */
private static FileSystem getFileSystem() throws Exception
{
    FileSystem fileSystem = FileSystem.get(new URI(hdfsPath), getConfiguration(), "root");
    return fileSystem;
}
/**
 * 获取HDFS配置信息
 *
 * @return
 */
private static Configuration getConfiguration() {
    Configuration configuration = new Configuration();
    configuration.set("fs.defaultFS", hdfsPath);
    return configuration;
}

Adding, deleting, modifying and checking HDFS basis in java

After connecting to hdfs and obtaining the file system object, you can perform related operations, as shown below:

/**
 * 在HDFS创建文件夹
 *
 * @param path
 * @return
 * @throws Exception
 */
public static void mkdir(String path) throws Exception
{
    FileSystem fs = getFileSystem();
    // 目标路径
    Path srcPath = new Path(path);
    boolean isOk = fs.mkdirs(srcPath);
    fs.close();
}
/**
 * 读取HDFS目录信息
 *
 * @param path
 * @return
 * @throws Exception
 */
public static void readPathInfo(String path)
    throws Exception
{
    FileSystem fs = getFileSystem();
    // 目标路径
    Path newPath = new Path(path);
    FileStatus[] statusList = fs.listStatus(newPath);
    List<Map<String, Object>> list = new ArrayList<>();
    if (null != statusList && statusList.length > 0) {
        for (FileStatus fileStatus : statusList) {
            System.out.print("filePath:"+fileStatus.getPath());
            System.out.println(",fileStatus:"+ fileStatus.toString());
        }
    }
}
/**
 * HDFS创建文件
 *
 * @throws Exception
 */
public static void createFile()
    throws Exception
{
    File myFile = new File("C:\\Users\\tuzongxun\\Desktop\\tzx.txt");
    FileInputStream fis = new FileInputStream(myFile);
    String fileName = myFile.getName();
    FileSystem fs = getFileSystem();
    // 上传时默认当前目录,后面自动拼接文件的目录
    Path newPath = new Path("/demo1/" + fileName);
    // 打开一个输出流
    ByteArrayOutputStream bos = new ByteArrayOutputStream();
    byte[] b = new byte[1024];
    int n;
    while ((n = fis.read(b)) != -1) {
        bos.write(b, 0, n);
    }
    fis.close();
    bos.close();
    FSDataOutputStream outputStream = fs.create(newPath);
    outputStream.write(bos.toByteArray());
    outputStream.close();
    fs.close();
}
/**
 * 读取文件列表
 * @param path
 * @throws Exception
 */
public static void listFile(String path)
    throws Exception
{
    FileSystem fs = getFileSystem();
    // 目标路径
    Path srcPath = new Path(path);
    // 递归找到所有文件
    RemoteIterator<LocatedFileStatus> filesList = fs.listFiles(srcPath, true);
    while (filesList.hasNext()) {
        LocatedFileStatus next = filesList.next();
        String fileName = next.getPath().getName();
        Path filePath = next.getPath();
        System.out.println("##########################fileName:" + fileName);
        System.out.println("##########################filePath:" + filePath.toString());
    }
    fs.close();
}
/**
 * 读取HDFS文件内容
 *
 * @param path
 * @return
 * @throws Exception
 */
public static String readFile(String path) throws Exception
{
    FileSystem fs = getFileSystem();
    // 目标路径
    Path srcPath = new Path(path);
    FSDataInputStream inputStream = null;
    try {
        inputStream = fs.open(srcPath);
        // 防止中文乱码
        BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));
        String lineTxt = "";
        StringBuffer sb = new StringBuffer();
        while ((lineTxt = reader.readLine()) != null) {
            sb.append(lineTxt);
        }
        return sb.toString();
    }
    finally {
        inputStream.close();
        fs.close();
    }
}
/**
 * 上传HDFS文件
 *
 * @param path
 * @param uploadPath
 * @throws Exception
 */
public static void uploadFile(String path, String uploadPath) throws Exception
{
    if (StringUtils.isEmpty(path) || StringUtils.isEmpty(uploadPath)) {
        return;
    }
    FileSystem fs = getFileSystem();
    // 上传路径
    Path clientPath = new Path(path);
    // 目标路径
    Path serverPath = new Path(uploadPath);

    // 调用文件系统的文件复制方法,第一个参数是否删除原文件true为删除,默认为false
    fs.copyFromLocalFile(false, clientPath, serverPath);
    fs.close();
}
/**
 * 调用
 * @param args
 */
public static void main(String[] args) {
    try {
        //创建目录
        //mkdir("/test2");
        //列出目录列表
        readPathInfo("/");
        //列出文件列表
        // listFile("/");
        // 创建文件
        //  createFile();
        // 读取文件内容
         String a = readFile("/test/test2.txt");
        // System.out.println("###########################" + a);
        //上传文件
        //uploadFile("C:\\Users\\tuzongxun\\Desktop\\tzx.txt", "/test2");
    }
    catch (Exception e) {
        e.printStackTrace();
    }
}

Note: The above dependent packages can also be changed to the springboot version, and the common-lang3 related jars need to be additionally introduced:

<dependency>
	<groupId>org.springframework.data</groupId>
	<artifactId>spring-data-hadoop</artifactId>
	<version>2.5.0.RELEASE</version>
</dependency>

Guess you like

Origin blog.csdn.net/tuzongxun/article/details/107910712