Use fuse_dfs to map files in HDFS to the local file system (compile + use tutorial)

HDFS in Hadoop is a very successful distributed file system. Many companies have used it as the underlying file system when building distributed prototype systems.

Unfortunately, the operation of HDFS requires the use of special commands, such as hdfs dfs -ls / etc. Is there any way to map HDFS to folders and drive letters on the hard disk, so that reading and writing HDFS is as easy as reading and writing local files? Yes, that is the fuse_dfs tool that comes with Hadoop. Using this tool, HDFS can be mounted to the local file system, and HDFS can be operated through ordinary file reading and writing.

Let me introduce how to compile and use this tool:

⚠️Note: The usage of this tool differs greatly between Hadoop v2 and v3. Some articles ( such as this one ) introduce the operation steps of v2, which are no longer applicable to v3.

The content of this article is mainly from this blog post by sleeplessbeastie . Readers can also refer to this blog post to use fuse_dfs.

compile fuse_dfs

1. First, you need to install JDK and Hadoop . There are many tutorials online for installing Hadoop, so this is beyond the scope of this article. After installation, you need to create a non-root user . Under this user, obtain the source code package consistent with the installed Hadoop version , download and unzip it:

# 获取对应版本的源码。你需要到 Hadoop 官网手动下载,获取最新的源码包链接;
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4-src.tar.gz

# 解压 Hadoop 源码
tar xvf hadoop-3.3.4-src.tar.gz

# 进入到解压好的文件夹中
cd hadoop-3.3.4-src

2. Install Docker:

# 使用 Docker 官方提供的一键安装脚本
curl -fsSL https://get.docker.com -o - | sudo bash

3. Enter the compilation environment. Hadoop has prepared a Docker-based compilation environment for us. Just run the following script, it will automatically build a Docker image, install various packages required for compilation, and then enter the container:

./start-build-env.sh

⚠️ In this step, you must ensure that the current user is not root , otherwise an inexplicable error will be reported.

4. After entering the container, we will find that the owner of the ~/.m2 folder is actually root, which will cause Maven to run incorrectly later, so let’s fix it:

chown -R $(whoami):$(whoami) ~/.m2

5. Enter the hadoop folder in the HOME directory:

cd ~/hadoop

6. Call Maven to compile Hadoop source code. Maven is already installed in the Docker image, so we don't need to install it manually. This step may be a bit lengthy. It is best to use an overseas server to speed up. It took 36 minutes to compile for the first time on my machine:

mvn package -Pnative -Drequire.fuse=true -DskipTests -Dmaven.javadoc.skip=true

7. Check the ID of the current container and record it. In Docker, by default, the hostname is the ID of the container:

cat /etc/hostname

8. Find the address of the fuse_dfs executable file:

sudo find / -name 'fuse_dfs'

On my machine and Hadoop version 3.2.2, the address found is:

/home/ipid/hadoop/hadoop-hdfs-project/hadoop-hdfs-native-client/target/main/native/fuse-dfs/fuse_dfs

9. Enter the directory where the fuse_dfs executable file is located, and then check the binary library it depends on:

ldd fuse_dfs

The output on my machine is:

linux-vdso.so.1 (0x00007ffd357fa000)
libfuse.so.2 => /lib/x86_64-linux-gnu/libfuse.so.2 (0x00007f57ba209000)
libhdfs.so.0.0.0 => not found
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f57b9fea000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f57b9de2000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f57b99f1000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f57b97ed000)
/lib64/ld-linux-x86-64.so.2 (0x00007f57ba652000)

It can be seen that libfuse.so.2 and libhdfs.so.0.0.0 are relied on, so we have to copy them out of the container in the next steps, otherwise the compiled fuse_dfs cannot run of.

10. Find the location of the above two files:

sudo find / -name 'libfuse.so.2'
sudo find / -name 'libhdfs.so.0.0.0'

11. Exit the container:

exit

12. Handle two dependent libraries. For libfuse, notice that the library version that fuse_dfs depends on is 2, so we can directly use the package manager that comes with the system to install libfuse2:

sudo apt install libfuse2

13. As for the body of fuse_dfs and the dynamic link library of libhdfs, we must copy it out of the container. Remember the container id and file address recorded in steps 7 and 10? We're going to copy them from their respective locations, and then modify the file permissions and file owner:

# 创建一个文件夹,存放文件,免得太乱
mkdir fuse_dfs
cd fuse_dfs

# 把编译好的 fuse_dfs 从容器里拷出来
docker cp <容器ID>:<fuse_dfs 文件地址> ./fuse_dfs

# 把编译好的 libhdfs 从容器里拷出来
# 注意这里的 cp 命令有个 -L 的参数,这是为什么呢?参见下文~
docker cp -L <容器ID>:/home/<用户名>/hadoop/hadoop-hdfs-project/hadoop-hdfs-native-client/target/native/target/usr/local/lib/libhdfs.so.0.0.0 ./libhdfs.so.0.0.0

# 文件此时有可能不是可执行状态,需要加上可执行权限
chmod 755 ./libhdfs.so.0.0.0

The reason why the cp command here has -La parameter is to resolve the symbolic link and copy the real file it points to .

As we all know, the convention in Linux is to use a soft link with only a large version number to point to a dynamic link library with a full file name . For example, two files will be found in the system directory: libyaml-0.so.2and libyaml-0.so.2.0.6, where libyaml-0.so.2is a soft link libyaml-0.so.2.0.6pointing . Therefore, if you copy without -Lthe switch , the last copied is a shortcut.

14. Some environment variables are needed to run fuse_dfs, because it is troublesome, we simply write a script to set them. Write a shell script fuse_dfs_wrapper.shcalled with the following content:

#!/bin/bash
# 此脚本用于代替 fuse_dfs 命令,执行本脚本即相当于运行 fuse_dfs 本体

# 这里填写你的 Hadoop 安装位置
export HADOOP_HOME=/usr/local/hadoop

# 这里填写你的 JDK 安装位置
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

# 检测当前用户是不是 root,如果不是就退出并报错
if [ "$(id -u)" != "0" ]; then
    echo "此脚本必须使用 root 用户运行" 1>&2
    exit 1
fi

# 计算当前脚本所在的目录位置
CURRENT_SHELL_FILE_DIR=$(dirname $(readlink -f "$0"))

# 设置 LD_LIBRARY_PATH,加入脚本所在目录,
# 这样一来运行的时候,就能找到 libhdfs.so.0.0.0
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CURRENT_SHELL_FILE_DIR

# 找到 Hadoop 安装目录中所有与 HDFS 相关的 Jar 包地址,并加入到 CLASSPATH 中
export CLASSPATH=.
while IFS= read -r -d '' file
do
  export CLASSPATH=$CLASSPATH:$file  
done < <(find ${HADOOP_HOME}/share/hadoop/{
    
    common,hdfs} -name "*.jar" -print0)

# 把运行本脚本时传入的参数转发给 fuse_dfs
./fuse_dfs $@

Afterwards, the above script can be used instead of the fuse_dfs command itself.

Use of fuse_dfs

1. Create a folder under /root to facilitate our mapping:

sudo mkdir /root/hdfs

2. Then run the following command to map HDFS to the file system:

sudo ./fuse-dfs-wrapper.sh dfs://<NameNode 地址>:<端口号> /root/hdfs -oinitchecks

3. When you want to close, just end the fuse_dfs process, and then close the mapping of /root/hdfs:

sudo killall fuse_dfs
sudo umount /root/hdfs

postscript

How about it? Is it very simple? I haven't done a performance test on this solution, but a foreign blog has done a test, and the conclusion is that this solution can basically fill the bandwidth of the network card with the transmission speed, at least it can be used as an alternative solution. You can give it a try, and if there are any mistakes or anything you want to add, please let me know in the comment area!

Guess you like

Origin blog.csdn.net/llbbzh/article/details/128394729