FTP traverse the directory to find the files in the specified directory and import it into HDFS

 最近有个需求,ftp主机上有很多目录,目录下层级不一,需要把父目录下所有满足的文件夹下的指定文件下载到HDFS,并且保留目录结构,因为数据在ftp主机已经落地,如果用公司内部的etl工具并不好实现这种层级不一的文件导入,故改用开发代码去实现。

Idea: 1. Traverse the specified directory of ftp and find the path of the specified folder
2. Get the path of all the folders and download to hdfs

Implementation (only put local test code here):
1. FTP connection tool class


import org.apache.commons.net.ftp.FTPClient;
import org.apache.commons.net.ftp.FTPReply;
import org.apache.log4j.Logger;

import java.io.IOException;

/**
 * @ClassName FTPUtils
 * @Decription FTP Connect Util Class
 * @Author KGC
 * @Date 2019/03/21
 **/
public class FTPUtils {

    private static Logger logger = Logger.getLogger(FTPUtils.class);

    /**
     * FTP连接
     */
    public static FTPClient loginFTP(String host, int port, String userName, String passWord) {
       // ftpClient.enterLocalPassiveMode();

        FTPClient ftpClient = null;

        try {
            ftpClient = new FTPClient();
            //连接ftp
            ftpClient.connect(host, port);
            //登录ftp
            ftpClient.login(userName, passWord);

            ftpClient.enterLocalPassiveMode();
            //字符格式
            ftpClient.setControlEncoding("UTF-8");


          //  ftpClient.setFileType(ftpClient.BINARY_FILE_TYPE);
            if (!FTPReply.isPositiveCompletion(ftpClient.getReplyCode())) {
                logger.error("ftp连接失败,请检查用户名或密码是否错误");
                ftpClient.disconnect();
            } else {
                logger.info("连接成功!");

            }

        }catch (Exception e){
            logger.error("连接失败,检查用户名和密码");
        }
            return ftpClient;
    }
}

Very simple, directly use the API in org.apache.commons.net.ftp.FTPClient

2. Traverse ftp file class


import org.apache.commons.net.ftp.FTP;
import org.apache.commons.net.ftp.FTPClient;
import org.apache.commons.net.ftp.FTPFile;

import java.io.File;
import java.io.IOException;
import java.util.ArrayList;

import java.util.List;


/**
 * @ClassName TraverseFtpDir, return  dir List
 * @Decription
 * @Author KGC
 * @Date DATE
 **/
public class TraverseFtpDir {

    static List<String> list = new ArrayList<String>();

    public static List func(String file, String fileName,FTPClient ftpClient) throws IOException {



        FTPFile[] fs =ftpClient.listFiles(file);

        //遍历目录,如果是子目录则递归继续遍历
        for (FTPFile f : fs) {
         //   System.out.println(f);

            if (f.isDirectory()) {

                func(file + f.getName()+ "/", fileName, ftpClient);


                if ((f.toString().substring((f.toString().length()) - 8, (f.toString().length()))).equals(fileName)) {
                    list.add(file + f.getName());

                }

            }

        }

        return list;


    }


}

Simply put the logic, ftpClients.listFiles gets all the files in the directory, including folders and files, and then traverses to determine whether it is a directory (folder), if it is, then recursively the method, and then continue to determine whether it is a specified The folder, if it is, is added to the list. It should be noted here that the parameter passed in the traversal function is changed. When the recursive call is made, the path of the parameter is not the path parameter passed in for the first time, but the path of the subdirectory after the traversal: func (file + f.getName () + "/", fileName, ftpClient); otherwise you will always get the specified file path of the first-level directory.

3. Download data to HDFS



import org.apache.commons.net.ftp.FTPClient;
import org.apache.commons.net.ftp.FTPFile;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.log4j.BasicConfigurator;


import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.util.List;

/**
 * @ClassName LoadDataToHdfs
 * @Decription get dir and download data to hdfs from it
 * @Author KGC
 * @Date 2019/3/24
 **/
public class LoadDataToHdfs {

    public static void loadDatatoHdfs(Configuration conf) throws IOException {

        //ftp客户端对象,参数需修改
        FTPClient ftp = FTPUtils.loginFTP("10.211.55.6", 21, "wangjuncheng", "wjc5524568");

        InputStream inputStream = null;
        FSDataOutputStream outputStream = null;

        List li = TraverseFtpDir.func("test/", "20190319", ftp);
        FileSystem fileSystem = FileSystem.get(conf);

        for (int i = 0; i < li.size(); i++) {

            if (!li.get(i).equals(null)) {
                //拿到该目录下的所有文件
                FTPFile[] files = ftp.listFiles(li.get(i).toString());
                ftp.changeWorkingDirectory(li.get(i).toString());
                for (FTPFile f : files) {
                    inputStream = ftp.retrieveFileStream(f.getName());
                    //这里修改/interface下的路径
                    outputStream = fileSystem.create(new Path(li.get(i).toString()));
                    IOUtils.copyBytes(inputStream, outputStream, conf, false);
                    if (inputStream != null) {
                        inputStream.close();
                        ftp.completePendingCommand();
                    }
                }


            }
        }

        ftp.disconnect();

    }


    public static void main(String[] args) throws IOException {
        BasicConfigurator.configure();
        Configuration conf = new Configuration();
      //  conf.set("fileSystem.defaultFS", "hdfs://10.211.55.6:9000");
      // System.setProperty("HADOOP_USER_NAME", "root");

        Path path  = new Path("hdfs://10.211.55.6:9000/0001");
        FileSystem  fs = FileSystem.get(conf);
        fs.mkdirs(path);

       LoadDataToHdfs.loadDatatoHdfs(conf);


    }


}

There are also a few pits here. When inputstream is taken out of the loop, it always reports nullException. Most of the information is that inputstream needs to be manually closed after each cycle, that is, add inputStream.close (); ftp.completePendingCommand (); these two sentences , But I have already added it, and it is not the first time the loop has value, but each loop is empty, and later to find other reasons, in fact, before calling the method, need to locate the ftp directory to the current directory, You need to add ftp.changeWorkingDirectory (li.get (i) .toString ());
so that you can get the file path every time.

Published 14 original articles · Like1 · Visits 684

Guess you like

Origin blog.csdn.net/qq_33891419/article/details/88844267