Springboot integrates the use of ETL engine Kettle

Introduction

ETL is the abbreviation of Extract-Transform-Load in English. It is used to describe the process of extracting, transforming, and loading data from the source to the destination. It can handle various distributed and heterogeneous data. Source data (such as relational data) is extracted, and "dirty" data content such as incomplete data, duplicate data, and erroneous data are cleaned according to pre-designed rules to obtain "clean" data that meets the requirements and loaded into the data warehouse . Storage, these "clean" data have become the cornerstone of data analysis and data mining.

kettle is an open source ETL tool. Kettle provides a graphical interface based on Java, which is very convenient to use. Kettle provides script writing functions based on JAVA, which can flexibly customize the ETL process, making self-customization, batch processing, etc. possible. This is what a programmer needs to do, not just operating the Kettle user interface like using Word. .

Environment integration:

Reference: java integrated kettle tutorial (with sample code)_kettle java_Cheng Weiping 2022's blog-CSDN blog

Code:

pom.xml added:

<!--mysql数据库链接驱动以及连接池-->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
        </dependency>
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>druid</artifactId>
            <version>1.2.11</version>
        </dependency>
<!-- kettle 工具本地jar包加载 -->
        <dependency>
            <groupId>pentaho-kettle</groupId>
            <artifactId>kettle-core</artifactId>
            <version>8.2.0.7-719</version>
            <scope>system</scope>
            <systemPath>${project.basedir}/lib/kettle-core-8.2.0.7-719.jar</systemPath>
        </dependency>
        <dependency>
            <groupId>pentaho-kettle</groupId>
            <artifactId>kettle-engine</artifactId>
            <version>8.2.0.7-719</version>
            <scope>system</scope>
            <systemPath>${project.basedir}/lib/kettle-engine-8.2.0.7-719.jar</systemPath>
        </dependency>
        <dependency>
            <groupId>pentaho-kettle</groupId>
            <artifactId>metastore</artifactId>
            <version>8.2.0.7-719</version>
            <scope>system</scope>
            <systemPath>${project.basedir}/lib/metastore-8.2.0.7-719.jar</systemPath>
        </dependency>
        <!--kettle需要用到的其它依赖-->
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-vfs2</artifactId>
            <version>2.2</version>
        </dependency>
        <dependency>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
            <version>17.0</version>
        </dependency>
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.2</version>
        </dependency>
        <dependency>
            <groupId>commons-lang</groupId>
            <artifactId>commons-lang</artifactId>
            <version>2.6</version>
        </dependency>
        <dependency>
            <groupId>commons-codec</groupId>
            <artifactId>commons-codec</artifactId>
            <version>1.10</version>
        </dependency>
        <dependency>
            <groupId>com.jcraft</groupId>
            <artifactId>jsch</artifactId>
            <version>0.1.54</version>
        </dependency>
        <dependency>
            <groupId>net.sourceforge.jexcelapi</groupId>
            <artifactId>jxl</artifactId>
            <version>2.6.12</version>
        </dependency>
@RestController
@RequestMapping("${application.admin-path}/etl-kettl")
//@Api(tags = "ETL-Kettle的demo接口")
public class KettleDemoContrllor {
	@Resource
	KettleService kettleService;

	@GetMapping("/execKtr")
	//@ApiOperation("执行ktr文件")
	private Object runKtr(String filename) throws Exception {
		return R.buildOkData(kettleService.runTaskKtr(filename,null).toString());
	}

	@GetMapping("/execKjb")
	//@ApiOperation("执行kjb文件")
	private Object runKjb(String filename) throws Exception {
		return R.buildOkData(kettleService.runTaskKjb(filename, null).toString());
	}
}

public interface KettleService {
    /**
     * 开始执行ETL任务(ktr文件)
     *
     * @param taskFileName 执行的任务文件名(ktr)
     * @param params 执行任务输入的参数
     * @return 运行结果
     * @throws Exception 没有找到配置文件,Kettle的运行异常不会抛出
     */
    Object runTaskKtr(String taskFileName, Map<String, String> params) throws Exception;
    /**
     * 开始执行ETL任务(kjb文件)
     *
     * @param taskFileName 执行的任务文件名(kjb)
     * @param params 执行任务输入的参数
     * @return 运行结果
     * @throws Exception 没有找到配置文件,Kettle的运行异常不会抛出
     */
    Object runTaskKjb(String taskFileName, Map<String, String> params) throws Exception;
}
@Service
public class KettleServiceImpl implements KettleService {

    @Value("${kettle.script.path}")
    private String kettleScriptPath;

    private static final Logger logger = LoggerFactory.getLogger("kettle-service-log");

    private final List<KtrMeta> KTR_METAS = new ArrayList<>();
    private final List<KjbMeta> KJB_METAS = new ArrayList<>();

    private List<String> getFiles(String path, String subName) {
        List<String> files = new ArrayList<>();
        File file = new File(path);
        File[] tempList = file.listFiles();
        if (tempList == null){
            return files;
        }
        for (File value : tempList) {
            if (value.isFile()) {
                if (Objects.equals(value.toString().substring(value.toString().length() - 3), subName)) {
                    files.add(value.getName());
                }
            }
        }
        return files;
    }

    //采用单列模式,项目启动时加载环境,加载所有的转换配置、任务配置,后续执行就会快一点
    //@PostConstruct
    public void init() throws KettleException {
        logger.info("----------------------开始初始化ETL配置------------------------");
        KettleEnvironment.init();
        List<String> ktrFiles = getFiles(kettleScriptPath, "ktr");
        List<String> kjbFiles = getFiles(kettleScriptPath, "kjb");
        logger.info("需要加载的转换为:" + ktrFiles.toString());
        logger.info("需要加载的任务为:" + kjbFiles.toString());
        logger.info("----------------------开始加载ETL配置--------------------------");
        for (String ktrFile : ktrFiles) {
            KtrMeta ktrMeta = new KtrMeta();
            ktrMeta.setName(ktrFile);
            ktrMeta.setTransMeta(new TransMeta(kettleScriptPath + ktrFile));
            KTR_METAS.add(ktrMeta);
            logger.info("成功加载转换配置:" + ktrFile);
        }
        for (String kjbFile : kjbFiles) {
            KjbMeta kjbMeta = new KjbMeta();
            kjbMeta.setName(kjbFile);
            kjbMeta.setJobMeta(new JobMeta(kettleScriptPath + kjbFile, null));
            KJB_METAS.add(kjbMeta);
            logger.info("成功加载任务配置:" + kjbFile);
        }
        logger.info("----------------------全部ETL配置加载完毕-----------------------");
    }


    @Override
    public Object runTaskKtr(String ktrFileName, Map<String, String> params) {
        logger.info("开始执行转换:" + ktrFileName);
        TransMeta transMeta = null;
        for (KtrMeta ktrMeta : KTR_METAS) {
            if(Objects.equals(ktrFileName,ktrMeta.getName())){
                transMeta = ktrMeta.getTransMeta();
                break;
            }
        }
        //如果在缓存的列表里面没找到需要自信的配置,尝试手动加载
        try {
            if (transMeta == null) {
                logger.warn("资源池没有找到配置文件:" + ktrFileName+"  尝试二次加载!");
                KettleEnvironment.init();
                transMeta = new TransMeta(kettleScriptPath + File.separator + ktrFileName);
                if(transMeta==null) throw new RuntimeException("未找到需要执行的转换配置文件:");
            }
            Trans trans = new Trans(transMeta);
            if (params != null) {
                for (Map.Entry<String, String> entry : params.entrySet()) {
                    trans.setParameterValue(entry.getKey(), entry.getValue());
                }
            }
            //trans.prepareExecution(null);
            //trans.startThreads(); //启用新的线程加载
            trans.execute(null);
            trans.waitUntilFinished();
            return trans.getResult();
        }catch (Exception e)
        {
            e.printStackTrace();
            return e.getMessage();
        }

    }

    @Override
    public Object runTaskKjb(String objFileName, Map<String, String> params) throws Exception {
        logger.info("开始执行任务:" + objFileName);
        JobMeta jobMeta = null;
        for (KjbMeta kjbMeta : KJB_METAS) {
            if(Objects.equals(objFileName,kjbMeta.getName())){
                jobMeta = kjbMeta.getJobMeta();
            }
        }
        try {
            if (jobMeta == null) {
                logger.warn("资源池没有找到配置文件:" + objFileName+"  尝试二次加载!");
                KettleEnvironment.init();
                jobMeta = new JobMeta(kettleScriptPath + File.separator + objFileName,null);
                if(jobMeta==null) throw new RuntimeException("未找到需要执行的任务配置文件:"+objFileName);
            }
            Job job = new Job(null, jobMeta);
            if (params != null) {
                for (Map.Entry<String, String> entry : params.entrySet()) {
                    job.setParameterValue(entry.getKey(), entry.getValue());
                }
            }
            job.start();
            job.waitUntilFinished();
            return job.getResult();
        }catch (Exception e)
        {
            e.printStackTrace();
            return e.getMessage();
        }

    }
}
@Data
public class KtrMeta {
	private TransMeta transMeta;
	private String name;
}
@Data
public class KjbMeta {
	private JobMeta jobMeta;
	private String name;
}

Summarize:

After integration, I feel that there is no need to integrate it into the project. The key is to learn how to use tools for data collection and management.

Reference: 1_Overview of ETL and Kettle_bilibili_bilibili

Download: kettle tool download

Guess you like

Origin blog.csdn.net/qq_22824481/article/details/130984107