Those things that kettle automates

One, kettle introduction

  Kettle is a foreign open source ETL tool, written in pure Java, green without installation, efficient and stable data extraction (data migration tool). There are two script files in Kettle, transformation and job. Transformation completes the basic transformation of data, and job completes the control of the entire workflow.

Two, ETL introduction

  ETL is the process of extracting, cleaning and transforming the data of the business system and then loading it into the data warehouse. The purpose is to integrate the scattered, disorderly, and inconsistent data in the enterprise to provide an analysis basis for the enterprise's decision-making. (Business Intelligence) An important part of the project.

Three, ETL implementation details

  In fact , there are many tools to implement ETL functions. I am familiar with and used them: Informatica PowerCenter, kettle, sql, PLSQL programming, python, etc.

Briefly talk about these means of implementing ETL:

(1)、Informatica PowerCenter

  This software is commercially available. There are relatively few mature Chinese materials on the Internet, and the version is older. Most of the materials in English have caused confusion for many beginners. The most famous god in China is Yang Xiaodong. Almost all domestic Chinese materials are shared by Yang Xiaodong, and the corresponding versions of the materials still stay at 7.6 and 8.5. (When I was studying, I bought the books myself, I found the videos on Taobao, and I still made soy sauce in Yang Xiaodong’s group. After 6 months of hard work, I could get started and work normally )

(2)、kettle

  This software is open source and written in pure java. There are a lot of online documents and video materials. Many people share their cases on blogs. ( From the first contact with Kettle to flexible use, I spent a total of 2 days. This is mainly due to my technical precipitation: java programming, sql, and the use of Informatica PowerCenter )

(3)、sql

  When it comes to SQL, you feel that as long as you are a person doing IT, you will. Do you think it is select\insert\update\delete, you have realized etl? Wake up Tiezi, if you think this way, you will never eat 4 dishes!!

The SQL mentioned here must be combined with your business at work, you have to understand the business, and then write the SQL. After you finish operating sql, you have to verify it. Is the result correct? How is the timeliness? If the performance is not good, you need to tune SQL. When it comes to tuning, some people will say that they find it on Baidu, and there are so many articles. You will find a lot of articles, like close relatives, a great copy of the world to learn literature! And you don't understand the business, you have to tune pure nonsense.

for example:

Xiao Ming has a lot of food. He ate a roast duck for a meal. Xiao Ming said that I was full just now, and I am so happy to be full.

Your appetite is so small that you usually eat half a bun per meal. You heard what Xiao Ming said, you also eat a roast duck, your belly is bursting, and the roast duck with a mouth is about to spit out of your mouth.

The above small example shows that different business rules \data volume\access volume have different corresponding tuning methods. In fact, there are also some general tuning methods, which are realized from the writing of sql. You can refer to my blog post: click Oracle tuning notes (reveal the rumors)

(4), PLSQL programming

  PLSQL is Oracle's advanced programming. If you want to use PLSQL programming to implement ETL, the basic requirements are very similar to the sql requirements in point 3 above. In addition, you must be familiar with the PLSQL syntax and be able to exclude the Exception generated during PLSQL runtime. ( I do ORALCE DBA and I am very familiar with PLSQL programming, so PLSQL is mentioned here. Most of the data have their own programming, as long as you are familiar with it, you can implement ETL ). If you are also interested in plsql programming, you can refer to my blog post: Click  the simplest data extraction in history

(5)、Python

  Python is a programming language that integrates many mathematical functions and tool classes, which can help us implement ETL operations more conveniently. If the data source you want to manipulate is excel and csv, you can use pandas, which is very convenient.

Four, the basic concept of kettle

  (1) Ktr conversion: Assemble one or more data sources into a data pipeline. According to business requirements, use Kettle's internal components for data processing, and finally output to a certain place (file or database).

image.png





  (2) kjb job: you can schedule one or more conversions designed, and you can also perform some file processing (compare\delete, etc.), you can also upload and download files using FTP, send mailboxes, execute shell commands, and so on.

image.png     

 



for example:

  Ktr conversion: ktr is written according to the business, they implement different operations, and there will be multiple ktrs. Ktr is the workers working on the construction site, each worker has different skills (small worker, bricklayer, steel worker, carpenter, crane driver)

  Kjb job: kjb manages multiple ktrs. kjb is a contractor, and the contractor manages multiple workers. When there is a task, directly ask the contractor and say that the contractor will find workers with different skills to complete the work according to the requirements of the task.

Five, business description

  我们公司的业务是多源的,有一条业务线是服务于国家医保局,根据全国各地名提供的,医院HIS\PACK\LIS\财务数据,及医保局的结算数据,进行数据分析和筛查,配合业务专家进行飞行检查,找出违规收费等问题。对医院进行管控,实现就医的合理收费,医患和睦,使用最少的医保费用,干更多的事,为老百姓谋幸福。

六、工作的流程

  根据业务专家的经验,将医保中出现的各种违规收费规则,写成kettle的转换,每一个转换就是一个检查规则,这样的检查规则有好几百个,部分如下图:

 image.png






这么多的转换,不可能一个一个去运行,我把每10个转换交给1个作业来管理调用,部分截图如下:

 image.png













这样的话就方便多了,我只需要调用这几十个作业,就可以对间接的调用这几百个转换了。

七、工作的困难

  最开始只有一个地区的10家医院,我们根据地区创建数据库,把本地区的所有医院的数据保存在对应库中hive的分区表里,使用医院的医疗机构编码作为分区Key。每家医院有自己的筛查转换文件,因为转换文件中的条件会使用到各医院的医疗机构编码不同,每家医院筛查的结果保存的路径也不同。每家医院对应500个转换和50个作业文件,10家医院我共修改5000个作文件和500个作业文件,当然了是使用工具快速查找替换的。

  但是现在有多个地区,医院数也由最初的10家变成了100多家。 我还是手工去替换修改的话,我当场就去世了。现在还要修改数据库连接信息,因为多个地区,对应多个数据库。后期全国的数据都上来了,要修改的转换和作业文件,太多了。我也不用活了,直接就脑梗死了。

八、解决困难

  前面提到了每个医院都有自己的几百个筛查转换,所有医院的数据结构是不变的。能不能只保留一套通用的筛查转换文件,因为筛查规则是一样的,就是各医院的医疗机构编码、结果文件输出路径、数据库连接信息是不同的。于是想到了可以使用java调用kettle,调用kettle时传入不同的参数,在kettle中是可以使用命名参数来接收,这样就实现了程序的复用性、灵活性。

  数据库相关替换的参数有:数据库名、数据库IP、端口号、用户名、密码。

  任务相关替换的参数有:医院的医疗机构编码、作业存放的结对路径、筛查结果文件保存路径。

附上 作业和转换的 命名参数 案例,如下截图:

  (1)kjb作业 全局的命名参数

image.png

  








(2)kjb作业中调用的转换 命名参数

image.png














(3)ktr转换 全局的命名参数

image.png










(4)ktr转换 DB连接的参数

 image.png












(5)结果文件输出路径 参数

image.png










  好多年不干java开发了,现在又要动代码,想想就头疼。但工作还要继续,日子还要过。上百度查查吧,还真查到了SpringBoot调用kettle的文章。剩下的就是动手干吧,再次化身程序员。

九、项目代码结构

image.png















十、项目代码运行

(1)、启动springboot

  InsuranceETLApplication类中,右键------>Run As----->Java Application

(2)、在浏览器上输入

http://localhost:9090/kettle/task

  (3), console output printing

 image.png










From the information output from the console, you can see the load_violation_data_to_hive_01 job that is being called now , and output the detailed information of the SQL statement currently converted and executed in the job.

Just write so much. If you are also engaged in ETL automation, we can communicate more. When I was developing the code, I encountered too many pits, learn together and make progress together!


Let yourself laugh when you are tired of studying, and tell yourself that your future life is beautiful! ! !

image.png

Guess you like

Origin blog.51cto.com/51power/2546708