First, install Kettle
1, on the Easy Install Kettle
First contact kettle
(previously only heard of nothing), groping for a few days, the mac
installation fails source in favor of quick installation. In mac
the latest version installed kettle
and successfully boot code as follows:
☁ ~ brew install kettle
☁ ~ cd /usr/local/Cellar/kettle/8.2.0.0-342/
☁ 8.2.0.0-342 cd libexec
☁ libexec spoon.sh
2, try to install the kettle on source code
git clone https://github.com/pentaho/pentaho-kettle
# or
git clone [email protected]:pentaho/pentaho-kettle.git
- Set up
setting.xml
Will setting.xml
see: settings.xml in your Maven
startup directory /.m2
in.
☁ pentaho-kettle [master] ⚡ ll /Users/zhangbocheng/.m2
total 8
drwxr-xr-x 97 zhangbocheng staff 3104 11 8 17:28 repository
-rw-r--r-- 1 zhangbocheng staff 2345 11 8 20:10 setting.xml
- installation
☁ pentaho-kettle [master] mvn clean install >> /Users/zhangbocheng/Desktop/kettle.log
- on
error.log
Not set setting.xml
error problem
.....................................
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 47:49 min
[INFO] Finished at: 2019-11-08T17:44:01+08:00
[INFO] Final Memory: 230M/985M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project pdi-ce: Could not resolve dependencies for project org.pentaho.di:pdi-ce:pom:9.0.0.0-SNAPSHOT: Could not transfer artifact org.hitachivantara.karaf.assemblies:client:zip:9.0.0.0-20191107.125717-160 from/to pentaho-public (http://nexus.pentaho.org/content/groups/omni/): Failed to transfer file http://nexus.pentaho.org/content/groups/omni/org/hitachivantara/karaf/assemblies/client/9.0.0.0-SNAPSHOT/client-9.0.0.0-20191107.125717-160.zip with status code 502 -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn <goals> -rf :pdi-ce
Set setting.xml
after it has been in waiting.
Second, the test case
About course experiment, for the first time need to personally set up Kettle
, which can be considered one of the more interesting opportunities for engineering practice, spend the least amount of time to get to know the more popular and powerful ETL
one tool - Kettle
.
1, on the experimental subject
Task Description: The kettle
complete the following experiments, the results are stored MySQL
(or CSV
). Known Excel
file that contains columns (name, age, identity card number, gender, date and time of registration, outpatient number), number of data.
1 to generate data that contains columns (date, gender, children / young / middle-aged / elderly, people), in which age children / young / middle-aged / elderly of their own definition;
generate data 2, contains a column (provinces, hour, trips )
First contact kettle
, to be simple, considering only the input and output are Excel
firstly fabricated in accordance with the number of questions asked of data, as shown below:
Excel
Field Description:
姓名:字符串
年龄:整型
身份证号码:字符串
性别:字符串
挂号日期时间:日期时间型
门诊号:整型
进入安装目录/usr/local/Cellar/kettle/8.2.0.0-342/libexec
启动kettle
:
根据实验要求,其实所涉及的问题仅仅是输入和输出,转换(分组统计)。创建任务之初,有必要先百度or Google
看看kettle
的输入输出是如何实现的?
2,实例预热
最容易实现的简单案例就是生成随机数,并存储到txt
文件。
1)新建一个转换保存为test_random
(后缀为.ktr
)通过拖拽插件方式,在核心对象->输入和输出分别拖拽“生成随机数”和“文本文件输出”两个按钮,然后点击“生成随机数”并按下sheft
键,用鼠标指向“文本文件输出”,以生成剪头,表示数据流向。如下图:
2)编辑输入流,即“生成随机数”按钮,如图所示:
关于支持的随机数据类型有:
3)然后编辑输出流,即“文本文件输出”按钮,如图所示:
输出文件名支持预览模式,即点击图中“显示文件名...”按钮:
4)最后执行,看看结果。
3,实验步骤
通过上述简单实验,我们知道了输入输出流的基本操作,下面开始进入正题。
1)将上述实验中的输入输出全部改为Excel
。进行相关配置说明如下:
Excel
输入:
在文件选项下,表格类型根据实际进行适配(xls or xlsx
),在文件或目录后,点击“浏览”选择自己的源数据文件,然后点击“添加”;
在工作表选项下,点击“获取工作表名称...”添加工作表,即Excel
中的sheet
;
在字段选项下,点击“获取来自头部数据的字段...”自动获取字段,由于原Excel
中整型数据转入会变成浮点型,所以需要进行更改,如图所示:
最后可以进行预览。
Excel
输出:只需要配置输出文件名即可,其他均为默认。
2)接下来需要处理的就行核心步骤,即转换。首先针对生成数据1
进行分析,由于kettle
中分组需要首先进行排序,从而需要处理的点有:
(1)将挂号日期时间截取到日;
(2)对年龄按照一定标准进行转换(自己定义);
(3)按照待分组的字段进行排序;
(4) 进行分组统计。
按照上述思路,在“转换”和“统计”核心对象中,分别找到对应组件,完成基本数据流节点配置,如图所示:
在“字段选择”组件中,对时间进行处理。在元数据选项中,需要对Date
进行转换成String
,格式设置为yyyy-MM-dd
,同时可以对字段进行更名操作。另外还可以对字段进行选择,修改,移除。如图所示:
注意,这里如果不将时间设置为String
,进行一个小实验可以可以发现,最后存储的依然是带时间的日期,本次实验过程中在这个坎纠结了,错误地以为是kettle
不支持多关键字(两个以上)排序,如下图所示:
经过与各位大佬沟通确认,kettle
是不可能不支持对多关键的排序的,对此深信不疑,那么问题就从kettle
本身存在的可能bug
消失了,对一个小白而言,不熟悉kettle
本身应遵守的规则,这是致命的,只能对怀疑的其他种种可能进行逐一实验了。期间怀疑过待排序关键字的顺序问题,测试发现都不是问题的根本原因,整个过程下来只有对日期做过预处理,而且从错误中发现,引起错排的唯一合理解释就是日期按照预处理之前的原始数据的日期时间型排序的。单独对日期设计实验,如果对预处理生效,那么输出也是预期结果。
- 验证日期实验
输入流,如图所示:
假设日期类型不改成String
,如图所示:
输出流,结果预览,如图所示:
输出流,Excel
输出,如图所示:
验证实验室结果发现,预览数据并没有存储到输出Excel
中去,然后尝试转换为String
,输出便一致了。再次验证,kettle
对日期类数据处理有待提高。
In "numerical range" component, process for age, their definition of criteria for the classification (defined below may exist defects) as shown in FIG.
In the "sort records" module, data generated in accordance with the requirements, needs to be struck to date, sex, age, as shown in FIG.
In the "packet" component, group statistics, as shown in FIG.
3) execution, the results shown in FIG.
4, a brief description of two experiments
For 生成数据2
the analysis, points need to be addressed are:
(1) The registration date and time settings
String
, because the date can not be extracted directly from the pre-formatted, it is necessary to take the string taken;(2) date and identification string taken, date and provinces were extracted codes (first two identification);
(3) to be sorted according to fields of the packet;
(4) the time period for the provinces and a value map;
(4) group statistics.
Overall design of the data flow diagram, as shown:
In the "cut the string" component, as follows:
In the "provinces value map" and the "time value map" component, are provided as follows:
Run a result, as shown:
Third, the summary
Through this experiment, the initial understanding of what a powerful ETL
tool it kettle
, in order to acquire more knowledge you have to experiment more, gain more than the reflection learned from success from their mistakes. As a tool, only a lot of experiments to better master it, confirms the classic phrase - "practice makes perfect."