[Java] Kettle practice from one experiment to start

First, install Kettle

1, on the Easy Install Kettle

First contact kettle(previously only heard of nothing), groping for a few days, the macinstallation fails source in favor of quick installation. In macthe latest version installed kettleand successfully boot code as follows:

☁  ~  brew install kettle
☁  ~  cd /usr/local/Cellar/kettle/8.2.0.0-342/
☁  8.2.0.0-342  cd libexec
☁  libexec  spoon.sh

2, try to install the kettle on source code

git clone https://github.com/pentaho/pentaho-kettle
# or
git clone [email protected]:pentaho/pentaho-kettle.git
  • Set up setting.xml

Will setting.xmlsee: settings.xml in your Mavenstartup directory /.m2in.

☁  pentaho-kettle [master] ⚡  ll /Users/zhangbocheng/.m2
total 8
drwxr-xr-x  97 zhangbocheng  staff  3104 11  8 17:28 repository
-rw-r--r--   1 zhangbocheng  staff  2345 11  8 20:10 setting.xml
  • installation
☁  pentaho-kettle [master] mvn clean install >> /Users/zhangbocheng/Desktop/kettle.log
  • onerror.log

Not set setting.xmlerror problem

.....................................
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 47:49 min
[INFO] Finished at: 2019-11-08T17:44:01+08:00
[INFO] Final Memory: 230M/985M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project pdi-ce: Could not resolve dependencies for project org.pentaho.di:pdi-ce:pom:9.0.0.0-SNAPSHOT: Could not transfer artifact org.hitachivantara.karaf.assemblies:client:zip:9.0.0.0-20191107.125717-160 from/to pentaho-public (http://nexus.pentaho.org/content/groups/omni/): Failed to transfer file http://nexus.pentaho.org/content/groups/omni/org/hitachivantara/karaf/assemblies/client/9.0.0.0-SNAPSHOT/client-9.0.0.0-20191107.125717-160.zip with status code 502 -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :pdi-ce

Set setting.xmlafter it has been in waiting.

Second, the test case

About course experiment, for the first time need to personally set up Kettle, which can be considered one of the more interesting opportunities for engineering practice, spend the least amount of time to get to know the more popular and powerful ETLone tool - Kettle.

1, on the experimental subject

Task Description: The kettlecomplete the following experiments, the results are stored MySQL(or CSV). Known Excelfile that contains columns (name, age, identity card number, gender, date and time of registration, outpatient number), number of data.

1 to generate data that contains columns (date, gender, children / young / middle-aged / elderly, people), in which age children / young / middle-aged / elderly of their own definition;
generate data 2, contains a column (provinces, hour, trips )

First contact kettle, to be simple, considering only the input and output are Excelfirstly fabricated in accordance with the number of questions asked of data, as shown below:
Metadata
ExcelField Description:

姓名:字符串
年龄:整型
身份证号码:字符串
性别:字符串
挂号日期时间:日期时间型    
门诊号:整型

进入安装目录/usr/local/Cellar/kettle/8.2.0.0-342/libexec启动kettle:
welcome
根据实验要求,其实所涉及的问题仅仅是输入和输出,转换(分组统计)。创建任务之初,有必要先百度or Google看看kettle的输入输出是如何实现的?

2,实例预热

最容易实现的简单案例就是生成随机数,并存储到txt文件。

1)新建一个转换保存为test_random(后缀为.ktr)通过拖拽插件方式,在核心对象->输入和输出分别拖拽“生成随机数”和“文本文件输出”两个按钮,然后点击“生成随机数”并按下sheft键,用鼠标指向“文本文件输出”,以生成剪头,表示数据流向。如下图:
test_random
2)编辑输入流,即“生成随机数”按钮,如图所示:
Generates a random number
关于支持的随机数据类型有:
Random data types
3)然后编辑输出流,即“文本文件输出”按钮,如图所示:
Text file output
输出文件名支持预览模式,即点击图中“显示文件名...”按钮:
Displays the file name
4)最后执行,看看结果。
log
preview_data
text

3,实验步骤

通过上述简单实验,我们知道了输入输出流的基本操作,下面开始进入正题。

1)将上述实验中的输入输出全部改为Excel。进行相关配置说明如下:

Excel输入:

在文件选项下,表格类型根据实际进行适配(xls or xlsx),在文件或目录后,点击“浏览”选择自己的源数据文件,然后点击“添加”;

在工作表选项下,点击“获取工作表名称...”添加工作表,即Excel中的sheet

在字段选项下,点击“获取来自头部数据的字段...”自动获取字段,由于原Excel中整型数据转入会变成浮点型,所以需要进行更改,如图所示:
Field configuration
最后可以进行预览。
Preview data
Excel输出:只需要配置输出文件名即可,其他均为默认。
Excel output
2)接下来需要处理的就行核心步骤,即转换。首先针对生成数据1进行分析,由于kettle中分组需要首先进行排序,从而需要处理的点有:

(1)将挂号日期时间截取到日;

(2)对年龄按照一定标准进行转换(自己定义);

(3)按照待分组的字段进行排序;

(4) 进行分组统计。

按照上述思路,在“转换”和“统计”核心对象中,分别找到对应组件,完成基本数据流节点配置,如图所示:
Data flow nodes
在“字段选择”组件中,对时间进行处理。在元数据选项中,需要对Date进行转换成String,格式设置为yyyy-MM-dd,同时可以对字段进行更名操作。另外还可以对字段进行选择,修改,移除。如图所示:
Processing time
注意,这里如果不将时间设置为String,进行一个小实验可以可以发现,最后存储的依然是带时间的日期,本次实验过程中在这个坎纠结了,错误地以为是kettle不支持多关键字(两个以上)排序,如下图所示:
ERROR1
error2
经过与各位大佬沟通确认,kettle是不可能不支持对多关键的排序的,对此深信不疑,那么问题就从kettle本身存在的可能bug消失了,对一个小白而言,不熟悉kettle本身应遵守的规则,这是致命的,只能对怀疑的其他种种可能进行逐一实验了。期间怀疑过待排序关键字的顺序问题,测试发现都不是问题的根本原因,整个过程下来只有对日期做过预处理,而且从错误中发现,引起错排的唯一合理解释就是日期按照预处理之前的原始数据的日期时间型排序的。单独对日期设计实验,如果对预处理生效,那么输出也是预期结果。

  • 验证日期实验

输入流,如图所示:
Date input stream
假设日期类型不改成String,如图所示:
Date
输出流,结果预览,如图所示:
Result Preview
输出流,Excel输出,如图所示:
Excel output
验证实验室结果发现,预览数据并没有存储到输出Excel中去,然后尝试转换为String,输出便一致了。再次验证,kettle对日期类数据处理有待提高。

In "numerical range" component, process for age, their definition of criteria for the classification (defined below may exist defects) as shown in FIG.
Age treatment
In the "sort records" module, data generated in accordance with the requirements, needs to be struck to date, sex, age, as shown in FIG.
Sorting records
In the "packet" component, group statistics, as shown in FIG.
Packet
3) execution, the results shown in FIG.
operation result
Excel output

4, a brief description of two experiments

For 生成数据2the analysis, points need to be addressed are:

(1) The registration date and time settings String, because the date can not be extracted directly from the pre-formatted, it is necessary to take the string taken;

(2) date and identification string taken, date and provinces were extracted codes (first two identification);

(3) to be sorted according to fields of the packet;

(4) the time period for the provinces and a value map;

(4) group statistics.

Overall design of the data flow diagram, as shown:
The overall design data flow diagram
In the "cut the string" component, as follows:
Cut the string
In the "provinces value map" and the "time value map" component, are provided as follows:
State value map
Time value map
Run a result, as shown:
Operating results previewdata
Operating results excel

Third, the summary

Through this experiment, the initial understanding of what a powerful ETLtool it kettle, in order to acquire more knowledge you have to experiment more, gain more than the reflection learned from success from their mistakes. As a tool, only a lot of experiments to better master it, confirms the classic phrase - "practice makes perfect."

Guess you like

Origin www.cnblogs.com/zhangbc/p/11841340.html