1.ETL concept
ETL: Extract-Transform-Load Abbreviation, i.e. data 抽取、转换、装载
process. The term more commonly used in data warehouse ETL, but the object is not limited to the data warehouse.
ETL is an important part of building a data warehouse, data from a source user to extract the desired data, after data cleaning, according to the final pre-defined good data warehouse model, the data is loaded into the data warehouse to.
Kettle (official name: Pentaho Data Integration) is an open source ETL tool based on JAVA development, there are easy-to-start graphical interface, graphical GUI design interface, then you can transfer the form workflow, skilled it can reduce a lot of R & D effort, improve work efficiency. Is in addition to the commercial version of DataStage best tool
Kettle allows you to manage data from different sources, including different databases, excel / csv and other documents, email, website source crawl, etc., in addition to extracting data conversion, also supports file operation, e-mail, to create by providing a graphical interface, designed to convert (Trans) and workflow (Jobs) task.
Kettle There are two script files, transformation
and job
.
Transformation(转换)
Complete basis for data conversion, hereinafter abbreviated as TransJob(作业)
Control the entire workflow is completed. Is a collection of operations is converted, to achieve a more complex logic.
Kettle family: Spoon, Pan, Kitchen.
- Spoon graphical interface: ETL design conversion process (Transformation) and workflow (Jobs).
- Pan backstage Batch: Allows you to batch convert ETL is run by Spoon Design (Trans). .
- Kitchen background Batch: Allows you to batch ETL workflow is run by Spoon Design (Jobs)
2. Installation
Installation: download, install and use the kettle in the preliminary Kettle Learning Series (windows platform)
use: Kettle Getting Started Tutorial
3. practical use
3.1 Create a new Trans (conversion)
Kettle conversion is the most basic tasks, defines how data conversion. New shortcut is converted ctrl-N
, the object is left of the interface region, where various components can be selected for complete conversion, the right is the workspace, simply drag to the right to assembly.
The only demonstration MySQL to convert text, it is only necessary 输入-表输入
, 转换-字段选择
and 输出-文本文件输出
these three components to complete the work.
Select a component, hold down the shift, drag the mouse to the destination point to complete the assembly.
Each component needs to be set, double-click the component will be able to open the editing dialog
- Enter the editing table
- Select edit field
using field selection table allows decoupling the input and output tables
- Editing text output
- Perform the conversion
3.2 Creating a job job
Operation is a logical collection, which contains the conversion of some of the components and other
3.3 Extraction variable
Any conversion, the job should get the site to perform, but the operation just in time to write a lot of specific IP, path, etc., which live in different environments are different, if they can be extracted, it is the best s Choice.
Just need to write these parameters in the configuration file Kettle ( kettle.properties
) below can, in Windows, this file exists in the user directory .kettle文件夹下
, in Linux systems, this file .kettle directory in the user's home directory.
3.4 Database Connection Sharing
The previous steps will create a new database connections, but these connections is one of Trans exclusive, each time a new Trans need to re-establish the connection, this is a very troublesome operation, but provides a feature called Kettle sharing feature, you can the database connections shared out.
In the "main object tree -DB Connection", select the connection you want to share, right-click, 选择共享
and you're done. Shared database link is displayed in bold font:
database connection is actually stored in a file, the same in .kettle
the next, there is a shared.xml
file, save the information that has been shared DB connections.
3.5 deployment
Just saw, the new Trans and Job are separate files. These files can be executed directly on site:
- The Kettle package deployed on the scene of the ETL machine, the pan, kitchen written to the directory where the PATH variable;
- The .kettle configuration directory
配置文件
andshared.xml
files; - The preconfigured
trans和job文件
on the specified directory ETL machine; - Kitchen run command to run Job.
- The following is an example of a kitchen command:
kitchen.sh -file=/path/demo.kjb -level-Minimal