Free open source ETL tool KETTLE

1.ETL concept

ETL: Extract-Transform-Load Abbreviation, i.e. data 抽取、转换、装载process. The term more commonly used in data warehouse ETL, but the object is not limited to the data warehouse.
ETL is an important part of building a data warehouse, data from a source user to extract the desired data, after data cleaning, according to the final pre-defined good data warehouse model, the data is loaded into the data warehouse to.
Kettle (official name: Pentaho Data Integration) is an open source ETL tool based on JAVA development, there are easy-to-start graphical interface, graphical GUI design interface, then you can transfer the form workflow, skilled it can reduce a lot of R & D effort, improve work efficiency. Is in addition to the commercial version of DataStage best tool
Kettle allows you to manage data from different sources, including different databases, excel / csv and other documents, email, website source crawl, etc., in addition to extracting data conversion, also supports file operation, e-mail, to create by providing a graphical interface, designed to convert (Trans) and workflow (Jobs) task.

Kettle There are two script files, transformationand job.

  • Transformation(转换) Complete basis for data conversion, hereinafter abbreviated as Trans
  • Job(作业)Control the entire workflow is completed. Is a collection of operations is converted, to achieve a more complex logic.

Kettle family: Spoon, Pan, Kitchen.

  • Spoon graphical interface: ETL design conversion process (Transformation) and workflow (Jobs).
  • Pan backstage Batch: Allows you to batch convert ETL is run by Spoon Design (Trans). .
  • Kitchen background Batch: Allows you to batch ETL workflow is run by Spoon Design (Jobs)

2. Installation

Installation: download, install and use the kettle in the preliminary Kettle Learning Series (windows platform)
use: Kettle Getting Started Tutorial

3. practical use

3.1 Create a new Trans (conversion)

Kettle conversion is the most basic tasks, defines how data conversion. New shortcut is converted ctrl-N, the object is left of the interface region, where various components can be selected for complete conversion, the right is the workspace, simply drag to the right to assembly.
The only demonstration MySQL to convert text, it is only necessary 输入-表输入, 转换-字段选择and 输出-文本文件输出these three components to complete the work.
Here Insert Picture Description
Select a component, hold down the shift, drag the mouse to the destination point to complete the assembly.
Each component needs to be set, double-click the component will be able to open the editing dialog

  • Enter the editing table
    Here Insert Picture Description
  • Select edit field
    using field selection table allows decoupling the input and output tables
    Here Insert Picture Description
  • Editing text output
    -
  • Perform the conversion
    Here Insert Picture Description

3.2 Creating a job job

Operation is a logical collection, which contains the conversion of some of the components and other
Here Insert Picture Description

3.3 Extraction variable

Any conversion, the job should get the site to perform, but the operation just in time to write a lot of specific IP, path, etc., which live in different environments are different, if they can be extracted, it is the best s Choice.

Just need to write these parameters in the configuration file Kettle ( kettle.properties) below can, in Windows, this file exists in the user directory .kettle文件夹下, in Linux systems, this file .kettle directory in the user's home directory.
Here Insert Picture Description

3.4 Database Connection Sharing

The previous steps will create a new database connections, but these connections is one of Trans exclusive, each time a new Trans need to re-establish the connection, this is a very troublesome operation, but provides a feature called Kettle sharing feature, you can the database connections shared out.

In the "main object tree -DB Connection", select the connection you want to share, right-click, 选择共享and you're done. Shared database link is displayed in bold font:
Here Insert Picture Description
database connection is actually stored in a file, the same in .kettlethe next, there is a shared.xmlfile, save the information that has been shared DB connections.
Here Insert Picture Description

3.5 deployment

Just saw, the new Trans and Job are separate files. These files can be executed directly on site:

  • The Kettle package deployed on the scene of the ETL machine, the pan, kitchen written to the directory where the PATH variable;
  • The .kettle configuration directory 配置文件and shared.xmlfiles;
  • The preconfigured trans和job文件on the specified directory ETL machine;
  • Kitchen run command to run Job.
  • The following is an example of a kitchen command:kitchen.sh -file=/path/demo.kjb -level-Minimal
Published 418 original articles · won praise 745 · Views 1.26 million +

Guess you like

Origin blog.csdn.net/u013467442/article/details/89519789