ETL-Kettle study notes (Getting Started, Introduction, simple operation)

Kettle: Introduction
ETL: Introduction
ETL (Extract-Transform-Load acronym, that data extraction, transformation, loading procedure), for business or industrial applications, we often encounter a variety of data processing, conversion, migration, Therefore, to understand and master the use of a etl tool, essential, Kettle is a powerful ETL tool.
Kettle: concept

Kettle is a foreign open source ETL tool, written in pure java, runs on Window, Linux, Unix, green without having to install, highly efficient and stable data extraction.

Kettle kettle Chinese name is called, the project's main programmer MATT want to put all kinds of data into a pot, and then out in a specified format.

The Kettle ETL tool set that allows you to manage data from different databases, is described by providing a graphical user environment you want to do, not what you want to do.

Kettle There are two script files, transformation (.ktr) and job (.kjb), transformation complete basis for the conversion of data, job is complete control of the entire workflow.

Kettle: four families (core components)

Chef (Chinese: chef), Kitchen (Chinese: kitchen), Spoon (Chinese: a spoon), Pan (Chinese: pan)

Chef- work (job) design tool (GUI mode).

Kitchen- work (job) actuator (command line).

Spoon- convert (transform) design tool (GUI mode).

pan- conversion (Transform) the actuator (command line).

Job and Transformation difference: Transformation focused on ETL data, and the Job of a broader scope, Transformation may be, may be Mail, SQL, Shell, FTP, and even can be another Job.

Kettle: conceptual model

Kettle execution is divided into two levels: Job and Transformation. The two levels of the most important is to deliver and operation mode data

1.Transformation: define a container for data manipulation, data from the operation data is input to the output of a process can be understood as the particle size of the container is smaller than the Job level, we decompose tasks into Job, and then need to break into a Job or more of Transformation, each Transformation complete only part of the job.

(Defined operation on the data container, the operation data is input to a data output process, the job can be understood as a particle size smaller than a container, we decompose the task into the job, then the job needs to be broken down into one or more conversion, each conversion is complete only part of the job.

)
2.Step: is the minimum unit inside Transformation, each of Step perform a specific function.
3.Job: Transformation is responsible for the organization together and then complete a job, we usually need a large task into several logical isolation on the Job, Job When these are completed, means that this task is completed a.

(Responsible for the [conversion] organizations together and then complete a piece of work, usually we need to put a big task into several logical isolation on the job, when these operations are completed, it means that the task completed .

4.Job Entry: Job Entry Job execution unit is internal, for each of the Entry Job specific function, such as: authentication table exists send mail. Job may be performed by another Job or Transformation, that can be used as Transformation and Job Job Entry.
5.Hop: for connecting the Transformation Step, or in connection Job Job Entry, is a graphical representation of a data stream.

In the Kettle in JobEntry Job is executed serially, it must have JobEntry a Job Start of; in Step Transformation are executed in parallel.

Kettle: catalog file

Kettle: deployment
install JDK:

Because Kettle is a java language development, the software allows rely java runtime environment, you need to install the JDK, ready to run the environment.

Configuration environment variable:

JAVA_HOME: JDK installation directory

KETTLE_HOME: kettle decompression directory

Kettle: the graphical interface

Kettle: the core concepts
of visual programming:

Kettle may be classified as a visual programming language (Visula Programming Languages), as may be used as Kettle graphically define complex ETL and workflow.

Kettle in the figure is the conversion and jobs

Visual programming has been the core concept Kettle, it can quickly build complex ETL jobs and reduce maintenance workload. It is by hiding a lot of technical details to make the field of IT closer to the business world.

Conversion:

Conversion (Transformation) is the most important part of the ETL solution that handles the extraction, transformation, loading various operations on the data lines.

Contains one or more conversion step (STEP), such as reading the file, the data line filtering, data cleaning, or to load data into the database.

In the conversion step is performed by hop (Hop) are connected, it defines a single jump channel allows data to flow from one step to another step.

In Kettle, the data unit is a row, the data line is the data stream from a moving step to another step.

Sometimes referred to as a data stream is recorded stream

Step steps:
Step (Control) is an essential component in the conversion.

A step has the following characteristics key steps:
Step a need to have the same name, the name is unique within the scope of the conversion
of each step will read and write data lines (the only exception is the "record generation" step of write-only data)
step output write data to one or more associated jump, the other end of the step jump then transmitted to
most of the steps can have a plurality of output hop. A data transmission step may be set to distribution and copying, the distribution is in turn received by the target recording step, copying all records are sent simultaneously to all target steps.
Hop Jump:
Jump is connecting with the arrow between step, the data path is defined before jumping step

Hop actually data line buffer (the size of the line set may be defined in the settings in the conversion) is called the line set between the two steps

When the full set of rows, the steps to write data to the line set will stop writing until another row set space.

When the data row is empty, the step of reading from the rowset stop reading until another set of rows in the row-readable data.

Rows - Data type:
data in the form of rows move along the step, zero or more data fields of a set of row, field types include the following.

String: Character Data Type
Number: double precision floating point
Integer: unsigned long integer (64-bit)
BigNumber: arbitrary precision data
Date: The date with millisecond precision time value
Boolean: the value of Boolean values true and false
Binary: binary field can contain pictures, sound, video and other types of binary data
rows of data - metadata:
each step to have a description of the fields in the output data line, which is meta data describing data lines.

It includes some of the following information.

Name: line in the field name should be unique
data type: data type of the field
formats: the way data is displayed, such as Integer of # 0.00.
Length: length of string or BigNumber
accuracy: BigNumber decimal precision type
currency symbol: ¥
decimal notation: decimal point format data. Under the decimal symbols from different cultural backgrounds are different, usually (.) Or (,).
Grouping Symbol: symbol packet data of numeric type, different cultural backgrounds are different grouping symbol, generally point, a comma (,), single quote ( ') (.).
Parallel:
jump this line-based assembly cache rules allow each step is a separate thread to run, so that the highest degree of concurrency. This rule also allows data so as to minimize memory consumption data stream processing. Factory data library in the way we often deal with large amounts of data, so this is complicated by low consumption of memory core needs ETL tool.

Kettle for conversion, not possible to define an order of execution, since all steps are performed in a concurrent manner: When the conversion is started, all steps are started simultaneously. They read from the jump in data input and sends the processed data is written to the input jump in, jump know there are no more input data, run abort procedure, when all steps are suspended, it suspended the entire conversion ( the execution order of the data flow to be separated, because they are operating in parallel).

kettle input controls

(A) XML input :( a Date from the Xml control -get)
the Xml: xml is extensible markup language, xml is designed to transmit and store data (we have to parse xml data on the use of Xpath.

)

Xpath: Xpath language is xml path, which is a language for determining the position of certain parts xml document.

XPath-based XML tree structure, providing the ability to find a node in the data tree.

Xpath- syntax:

Select paths using Xpath expression node in the selected node in Xml. Node along the path, or by step

Be selected.

expression

description

nodename

Select this node to all nodes

/

Starting from the root selected

//

Select the current node in the document matches the selected node from without their open position

.

Select the current node

Select the parent of the current node

@

Select Properties

Example:

Path expression

result

bookstore

Select all the child nodes of the bookstore element

/bookstore

Select the root element bookstore

Comment: Add path starts with a forward slash (/), then this path is always representative of the absolute path to an element of

bookstore/book

All book elements selected sub-elements belonging to bookstore

//book

Select all book child elements, regardless of their position in the document

Bookstore//book

Select the background belongs to the bookstore element of the book all the elements, regardless of their location and any bookstore located at

// @ lang

Select all of the property named lang

Example:

Get xml file via input controls Get data from XML

Address read cycle path

Configuration parameters

Export

(Ii) the JSON input
JSON (JavaScript Object Notation) is a lightweight data-interchange format

The core concept of JSON: an array of object properties

Array: []

Object: {}

Properties: key: value

JSONPath:

JSONPath similarly positioned in XPath xml document, JsonPath expression is usually used to route search or set of JSON.

Whose expression is acceptable "data-notation" (dot notation fat) and "bracket-notation" (issued bracket notation) format

Register & Point: $ store.book [0] .title.

Hair bracket notation: $ [ 'store'] [ 'book'] [0] [ 'title']

JSONPath operator:

symbol

description

$

Root object query, JSON used to represent a data array or an object can

@

Asserts filter (filter predicate) of the current node object processing, similar in this field java

Wildcard can represent a name or number

Can be understood as a recursive search, Deep scan.Available anywhere a name is required

.

It represents a child node

[‘<name’>(,’<name’>’)]

It represents one or more child nodes

[(,)]

It represents one or more array subscript

[start:end]

Array section, interval [Start, end], does not include end

[?()]

Filter expression, the expression must be a boolean result

Example:

Example:

Get JSON .js file storage, and added "selected files"

Name can be readily defined, but opposite path to match

JSON output data

Output
output is transformed inside the second classification belonging ETL obtained L, L is loading (loading of the class data).

(A) Table output
First add an Excel data, and acquires field information

Create a database connection, access to the information table.

start up

Conversion (focus)
Concat Fields (controls) that connect multiple fields forming a new field.

Value Mapping (control) is the value of a field mapped to other values.

Constant increase (control) is the addition of a data in the data stream itself, this column is the same data value.

Increasing sequence (control) is added to the stream a sequence field.

Field selection (control) is selected from a field in the data stream, the name change, modify the data type.

Calculator (Control) is a function of the new fields to create a collection, you can also set whether the field is deleted (temporary field).

Cutting string (control) bar input position specified cut v shear flow field of a new field.

String manipulation (control) removing the string ends and the space switch case and generate a new field.

String substitution (control), and specify whether the search is to replace the contents, if the contents of the search field matches the input stream to generate a new field to be replaced.

Removing duplicates (control) data stream which is removed the same data line (before performing the operation Advanced sorting).

Sort records (controls) are sorted according to ascending and descending stream data fields specified.

The only line (hash value) (control) is to delete duplicate data flow line (Note: The only line (hash value) and (sort records + remove duplicate records) effect is the same, but the realization of the principle is not the same).

Split fields (control) the field is split into two or more fields in accordance with the separators.

Column is split into multiple rows (control) is the designated delimiter field split into multiple lines.

Column switch (control) is, if a data has the same value, in accordance with the specified field, the multi-line data into line data removing some of the original column name, a column of data into the field. (Data stream before sorting column switch)

Line transfer column (control) is to convert the amount of data fields is a field name, the rows become columns of data.

Flattening row (controls) the plurality of data lines of the same group as a single line. Note: in the case of a consistent record of comparable data in order to use data line of the data stream. Data streams must be sorted

Kettle process control (focus)
process is mainly used to control data flow and data flow

Switch / Case (controls) all the way to let the flow of data from multiple

Filter Record (control) from the data stream all the way to the two (programmed wanted IF statement true, false)

No operation (control) as the end of the data stream (without performing any rubbing operation)

Suspension (control) is the end of the data stream, if there is data to be here, will be thrown (used when test data is used)

Kettle search controls (focus)
query is used to query the data source and merged into the master data.

Http client (control) is to submit a request using the Get way to get page content returned

Database Query (control) is left connected to the database.

A database connection can perform two database queries, and tables unary input table

Kettle script control (focus)
script is done directly through some complex operation code.

javascript to
javascript javascript script language is used to complete the operation of the data stream by the code programming.

JS has a lot of built-in functions, you can view when writing JS code

There are two different modes: compatibility mode and compatibility mode is not

Incompatible mode: it is the default and recommended

Compatibility mode: compatibility with older versions of Ketle

Get Field:
incompatible mode:
MyVar = filedName; (direct a variable name)

Compatibility mode: The use different methods in different field type

MyVar = filedName.getString (); (String)

MyVar=filedName.getValue();(数字)

 给字段赋值:

Incompatible mode: the direct use of field names

filedName=MyVar;

Compatibility Mode: Use

        filedName.setValue(MyVar);

Java Script
Java Script is the use of java language to perform operations on the data flow through the code programming.

Many built-in functions can be used.

Main:

The main function corresponding to a processRow () function, ProcessRow () function is used to place the data processing flow.

SQL script (control) can perform an update information statement is used to update a table

Job
Summary: ETL Most projects require the completion of a variety of maintenance work.

For example, how to transfer files; verifying the presence of the database tables, and the like. These operations are performed in a certain order. Since conversion is performed in a parallel manner, a job can be executed serially need to handle these operations.

A job consists of one or more work items, the work item is performed in a certain order. Job execution order determined by the jump (JOB HOP) between the item and the job execution result of each job.

Job entry
job item is an essential element of the job, as the conversion step, can also work item icon of a graphic representation.

But if you look closely, you'll still find some place different from the job entry steps;

A result object can be passed between the work items. The result object which contains rows of data, they are not data-flow way to pass to twenty waiting to be delivered after a job has been executed to the next job.

Job jump
jump operation is the connecting line between the work items. He defines the execution path of the job. Work in different operating results of each job entry determines the different execution paths jobs.

① unconditional implementation: whether the job execution term success or failure, the next entry job will be executed. This is a blue connecting line, there is a lock of the above standard.

② run is true that, when executed: When a job entry execution result is true, the next item to perform a job. Often used in cases where error-free execution. This is a green cable, there is an icon of tick marks above.

③ When the operating results for the implementation of false: When the execution result on a work item to be false or did not execute successfully execute a job by a term, this is a red cable red stop icon above.

Parameters:
For ETL parameter passed is a very important part, because the transfer involves a reference to how the business parameters are extracted.

Parameters are divided into two types: global and local parameters Parameters

Global parameters: Define .kettle folder kettle.properties file to define the current user.

Defined way is to use the key = value way as to define: start_date = 120;

Kettle configuration variables need to restart before the Note:

Local parameters: through the "Set Variables" on "Get Variables" way to set

Note: When "Set Variables" conversion can not be used immediately in the current, you need to perform a step in the job.

Using parameters: the Kettle parameters used: (1) the variable name %% %% (2) the variable name $ {}

Note: You need it when using variables in SQL, "whether to replace the parameter" checked, otherwise the variable can not take effect.

Constant propagation:
constant propagation constants is the first custom data, input table using SQl statement inside? Instead.

? Alternative order is the order of the constant calls.

Conversion named parameters:
conversion parameter named variable is defined in the internal conversion, conversion scope is internal.

Right in the space conversions, select the conversion settings you can see.

Set variables to get variable:
there is a job category in which the conversion, which is a classification conversion variables and setting variables.

Note: The conversion was not immediately use, you need to use in the next step in the current job to "get variable"

--------- variables can also be provided inside the job (the "Set Variable" in a generic module lower work)

Released two original articles · won praise 0 · Views 255

Guess you like

Origin blog.csdn.net/xiaohuangren_123/article/details/105057866