What can Kettle do?

Introduction 

Kettle is a foreign open source ETL tool, written in pure java, can run on Window, Linux, Unix, green without installation, efficient and stable data extraction.

Kettle's Chinese name is Kettle. The main programmer of the project, MATT, hopes to put various data into a kettle, and then flow out in a specified format.  

Kettle is a set of ETL tools that allow you to manage data from different databases by providing a graphical user environment that describes what you want to do, not how you want to do it. 

There are two script files in Kettle, transformation and job. Transformation completes the basic transformation of data, and job completes the control of the entire workflow. 

Kettle can be downloaded from http://kettle.pentaho.org/. 

    the term

1. Transformation The transformation step can be understood as assembling one or more different data sources into a data pipeline. And then finally output to a certain place, a file or a database, etc.

2. Job job, which can schedule the designed conversion, can also perform some file processing (compare, delete, etc.), and can also ftp upload, download files, send emails, execute shell commands, etc.

3. Hop connection transformation step or connection Job (actually execution order) connection Transformation hop: mainly indicates the flow of data. Transform operations from input, filtering, etc., to output.

       Job hop: Execution conditions can be set: 1, unconditional execution 2, execution when the execution result of the previous job is true 3, execution when the execution result of the previous job is false

 

 

     Application scenarios

  • Table view mode: We often encounter this situation, that is, in the same network environment, we extract, filter, and clean table data from various data sources, such as historical data synchronization, heterogeneous system data interaction, and data symmetrical publishing. Or backup, etc. all belong to this mode; traditional implementation methods generally require research and development (a small part, such as data synchronization between two tables with the same table structure, if the sqlserver database can be implemented through publish/subscribe), involving some If we develop some complex business logic, it is prone to various bugs;

  • Front-end mode: This is a typical data exchange application scenario. The two sides of the data exchange, A and B, are not connected to the network, but both A and B can be connected to the front-end C. Generally, the two parties have agreed on the front-end. Data structure, this structure is basically inconsistent with the data structure of A and B, so we need to push the data of the application to the front-end computer according to the data standard, the research and development workload is still relatively large;

  • File mode: The two parties A and B of data interaction are completely physically isolated, so data interaction can only be carried out in the form of files, such as XML format. In application A, we develop an interface to generate standard format XML , and then use a USB flash drive or other media to copy the XML data at a certain time, and then access it to application B, which parses the corresponding file according to the standard interface and receives the data;

   Composition of kettle

   SPOON: Allows you to design the ETL transformation process (Transformation) through a graphical interface.
   PAN: Allows you to batch run ETL transformations designed by Spoon (eg using a time scheduler). Pan is a program that executes in the background and has no graphical interface.
   CHEF: Allows you to create tasks (Job). Tasks are more useful for automating the complex work of updating a data warehouse by allowing each transformation, task, script, and more. Tasks pass allows each transition, task, script, and more.
   KITCHEN: Allows you to batch use tasks designed by Chef (eg using a time scheduler). KITCHEN is also a program that runs in the background.

     Tips: Execute job kitchen.sh -file=/PRD/updateWarehouse.kjb -level=Minimal on linux

             Perform transformation pan.sh -file=/PRD/updateWarehouse.kjb -level=Minimal

 

   Transformation component tree introduction

  The nodes in Transformation are described as follows:

  • Main Tree: The menu lists the basic properties in a transformation, which can be viewed through each node.
  • DB connection: Displays the database connection in the current transformation. The database connection of each transformation needs to be configured separately.
  • Steps: A list of steps applied in a transformation
  • Hops: A list of node connections applied in a transformation The 
    core object menu lists a list of links that can be called in the transformation, and links can be added by dragging the mouse.
  • Input: input link
  • Output: output link
  • Lookup: query link
  • Transform: Transformation link
  • Joins: connection link
  • Scripting: scripting link

  Job component tree introduction

The nodes in the Job are described as follows:

    • Main Tree: Lists the basic properties in a Job, which can be viewed through each node. 
      DB connection: Displays the database connection in the current job. The database connection of each job needs to be configured separately.
    • Job entries/job items: a list of links referenced in a Job The 
      core object menu lists the list of links that can be called in the job, and links can be added by dragging with the mouse. 
      Each link can be added to the main window by dragging it with the mouse. 
      And can be dragged by shift+mouse to realize the connection between links.

 Frequently encountered problems:

1. How to connect to the repository?

Create repository if not

 


2. How to connect to the database?

Before connecting to the database, you first need to ensure that you are currently on a transform page, then click "Main Object Tree" in the options bar on the left, then right-click "DB Connection" and select "New".

Of course, you can also set some other connection properties, such as zeroDateTimeBehavior=round&characterEncoding=utf8.

 


3. How to solve the problem that the database connection is not updated in time?

Sometimes the fields of the table in our database are updated (added or deleted), but when the "Get SQL Statement" function of the "Table Input" control is used, the found fields are still the original fields. This is due to the cache. Cache cleaning is required.

4. How to fix Unable to read file error?

Sometimes after we move the Job or Transform to another directory in the folder, there will be an Unable to read file error when executing. Then you enter the configuration page of the current Transform. You can modify the directory in the configuration.

5. How to solve the problem of tinyint type data loss?

When Kettle uses JDBC to connect to MySQL, for fields whose data type is tinyint in the table, it may be converted to bool type when reading, which may cause data loss. For example, there is a tinyint type field named status with three values: 0, 1, and 2. After the kettle reads it, it is likely to convert 0 to false, and 1 and 2 to true. When outputting, convert false to 0 and true to 1, which will cause the data whose status is 2 in the metadata to be incorrectly assigned to 1.
To solve this problem, you can convert status to int or char when reading metadata. For example SELECT CAST(status as signed) as status FROM <table_name> or SELECT CAST(status as char) as status FROM <table_name>

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324726562&siteId=291194637