Basic use of ETL resource library

1. Metadata

1. The general concept of metadata: "descriptive data" or "data of data".

Metadata of ETL: Describe the tasks to be performed by ETL.

How to store metadata in Kettle:

  1. Resource library : Resource library includes file resource library, database resource library. After Kettle 4.0, the resource library type can be extended by plug-ins
  2. XML file : The root node of the XML of the .ktr transformation file must be < transformation >. The root node of the kjb job XML is < job >

2. Resource Library

Without using the resource library, it can be directly saved as a ktr or kjb file.

2.1 Data Resource Library

Serialize Kettle's metadata to the database. For example, the R _TRANSFORMATION table saves the name, description and other attributes of the Kettle transformation. Create and upgrade the database resource library in Spoon.

2.2 File Resource Library

Encapsulation based on files, implements the org.pentaho.di.repository.Repository interface. It is the type of resource library added in Kettle 4.0 or later

2.3 How to choose a resource library

Disadvantages of database resource library :

  1. Cannot store multiple versions of conversions or jobs
  2. Rely heavily on the database lock mechanism to prevent loss of work
  3. Without considering team development, developers can’t lock a job to develop by themselves

Disadvantages of file resource library :

  1. The relationship between objects (such as conversions, jobs, database connections, etc.) is difficult to handle, so operations such as deletion and renaming will be more troublesome
  2. No version history
  3. Difficult for team development

Do not use resource library : Use svn for file version control.

3. Use of Kettle Resource Library

3.1 Kettle Data Resource Library

3.1.1 Create Data Resource Library

Create a database repository

Insert picture description here
Insert picture description here
Insert picture description here

Set up data source connection

Clicking to test will report an error: Driver class 'oracle.jdbc.driver.OracleDriver' could not be found, make sure the 'Oracle' driver (jar file) is installed. oracle.jdbc.driver.OracleDriver

Mysql data connection is not have this problem. Oracle database connection is there needs to be an Oracle ojdbc6.jarcopy and paste in the ETL directory libdirectory. After the restart.

Insert picture description here

Click Finish after restarting.

Insert picture description here

The last step is to connect. The database is best to create a new one, because this library is an independent resource library for kettle. The account password is default admin.

Insert picture description here

Insert picture description here

Check the database and find that there will be some tables that have been created.

Insert picture description here

3.1.2 Disconnect, modify, delete, etc. of data resource library

Insert picture description here

Insert picture description here

3.1.3 Add conversion, save and export to data resource library

Add conversion

Insert picture description here

Save: Ctrl + S

Insert picture description here

View

Insert picture description here

Insert picture description here
Insert picture description here

Import and Export

Insert picture description here

Insert picture description here
Insert picture description here

After asking whether to add rules, NO can be exported.

Insert picture description here

3.2 Kettle File Resource Library

The process is simpler than the data repository, and most operations are similar.

Insert picture description here
Insert picture description here

It can be done directly, there is no user and password.

Insert picture description here

4. Management Resource Library

Several stages of ETL development : development, testing, validation, and release.

Resource library corresponding to each stage : development resource library, test (confirmation) resource library, release resource.

Advance in various stages :

  1. From the development resource library to the test resource library:
    1.1 Pay attention to the naming rules
    1.2 Release by one person to avoid conflicts
    1.3 Two transplant methods: disconnect and reconnect, export/import
  2. From test (confirmation) repository to release repository: export/import

Do not use the resource library : SVN version control, test tagging, release and build branch.

5. Parameterization

Why parameterization : When migrating jobs between resource libraries, because the environment at each stage is different, metadata such as database connections used in the job cannot be hard-coded.

Several methods of parameterization : The kettle.propertiesfile is located in the user.home directory of java , the custom properties file is read through the property file input step. Use the parameter table .

How do I know java the user.home directory, enter the following to create a java file contents.

public class PrintUserHome {
    
    
	public static void main(String[] args) {
    
    
		System.out.println(System.getProperty("user.home"));
	}
}

Run the following statement in cmd

javac PrintUserHome.java
java PrintUserHome

Insert picture description here

The structure of the parameter table :

Environment	parameter_name	parameter_value	valid_from	valid_to
Dev	host_name	localhost	2011-01-01	2099-01-01
Test	host_name	192.168.12.10	2011-01-01	2013-05-01
Test	host_name	192.168.12.11	2011-05-02	2099-01-01

Meaning :

  1. Environment : Environment. For example: Dev development environment, Test test environment.
  2. parameter_name : The parameters corresponding to different environments. For example: host_name host.
  3. parameter_value : The parameter value corresponding to different environments.
  4. valid_from : The valid time of the parameters and parameter values ​​corresponding to different environments.
  5. valid_to : The parameters and parameter value deadlines corresponding to different environments.

Guess you like

Origin blog.csdn.net/YKenan/article/details/112406203