Basic use of ETL resource library
1. Metadata
1. The general concept of metadata: "descriptive data" or "data of data".
Metadata of ETL: Describe the tasks to be performed by ETL.
How to store metadata in Kettle:
- Resource library : Resource library includes file resource library, database resource library. After Kettle 4.0, the resource library type can be extended by plug-ins
- XML file : The root node of the XML of the .ktr transformation file must be < transformation >. The root node of the kjb job XML is < job >
2. Resource Library
Without using the resource library, it can be directly saved as a ktr or kjb file.
2.1 Data Resource Library
Serialize Kettle's metadata to the database. For example, the R _TRANSFORMATION table saves the name, description and other attributes of the Kettle transformation. Create and upgrade the database resource library in Spoon.
2.2 File Resource Library
Encapsulation based on files, implements the org.pentaho.di.repository.Repository interface. It is the type of resource library added in Kettle 4.0 or later
2.3 How to choose a resource library
Disadvantages of database resource library :
- Cannot store multiple versions of conversions or jobs
- Rely heavily on the database lock mechanism to prevent loss of work
- Without considering team development, developers can’t lock a job to develop by themselves
Disadvantages of file resource library :
- The relationship between objects (such as conversions, jobs, database connections, etc.) is difficult to handle, so operations such as deletion and renaming will be more troublesome
- No version history
- Difficult for team development
Do not use resource library : Use svn for file version control.
3. Use of Kettle Resource Library
3.1 Kettle Data Resource Library
3.1.1 Create Data Resource Library
Create a database repository
Set up data source connection
Clicking to test will report an error:
Driver class 'oracle.jdbc.driver.OracleDriver' could not be found, make sure the 'Oracle' driver (jar file) is installed. oracle.jdbc.driver.OracleDriver
Mysql data connection is not have this problem. Oracle database connection is there needs to be an Oracle
ojdbc6.jar
copy and paste in the ETL directorylib
directory. After the restart.
Click Finish after restarting.
The last step is to connect. The database is best to create a new one, because this library is an independent resource library for kettle. The account password is default
admin
.
Check the database and find that there will be some tables that have been created.
3.1.2 Disconnect, modify, delete, etc. of data resource library
3.1.3 Add conversion, save and export to data resource library
Add conversion
Save:
Ctrl + S
View
Import and Export
After asking whether to add rules, NO can be exported.
3.2 Kettle File Resource Library
The process is simpler than the data repository, and most operations are similar.
It can be done directly, there is no user and password.
4. Management Resource Library
Several stages of ETL development : development, testing, validation, and release.
Resource library corresponding to each stage : development resource library, test (confirmation) resource library, release resource.
Advance in various stages :
- From the development resource library to the test resource library:
1.1 Pay attention to the naming rules
1.2 Release by one person to avoid conflicts
1.3 Two transplant methods: disconnect and reconnect, export/import- From test (confirmation) repository to release repository: export/import
Do not use the resource library : SVN version control, test tagging, release and build branch.
5. Parameterization
Why parameterization : When migrating jobs between resource libraries, because the environment at each stage is different, metadata such as database connections used in the job cannot be hard-coded.
Several methods of parameterization : The
kettle.properties
file is located in the user.home directory of java , the custom properties file is read through the property file input step. Use the parameter table .
How do I know java the user.home directory, enter the following to create a java file contents.
public class PrintUserHome {
public static void main(String[] args) {
System.out.println(System.getProperty("user.home"));
}
}
Run the following statement in cmd
javac PrintUserHome.java
java PrintUserHome
The structure of the parameter table :
Environment parameter_name parameter_value valid_from valid_to
Dev host_name localhost 2011-01-01 2099-01-01
Test host_name 192.168.12.10 2011-01-01 2013-05-01
Test host_name 192.168.12.11 2011-05-02 2099-01-01
Meaning :
- Environment : Environment. For example: Dev development environment, Test test environment.
- parameter_name : The parameters corresponding to different environments. For example: host_name host.
- parameter_value : The parameter value corresponding to different environments.
- valid_from : The valid time of the parameters and parameter values corresponding to different environments.
- valid_to : The parameters and parameter value deadlines corresponding to different environments.