First, go to the official website first, because it is in pure English. So it was translated.
https://community.hitachivantara.com/s/article/data-integration-kettle
Click the red framed line below to download it.
After downloading, unzip it
Kettle is an open source software for pure JAVA programming. The local environment can be run with JDK1.7 or above. After decompression, it can be used directly without installation.
Second, configure the pentaho_java_home variable in the environment variable. The value is the local jdk path
After configuration, click Spoon.bat
Wait patiently for a while after opening.
3. Create a database connection
Click Transform to switch the main object tree. The DB connection can be seen. Click on DB connection.
Select the mysql connection. Enter the relevant connection information.
Then click test, the following error occurs.
This is because there is no mysql driver package. So put the mysql driver package under pdi-ce-8.3.0.0-371\data-integration\lib. Find the driver package of the corresponding mysql version, if the downloaded version is too low, the driver package will appear. Unknown system variable 'query_cache_size' is an error, so the database cannot be connected.
I downloaded the driver package mysql-connector-java-5.1.8.jar. You can see that the test connection is successful.
Click to confirm
4. Synchronizing data
Create a new transformation, drag an input and an output from the input and output.
Select a data connection in the table input, or create a new connection
Then click to get the sql query statement
Select the table you want to enter - click OK
Once you click yes, the following error will be reported.
The guess is that the mysql database version conflicts with the mysql connection driver (mysql-connector-java) version .
The current environment is as follows:
Execute: select version();
mysql-connector-java version is : 5.1.8
Tried different versions of the connection driver:
Finally found that 5.1.47 solves the problem perfectly
Explanation:
jdbc will send the test statement SET OPTION SQL_SELECT_LIMIT=DEFAULT when connecting to the database, and mysql 5.6 and above versions no longer support this statement.
After executing sql, it will be as shown below
Insert fields from table A into table B
Table output is simply outputting data to another table.
Settings for table output:
Running result (user_copy table data): Copy the data of table A to table B
After we run it for the second time, kettle will report an error saying that the primary key already exists
This means that the table output can only be output once. If the corresponding primary key already exists in the target table, it will not be updated and an error will be reported.
If we modify the settings of the output in the table below, let's specify the following output fields:
Running result (user_copy table data):
https://blog.csdn.net/qqfo24/article/details/82190535
https://blog.csdn.net/qqfo24/article/details/82190535
You can refer to this URL pair to update or add data from a table to a new table.
The operation steps are as follows:
Click on the core object to create a new conversion
Then click on the main object tree and select DB to connect
After clicking, click the core object. Select Input. Click Table Input.
Then click insert/update
Now let's look at the data in the User table
Then take a look at the data in the test table
Then double click on insert/update
This picture is just some instructions, the picture below is my own operation picture.
Click OK. then run this transform
Click to start and save
After the operation is over, we can see the operation results below, including logs, data previews, etc. We can see how many pieces of data have been read in total, how many data have been inserted and updated, and so on.
This completes the simplest transformation, fetching data from one table, inserting and updating to another table.
Now let's look at the test table, we can see that the data with id 4 is updated from order to method
If you want to run this transformation periodically, you need to use a job.
Click on General
Drag START, TRANSITION, SUCCESS from the left to the right and connect them with lines.
Double-click START to configure the running interval of the job, which is configured to run every hour.
Double-click the transformation and select the one you created earlier
Click Run to run the job, and click Stop to stop it. In the execution result below, you can see the running log.
I add a new piece of data with an id of 1 to the user table
Now run this job
I found out that an hour was too long, so I set it to 3 minutes. operation result
Now let's see if there is the piece of data asked in the test in the database
The above screenshot shows that the timing script is inserted successfully.
If you want the scheduled task to repeat the operation, check Repeat this
You don't need to stop to run the script consistently. Click the stop button if you don't want to run the script.
Summarize
Insert update is used more because it can update data.
Table output, easy to insert duplicate data, please use with caution.
Timed operation, open can automatically update data, reduce the cost of manual operation.