Implementation summary of incremental data acquisition based on oracle

Project packaging scheme

In the article "Oracle-based incremental data collection", a data collection scheme based on triggers , materialized views, stored procedures, java source, and external programs is proposed . This article initially implements it, using maven-assembly-plugin for packaging, and the output structure is as follows: bin, conf, lib, respectively store command files, configuration files, and jar packages. It should be noted that in the command files in the bin directory, the conf, lib is added to the classpath, see start.bat, clear.bat for details

@echo off & setlocal enabledelayedexpansion

set LIB_JARS=""
cd ..\lib
for %%i in (*) do set LIB_JARS=!LIB_JARS!;..\lib\%%i
cd .. \ bin

java -Xms64m -Xmx1024m -classpath ..\conf;%LIB_JARS% com.service.data.sync.oracle.producer.SyncDataEnv init
goto end

:end
pause

Currently only configure command files start.bat, clear.bat, start.bat: start the initialization data synchronization environment (create java source, stored procedure, materialized view, trigger), clear.bat: clear the data synchronization environment (delete java source, Stored procedures, materialized views, triggers), clear.bat only needs to replace init in the above command with clear to

maven-jar-plugin, which is also a maven packaging plug-in, which can add lib dependencies, specify startup classes, etc. Use java -jar [jar package] [args] command to start, this project can use assembly

Notes

a. Specify the table to be synchronized in the configuration file, and execute start.bat to automatically create a materialized view and trigger for each table. Ignore it already exists
b. Currently, binary data is not supported, and a single piece of data is too long, because the data spliced in the trigger The type is varchar2, and the limit is 4000; special processing is required, and the data is not verified
c. The asynchronous sending scheme should be adopted for publishing data, otherwise it will affect the data submission of the business library. After the java source is executed, the transaction in the business library can be successfully submitted. It has been verified
that the trigger is created in the materialized view because the trigger is established on the table, and the trigger has not been submitted. The materialized view exists in the materialized view log before it is submitted, and will not trigger
e. Minimize the use of jars to avoid Oracle importing too many jars.
f. Check whether OracleJVM is installed, and execute "select * from dba_registry where comp_id='JAVAVM as the sys user" '", no record means not installed, use database configuration assistant to install java components or execute $ORACLE_HOME/javavm/install/initjvm.sql script
g. Before initializing the data synchronization environment, you need to add the jar package that java source depends on in oracle, The simplest solution at present is to package the three classes of AbstractSend, HttpUrlSend and SyncDataRunner in this project and add them to the oracle; only use HttpURLConnection to send data to the outside world, without relying on any jar outside the java environment, only for testing; the formal environment should consider using Other asynchronous sending schemes, such as message queues

loadjava -r -f -verbose -resolve -user username/password xxx.jar
loadjava -r -f -user username/password xxx.class
dropjava -r -f -user [option_list] file_list

This command needs to be executed directly in the command line of the database machine, and cannot be executed in plsql

Code

package com.service.data.sync.oracle.producer.send;

/**
 * Send the monitoring data to the outside,
 * @author sheungxin
 *
 */
public abstract class AbstractSend {
	
	/**
	 * send messages
	 * @param message
	 */
	public void send(String message){
		beforeSend(message);
		exeSend(message);
		afterSend(message);
	}
	
	/**
	 * Execute before sending message
	 * @param message
	 */
	private void beforeSend(String message){
		//do something
	}
	
	/**
	 * The actual sending message implementation class
	 * @param message
	 */
	public abstract boolean exeSend(String message);

	/**
	 * Execute after sending message
	 * @param message
	 */
	private void afterSend(String message){
		//do something
	}
}

Message sending abstract class, send the monitored data change information, and can do some operations before and after sending

package com.service.data.sync.oracle.producer.send;

import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.URL;

/**
 * Use HttpURLConnection to send out messages, GET method, the message has a length limit, only for testing
 * @author sheungxin
 *
 */
public class HttpUrlSend extends AbstractSend{
	
	private static final String sendPath="http://192.168.19.99:8181/jeesite/f/list-7.html?params=";
			
	public boolean exeSend(String message){
		boolean flag=true;
		try{
			URL url = new URL(sendPath);
			HttpURLConnection conn = (HttpURLConnection) url.openConnection();
			conn.setRequestMethod("GET");
			conn.setDoInput(true);
			InputStream is = conn.getInputStream();
			is.close();
		}catch(Exception e){
			e.printStackTrace ();
			flag=false;
		}
		return flag;
	}

}

This message sending implementation class is only used for testing and is not recommended. As mentioned in the precautions, it is necessary to use the asynchronous sending mechanism to avoid affecting the submission of business library transactions

package com.service.data.sync.oracle.producer;

import java.util.concurrent.BlockingQueue;
import java.util.concurrent.LinkedBlockingQueue;
import com.service.data.sync.oracle.producer.send.AbstractSend;
import com.service.data.sync.oracle.producer.send.HttpUrlSend;

/**
 * Data synchronization execution class
 * @author sheungxin
 *
 */
public class SyncDataRunner {
	
	private static final BlockingQueue<String> taskQueue=new LinkedBlockingQueue<String>();
	
	/**
	 * Add messages to be sent to the queue, return false when the queue is full, LinkedBlockingQueue defaults to Integer.MAX_VALUE
	 * @param message
	 */
	public static boolean addTask(String message){
		return taskQueue.offer(message);
	}
	
	/**
	 * Consume data in the queue and send out messages
	 */
	public static void executeTask(){
		AbstractSend httpSend=new HttpUrlSend();
		while(true){
			String message;
			try {
				// block to get the message to be sent
				message = taskQueue.take();
				httpSend.send(message);
			} catch (InterruptedException e) {
				e.printStackTrace ();
			}
		}
	}

}

The data synchronization actual operation class only provides two methods, addTask is used to put the monitored data into the queue and called in the java source code in Oracle; executeTask is used to send the messages in the queue to the outside, here only For single-threaded sending out, multi-threading can be considered, but attention should be paid to the impact on oracle performance. According to the specific business situation, single-threaded + MQ should also meet the business needs. In addition, the message is stored in the queue, and it may be lost under abnormal conditions. Special handling is required for possible exceptions.

Regarding the operation of java code, java source, stored procedure, materialized view, and trigger, the code will not be posted here. some pits.
1. When creating a trigger, splicing data needs to use :old and :new, which are the objects before and after the trigger event, respectively. Errors are reported all the time, and the creation cannot be successful. However, copying sql to plsql can be executed normally, and it is caused by ":old, :new" through the exclusion method. The JDBC used directly can be replaced by Statement instead of PrepareStatement. PrepareStatement will be precompiled, and it has not passed
. 2. When creating a java source, it has always been abnormal, and it can be successfully executed by copying it to plsql. Final solution: Statement.setEscapeProcessing(false), if the parameter is true, the driver will escape and replace the SQL statement before sending it to the database, otherwise let the database process it by itself

other ideas

add2ws wrote

Is it so troublesome to use, insert updates directly with kettle, and save time and effort with oracle triggers

According to the above comments, Kettle's incremental synchronization scheme was investigated:
1. Increment for full comparison
2. Incremental data update using timestamp
3. Incremental data update using trigger + snapshot table
Scheme 1, full comparison The performance is definitely not good; for scheme 2, time stamps are required, and the original business library may not support it; for scheme 3, a snapshot table is used to record data changes. Some solutions on the Internet propose that the snapshot table only records modification and deletion operations, regularly deletes the data in the snapshot table of the target table, and then copies the data that is not in the source table. In this way, deleting and modifying data and then adding it is not advisable for some services. My understanding of the advantage of this is that data can be processed in batches, but it is too inefficient to process one by one according to the type of operation. Therefore, we need to balance real-time, performance, consistency, complexity, etc. to formulate solutions according to actual business scenarios.
Therefore, the above implementation scheme is replaced with: trigger + snapshot table + snapshot table scanner (kettle can also be considered) , the advantage is that the coupling with the database is reduced, but the real-time performance and efficiency may

be The materialized view and ETL are discussed, and it is suitable for you. You can refer to the following:

quote

http://www.itpub.net/thread-1599897-1-1.html

Implementation summary of incremental data acquisition based on oracle

Guess you like