Million-level data-program migration

JVM study notes: http://blog.csdn.net/cutesource/article/details/5904501
Principle of heap memory setting: http://blog.csdn.net/sivyer123/article/details/17139443/
GC log analysis of JVM: http://blog.csdn.net/lan861698789/article/details/51985188
The difference between JVM client mode and server mode: http://developer.51cto.com/art/201009/228035.htm
Garbage collector: http:/ /yueyemaitian.iteye.com/blog/1185301
VisualVM Analyzer: http://www.cnblogs.com/linghu-java/p/5689227.html
MemoryAnalyzer Usage: http://wensong.iteye.com/blog/1986449
MAT The difference between Retained and Shallow: http://bjyzxxds.iteye.com/blog/1532937
Application scenario: First, you need to migrate all the data in the tables in Mdb to Sdb. One method is to use the tool Kettle, the second is to write scripts, and the third is to migrate programs. The first and second migrations are no problem for large tables. , the third type, due to the million-level data table, the full amount is taken out from the Mdb, and then the full amount is inserted into the Sdb, which has higher memory requirements. The more effective method is paging query, and then insert; the paging query insertion that we will not talk about today, Let's take a look at the Sdb that is fully extracted from Mdb and inserted in batches. About paging and batch insertion (in order to deal with memory problems, the query results can be saved to a file, and then read and inserted from the file), and I will talk about it later.
Main code:
int counts =0;//Number of records
//Data d = null;
while (rs.next()) {
	Data d = new Data();
	d = getData(rs);
	insertList.add(d);
	pd = null;//to be recycled by gc
	if(counts%5000==0){
		batchSave(insert, insertList);
		insertList.clear();
		log.info("============RecordsSave:"+counts);
	}
	if(rs.isLast()){
		batchSave(insert, insertList);
		insertList.clear();
	}
  log.info("============Records:"+counts);
  counts++;
}

The number of records in the following test table is 280,000
Tomcat: JavaOPT default
-client
-Xmx256M When the
JVM virtual machine is started by default, client mode, garbage mode is Serial New + Serial Old
When counts%10000==0, batch is 10000, OOM is thrown :java heap space
when Data d = new Data(); is placed outside the loop Definition: Data d = null; throws OOM: java heap space
When the batch is 5000:
JConsole heap information:
2016-09-29 16:40: 42
Used: 107,876 Kb
Allocation: 253,440 Kb
Max: 253,440 Kb
GC Time: 
2.518 seconds for Copy (326 collections)
1 minute for MarkSweepCompact (171 collections) Time
taken: 220.66
VisualVM:



Viusal GC:



remove log.info("============Records:"+counts);
JConsole heap info:
Time: 2016-09-29 16:25:44
Used: 244,748 Kb
Allocation: 253,440 Kb
Max: 253,440 Kb
GC Time:
2.456 seconds for Copy (332 collections) 39.902 seconds for
MarkSweepCompact (114 collections) Time
taken
This says, don't hit, unnecessary log
VisualVM:





Viusal GC:



When batch is 2000:
JConsole Heap Info:
2000
Time: 2016-09-29 16:50:28
Used: 252,652 Kb
Allocation: 253,440 Kb
Max: 253,440 Kb
GC Time:
Copy (321 items The time taken for
MarkSweepCompact (106 collections) is 2.369 seconds. The time taken for MarkSweepCompact (106 collections) is 36.438 seconds
. The time taken is s: 145.524

VisualVM:




Viusal GC:





Compared with 5000, it can be seen that the trend graph of heap memory is more gentle, and the number of Copy and MarkSweepCompact collections is reduced, so the time taken is reduced.

When the batch is 1000:
JConsole heap info:
time: 2016-09-29 17:03:10
Used: 249,215 Kb
Allocation: 253,440 Kb
Max: 253,440 Kb
GC Time: 
Copy (330 collections) took 2.299 seconds
MarkSweepCompact (100 collections) took
34.530 time: 159.459
from elapsed time and Copy and point of view, compared with 2000 batches, the number of MarkSweepCompact collections is reduced by 6 times.
From 5000, 2000, 1000, it shows that the batch processing volume has an impact on the processing speed, and an appropriate batch number should be selected.

Let's adjust
JavaOPT
-server
-XX:+PrintGCDetails
-Xloggc:E:\gc.log
Server mode, the garbage collector defaults to PS Scavenge+ PS-Old
My machine memory is 8G, in default Server mode, Xmx1024M, the same Processing logic:
When batch is 2000:
JConsole heap info:
Time: 2016-09-29 17:12:41
Used: 337,969 Kb
Allocation: 574,784 Kb
Max: 932,096 Kb
GC Time: 
PS MarkSweep (6 collections) took 1.475 seconds
PS Scavenge (74 collections) took 1.090 seconds
Full update took time s: 114.25

When batch was 5000:
time: 2016-09-29 17:19: 26
Used: 365,969 Kb
Allocation: 547,968 Kb
Max: 932,096 Kb
GC Time: 
PS MarkSweep (6 collections) took 1.400 seconds
PS Scavenge (70 collections) took 1.212
seconds Time taken: 107.12
from It can be seen from the above that the number of collections of the new generation is reduced, and the time taken is reduced, which means that when the heap memory increases, the number of batches can be adjusted accordingly to improve the processing efficiency.


Let's take a look at the maximum number of records that can be processed in -server, mode:
java OPTS:
-server
-Xms512M
-Xmx1024M
-XX:NewSize=384M
-XX:ParallelGCThreads=4
-XX:+PrintGCDetails
-Xloggc:E:\gc .log
time: 2016-09-29 18:16:19
Used: 1,004,657 Kb
Allocation: 1,028,288 Kb
Maximum: 1,028,288 Kb
GC Time:
PS MarkSweep (424 collections) took 10 minutes
PS Scavenge (121 collections) took 3.631 seconds
Under the above configuration, able to process The maximum number of records is 1,040,000.
Summary:
From the above analysis, the number of batch processing should be adjusted according to the change of heap memory, which has reached the processing efficiency of the project;
in 1G heap memory, the amount of data below one million can be processed. , can still drip.
Paging batch processing:
java opts:
-server
-Xms1024M
-Xmx1536M
-XX:NewSize=512M
-XX:ParallelGCThreads=4
Main logic:
  sql = "SELECT COUNT(*) FROM " + test;
	ps = con.prepareStatement(sql);
	countResultSet = ps.executeQuery();
	int sums =0;
	while(countResultSet.next()){
		sums =countResultSet.getInt(1);
	}			
	int batches = 0;
	if( sums > 100000){
		if(sums % 100000 ==0){
			batches = sums/100000;
		}
		else{
			batches = sums/100000  + 1;
		}
	}
	int counts =0;//Number of records
	for(int i =1;i<=batches;i++){
	logger.info("==============第"+i+"页start==========");
	counts+=InsertRecordByPages((i-1)*100000+1,(i)*100000);
	logger.info("==============Number of records updated: "+counts);
	logger.info("==============第"+i+"页end==========");
       }

In the InsertRecordByPages function, the unused local variables should be nullified
Result : 1.4 million data can be processed
=============Page 15 start==========
2016- 09-30 16:16:10 -885050 ==============Updated records: 1409864
2016-09-30 16:16:10 -885050 ======== ======Page 15 end========== No matter
how big it is, it can only be migrated with tools.











Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326686826&siteId=291194637