I. Introduction
Since the new environment, various problems have emerged one after another. If it wasn't for the rich experience accumulated before, it is estimated that I would have stopped working. It seems that as a database full-stack engineer (oracle/mysql/sqlserver/sap hana/pg/mongodb/redis), there are still some problems. The benefits (the new environment needs to be improved a lot...), O(∩_∩)O haha~. Today, my colleague came to me and said that there is a report library No. 4.5 data that is gone, and asked me if there is a problem with the ogg data synchronization. I was stunned. First of all, the problem occurred on the 4.5th. It has been almost a month before I found out that the monitoring mechanism was not perfect. Secondly, the business department’s response was too late. If there is a problem, solve the problem first.
2. Troubleshooting
Log in to the Ogg source library to view related processes:
[oracle@dg dirprm]$ ../ggsci
GGSCI (dg) 1> info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER RUNNING EXTRACT RUNNING DPRPT01 00:00:00 666:57:30 EXTRACT ABENDED EXTRPT01 00:00:00 666:57:38
GGSCI (dg) 2 > info EXTRPT01 EXTRACT EXTRPT01 Last Started 2017-10-27 18:12 Status ABENDED Checkpoint Lag 00:00:00 (updated 666:58:56 ago) Log Read Checkpoint Oracle Redo Logs 2018-04-05 23:00:24 Seqno 282927, RBA 832138240 SCN 12.2143954677 (53683562229)
Through the above inspection, it is found that the data extraction process at the source has been suspended for about 27 days, that is, at 23:00 on the 4.5th, so what is the specific reason for this problem? Need to check the ogg error log by
[oracle@dg ogg]$ cd dirrpt/
[oracle@dg dirrpt]$ vi DPRPT010.rpt
2018-04-05 23:00:36 ERROR OGG-01098 Could not flush "./dirdat/e1000004383" (error 28, No space left on device). Failed to save data to 'dirdmp/gglog-EXTRPT01.dmp', error 28 - No space left on device
It is found that the specific cause of the process hang is due to insufficient disk space, so the data extraction cannot be written to the trail file. Check the disk space and find that the disk is currently sufficient, then try to restart the ext process
GGSCI (dg) 1>start EXTRPT01 Sending START request to MANAGER ... EXTRACT EXTRPT01 starting GGSCI (dg) 3> info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER RUNNING EXTRACT RUNNING DPRPT01 00:00:00 667:04:13 EXTRACT RUNNING EXTRPT01 00:00:00 667:04:21 1 minute later GGSCI (dg) 4> info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER RUNNING EXTRACT RUNNING DPRPT01 00:00:00 667:05:10 EXTRACT ABENDED EXTRPT01 00:00:00 667:05:17
The startup failed, check the error log DPRPT01.rpt again, the error information is as follows:
2018-05-03 11:36:52 ERROR OGG-00446 Could not find archived log for sequence 282927 thread 1 under default destinations SQL <SELECT name FROM v$archived_log WHERE sequence# = :ora_seq_no AND thread# = :ora_thread AND resetlogs_id = :ora_resetlog_id AND archived = 'YES' AND deleted = 'NO' >, error retrieving redo file name for sequence 282927, archived = 1, use_alternate = 0Not able to establish initial position for sequence 282927, rba 790067728. 2018 - 05 - 03 11:36:52 ERROR OGG - 01668 PROCESS ABENDING . _
Here you can see that the extraction process cannot find the corresponding archive log when reading the archive log of the source library (it is estimated that it has been cleaned up)
col name for a55; set line 200; set pagesize 20000; select sequence#,name,COMPLETION_TIME,STATUS from v$archived_log where sequence#>=282926 and rownum<=30;
After confirmation, it is found that the archived logs from 28292 7 to 5.2 have been deleted. At this point, the cause and current situation of the problem can be confirmed.
The reason for the error is that the extraction process is suspended due to insufficient disk space, and then the OGG data synchronization is not resumed after a long period of time, and the archived logs of the data source are cleaned up, so the recovery of the OGG data synchronization cannot be completed by starting the extraction process.
3. Solutions
1. You can complete ogg data synchronization directly by restoring archived logs from backup. (Given that the daily archive log is about 80G, the archived data recovery of a month is relatively large, and the data synchronization still requires a large amount of data, so this method is not adopted)
2. By redeploying the OGG master-slave synchronization process, the OGG data synchronization is completed. After inspection, it is found that there are 11 tables that need to be synchronized, and the largest data volume is about 60 million data, and the synchronization speed is relatively fast.
select count(1) from testuser.t_t1; -- 1163 select count(1) from testuser.t_t2; -- 3794574 select count(1) from testuser.t_t3; --14461070 select count(1) from testuser.t_t4; -- 135962 select count(1) from testuser.t_t5; --3331344 select count(1) from testuser.t_t6; -- 5961455 select count(1) from testuser.t_t7; --131280 select count(1) from testuser.t_t8; --7459898 select count(1) from testuser.t_t9 --8698 select count(1) from testuser.t_t10; --62504749 select count(1) from testuser.t_t11; --11581710
3. After the data synchronization recovery is completed, it is necessary to improve the database status monitoring, including but not limited to DG master and equipment loading, OGG process status, instance status, etc.
4. The redeployment process is omitted. (Will re-write an article explaining the details of ogg data synchronization)