Oracle案例06—— OGG-01098 Could not flush "./dirdat/e1000004383" (error 28, No space left on device)

 I. Introduction

  Since the new environment, various problems have emerged one after another. If it wasn't for the rich experience accumulated before, it is estimated that I would have stopped working. It seems that as a database full-stack engineer (oracle/mysql/sqlserver/sap hana/pg/mongodb/redis), there are still some problems. The benefits (the new environment needs to be improved a lot...), O(∩_∩)O haha~. Today, my colleague came to me and said that there is a report library No. 4.5 data that is gone, and asked me if there is a problem with the ogg data synchronization. I was stunned. First of all, the problem occurred on the 4.5th. It has been almost a month before I found out that the monitoring mechanism was not perfect. Secondly, the business department’s response was too late. If there is a problem, solve the problem first.

2. Troubleshooting

Log in to the Ogg source library to view related processes:

[oracle@dg dirprm]$ ../ggsci

GGSCI (dg) 1> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING                                           
EXTRACT     RUNNING     DPRPT01     00:00:00      666:57:30   
EXTRACT     ABENDED     EXTRPT01    00:00:00      666:57:38   
GGSCI (dg) 2 > info EXTRPT01

EXTRACT    EXTRPT01  Last Started 2017-10-27 18:12   Status ABENDED
Checkpoint Lag       00:00:00 (updated 666:58:56 ago)
Log Read Checkpoint  Oracle Redo Logs
                     2018-04-05 23:00:24  Seqno 282927, RBA 832138240
                     SCN 12.2143954677 (53683562229)

Through the above inspection, it is found that the data extraction process at the source has been suspended for about 27 days, that is, at 23:00 on the 4.5th, so what is the specific reason for this problem? Need to check the ogg error log by

[oracle@dg ogg]$ cd dirrpt/

[oracle@dg dirrpt]$ vi DPRPT010.rpt

2018-04-05 23:00:36  ERROR   OGG-01098  Could not flush "./dirdat/e1000004383" (error 28, No space left on device).
Failed to save data to 'dirdmp/gglog-EXTRPT01.dmp', error 28 - No space left on device

It is found that the specific cause of the process hang is due to insufficient disk space, so the data extraction cannot be written to the trail file. Check the disk space and find that the disk is currently sufficient, then try to restart the ext process

GGSCI (dg) 1>start EXTRPT01
Sending START request to MANAGER ...
EXTRACT EXTRPT01 starting

GGSCI (dg) 3> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING                                           
EXTRACT     RUNNING     DPRPT01     00:00:00      667:04:13   
EXTRACT     RUNNING     EXTRPT01    00:00:00      667:04:21   

1 minute later

GGSCI (dg) 4> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING                                           
EXTRACT     RUNNING     DPRPT01     00:00:00      667:05:10   
EXTRACT     ABENDED     EXTRPT01    00:00:00      667:05:17   

The startup failed, check the error log DPRPT01.rpt again, the error information is as follows:

2018-05-03 11:36:52  ERROR   OGG-00446  Could not find archived log for sequence 282927 thread 1 under default destinations SQL <SELECT  name    FROM v$archived_log   WHERE sequence# = :ora_seq_no AND         thread# = :ora_thread AND         resetlogs_id = :ora_resetlog_id AND         archived = 'YES' AND         deleted = 'NO' >, error retrieving redo file name for sequence 282927, archived = 1, use_alternate = 0Not able to establish initial position for sequence 282927, rba 790067728.

2018 - 05 - 03  11:36:52 ERROR   OGG - 01668   PROCESS
 ABENDING . _

Here you can see that the extraction process cannot find the corresponding archive log when reading the archive log of the source library (it is estimated that it has been cleaned up)

col name for a55;
set line 200;
set pagesize 20000;

select sequence#,name,COMPLETION_TIME,STATUS from v$archived_log  where sequence#>=282926  and rownum<=30;

After confirmation, it is found that the archived logs from 28292 7 to 5.2 have been deleted. At this point, the cause and current situation of the problem can be confirmed.

The reason for the error is that the extraction process is suspended due to insufficient disk space, and then the OGG data synchronization is not resumed after a long period of time, and the archived logs of the data source are cleaned up, so the recovery of the OGG data synchronization cannot be completed by starting the extraction process.

3. Solutions

1. You can complete ogg data synchronization directly by restoring archived logs from backup. (Given that the daily archive log is about 80G, the archived data recovery of a month is relatively large, and the data synchronization still requires a large amount of data, so this method is not adopted)

2. By redeploying the OGG master-slave synchronization process, the OGG data synchronization is completed. After inspection, it is found that there are 11 tables that need to be synchronized, and the largest data volume is about 60 million data, and the synchronization speed is relatively fast.

select count(1) from   testuser.t_t1; --      1163
select count(1) from   testuser.t_t2;  --   3794574
select count(1) from   testuser.t_t3; --14461070
select count(1) from   testuser.t_t4; --    135962
select count(1) from   testuser.t_t5; --3331344
select count(1) from   testuser.t_t6;  --   5961455
select count(1) from   testuser.t_t7;  --131280
select count(1) from   testuser.t_t8; --7459898
select count(1) from   testuser.t_t9  --8698
select count(1) from   testuser.t_t10;  --62504749
select count(1) from   testuser.t_t11; --11581710

3. After the data synchronization recovery is completed, it is necessary to improve the database status monitoring, including but not limited to DG master and equipment loading, OGG process status, instance status, etc.

4. The redeployment process is omitted. (Will re-write an article explaining the details of ogg data synchronization)

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325219818&siteId=291194637