质疑页修复实践

 
 

背景

最近我们在例行检查一台新迁不久的数据库服务器的JOB执行失败,查看历史记录,发现 错误提示如下:
Date        2019-01-06 0:25:00
Log        Job History (新增门店更新数据)

Step ID        1
Server        ********
Job Name        新增门店更新数据
Step Name        更新信息
Duration        00:00:47
Sql Severity    24
Sql Message ID    824
Operator Emailed    
Operator Net sent    
Operator Paged    
Retries Attempted    0

Message
Executed as user: NT AUTHORITY\SYSTEM. SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0xca6a4ba5; actual: 0xca6a4b25). It occurred during a read of page (5:3768184) in database ID 5 at offset 0x0000072fef0000 in file 'F:\data\********_3.mdf'.  Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online. [SQLSTATE HY000] (Error 824).  The step failed.
 

操作环境

Microsoft SQL Server 2012 (SP4) (KB4018073) - 11.0.7001.0 (X64) 
 Aug 15 2017 10:23:29 
 Copyright (c) Microsoft Corporation
 Web Edition (64-bit) on Windows NT 6.3 <X64> (Build 9600: ) (Hypervisor)
 

操作步骤

通过上述的错误提示,初步判断是页质疑了,查看对应的页质疑的记录
SELECT * FROM [msdb].[dbo].[suspect_pages];
GO
 
在问题数据库的查询窗口下,执行如下代码
 
dbcc checkdb

主要的错误信息如下:

Msg 8939, Level 16, State 98, Line 1
Table error: Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data), page (5:3768184). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 133129 and -4.
Msg 8928, Level 16, State 1, Line 1
Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data): Page (5:3768184) could not be processed.  See other errors for details.
Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data). Page (5:3768184) was not seen in the scan although its parent (4:3692451) and previous (1:6408870) refer to it. Check any previous errors.
Msg 8978, Level 16, State 1, Line 1
Table error: Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data). Page (5:3768185) is missing a reference from previous page (5:3768184). Possible chain linkage problem.

通过object id 2131746897,可以先查询一下是哪个表

SELECT OBJECT_NAME(2131746897)
 
如果是较大型的业务库,几百G 以上的库,直接dbcc checkdb 耗时是非常久的,可以根据之前的job错误提示,直接执行如下:
可以快速定位到哪个表,什么索引;
本想尝试使用 restore page ,可惜因维护策略已自动清理出现质疑前的备份,该方法只能放弃;
尝试把 全部记录插入新表也是失败
Msg 824, Level 24, State 2, Line 1
SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0xca6a4ba5; actual: 0xca6a4b25). It occurred during a read of page (5:3768184) in database ID 5 at offset 0x0000072fef0000 in file 'F:\data\******_3.mdf'.  Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

为了验证能否修复,我们把生产环境的备份拷贝到测试环境并恢复。

a.先验证恢复的库问题是否真实存在;
dbcc checktable('dbo.sfa_t_tmonvivid')
Msg 8939, Level 16, State 98, Line 1
Table error: Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data), page (5:3768184). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 133129 and -4.
Msg 8928, Level 16, State 1, Line 1
Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data): Page (5:3768184) could not be processed.  See other errors for details.
Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data). Page (5:3768184) was not seen in the scan although its parent (4:3692451) and previous (1:6408870) refer to it. Check any previous errors.
Msg 8978, Level 16, State 1, Line 1
Table error: Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data). Page (5:3768185) is missing a reference from previous page (5:3768184). Possible chain linkage problem.
DBCC results for 'sfa_t_TMonVivid'.
There are 7548683 rows in 629541 pages for object "sfa_t_TMonVivid".
CHECKTABLE found 0 allocation errors and 4 consistency errors in table 'sfa_t_TMonVivid' (object ID 2131746897).
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKTABLE (*********.dbo.sfa_t_TMonVivid).
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
 
     暂时先不使用repair_allow_data_loss 修复,看错误提示是index id 1 试一下直接删除索引、再新建或重建;
b.把该库置换成单用户;
     这个步骤方便维护后续进行 丢失数据修复
c.删除问题表上已有的索引或重建;
    在已打开的单用户模式下删除已有的索引
  重建索引
都是报一样的错误,看来新建重建索引都不行。下图是未修改单用户时查看问题表的记录量的截图,单用户模式下可以使用 sp_spaceused [问题表名] 来查看记录数;
 
d.使用丢失数据修复;
dbcc checktable ('dbo.sfa_t_tmonvivid',repair_allow_data_loss )
DBCC results for 'sfa_t_TMonVivid'.
Repair: The Clustered index successfully rebuilt for the object "dbo.sfa_t_TMonVivid" in database "xw_dl_1008307_error".
Repair: The page (5:3768184) has been deallocated from object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data).
Msg 8945, Level 16, State 1, Line 1
Table error: Object ID 2131746897, index ID 1 will be rebuilt.
        The error has been repaired.
Msg 8928, Level 16, State 1, Line 1
Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data): Page (5:3768184) could not be processed.  See other errors for details.
        The error has been repaired.
Msg 8939, Level 16, State 98, Line 1
Table error: Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data), page (5:3768184). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 2057 and -4.
        The error has been repaired.
Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data). Page (5:3768184) was not seen in the scan although its parent (4:3692451) and previous (1:6408870) refer to it. Check any previous errors.
        The error has been repaired.
Msg 8978, Level 16, State 1, Line 1
Table error: Object ID 2131746897, index ID 1, partition ID 72057594803716096, alloc unit ID 72057595502329856 (type In-row data). Page (5:3768185) is missing a reference from previous page (5:3768184). Possible chain linkage problem.
        The error has been repaired.
There are 7548683 rows in 629541 pages for object "sfa_t_TMonVivid".
CHECKTABLE found 0 allocation errors and 4 consistency errors in table 'sfa_t_TMonVivid' (object ID 2131746897).
CHECKTABLE fixed 0 allocation errors and 4 consistency errors in table 'sfa_t_TMonVivid' (object ID 2131746897).
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
 
e.验证问题表问题是否已修复;
f.结论
质疑页的问题是解决了,但丢失了7548695-7548683=12条记录;
好消息当时迁移完之后,原来旧的服务器还未释放资源即原来的环境还在且页未出现质疑,通过恢复问题数据库至旧的服务器上与原来的库对较一下,找出丢失的12条记录。 该表 sfa_t_tmonvivid 有一个自增主键字段VividID,只要比对旧服务器对应的表的最大VividID代码如下:
use [迁移前的库名]
select a.* into wen from sfa_t_tmonvivid a  left join [迁移后的生产备份库_error].dbo.sfa_t_tmonvivid b
on a.VividID=b.VividID
where b.VividID is null and a.VividID<=7531837
 
再把表wen 的记录导成insert values 脚本,在问题库上执行
 
USE [xw_dl_****]
GO
set identity_insert  [sfa_t_tmonvivid] on ;
--INSERT [dbo].[sfa_t_tmonvivid] ([VividID], [StoreID], [UserNumber]
--插入实际丢失的数据
set identity_insert  [sfa_t_tmonvivid] off ;
 
g.重新验证问题表,数据量一致且无错误提示
看来平时备份并验证有效性是非常有必要,关键时刻能起大作用;

参考

 
 

猜你喜欢

转载自www.cnblogs.com/jil-wen/p/10405054.html