Troubleshooting and Fixing SQL Server Page Level Corruption
Corrupt SQL Server databases are the worst nightmare of any SQL Server professional. In any environment, from small business to enterprise, the compromise of integrity and availability of the data can constitute a business emergency. This is especially the case in those organizations reliant on an OLTP data model, for a high-volume website. SQL Server database corruption and disruption of the transaction processing system can cause business repercussions such as large financial losses, a drop in reputation or customer retention, or contractual SLA problems with the service provider, if not managed in-house. This tip will demonstrate the diagnosis process, discuss one method of correcting page-level corruption without using the REPAIR options with the DBCC CHECKDB command and outline how to get the SQL Server back online.
Diagnosis – SQL Server Corruption
Normally any business will have contingency plans to deal with SQL Server database corruption scenarios, and any good database professional will have immediately thought of a number of strategies to deal with these. Items such as disaster recovery plans, replication, Database MirroringAlwaysOn, Clustering, etc. However, sometimes these options are not appropriate. Imagine you have a 20GB database populated with records of your customers and records of each transaction linked back to your customers. You have two .mdf data files. You are using the full recovery model, full backups are taken daily at 21:00 with transaction log backups taken every 15 minutes. You don’t use mirroring, replication or clustering, instead of relying on a robust backup model to protect your data. At 20:00, a message like this occurs:
Msg 824, Level 24, State 2, Line 1
SQL Server detected a logical consistency-based I/O error: unable to decrypt page due to missing DEK. It occurred during a read of page (3:0) in database ID 10 at offset 0000000000000000 in file 'c:delcorruption_secondary.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
It is clear that the SQL Server database integrity has been compromised. As this sample error reports, there’s a problem in the ‘corruption_secondary.mdf’ file (my second data file in this test database) at page 3:0, offset 000..000 (right at the beginning). So what do you do? The first instinct is to think, ‘Restore! Restore! Restore!’ But in this scenario, this will involve the following steps as a minimum:
- Set database in single-user mode
- Tail log backup of all transactions since last transaction log backup
- Full restore of backup from today – 1
- Restore of 4 transaction logs per hour multiplied by 23 hours = 92 individual logs
- Restore of tail log up to the point of corruption (point in time recovery)
- DBCC check of the database
- Set database online
But wait – there’s two things to consider. One, what’s your recovery time objective (RTO)? Is it, let’s say, four hours? Are you sure you can perform 93 individual backups in four hours, over 20GB of data? What if you’re using SATA drives – IOPS constraints will make this a close-run race. What about configuration of the script to do this? What happens if one of those logs is also corrupt? And what happens if the corruption pre-dates the error message (i.e. the error message wasn’t a direct response to a very recent event, like disk failure) but instead has festered inside your database? What if (horror of horrors)… the full backups include this corrupted data?
Fortunately, there’s a few ways to identify corruption in the database. The first, and the most well-known, is DBCC CHECKDB. This utility will perform systematic data integrity checks throughout the datafiles and identify areas of concern. As documented in other excellent articles, this comes with three main options for recovery: REPAIR_ALLOW_DATA_LOSS, REPAIR_FAST and REPAIR_REBUILD. These options are not ideal, especially REPAIR_ALLOW_DATA_LOSS, although it is perfectly possible to restore without data loss using these tools. This article is not going to demonstrate the use of CHECKDB, rather it will demonstrate a different method. Here’s the warning about CHECKDB repairs from Books Online:
"Use the REPAIR options only as a last resort. To repair errors, we recommend restoring from a backup. Repair operations do not consider any of the constraints that may exist on or between tables. If the specified table is involved in one or more constraints, we recommend running DBCC CHECKCONSTRAINTS after a repair operation. If you must use REPAIR, run DBCC CHECKDB without a repair option to find the repair level to use. If you use the REPAIR_ALLOW_DATA_LOSS level, we recommend that you back up the database before you run DBCC CHECKDB with this option."