Search
Home
Articles
Backup
Books
Certification
FAQ
Products
Replication
Scripts
Seminars
Training
TSQL

MSDN Fourms
Philippine SSUG
Fort Worth SSUG
Oklahoma City SSDG

Resume

MHS Enterprises
BlowFrog Software
FilAm Software
AcrylicAcetate.com
Bargain Humidors
Western Humidor

6.5 Disaster Recovery Plan 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Download the disaster recovery template in Word format

Any sound disaster recovery plan needs to encompass every known situation that can result in loss of service or data. Each of these shall be addressed in this document along with specific steps for recovering from a service or data loss.

The disaster recovery plan must be validated and tested. This is not a one time process. The testing and validation of the disaster recovery plan must occur on a regularly scheduled basis. The entire plan needs to be tested once and validated before being rolled out. After that, the DBA staff needs to test and probe at least a part of the disaster recovery plan on a weekly basis. This testing is required simply because data volumes grow and environments change. What was adequate to accomplish disaster recovery requirements this week may be insufficient for next week.

This process requires a test server for the DBA staff that can accommodate one of the largest production systems. This server gives the DBA staff the ability to continuously probe, test, break, fix, enhance, and validate the disaster recovery plan. No plan can account for everything that can occur in a production environment. The ability to continuously test the disaster plan allows the team to close these holes before they are encountered in a production system. Additionally, this also gives the team practice and familiarity with the disaster recovery plan. This increases the response time in the event of an emergency and also eliminates any errors that can occur. Having to recover a system after having been through the procedure dozens of times is much better than having to do it the first time while the system is down. This practice reduces the response time and brings systems online in a much more efficient manner. The service level increases, because the time estimates are based on practice and experience instead of simply estimates.

The way this process works is that each week, the lead DBA selects a segment of the disaster recovery plan to test. Each DBA in turn must complete this segment of the recovery process. Each area tested is documented and the process is timed for statistical purposes. Any problems encountered as well as workarounds are documented. At the end of each recovery a second person, normally the lead DBA, verifies that the recovery is valid and the system possible to go into production. Once the verification is done, the entire process needs to be evaluated from both a performance standpoint and also for any problems encountered. All problems should be documented and the disaster recovery plan updated to address them. This new addition to the disaster recovery plan should then be tested during the next testing cycle.

In addition to the regularly scheduled recovery drills, the lead DBA should randomly choose at least one time every month for a live disaster drill. This disaster drill is addressed and treated just like a real emergency. This is a date that should be selected by the lead DBA and approved by management. No one else on the DBA staff should know when it is going to occur. Since this is treated just like a live disaster, this means nothing should interferes with this drill. Support gets put on hold or handled exactly the way it would in a normal emergency. This last part of the disaster recovery program obviously needs to be approved by and supported by all levels of management. The entire purpose of a drill of this kind is for practice in a live situation. This sets a real life benchmark for everyone that would be affected by an outage. The DBAs and management know how long a recovery will take and how the business could be impacted. The users get to understand how service levels for routine operations will be affected during a recovery scenario.

This process gives you a highly trained DBA team that can quickly and accurate diagnose and recover from any problem in your environment. With the practice also comes familiarity with recovery procedures and the specific processes for your environment. This allows you to set and be able to expect specific levels of service and outage time frames.

The other things that make your systems much more stable are environmental in nature. Every server should have clean, conditioned power running to it. This protects the hardware from power surges, spikes, and power fluctuations. The servers should also have a UPS attached to them to protect the hardware in the case of a power failure. While most people will place the computers on a UPS, don’t overlook having the tape drives on a UPS as well. You need to ensure that the backup will continue to run even though power has gone out and you are running off your UPS. The last thing you want to happen is have the power go out and abruptly terminate the server. Many power outages are very short lived and without a UPS, you can have an even more serious situation when the power comes back on. This generally produces a power surge and the machine will immediately start back up. This can very easily overload a circuit and cause an immediate shutdown of a machine after it was only partially powered up. If you do not have a UPS connected to the servers, you run a very significant risk of hardware failure, loss, and data corruption during a power outage. The UPS protects the hardware during a power up and also gives the staff the time to properly shut down hardware if the power fails. Many of the UPS products currently on the market also give you configuration options to properly power down a server when the UPS reaches a certain threshold. In addition to the power being supplied to the servers, they should also be housed in an environmentally controlled room. The server hardware will generate a large amount of heat. Many of the newer servers have automatic safety cutoffs when the internal temperature exceeds an unsafe threshold. This will cause the server to shutdown unexpectedly. Controlling the temperature in the server room has the benefit of prolonging the life of the hardware.

6.5 Disaster Recovery Plan 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Michael R. Hotek

All content on this site, except where noted, represents an original work of Michael R. Hotek and is protected by applicable copyright laws. The SQL Server FAQ is the sole work of Neil Pike. No page, portion of a page, or download may be used for commercial purposes in whole or in part without the express, written permission of the applicable author.