Planning for Disaster Recovery
A number of technologies are available that provide fault tolerance in the event of a failure, such as fault tolerant disk configurations with hot swappable drives, server clusters, and uninterruptible power supplies, However, all of these high availability technologies cannot substitute for having a reliable backup of mission-critical data.In a complete site disaster, for example, it is possible that both online and offline availability technologies are destroyed (for example, all cluster nodes and all disks in a RAID array). Following such a system failure or disaster at a particular site, you must be able to recover data and systems from backup. Recovering systems or sites from failure is a daunting task, unless you have thoroughly planned and prepared by implementing scheduled backups, providing for backups of open files, and configuring for Automated System Restore. You must also thoroughly test your ability to restore data. The tasks involved in planning for disaster recovery are shown in Figure 1.10.
Figure 1.10: Planning for Disaster Recovery
Planning a Backup Schedule
The backup schedule is perhaps the most important consideration when planning for disaster recovery. It is critical to plan your backups around your restore requirements. For example, you run only incremental backups of a file server over an extended period of time. This approach delivers very fast backups that use little network bandwidth. However, to restore the server, you would need to restore all of the incremental backups, which could take days to complete. A backup plan presumes a restore plan: Your backup plan should always be based on your requirements for restoring data. You should carefully test and document all of the processes and time required to fully recover your servers from backup data. Use the results of this testing to establish the requirements for your backup schedules.
Backup Decision Points
To plan an effective backup schedule that meets your organization's data protection and recovery requirements, consider both the available backup types and your restore plan. When considering backup types, remember that your choice must accommodate restoring a system within the time specified in your restore plan. Available backup types can be categorized as fundamental and advanced.
Fundamental Backup Types
The fundamental backup types available are:
Full. Full backups are the baseline for all other backups, and contain all the data on a system (or all the data in the folders or volumes that are defined to be backed up). Because full backups secure all server data, frequent full backups can provide you with greater guarantees as to the speed and success of restore operations. Remember that when you add additional backup types to the backup set, restore jobs are prolonged.
Incremental. An incremental backup stores all files that have changed since the last backup, regardless of the backup type. The advantage of incremental backups is that they take the least time to complete. However, during a restore operation, each incremental backup in the backup set is applied, which could result in a lengthy restore job.
Differential. Differential backups contain all data that has changed since the last full backup. The advantage of differential backups is that they can shorten restore time compared to a backup set that includes just full and incremental backups. However, if you perform too many differential backups within a single backup set, the size of the differential backup might grow to be as large as the baseline full backup.
Copy. Copy backups are identical to full backups, with the exception that they do not mark files as backed up, and thus do not impact any other backup types. A full backup marks the beginning of a new backup cycle, while a copy backup does not. Copy backups are most frequently used to create offsite copies of backup data. In this way, you can maintain a local full backup for quick restore purposes, and keep an offsite copy backup for use in the event that your local backup copy is lost.
Your backup schedule should probably include some combination of all backup types. Regular full backups take the longest to complete, but offer the quickest restore performance. Because full backups copy all files defined in the backup job, they also consume the most backup media. Incremental backups consume the least backup media and take the shortest time to complete; however, they result in lengthier restores. Consequently, it is best to balance each of the backup types so as to satisfy your organizations restore needs. For example, you might elect to perform weekly full backups of a crucial file server, and supplement that with daily differential backups, and incremental backups every two hours. This approach gives you a high degree of data protection, while keeping restore time within reasonable boundaries.
Advanced Backup Types
Advanced backup types include:
Backup applications that leverage the Volume Shadow Copy service. Backup applications that support the Volume Shadow Copy service can back up open files, and all the backed-up files are from a single point-in-time. For more information about backing up open files, see "Planning to Back Up Open Files" later in this chapter.
Server-free backups. In a server-free backup, a data mover on a SAN copies data on a SAN disk to another storage device, such as a library on the SAN. A data mover is software that can reside on a SAN device, such as a switch or router, or on a server connected to the SAN. Because the data mover performs the copy operations, the server can maintain a high level of performance for clients. Server-free backup applications must have a way of creating a point-in-time snapshot, such as by leveraging the Volume Shadow Copy service or by using some other technology.
Hardware-based shadow copies. External storage arrays that support hardware-based shadow copies (also called snapshots) can implement Volume Shadow Copy service providers to fully utilize advanced storage array features.
Your choice of how to integrate advanced backup types into your backup schedule will be dictated by your restore requirements. Some systems might require hourly shadow copies, while other, low priority systems might only warrant weekly full backups and daily incremental backups. Your best strategy is to align your backup plan for each system with the restore plan for that system. If the system requires quick recovery of user files, frequent shadow copies are ideal. If the data on the same system is required to be retained for seven years, then you will also need to perform frequent full backups of the system to media that can be stored in an offsite location.For more information about backup and recovery planning, see the Server Management Guide of the Windows Server 2003 Resource Kit (or see the Server Management Guide on the Web at http://www.microsoft.com/reskit).
Planning to Back Up Open Files
Open files are files that are normally skipped during the backup process. This is usually because the files are locked by a service or application, such as an operating system, word processing program, or application database. In Windows 2000 and Windows NT 4.0, if the operating system could not back up a file that was locked by an application, the file was skipped, and thus not backed up. In Windows Server 2003, open files are backed up by using the Volume Shadow Copy service.When a backup is initiated by an application that can use the Volume Shadow Copy service, such as the Backup program in Windows Server 2003, the Volume Shadow Copy service makes a shadow copy of the volume to be backed up. The shadow copy constitutes a read-only copy of the volume data that is read by the backup application during the backup job. Applications can continue to access the files on the volume itself, uninterrupted by the backup. After the backup is completed, the shadow copy of the volume is deleted, because it is no longer needed. The backed-up data is stored on the backup media.By default, Windows Server 2003 temporarily consumes free disk space on a volume for the shadow copy. The amount of disk space consumed depends on the amount of data that changes on the volume during the backup.In the event that a shadow copy is unsuccessful, for example, when there is not enough temporary disk space available on the volume, Backup continues without using shadow copy techniques and, as in previous versions of Windows, reads files from the original volume and does not back up any open files.To take advantage of open-file backups, purchase a backup application that works with the Volume Shadow Copy service.
Planning for Bare Metal Restores with ASR
Automated System Recovery (ASR) is a new tool for use with Backup (NTBackup.exe) and other programs created by ISVs. It has a limited but critical purpose: to help you automatically restore your system after a system failure. Previously, restoring the complete system required that you first reinstall the operating system and then restore the data. With a bare metal restore, you can boot a system from the operating system CD, and then use an ASR floppy disk to recover the system directly from a backup.ASR works with Windows Setup to rebuild the storage configuration of the physical disks and writes the critical operating system files to the boot and system partitions in order to allow the system to boot successfully. This process is referred to as a bare metal restore, because the system is restored to hardware that has no installed software. The process uses an ASR floppy disk that defines the state of the storage prior to the disaster and the process to be used for restoring the server. After an ASR restore completes, you can restore any needed user or application files.
How ASR Works
ASR represents a new approach to backup and recovery. Prior to ASR, after a large-scale failure, you needed to reinstall Windows, configure all physical storage to the original settings, and then perform a complete restore of the data. The process of rebuilding the operating system could be lengthy, and you needed to perform many of these tasks at the local computer. ASR significantly automates this process. In addition to automating the restore of a single system, ASR can be used with Remote Installation Services (RIS) to automate the system state recovery of several systems across the network.To prepare for ASR recovery, you must run the Automated System Recovery Wizard, which is part of Backup. To access this wizard when you are running Backup in Advanced Mode, click Tools and select ASR Wizard. You can start Backup in Advanced Mode by clearing the Always start in Wizard Mode check box when Backup starts.The wizard backs up the operating system boot volumes and system volumes, but does not back up other volumes, such as program or data volumes. To secure data on other volumes, you must back up those volumes separately by using Backup or another backup tool. You can, however, choose to back up All information on this computer when running Backup. This option creates a full backup of your entire system, including ASR data. This means that you can recover the entire system through the ASR process in the event of failure.
When an ASR restore is initiated, ASR first reads the disk configurations from the ASR floppy disk and restores all disk signatures and volumes on the disks from which the system boots. In the ASR process, these are known as critical disks, because they are required by the operating system. Noncritical disks — disks that might store user or application data — are not backed up as a part of a normal ASR backup, and are not included in an ASR restore. If these disks are not corrupted, their data will still be accessible after the ASR restore completes. If you want to secure data on noncritical disks from disk failure, you can do so by backing it up separately.After the critical disks are recreated, ASR performs a simple installation of Windows Server 2003 and automatically starts a restore from backup using the backup media originally created by the ASR Wizard. During an ASR restore, any Plug and Play devices on the system are detected and installed.Before performing an ASR restore, ensure that the target system to which the restore will be made meets the following requirements:
The target system hardware (except for hard disks, video cards, and network adapters) is identical to that of the original system.
There are enough disks to restore all the critical system disks.
The number and storage capacity of the critical disks are at least as great as those of the corresponding original disks.
Caution | Do not depend on ASR to back up and recover user data files stored on the boot and system volumes. In addition, because your system volume is formatted during the ASR recovery process, any user files or directories located on those volumes are lost. |
You normally access the ASR state file (Asr.sif) through a local floppy disk drive. If the computer does not have a floppy disk drive, or you want to perform an ASR restore over a network or remotely, you can use a Remote Installation Services (RIS) server to fully automate the ASR process. RIS uses Pre-boot eXecution Environment (PXE) technology to enable client computers without an operating system to boot remotely to a RIS server that performs installation of a supported operating system over a TCP/IP network connection. Consequently, the remote installation client computer must have a PXE-enabled network adapter.
For more information about using RIS to perform remote installations, see "Designing RIS Installations" in Automating and Customizing Installations of this kit.
Note | ASR behaves differently from the Emergency Repair Disk feature in Windows 2000 Server, which ASR replaces. Emergency Repair Disk replaces missing or corrupt system files without formatting drives or reconfiguring storage. ASR, by contrast, always formats the boot volume and might format the system volume. |
Guidelines for Using ASR
To successfully use ASR in your disaster recovery plan, you should include the following guidelines:
Run ASR backup regularly, preferably by using automatic settings.
Plan for making the required resources, including tape backup drives and removable and hard disks, available for ASR recovery.
Perform any needed file system conversions before running your first ASR backup.
Plan for conditions that might prevent a fully successful ASR restore.Under the following conditions, ASR might not be able to restore all disk configurations
If a critical volume is not accessible during an ASR restore, the restore will fail.
Noncritical disks that are a part of the ASR backup are not restored if they are not found during the ASR restore, but the balance of the restore will complete successfully. Disk types that might not appear to the restore process include IEEE 1394, USB, or Jaz disks.
Plan to protect the critical files Asr.sif and Asrpnp.sif generated by Backup and copied to your ASR floppy disk.If the ASR floppy disk that contains these files is lost, you can recover the files from the systemroot\Repair folder on the host system. If these files are not accessible on the original host, you can recover them from the ASR backup media by using another system. By storing these files in three locations — the ASR floppy disk, the Repair folder, and on ASR backup media — you have three levels of protection against their loss.
For more information about ASR, see the Server Management Guide of the Windows Server 2003 Resource Kit (or see the Server Management Guide on the Web at http://www.microsoft.com/reskit) and see "Automated System Recovery (ASR) overview" in Help and Support Center for Windows Server 2003.
Testing Restores
The single most overlooked aspect of disaster recovery planning is testing restores. Receiving confirmation that a backup completed successfully does not guarantee that a backup can be restored. To prepare for recovery and to validate backup data, you should periodically test your backups of mission-critical servers. If a server cannot be taken offline for testing restores, you can instead restore its backup data to a test server. Practicing restore operations allows you to prepare for problems that you might encounter when recovering a complete system after a failure. These problems include the following:
When restoring to a new server with different hardware — for example, if the original server used SCSI storage and the server where you restore the data uses IDE storage — you might need to filter files such as Boot.ini from the restore job. If the restored data replaces the Boot.ini file, the system might not be able to boot following the restore, and the parameters in the restored Boot.ini file would need to be modified.
When restoring to a new server with different hardware, you might need different drivers, such as drivers for the network adapter or display adapter. Make sure that these drivers are available during the restore.
The time required to restore data might exceed the time allotted in your recovery plan, which means you must change your backup schedule, your process, or your equipment to improve the speed of the restoration.
Backup media might be corrupt. Your disaster recovery plan should specify preventive measures (such as redundant backups), and indicate the procedures to follow in the event of corrupt media (such as the location of any redundant backups, or whether out-of-date backups are to be used).
Testing restores can help you plan how to deal with such problems. For example, if your organization requires that a critical file server be returned to full operation within a four hour window, but your test restore took six hours to complete, you can plan to either adjust your backup schedule or else justify the purchase faster backup media drives or network components. Without thoroughly testing your restore procedures, you cannot conclusively document recovery procedures or recovery time, both of which are crucial to your organization's disaster recovery plan.