Microsoft Windows Server 2003 Deployment Kit [Electronic resources] : Planning Server Deployments

Microsoft Corporation

نسخه متنی -صفحه : 122/ 62
نمايش فراداده

Protecting Data from Failure or Disaster

You must take steps to protect and ensure the integrity of your data. If your organization spans more than one physical site, for example, you might want to cluster resources across sites in what is known as a geographically dispersed cluster. Geographically dispersed clusters maintain high availability even when communication with another site is disrupted or lost. And all server clusters, whether in a single site or spread across two sites, need a plan for backing up application data regularly, and for recovering data in the event of a failure or disaster.

Figure 7.22 displays the process for ensuring the data integrity for your server cluster.

Figure 7.22: Protecting Data from Failure or Disaster

Designing a Geographically Dispersed Cluster

Geographically dispersed server clusters ensure that a complete outage at one site does not cause a loss of access to the application being hosted. All nodes hosting an application must exist within the same cluster. Therefore, to provide fault tolerance, a single cluster spans multiple sites.

Important

Windows Server 2003 supports two-site geographically dispersed clusters. However, Microsoft does not provide a software mechanism for replicating application data from one site to another. Instead, Microsoft works with hardware and software vendors to provide a complete solution. For more information about qualified clustering solutions, see the Windows Server Catalog link or the Geographically Dispersed Clusters link on the Web Resources page at http://www.microsoft.com/windows/reskits/webresources.

A geographically dispersed cluster requires multiple storage arrays, at least one in each site, to ensure that in the event of failure at one site, the remaining site will have local copies of the data. In addition, the nodes of a geographic cluster are connected to storage in such a way that when there is a failure at one site or a failure in communication between sites, the functioning nodes can still connect to storage at their own site. For example, in the simple two-site cluster configuration shown in Figure 7.23, the nodes in Site A are directly connected to the storage in Site A, so they can access data with no dependencies on Site B.

Figure 7.23: Geographically Dispersed Cluster

In Figure 7.23, Nodes A and B are connected to one array in Storage Controller A, while Node C and Node D are connected to another array in Storage Controller B. These storage arrays present a single view of the disks spanning both arrays. In other words, Disks A and B are combined into a single logical device (by using mirroring, either at the controller level or the host level). Logically, the arrays appear to be a single device that can fail over between Nodes A, B, C, and D.

Note

For a job aid to assist you in gathering your requirements for a multisite server cluster and organizing the necessary, preparatory information, see "Planning Checklist for Geographically Dispersed Clusters" (Sdcclu_1.doc) on the Microsoft Windows Server 2003 Deployment Kit companion CD (or see "Planning Checklist for Geographically Dispersed Clusters" on the Web at http://www.microsoft.com/reskit).

Design Requirements for Geographically Dispersed Clusters

Although there are many requirements in setting up a geographically dispersed cluster, at the most fundamental level your design has to meet two requirements:

Both sites must have independent copies of the same data. The goal of a geographically dispersed cluster is that if one site is lost, such as from a natural disaster, the applications can continue running at the other site. For read-only data, the challenge is relatively simple: the data can be cloned, and one instance can run at each site. However, in a typical stateful application, in which data is updated frequently, you must consider how changes made to the data at one site are replicated to the other site.

When there is a failure at one site, the application must restart at the other site. Even if the application data is replicated to all sites, you need to know how the application is restarted at an alternate site when the site that was running the application fails.

Important

The hardware and software configuration of a geographically dispersed cluster must be certified and listed in the Windows Server Catalog. Also, be sure to involve your hardware manufacturer in your design decisions. Frequently, third-party software and drivers are required for geographically dispersed clusters to function correctly. Microsoft Product Support Services might not be aware of how these components interact with Windows Clustering. For more information about hardware and software configurations, see the Windows Server Catalog link on the Web Resources page at http://www.microsoft.com/windows/reskits/webresources.

Network Requirements for Geographically Dispersed Clusters

The following network requirements pertain to all server clusters, but are particularly important with regard to geographically dispersed clusters:

All nodes must be on the same subnet. The nodes in the cluster can be on different physical networks, but the private and public network connections between cluster nodes must appear as a single, nonrouted Local Area Network (LAN). You can do this by creating virtual LANs (VLANs).

Each VLAN must fail independently of all other cluster networks. The reason for this rule is the same as for regular LANs — to avoid a single point of failure.

The roundtrip communication latency between any pair of nodes cannot exceed 500 milliseconds. If communication latency exceeds this limit, the Cluster service assumes the node has failed and will potentially fail over resources.

Data Replication in Geographically Dispersed Clusters

Data can be replicated between sites with different techniques, and at different levels in the clustering infrastructure. At the block level (known as disk-device-level replication or disk-device-level mirroring), replication is performed by the storage controllers or by mirroring the software. At the file-system level (replication of file system changes), the host software performs the replication. Finally, at the application level, the applications themselves can replicate data. An example of application-level replication is Microsoft SQL Server log shipping.

The method of data replication that is used depends on the requirements of the application and the business needs of the organization that owns the cluster.

Synchronous vs. Asynchronous Replication

When planning geographically dispersed clusters, you need to understand your data consistency needs in different failure and recovery scenarios and work with the solution vendors to meet your requirements. Different geographically dispersed cluster solutions provide different replication and redundancy strategies. Determine the support requirements of your applications with regard to replication — in geographically dispersed server clusters, the type of data replication is just as important as the level at which it occurs. There are two types of data replication: synchronous and asynchronous.

Synchronous replication is when an application performs an operation on one node at one site, and then that operation is not completed until the change has been made on the other sites. Using synchronous, block-level replication as an example, if an application at Site A writes a block of data to a disk mirrored to Site B, the I/O operation will not be completed until the change has been made to both the disk on Site A and the disk on Site B. Because of this potential latency, synchronous replication can slow or otherwise detract from application performance for your users.

Asynchronous replication is when a change is made to the data on Site A and that change eventually makes it to Site B. In asynchronous replication, if an application at Site A writes a block of data to a disk mirrored to Site B, then the I/O operation is complete as soon as the change is made to the disk at Site A. The replication software transfers the change to Site B (in the background) and eventually makes that change to Site B. With asynchronous replication, the data at Site B can be out of date with respect to Site A at any point in time.

Different vendors implement asynchronous replication in different ways. Some preserve the order of operations and others do not. This is very important, because if a solution preserves ordering, then the disk at Site B might be out of date, but it will always represent a state that existed at Site A at some point in the past. This means Site B is crash consistent: the data at Site B represents the data that would exist at Site A if Site A had crashed at that point in time. Conversely, if a solution does not preserve ordering, the I/O operations might be applied at Site B in an arbitrary order. In this case, the data set at Site B might never have existed at Site A. Many applications can recover from crash-consistent states; very few can recover from out-of-order I/O operation sequences.

Caution

Geographically dispersed server clusters must never use asynchronous replication unless the order of I/O operations is preserved. If this order is not preserved, the data that is replicated to the second site can appear corrupt to the application and be totally unusable.

Mirroring and replication solutions are implemented differently, depending on the vendor, the business needs of the organization, and the logistics of the cluster. In general terms, however, for every disk there is a master copy and one or more secondary copies. The master is modified, and then the changes are propagated to the secondary copies.

In the case of disk-device-level replication, the secondary disk might not be visible to the applications. If it is visible, it will be a read-only copy of the device. When there is a failover, a cluster resource typically designates one of the secondary disks to be the new primary, and the old primary becomes a secondary. In other words, most of the mirroring solutions are master-secondary, one-way mirror sets. This is usually on a per-disk basis, so some disks might have the master at one site and others might have the master at another site.

Majority Node Set Quorum in Geographically Dispersed Clusters

Although geographically dispersed clusters can use a standard quorum, presenting the quorum as a single logical shared drive between sites can create design issues. Majority node set clusters solve these issues by allowing the quorum to be stored on the local hard disk of each node.

Applications in a geographically dispersed cluster are typically set up to fail over in the same manner as those in a single-site cluster; however, the total failover solution is inherently more complex. The Cluster service provides health monitoring and failure detection of the applications, the nodes, and the communications links, but there are cases where it cannot differentiate between various failure modes.

Consider two identical sites, each having the same number of nodes and running the same software. If a complete failure of all communication (both network and storage fabric) occurs between the sites, neither site can continue without human intervention because neither site has sufficient information about the other site's continuity.

As discussed in "Choosing a Cluster Model" earlier in this chapter, the server cluster architecture requires that a single-quorum resource in the cluster be used as the tiebreaker to avoid split-brain scenarios. Although split-brain scenarios can happen in single-site clusters, they are much more likely to occur in geographically dispersed clusters.

If communication between two sites in a geographically dispersed server cluster were to fail, none of the cluster nodes in either site could determine which of the following is true:

Communication between sites failed and the other site is still alive.

The other site is dead and no longer available to run applications.

This problem is solved when one of the partitions takes control of the quorum resource. However, with majority node set clusters, the process is different than in a traditional (single-site, single-quorum) server cluster. Traditional clusters can continue as long as one of the nodes owns the quorum disk, and traditional server clusters can therefore continue even if only one node out of the configured set of nodes is running (if that node owns the quorum disk). By contrast, a server cluster running with a majority node set quorum resource will start up (or continue running) only if a majority of the nodes configured for the cluster are running, and if all nodes in that majority can communicate with each other.

When designing your server cluster with a majority node set quorum, consider this difference, because it can affect how majority node set clusters behave when the server cluster is partitioned and during split-brain scenarios. This "more than half" quorum requirement makes the number of nodes in your server cluster very important. For more information about majority node set quorum cluster models, see "Model 3: Majority node set server cluster configuration" in Help and Support Center for Windows Server 2003.

Single Quorum in Geographically Dispersed Clusters

It is possible to use a single quorum disk resource in a geographically dispersed cluster, provided the quorum is replicated across sites, such as through disk mirroring. When a single quorum is used across multiple sites, the quorum data must always use synchronous replication, although application data can be replicated either synchronously or asynchronously (assuming there are no application-specific restrictions). This is because the quorum could become corrupt during asynchronous replication, rendering different quorum data on each site, and your server cluster would then fail.

Backing Up and Restoring Cluster Data

In order to maintain high availability on your server clusters, it is important to back up server cluster data on a regular basis. The Backup or Restore Wizard in Windows Server 2003 (Ntbackup.exe) has been enhanced to enable backups and restores of the local cluster database, in addition to the capacity to restore configurations locally and to all nodes in a cluster.

Note

In addition to the Backup or Restore Wizard, Ntbackup.exe is also available as a command-line tool. For more information about Ntbackup.exe, in Help and Support Center for Windows Server 2003, click Tools, and then click Command-line reference A–Z.

Although there are some scenarios in which you can restore data without having a backup data set available, your options are limited. Providing true high availability to your users is possible only by backing up your data on a regular basis. For information about what you can do to restore your server cluster without a backup, see the Server Cluster (MSCS) Backup and Recovery Best Practices link on the Web Resources page at http://www.microsoft.com/windows/reskits/webresources.

A new backup and restore feature in Windows Server 2003 is Automated System Recovery (ASR), which helps you recover a system that will not start. ASR is a two-part recovery option consisting of ASR backup and ASR restore, both of which are accessed through the Advanced mode of the Backup or Restore Wizard. Note that ASR backs up only the partition used by the operating system. You must back up other partitions by using the standard Backup or Restore Wizard.

You can use ASR to recover a cluster node, the quorum disk, and the signatures on any lost shared disks. You can use ASR to restore a Windows Server 2003 installation when:

A node cannot start as the result of damaged or missing Windows Server 2003 system files. ASR verifies and replaces damaged critical system files.

There is a hardware failure, such as disk failure. ASR restores the installation to the state of your most recent ASR backup.

The Cluster service will not start on the local node because the Cluster database is corrupt or missing.

The disk signature has changed on one of the shared disks as a result of a disk replacement or other disk-related issue, causing a shared disk to fail to come online.

For more information about ASR, see "Automated System Recovery (ASR) overview" in Help and Support Center for Windows Server 2003.

For specific information about backing up cluster disk signatures and partition layouts, the cluster quorum, and data on cluster nodes, see "Backing up and restoring server clusters" in Help and Support Center for Windows Server 2003. Details include how to restore the cluster database on a local node, how to restore the contents of a cluster quorum disk for all nodes in a cluster, and how to repair a damaged cluster node by using ASR.

Help and Support Center for Windows Server 2003 also includes:

The specific permissions that a cluster administrator needs in order to perform each backup and restore procedure. For example, to back up the data on cluster nodes or to restore a damaged cluster node using Automated System Recovery, you must be a member of the Administrators group on the local computer, or you must have been delegated the appropriate authority. For the specific access control permissions you need to perform each backup and restore operation, see "Backing up and restoring server clusters" in Help and Support Center for Windows Server 2003.

Ten failover and restore scenarios that describe failure from cluster disk data loss, cluster quorum corruption, cluster disk corruption or failure, and complete cluster failure. Scenarios include symptoms that are common indications of a server cluster failure, and methods for recovering from such failures. For more information see "Backing up and restoring server clusters" in Help and Support Center for Windows Server 2003.

Important

For information about backing up applications that run on your server cluster, consult the documentation for the application.