Microsoft Windows Server 2003 Deployment Kit [Electronic resources] : Planning Server Deployments نسخه متنی

Using IT Procedures to Increase Availability and Scalability

The following sections introduce best practices for optimizing your Windows Server 2003 deployment for high availability and scalability. A well-planned deployment strategy can increase system availability and scalability while reducing the support costs and failure recovery times of a system. Figure 6.3 displays the process for deploying your servers and network infrastructure in a fault-tolerant manner that also provides manageability.

Figure 6.3: Using IT Procedures to Increase Availability and Scalability

To aid in this planning, Microsoft recommends the Microsoft Operations Framework (MOF). MOF is a flexible, open-ended set of guidelines and concepts that you can adapt to your specific operations needs. Adopting MOF practices provides greater organization and contributes to regular communication between your IT department, your end users, and other departments in your company that might be affected. For complete information about MOF and how to implement it in your organization, see the Microsoft Operations Framework (MOF) link on the Web Resources page at http://www.microsoft.com/windows/reskits/webresources.

Planning and Designing Hardware for High Availability

Fault tolerance means there can be no single point of failure that can cause system failures. The following sections contain highlights of IT practices for ensuring reliable hardware performance and a fault-tolerant IT infrastructure. Included is a summary of ways to safeguard your servers for optimal performance. For detailed information on IT fault-tolerant hardware strategies and highly available system designs, see the Microsoft Solutions Framework link on the Web Resources page at http://www.microsoft.com/windows/reskits/webresources.

Planning and Designing Fault-Tolerant Hardware Solutions

An effective hardware strategy can improve the availability of a system. These strategies can range from adopting commonsense practices to using expensive fault-tolerant equipment.

Using Standardized Hardware

To ensure full compatibility with Windows operating systems, choose hardware from the Windows Server Catalog only. For more information, see the Windows Server Catalog link on the Web Resources page at http://www.microsoft.com/windows/reskits/webresources.

When selecting your hardware from the Windows Server Catalog, adopt one standard for hardware and standardize it as much as possible. To do this, pick one type of computer and use the same kinds of components, such as network cards, disk controllers, and graphics cards, on all your computers. Use this computer type for all applications, even if it is more than you need for some applications. The only parameters that you should modify are the amount of memory, number of CPUs, and the hard disk configurations.

Standardizing hardware has the following advantages:

Having only one platform reduces the amount of testing needed.

When testing driver updates or application-software updates, only one test is needed before deploying to all your computers.

With only one system type, fewer spare parts are required.

Because only one type of system must be supported, support personnel require less training.

For help choosing standardized hardware for your file and print servers, see "Designing and Deploying File Servers" and "Designing and Deploying Print Servers" in this book.

Using Spares and Standby Servers

This chapter discusses clustering as a means of providing high availability for your applications and services to your end users. However, there are two clustering alternatives that provide flexibility or redundancy in your hardware design: spares and standby systems.

Spares

Keep spare parts on-site, and include spares in any hardware budget. One of the advantages of using a standard configuration is the reduced number of spares that must be kept on-site. If all of the hard drives are of the same type and manufacturer, for example, you can keep fewer drives in stock as spares. This reduces the cost and complexity associated with providing spares.

The number of spares that you need to keep on hand varies according to the configuration and failure conditions that users and operations personnel can tolerate. Another concern is availability of replacement parts. Some parts, such as memory and CPU, are easy to find years later. Other parts, like hard drives, are often difficult to locate after only a few years. For parts that may be hard to find, and where exact matches must be used, plan to buy spares when you buy the equipment. Consider using service companies or contracts with a vendor to delegate the responsibility, or consider keeping one or two of each of the critical components in a central location.

Standby Systems

Consider the possibility of maintaining an entire standby system, possibly even a hot standby to which data is replicated automatically. For file servers, for example, the Windows Server 2003 Distributed File System (DFS) allows you to logically group folders located on different servers by transparently connecting them to one or more hierarchical namespaces. When DFS is combined with File Replication service (FRS), clients can access data even if one of the file servers goes down, because the other servers have identical content. DFS and FRS are discussed in detail in "Designing and Deploying File Servers" in this book.

If the costs of downtime are very high and clustering is not a viable option, you can use standby systems to decrease recovery times. Using standby systems can also be important if failure of the computer can result in high costs, such as lost profits from server downtime or penalties from a Service Level Agreement violation.

A standby system can quickly replace a failed system or, in some cases, act as a source of spare parts. Also, if a system has a catastrophic failure that does not involve the hard drives, it might be possible to move the drives from the failed system to a working system (possibly in combination with using backup media) to restore operations relatively quickly. This scenario does not happen often, but it does happen, in particular with CPU or motherboard component failures. (Note that this transfer of data after a failure is performed automatically in a server cluster.)

One advantage to using standby equipment to recover from an outage is that the failed unit is available for careful after-the-fact diagnosis to determine the cause of the failure. Getting to the root cause of the failure is extremely important in preventing repeated failures.

Standby equipment should be certified and running on a 24-hours-a-day, 7-days-a-week basis, just like the production equipment. If you do not keep the standby equipment operational, you cannot be sure it will be available when you need it.

Using Fault-Tolerant Components

Using fault-tolerant technology improves both availability and performance. The following sections describe some basic fault-tolerant considerations in two key areas of your deployment: storage and network components. In both cases you should also consult hardware vendors for details specific to each product, especially if you are considering deploying server clusters. For more information about storage options and strategies for server clusters, see "Designing and Deploying Server Clusters" in this book.

Storage Strategies

When planning how to store your data, consider the following points:

The type and quantity of information that must be stored. For example, will a particular computer be used to store a large database needing frequent reads and writes?

The cost of the equipment. It does not make sense to spend more money on the storage system than you expect to recover in saved time and data if a failure occurs.

Specific needs for protecting data or making data constantly available. Do you need to prevent data loss, or do you need to make data constantly available? Or are both necessary? For preventing data loss, a RAID arrangement is recommended. For high availability of an application or service, consider multiple disk controllers, a RAID array, or a Windows clustering solution. (Clustering is discussed later in this chapter.)

A good backup and recovery plan is essential. Downtime is inevitable, but a sound and proven backup and recovery plan can minimize the time it takes to restore services to your users. For more information, see "Backing up and recovering data" in Help and Support Center for Windows Server 2003.

Physical memory copying, or memory mirroring, provides fault tolerance through memory replication. Memory-mirroring techniques include having two sets of RAM in one computer, each a mirror of the other, or mirroring the entire system state, which includes RAM, CPU, adapter, and bus states. Memory mirroring must be developed and implemented in conjunction with the original equipment manufacturer (OEM).

For more information about storage strategies, see "Planning for Storage" in this book.

Network Components

The network adapter is a potential single point of failure. Fortunately, the network adapter is, on average, very reliable. However, other components outside the computer can fail, causing the same effect that you would experience with the loss of the network adapter. These include the network cable to the computer; the switch or hub; the router; and the Dynamic Host Configuration Protocol (DHCP), Domain Name System (DNS), and Windows Internet Name Service (WINS) systems. Any one of these components can fail and cause the failure of one or more servers and, potentially, all the servers.

You can contend with such failures through redundancy in your network design. Many components lend themselves to backup or load-sharing strategies. The following list describes redundancy strategies for the network hardware (hub or switch, network adapter, and wiring), the routers, and DNS or WINS.

Note

The following is a list of general planning considerations for all networks. For more information about network considerations for specific cluster types, see "Designing Network Load Balancing" and "Designing and Deploying Server Clusters" in this book.

Network hardware Although hubs, switches, network adapters, and wiring are very reliable, if a service must be guaranteed, it is still important to use redundancy for these components. Consult with the vendors who provide your network hardware and support for recommendations on how to build redundancy into your network. For more information about building redundancy into your network, see "Designing a TCP/IP Network" in Deploying Network Services of this kit.

Routers Routers do not frequently fail, but when they do, entire computer centers can go down. Having redundant routing capability in the computer center is critical. Your router vendor is a recommended source of information about how to protect against router failures.

DHCP For the servers on which you must maintain the highest degree of availability, use fixed IP addresses on servers and do not use DHCP. This prevents an outage due to the failure of the DHCP server. This can improve address resolution by DNS servers that do not handle the dynamic address assignment provided by DHCP. For more information about DHCP, see "Deploying DHCP" in Deploying Network Services of this kit.

DNS and WINS DNS and WINS infrastructure components are easy to replicate. Both were designed to support replication of their name tables and other information. Make sure that when you use multiple DNS and WINS servers, you place them on different network segments. For information about WINS and DNS servers and replication, see "Deploying DNS" and "Deploying WINS" in Deploying Network Services of this kit.

For information about replication options for WINS servers, see "Configuring WINS replication" in Help and Support Center for Windows Server 2003. For information about replicating DNS zones, see "DNS zone replication in Active Directory" in Help and Support Center for Windows Server 2003.

For more information about network infrastructure, see the Microsoft Systems Architecture link on the Web Resources page at http://www.microsoft.com/windows/reskits/webresources.

Safeguarding the Physical Environment of the Servers

An important practice for high availability of servers is to maintain high standards for the environment in which the servers must run. The following list contains information to consider if you want to increase the longevity and reliability of your hardware:

Temperature and humidity. Install mission-critical servers in a room set aside for that purpose, where you can carefully control temperature and humidity. Computers perform best at approximately 70 degrees Fahrenheit. In an office, temperature is not normally an issue, but be aware of the effect of a long holiday weekend in the summer with the air conditioning turned off.

Dust or contaminants. Protect servers and other equipment from dust and contaminants where possible, and check for dust periodically. Dust and other contaminants can cause components to short-circuit or overheat, which can cause intermittent failures. Whenever the case of a server is opened for any reason, perform a quick check to determine whether the unit needs cleaning. If so, check all the other units in the area.

Power supplies. Planning for power outages, like any disaster-recovery planning, is best done long before you anticipate outages, and it involves identifying the resources that are most critical to the operation of the company. When possible, provide power from at least two different circuits to the computer room and divide redundant power supplies between the power sources. Ideally, the circuits should originate from two different sources external to the building. Be aware of the maximum amount of power a location can provide. It is possible that a location could have so many servers that there is not sufficient power for any additional servers you might want to install. Consider a backup power supply for use in the event of a power failure in your computer center. It may be necessary to continue providing computer service to other buildings in the area or to areas geographically remote from the computer center. Short outages can be dealt with through uninterruptible power supply (UPS) units. Longer duration outages can be handled using standby generators. Include network equipment, such as routers, when reviewing equipment that requires backup power during an outage.

Maintenance of cables. Prevent physical damage to cables in the computer room by making sure cables are neat and orderly, either with a cable management system or tie wraps. Cables should never be loose in a cabinet, where they can be disconnected by mistake. Make sure all cables are securely attached at both ends where possible, and make sure pull-out, rackmounted equipment has enough slack in the cables, and that the cables do not bind and are not pinched or scraped. Set up good pathways for redundant sets of cables. If you use multiple sources of power or network communications, try to route the cables into the cabinets from different points. If one cable is severed, the other can continue to function. Do not plug dual power supplies into the same power strip. If possible, use separate power outlets or UPS units (ideally, connected to separate circuits) to avoid a single point of failure.

Security of the computer room. For servers that must maintain high availability, restrict physical access for all but designated individuals. In addition, consider the extent to which you need to restrict physical access to network hardware. The details of how you implement this depend on your physical facilities and your organization's structure and policies. When reviewing the security in place for the computer room, also review your methods for restricting access to remote administration of servers. Make sure that only designated individuals have remote access to your configuration information and your administration tools.

Implementing Software Monitoring and Error-Detection Tools

Constant vigilance of your network and applications is essential for high availability. Software-monitoring tools and techniques allow you to determine the health of your system and identify potential trouble spots before an error occurs.

This section assumes that you have selected software that supports the high availability features you require. Not all software supports features such as redundancy or clustering. For an application that requires 99 percent uptime, this might not matter. An application that requires 99.9 percent or greater availability must support such features. Monitoring tools can reveal performance trends and other indications that a potential loss of service is eminent before an error actually occurs. If an error does occur, monitoring tools can provide analytic data that administrators can use to prevent the problem from happening again.

Before deploying software, check the application's hardware requirements and consult the documentation or software vendor to be sure the application supports online backup. When your monitoring tools detect a problem or error in an application, online backup allows you to fix the problem with no disruption of service.

Planning Long-Term Monitoring for Trend Analysis

Whenever possible, establish monitoring over the long term so you can carry out trend analysis and capacity planning. The advantages of long-term monitoring include:

Increasing your ability to predict when expansion of the system is necessary.

Helping with troubleshooting for problems such as memory leaks or abnormal disk consumption.

Assisting in identifying strategies for load balancing.

Capacity planning plays an important part in the long-term success of a high availability system. A good capacity plan can limit avoidable failures by anticipating system usage, scalability, and growth requirements. Do not limit capacity planning to items such as disks. Ensure that your plans encompass all parts of a system that could become bottlenecks.

Potential bottlenecks for a server include the CPU, memory, disks, and networking components such as cables, routers, switches, and telecommunications carriers.

CPU As a general guideline, provide enough processing power to keep CPU utilization below 75 percent.

Memory It is important to understand physical memory and its relation to caching, virtual memory, and disk paging. The goal is to provide enough memory to avoid excessive paging, because paging can never be as efficient as directly transferring information to or from physical memory.

Disks A general observation for disks is that utilization above 85 percent tends to cause longer seek times, adding to overall response time. As a general guideline, size your disks, and distribute the load between them, to keep utilization below 85 percent.

Networking, including cabling, routers, switches, and telecommunications carriers

Create a logical map that includes each type of network component and its location along with servers and user access points. This type of diagram provides an understanding of how information travels back and forth between a given user and a given server. Along with using the diagram, gather and organize as much information as possible about the capacities of your network components and about the network loads generated in your enterprise.

Domain controller traffic When monitoring network performance, be sure to consider domain controllertraffic and other services inherent in your networking infrastructure. The number of domain controllers that you deploy and their location in your topology can affect performance. For more information about domain controllers, see "Planning Domain Controller Capacity" in Designing and Deploying Directory and Security Services of this kit.

Choosing Monitoring Tools

After you deploy your software, establish routine and automated monitoring and error detectionfor your operating system and applications. If you can detect application and system errors immediately after they occur, you have a better chance of responding before a system shutdown. Monitoring can also alert you if scaling is necessary somewhere in your organization. For example, if one or more servers are operating at capacity some or all of the time, you can decide if you need to add more servers or increase the CPUs of existing servers.

You can use the following tools and methods to monitor your programs. For more information about monitoring, consult "Monitoring and status tools" in Help and Support Center for Windows Server 2003.

When choosing software to run in a server cluster or Network Load Balancing cluster, see "Designing Network Load Balancing" and "Designing and Deploying Server Clusters" in this book. For additional information, see "Choosing applications to run on a server cluster" and "Determining which applications to use with Network Load Balancing" in Help and Support Center for Windows Server 2003.

Windows Management Instrumentation

Windows Management Instrumentation (WMI) helps you manage your network and applications as they become larger and more complex. With WMI, you can monitor, track, and control system events that are related to software applications, hardware components, and networks. WMI includes a uniform scripting application programming interface (API), which defines all managed objects under a common object framework that is based on the Common Information Model (CIM). Scripts use the WMI API to access information from different sources. WMI can submit queries that filter requests for very specific information, and it can subscribe to WMI events based on your particular interests, rather than being limited to events predefined by the original developers. For more information, see " Windows Management Instrumentation overview" in Help and Support Center for Windows Server 2003.

Microsoft Operations Manager 2000

The Microsoft Operations Manager 2000 management packs have a full set of features to help administrators monitor and manage both the events and performance of their IT systems based on the Microsoft Windows 2000 or Windows Server 2003 operating systems. Microsoft Operations Manager 2000 Application Management Pack improves the availability of Windows-based networks and server applications. Microsoft Operations Manager 2000 and Windows Server 2003 are sold separately. For information about Microsoft Operations Manager 2000 and the Application Management Pack, see the Microsoft Operations Manager (MOM) link on the Web Resources page at http://www.microsoft.com/windows/reskits/webresources.

Simple Network Management Protocol

Simple Network Management Protocol (SNMP) allows you to capture configuration and status information on systems in your network and have the information sent to a designated computer for event monitoring. For more information, see "SNMP" in Help and Support Center for Windows Server 2003.

Event logs

When you diagnose a system problem, event logs are often the best place to start. By using the event logs in Event Viewer, you can gather important information about hardware, software, and system problems. Windows Server 2003 records this information in the system log, the application log, and the security log. In addition, some system components such as the Cluster serviceand FRS also record events in a log. For more information about event logs, see "Checking event logs" in Help and Support Center for Windows Server 2003.

Service Control Manager

Service Control Manager (SCM), a tool introduced with the release of the Microsoft Windows NT version 4.0 operating system, maintains a database of installed services in the registry. SCM can provide high availability because you can configure it to autorestart services after they have failed. For more information about SCM, see the topic "Service Control Manager" in the Windows Platform SDK documentation.

Performance Logs and Alerts

Performance Logs and Alerts collects performance data automatically from local or remote computers. You can collect a variety of information on key resources such as CPU, memory, disk space, and the resources needed by the application. When planning your performance logging, determine the information you need and collect it at regular intervals. Be aware, however, that performance sampling consumes CPU and memory resources, and that excessively large performance logs are hard to store and hard to extract useful information from. For more information about automatically collecting performance data, see "Performance Logs and Alerts overview" in Help and Support Center for Windows Server 2003.

Shutdown Event Tracker

You can document the reasons for shutdowns and save the information in a standard format by using Shutdown Event Tracker. You can use codes to categorize the major and minor reasons for each shutdown and record a comment for the shutdown. For more information, see "Shutdown Event Tracker overview" in Help and Support Center for Windows Server 2003.

Planning for Unreliable Applications

An unreliable application is an application, usually proprietary to an organization, that your business cannot do without, but that does not meet high standards for reliability. There are two basic approaches you can take if you must work with such an application:

Remove the unreliable applications from the servers that are most critical to your enterprise. If an application is known to be unreliable, take steps to isolate it, and do not run the application on a mission-critical server.

Provide sufficient monitoring, and use automatic restarting options where appropriate. Sufficient monitoring requires taking snapshots of important system performance measurements at regular intervals. You can set up automatic restarting of an application or service by using the Services snap-in. For more information about services, see " Services overview" in Help and Support Center for Windows Server 2003.

Avoiding Operational Pitfalls

The following sections describe operational practices that can limit the availability of applications and computers, in both clustered and nonclustered environments.

Supporting multiple versions of the operating system, service packs, and out-of-date applications

Support of a highly available system becomes much more difficult when multiple combinations of different versions of software (and hardware) are used together in one system or in systems that interact on the network. Older software, protocols, and drivers (and the associated hardware) become impractical when they do not support new technologies.

Set aside resources and time for planning, testing, and installing new operating systems, applications, and (where appropriate) hardware. When planning software upgrades, work with users to identify the features they require. Provide training to ease users through software transitions. In your budget for software and support, provide funds for upgrading applications and operating systems in the future.

Installing incompatible hardware

Maintain and follow a hardware standard for new systems, spare parts, and replacement parts.

Failing to plan for future capacity requirements

Capacity planning is critical to the success of highly available systems. Study and monitor your system during peak loads to understand how much extra capacity currently exists in the system.

Performing outdated procedures

Make sure you remove any outdated procedures from operation and support schedules when a root system problem is fixed. For example, when software is replaced or upgraded, certain procedures might become unnecessary or no longer be valid. Pay special attention to procedures that may have become routine. Be sure that all procedures are necessary and not simply temporary fixes for issues for which the root cause has not been found.

Failing to monitor the system

If you do not use adequate monitoring, you might not have the ability to catch problems before they become critical and cause system failures. Without monitoring, an application or server failure may be the only notification you receive of a problem.

Failing to determine the nature of the problem before reacting

If the operations staff is not trained and directed to analyze problems carefully before reacting, your personnel can spend large amounts of time responding inappropriately to a problem. They also might not use monitoring tools effectively in the crucial time between the first signs of a problem and an actual failure.

Treating symptoms instead of root cause

Symptom treatment is an effective strategy for restoring service when an unexpected failure occurs or when performing short-term preventative maintenance. However, symptom treatments that are added to standard operating procedures can become unmanageable. Support personnel can be overwhelmed with symptom treatment and might not be able to react properly to new failures.

Stopping and restarting to end error conditions

Stopping and restarting a computer may be necessary at times. However, if this process temporarily fixes a problem but leaves the root cause untouched, it can create more problems than it solves.