Defining Availability and Scalability Goals
Defining availability and scalability goals is the first step toward ensuring that your efforts are focused on the elements of the system that matter the most to your organization. Figure 6.2 shows how to quantify and determine your availability goals.

Figure 6.2: Defining Availability and Scalability Goals
Availability goals allow you to accomplish the following tasks:
Design, operate, and evaluate your systems in relationship to a consistent set of priorities, and place new requests or problems in context.
Keep efforts focused where they are needed. Without clear goals, efforts can become uncoordinated, or resources can be spread so thin that none of the organization's most important needs are met.
Limit costs. You can direct expenditures toward the areas where they make the most difference.
Recognize when tradeoffs must be made, and make them in appropriate ways.
Clarify areas where one set of goals might conflict with another, and avoid making plans that require a system to handle two conflicting goals simultaneously.
Provide a way for operators and support staff to prioritize unexpected problems when they arise by referring to the availability goals for that component or service.
Quantifying Availability and Scalability for Your Organization
Your goal in quantifying availability is to compare the costs of your current IT environment — including the actual costs of outages — and the cost of implementing high availability solutions. These solutions include training costs for your staff as well as facilities costs, such as costs for new hardware. After you have calculated the costs, IT managers can use these numbers to make business decisions, not just technical decisions, about your high availability solution. For information about monitoring tools that can help you measure the availability of your services and systems, see "Implementing Software Monitoring and Error-Detection Tools" later in this chapter.Scalability is more difficult to quantify because it is based on future needs and therefore requires a certain amount of estimation and prediction. Remember, though, that scalability is tied to availability because if your system cannot grow to meet increased demand, certain services will become less available to your users.
Determining Availability Requirements
Availability can be expressed numerically as the percentage of the time that a service is available for use. The exact level of availability must be determined in the context of the service and the organization that uses the service. Table 6.1 displays common availability levels that many organizations try to achieve. The following formula is used to calculate these levels:
Percentage of availability = (total elapsed time - sum of downtime)/total elapsed time
Availability | Yearly Downtime |
---|---|
99.999% | 5 minutes |
99.99% | 53 minutes |
99.9% | 8 hours, 45 minutes |
Availability requirements can vary depending on the server role. Your users can probably continue to work if a print server is down, for example, but if a server hosting a mission-critical database fails, your business might feel the effects immediately.
Determining Reliability Requirements
Reliability is related to availability, and it is generally measured by computing the time between failures. Mean time between failures (MTBF) is calculated by using the following equation:
MTBF = (total elapsed time - sum of downtime)/number of failures
A related measurement is mean time to repair (MTTR), which is the average amount of time that it takes to bring an IT service or component back to full functionality after a failure.A system is more reliable if it is fault tolerant. Fault tolerance is the ability of a system to continue functioning when part of the system fails. This is achieved by designing the system with a high degree of hardware redundancy. If any single component fails, the redundant component takes its place with no appreciable downtime. For more information about fault-tolerant components, see "Planning and Designing Fault-Tolerant Hardware Solutions" later in this chapter.
Determining Scalability Requirements
You need to consider scalability now to provide your organization a certain amount of flexibility in the future. If you believe your hardware budget will be sufficient, you can plan to purchase hardware at regular intervals to add to your existing deployment. The amount of hardware you purchase depends on the exact increase in demand. If you have budget limitations, purchase servers that you can scale up later by adding RAM or CPUs to meet a rise in users or client requests.Looking at past growth can help you determine how demand on your IT system might grow. However, because business technology is becoming increasingly complex, and reliance on that technology is growing every year, you must consider other factors as well. If you anticipate growth, realize that some aspects of your deployment may grow at different rates. You might need many more Web servers than print servers, for example, over a certain period of time. For some types of servers, it might be sufficient to add CPU power when network traffic increases, while in other cases, such as with a Network Load Balancing cluster, the most practical scaling solution might be to add more servers.
Recreate your Windows deployment as accurately as possible in a test environment and, either manually or through a simulation program, put as much workload as possible on different areas of your deployment. Observing your system under such circumstances can help you formulate scaling priorities and anticipate where you might need to scale first.After your system is deployed, software-monitoring tools can alert you when certain components of your system are near or at capacity. Use these tools to monitor performance levels and system capacity so that you know when a scaling solution is needed. For more information about monitoring performance levels, see "Implementing Software Monitoring and Error-Detection Tools" later in this chapter.
Analyzing Risk
When planning a highly available Windows Server 2003 environment, consider all available alternatives and measure the risk of failure for each alternative. Begin with your current organization and then implement design changes that increase reliability to varying degrees. Evaluate the costs of each alternative against its risk factors and the impact of downtime to your organization.Often, achieving a certain level of availability can be relatively inexpensive, but to go from 98 percent availability to 99 percent, for example, or from 99.9 percent availability to 99.99 percent, can be very costly. This is because bringing your organization to the next level of availabity might entail a combination of new or costly hardware solutions, additional staff, and support staff for non-peak hours. As you determine how important it is to maintain productivity in your IT environment, consider whether those added days, hours, and minutes of availability are worth the price.Every operations center needs a risk management plan. When assessing risks in your proposed Windows Server 2003 deployment, remember that network and server failures can cause considerable loss to businesses. After you evaluate risks versus costs, and after you design and deploy your system, your IT staff needs sound guidelines and plans of action in case a failure in the system does occur.For more information about designing a risk management plan for your Windows Server 2003 deployment, see the Microsoft Operations Framework (MOF) link on the Web Resources page at http://www.microsoft.com/windows/reskits/webresources. The MOF Web page provides complete information about creating and maintaining a flexible risk management plan, with emphasis on change management and physical environment management, in addition to staffing and team role recommendations.
Developing Availability and Scalability Goals
Begin establishing goals by reviewing information that is readily available within your organization. Existing Service Level Agreements, for example, define the availability goals for specific IT services or systems. Gather information from those individuals and groups who are most directly affected, such as the users or departments that depend on the services and the people who make decisions about IT staffing.The following questions provide a starting point for developing a list of availability goals. These goals, and the factors that influence them, vary from organization to organization. By identifying the goals appropriate to your situation, you can clarify your priorities as you work to increase system availability and reliability.
Organization's Central Purposes
These fundamental questions will help you prioritize the applications and services that are most important to your organization and the extent to which you rely on your IT infrastructure for certain key tasks.
What are the organization's central purposes?
What must the organization accomplish to survive and flourish?
Details on Record That Help Define Availability Requirements
The questions in this section can help you quantify your availability needs, which is the first step in addressing those needs.
If your organization has attempted to evaluate the need for high availability in the past, do you have existing documents that already outline availability goals?
Do you have current or previous Service Level Agreements, Operating Level Agreements, or similar agreements that define service levels?
Have you defined acceptable and unacceptable service levels?
Do you have data about the cost of outages or the effect of service delays or outages (for example, information about the cost of an outage at 9 A.M. versus the cost of an outage at 9 P.M.)?
Do you have any data from groups that practice incident management, problem management, availability management, or similar disciplines?
Users of IT Services
It is important to define the needs of your users to provide them with the availability they need to do their work. There is often a tradeoff between providing high availability and paying the cost of hardware, training, and support. Categorizing your users can make these kinds of business decisions easier.
Who are the end users? What groups or categories do they fall into? What expertise levels do they have?
How important is each user group or category to the organization's central goals?
Requirements and Requests of End Users
These questions help pinpoint the needs of your users. You can more easily customize your high availability solutions and anticipate scalability issues if you know exactly what your users need.
Among the tasks that users commonly perform, which are the most important to the organization's central purposes?
When end users try to accomplish the most important tasks, what do they expect to see on their screens (or access through some other device)? Described another way, what data (or other resources) do users need to access, and what applications or services do they need when working with that data?
For the users and tasks most important to the organization, what defines a satisfactory level of service?
Requirements for User Accounts, Networks, or Similar Types of Infrastructure
It is important to know about supporting services — even services that you do not control — when evaluating availability needs and defining availability goals. Your system will be only as fault tolerant as the systems that support it.
What types of network infrastructure and directory services are required so that users can accomplish the tasks that you have identified as requirements for end users? In other words, what types of behind-the-scenes services do users require?
For these behind-the-scenes services, what defines the difference between satisfactory and unsatisfactory results for the organization?
Time Requirements and Variations
Keeping support staff on-site to maintain a system can be expensive. Costs can be minimized if support personnel are on-site only during critical periods. Similarly, knowing when workload is highest can help you anticipate when availability is most important, and possibly when a failure is likely to occur.
Are services needed on a 24-hours-a-day, 7-days-a-week basis, or on some other schedule (such as 9 A.M. to 5 P.M. on weekdays)?
What are the normal variations in load over time?
What increments of downtime are significant (for example, five seconds, five minutes, an hour) during peak and nonpeak hours?