The Business Case for Storage Networks [Electronic resources] نسخه متنی

Storage Technology Primer

This section recaps the various storage technologies previously touched on and places them in a matrix before moving on to the actual implementation discussion.

At the disk level, there are four connectivity options:

Direct-attached storage (DAS) configured in a one-to-one relationship with the host

Network-attached storage (NAS) residing on an IP-based network

Storage Area Network (SAN)-attached storage shared between hosts via a private Fibre Channel network

SAN-attached storage shared between hosts over a routable IP network

As noted earlier, in addition to being expensive, inefficient, and difficult to scale, DAS presents multiple single points of failure. NAS allows storage to be accessed remotely and utilizes capacity on the IP network infrastructure. Fibre Channel (FC) SANs allow sharing of storage resources over a private Fibre Channel network, increasing allocation efficiency. IP SANs allow for low-cost networked storage as well as the extension of SANs over long distances. Consequently, both FC and IP SANs offer significant advantages over DAS and NAS.

A Fibre Channel SAN infrastructure can be comprised of small independent SAN islands built on fixed Fibre Channel switches (16- and 32-port switches). A SAN infrastructure can also be built by joining multiple larger switches together Chapter 5, "Maximizing Storage Investments," simplifies the management of storage devices by abstracting the devices themselves to increase operational efficiencies. Virtualization also reduces capital expenditures by increasing utilization rates. Software products in this category include MonoSphere Storage Manager, VERITAS Storage Foundation for Networks, and IBM SAN Volume Controller.

Table 4-1 highlights the solution types and their application in the enterprise.

Table 4-1. Storage Solutions Matrix

Business Need

Solution Format

Solution Technology

Solution Type

Data storage

Hardware

DAS

Expensive disk

Shared data storage

Hardware

FC SANs

Small Fabric, fixed Fibre Channel switch

Shared data storage

Hardware

FC SANs

Large Fabric, modular Fibre Channel switch

Shared, low-cost data storage

Hardware

NAS

Inexpensive, redundant, shared disk

Shared, low-cost data storage

Hardware, Protocol

iSCSI storage network

Small fixed device, blade, multiprotocol modular switch

SAN Extension

Hardware, Protocol

CWDM, DWDM, SONET/SDH, and dark fiber

Optical, long-distance carry

SAN Extension

Hardware, Protocol

iSCSI storage network

Blade, multiprotocol modular switch between hosts, SCSI storage

SAN Extension

Hardware, Protocol

FCIP storage network

Blade, multiprotocol modular switch between Fibre Channel devices

SAN Extension

Hardware, Protocol

IFCP storage network

Gateway device between Fibre Channel devices

SRM

Software

Resource and asset management

Agent devices on hosts

SAN Management

Software

Device management

Centralized device administration and management

Virtualization

Software

Virtualization of resources

Abstraction of resources to increase utilization

Given the array of storage networking solutions on the market designed to solve the inadequacies of DAS, it is critical that one use a business case methodology to evaluate the financial impact of each solution and the appropriateness of the solution to fit the business need. Consequently, the application of storage networking technology must be framed within the discussion of the total cost of ownership (TCO) of the solution. In other words, the costs associated with the solution must match the business need accordingly; the most appropriate way of doing so is with a tiered storage strategy.

TCO, Tiered Storage, and Capacity Planning

Calculating TCO is covered in Chapter 2; however, it is important to take that discussion to the next level. Discussions of TCO eventually lead to the discussion of a tiered storage architecture as part of an Information Lifecycle Management (ILM) framework. In an ILM framework, processes and procedures dictate the movement of information through storage tiers as its criticality and frequency of access decrease over time.

The U.S. Securities Exchange Commission (SEC) regulations and new requirements for compliance with the Sarbanes-Oxley Act and the Health Insurance Portability and Accountability Act (HIPAA) have lengthened retention times for many types of data up to seven years or more. It is not cost effective to store all types of data on the same format for such long periods of time because the criticality of information typically decreases over time. This is not to say that the information becomes worthless as it ages (if a certain record is required after a long period of time, the crucial nature of that information increases tenfold), but the frequency of its access decreases and therefore the nature of the storage solution should change as well. This is a simple cost-benefit analysis. Even much of the information that is considered mission-critical changes little after it is written, and might be accessed infrequently if at all. Abstraction or virtualization of the storage through software can allow an application to transparently access the information even after it is moved to a different tier.

Information Lifecycle Management

ILM is not a new concept; processes similar to those now being introduced as part of an ILM framework have been used by business for years to manage the storage of information. As requirements for retention change and the frequency of access declines, the information moves through a tiered infrastructure. In this manner, the TCO is reduced as hardware solutions at each tier have different cost structures. Ultimately, the information is archived to tape for long-term, offline storage, converted to some type of flat file format for access by any application, or deleted entirely as appropriate. Figure 4-2 outlines a generic tiered, ILM infrastructure.

Figure 4-2. Tiered Storage and Information Lifecycle Management

A typical Tier 1 storage platform can be built with highly available, redundant disks, whereas a Tier 2 platform might be the same disks configured in a non-redundant fashion. Tier 3 might be configured with redundant, inexpensive disks such as Serial Advanced Technology Attachment (SATA) drives for online storage of rarely accessed information.

An accompanying strategy for long-term, nearline storage is to modify the data from its original format to release the data from its application-specific requirements. Under this procedure, the contents of the original data records would be scraped into a different file format, such as hypertext markup language (HTML), extensible markup language (XML), or plain text. In the long term, as backup applications expire or are decommissioned, the means to access the data from tape might no longer be available. Converting the contents of the file to a standardized format (without dependence on a vendor-specific solution) enables the data to be more easily retrieved when needed.

Just as there are hardware savings with the tiered storage strategy, there are also returns to scale with regard to management of the data as it moves through each phase of its lifecycle. The management requirements for high performance storage and frequently accessed or briskly growing data far outstrip the associated management requirements of offline (tape storage) technologies. Enterprise Resource Planning (ERP) installations and On-Line Transactional Processing (OLTP) environments require rigorous capacity planning and monitoring, whereas records retained just for the purpose of regulatory compliance require little or no hands-on attention.

Software tools designed to move information through its lifecycle (until the data is finally deleted at the end of its useful life) are slowly coming to market. Independent Software Vendors (ISVs) and Original Equipment Manufacturers (OEMs) are working to build lifecycle management functionality into their current frameworks for storage management. Such software products decrease the management overhead associated with moving data between tiers, and consequently lower the TCO throughout the entire lifecycle.

As mentioned previously, the final stage of the information lifecycle is to delete the record or file and reclaim the associated storage. Deletion of the file when the requirements for retention have expired is mandatory to release the costs associated with this storage and to relieve the burden on the storage management team.

An essential component of planning for a storage networking project, after addressing the need for a tiered infrastructure, is planning for performance.

Performance Planning

Critical to supporting a tiered infrastructure and managing information through its lifecycle is the ability to accurately size an application based on its performance traits (such as heavy writes in online transaction processing, a lot of reads in data warehousing, or bursts of reads and writes in batch processing). In addition, it is necessary to match the storage subsystem to the functional requirements of the application (high availability, near line, length of retention).

Estimating an application's performance requirements can be done simply by running systems tools such as, system activity reporter (SAR), to find IO rates over a set period of time, then sizing appropriately to handle periods of peak throughput based on the block size of the application. Similarly, the same type of data can be collected at the database level with regard to cache hits and misses.

Alternately, capacity planning and application sizing can be done with proprietary capacity planning tools or by using capacity planning frameworks, such as those offered by Teamquest or BMC Software. Depending on the nature of the environment, if one of these larger frameworks is not already installed, it is highly unlikely that it is worth buying and installing one just to size a solution prior to investing in a SAN. Ultimately, upon maturity, SRM software and SAN management software greatly simplify the process of capacity planning for SAN environments.

Using system tools and doing the math required to accurately size and plan for disk and IO throughput prior to installing a SAN are relatively straightforward processes. The final requirement for performance planning is a holistic understanding of the application environment, which is where corporate knowledge plays a big role. When are the backup windows? When are the periods of peak processing? When was the last time the system was upgraded? This type of data is crucial to solid analysis and cannot be completely obtained through the use of software. Although this information might not be inherently quantifiable, it can help increase the granularity of your quantitative analysis and help make a better business decision.

Oversubscription

Another aspect of performance planning is oversubscription. Oversubscription in the SAN arena, just as in LAN or WAN networking, is a capacity planning technique that guarantees a minimum rate of throughput to an application or set of applications, while at the same time provisioning for an estimated maximum rate of throughput. In other words, the capacity of the network is oversold under the assumption that not all applications burst and reach maximum throughput at the same time. In this way, provisions are made for maximum bandwidth demand with significantly less infrastructure cost.

The following is a generic example designed to illustrate the concept of oversubscription exclusive of the concept of Quality of Service (QOS). Also note that oversubscription, in the context of host to switch connectivity, is often referred to as the fan-in or fan-out ratio.

A small SAN-island backbone (64 2-Gbps ports on a director-class switch) is oversubscribed to five applications on each of six hosts. Each application's demand bursts at 2 Gbps on each channel for a total application demand of 60 Gbps (6 hosts x 5 applications x 2 Gbps), as shown in Figure 4-3.

Figure 4-3. Oversubscription

The aggregate bandwidth supplied over two connections to each host is 48 Gbps full duplex (2 ports x 2 Gbps x 6 hosts = 24 Gbps half duplex). Therefore, with the configuration shown in Figure 4-3, you have a 5:4 oversubscription ratio.

By providing a minimum committed information rate you are also providing for a maximum information rate based on assumptions about characteristic application performance. Figure 4-3 demonstrates this concept.

Note

In the real world, special operating system tools and application scheduling software is required to manage application level planning for oversubscription. Mapping application requirements to bandwidth restrictions is currently beyond the scope of SAN software and hardware.

The concept of oversubscription also applies when dealing with ISLs. ISLs, as mentioned earlier, are interswitch links or ports on a switch reserved for connections to other switches, and are used for building large fabrics. If the small SAN island in our previous example is expected to scale to two 64-port directors, then a minimum of two ports on each switch (four ports total) are required to link the switches together for half-duplex bandwidth of 4 Gbps. Depending on the amount of maximum overall throughput required now and in the future, it might be necessary to reserve as many as 16 ports for ISLs for greater bandwidth. Figure 4-4 demonstrates the use of ISLs to provide connectivity and throughput for a larger fabric.

Figure 4-4. Oversubscription with ISLs

In this diagram, two director-class switches are linked together with ISLs to provide connectivity to the two disk cabinets on the right. Hosts 1 through 4 are attached through Switch 1 at the top to both Disk Frame 1 and Disk Frame 2 via the ISLs between Switch 1 and Switch 2. Hosts 5 and 6 are attached to both Disk Frame 1 and Disk Frame 2 via the ISL between Switch 1 and Switch 2. There is a committed information rate of 4 Gbps half-duplex over the ISLs.

If the applications in this environment all burst at the same time there is a bandwidth demand of 12 Gbps while the ISLs can only provide 4 Gbps. In this particular instance, it is important to note that the 4 Gbps ISL bandwidth is a possible bottleneck for all hosts using the ISLs.

If the application environments and storage subsystems are appropriately sized then oversubscription provides the proper bandwidth at the least cost. If not, and if the traffic between the hosts and the disk frames is sustained at greater than 4 Gbps (possibly during a backup window) then there will be congestion on the ISL link.

If significant throughput is required, this environment is undersized and requires a different architecture. The entire backbone could be replaced with an intelligent switch platform, a director-class switch with 140 ports such as the McData Intrepid 6140, or the Cisco MDS 9509, which is configurable with up to 224 ports.

Bear in mind that increasing the number of ISLs raises the average per port cost of the solution insofar as ISLs are ports consumed solely for interswitch communication and cannot be doled out directly to the host or the storage device.

In summary, oversubscription is one way of cost-effectively allocating bandwidth supply to environments without purchasing dedicated resources to meet anticipated maximum demand. In this manner it is possible to keep the TCO to a minimum and still supply a maximum information rate at times of peak load.

Note

Oversubscription is a term often used in a pejorative sense, but it should be emphasized that oversubscription is not inherently evil. Oversubscription does cause problems when an environment is oversubscribed to the point of causing congestion, and changes in an application that cause unpredicted periods of peak processing can disrupt the model service levels built into an oversubscribed architecture. There is nothing inherently wrong with oversubscription, however; it is an industry best practice commonly used to contain costs.

Depending on the application environment, the amount of throughput required, and the amount of oversubscription needed to avoid congestion, either a core topology or a core-edge topology (the two most common SAN topologies) can be implemented. A brief overview of each topology follows before a discussion of the differences between redundancy and resiliency.

Core Topology

The previous examples demonstrated a core topology using a single FC switch at the core of the architecture. Core topology SANs may also use a large SAN fabric built on multiple core switches linked together by ISLs. Both of these scenarios are demonstrated in Figure 4-5.

Figure 4-5. Core Topology

Note

It is important to note that these examples require additional switches for redundancy and separate fabrics for resiliency.

The primary advantage of a core topology is fewer points of management. This architecture works well in an environment with a fewer number of hosts accessing large amounts of data. With a core topology, the process of designing and provisioning for application performance for a small number of hosts, allocating ISLs, and managing oversubscription is significantly easier than in a core-edge topology.

One significant disadvantage to a core topology is the cost of expansion as capacity needs increase. Adding additional core switches can be expensive, and reallocating ports across the environment can be labor-intensive.

Core-Edge Topology

A core-edge topology, as shown in Figure 4-6, works well in an environment with a large number of hosts accessing small amounts of data. To provide adequate service to all hosts, as well as face the challenge of locality and many fiber runs across the datacenter back to the core, smaller departmental, fixed, or edge switches are inserted at fixed points in the topology to provide fan-in to many hosts.

Figure 4-6. Core-Edge Topology

As you can imagine, the process of calculating application throughput over multiple edge switches going back to a single core (or a core comprised of a few director-class switches) becomes challenging quite quickly as more hosts are attached to the SAN. ISL throughput becomes crucial as the number of hosts increases.

The advantage of a core-edge topology should be obvious: with many hosts that have less stringent performance requirements, it is possible to build a cost-effective storage network by adding smaller fixed FC switches as capacity requirements dictate. The disadvantage to this topology should also be obvious: the overhead involved with increased points of management is not trivial. The additional cost of increased ISL throughput is also significant.

In the example in Figure 4-6, ISL allocation becomes quite costly, adding to the TCO of this environment. For example, if the required ISL throughput reaches 8 Gbps, then four ISLs would be required at the edge. If the per port cost is $1000, then the TCO for this environment increases by $4000, as these ports cannot be allocated to the hosts using storage, but can be used only for switch-to-switch communication.

Other topologies include mesh, star, bus (chain) and collapsed-core. Whereas mesh, star and bus topologies are rarely used, the collapsed-core is used more often. The collapsed-core is essentially a combination of core and core-edge topologies with an intelligent switch platform at the core and a suitable mix of 16-and 32-port line cards in the same chassis.

Redundancy and Resiliency

In the discussion of topologies, it is important to understand the nature of redundancy and resiliency in building a SAN architecture as well as the cost differences between these two features.

Previous examples glossed over the topic of redundancy to simplify the concepts of core and core-edge topologies. In these examples, redundant hardware is needed to provide for maximum availability and uptime, while a resilient architecture protects against a single component failure.

To implement full redundancy, there must be at least one additional switch with separate power, separate networking, and multiple paths from the hosts as shown in Figure 4-7. This design also utilizes a separate fabric at each core that prevents failures associated with single-fabric service outages. Multipathing software such as VERITAS DMP or EMC PowerPath provides for multiple paths to the disk at the operating system level, which provides path failover capability at the host level. Obviously, with redundant hardware at each level from the host bus adapter (HBA) to the disk, this solution has the highest price tag. But this solution also provides the best protection against failure and interruption in service, and is a suitable architecture for a Tier 1 infrastructure requiring maximum availability.

Figure 4-7. Redundant Architecture

A resilient architecture, on the other hand, provides for recovery from failure in either path without providing full redundancy at the fabric level. As shown in Figure 4-8, the two core switches are joined together into a single fabric that provides fault recovery, but not full failover capabilities. In this example, a resilient architecture is capable of protecting against a single software or hardware failure.

Figure 4-8. Resilient Architecture

A fully redundant architecture provides maximum protection for the entire solution but at a significantly higher cost than a resilient architecture. The high TCO for such an environment must match the business requirements; otherwise, a less costly solution should be architected. Depending on the revenue associated with a particular environment and the service level agreement (SLA) guiding the delivery of the services residing on the SAN architecture in question, the business needs to perform a cost-benefit analysis and potentially make a choice between the two types of architectures.

As both solutions need to scale to provide added throughput and additional storage for growth, the cost advantages of resiliency (and the cost disadvantages of redundancy) become apparent. Still, for maximum availability, a redundant solution is the best choice. Depending on the business requirements, however, there might not be an opportunity to choose between the two. A highly available ERP environment requires a highly available SAN and likely requires a redundant architecture. A development environment with fewer performance and availability restrictions might find a resilient architecture sufficient in terms of recoverability and scalability.

Thus far this chapter covered in detail the concepts behind ILM, as well as the most commonly used topologies and architectures. In highlighting the advantages and disadvantages of each solution, you are now prepared to make an informed decision regarding implementation steps, the first of which is choosing the right vendor.