The Move to Storage Networks
The first SAN at this firm was built in April of 1999, long before the creation of a dedicated storage support team, and long before many firms had even considered using Fibre Channel SAN technology in production environments. Rather than just move a small test environment to a SAN infrastructure, however, the storage services team decided instead to tackle one of the largest data warehouse environments, which at the time approached 30 TB in size.The resulting data warehouse SAN was comprised of 10 small, 16-port fixed switches, 12 of the largest hosts, and approximately 10 external storage arrays.Behind this decision to build a Fibre Channel SAN was a critical business driver: the need for flexibility.Ease of use for provisioning large amounts of data is a fundamental component of operational efficiency. To increase the number of TB managed by an individual storage manager, the need to quickly provision storage for rapidly growing environments became a prerequisite. The data warehouse environment in which the first SAN was implemented was growing quickly with an estimated 2 TB added every six months. Because the environment was direct-attached Small Computer Systems Interface (SCSI), adding new disks required long outages, which could not be done during the day due to system availability requirements. The migration of the data warehouse to a Fibre Channel SAN gave the team the advantage of adding storage on an ad hoc basis to meet rampant and often unpredictable growth demands.An additional driver behind the migration to SAN storage was the need for solid, first-hand exposure to what the team believed would ultimately become the new standard for storage deployments. Although many of the system administrators had already been exposed to Fibre Channel protocols while deploying smaller Fibre Channel Arbitrated Loop (FC-AL) JBOD devices, management believed that it was crucial for the remainder of the staff to become experts quickly in what would inevitably become a heavily-leveraged technology.In particular, senior staff wanted to familiarize themselves with switched Fibre Channel ahead of the mainstream adoption curve so that they could better understand how performance and interoperability issues might affect their environments.
Interoperability of Storage Solutions
Although this firm certainly falls into the early adopter categorization for Fibre Channel switches, it is interesting to note that the storage services team faced few of the same tribulations as other early adopters (in particular, interoperability issues with products from various HBA and disk vendors). Because the operating systems in use at the time had strong support for Fibre Channel, the system administrators were able to make the jump from FC-AL to switched Fibre Channel with little difficulty. The team's earlier experiences with FC-AL technology coupled with the operating systems' capability to easily integrate with switched Fibre Channel allowed the group to sidestep many of the problems familiar to users of operating systems with weaker support for Fibre Channel.The cutovers in the FC-AL environments, which had already utilized World Wide Names (WWNs) (the method of identifying objects on a SAN fabric), were smoother than the typical parallel SCSI migration to Fibre Channel. In such circumstances, HBA compatibility and interoperability issues were minimized by the early adoption of Fibre Channel.
Networked Storage Implementation
The current SAN implementation, which has grown considerably since 1999 from 10 16-port fixed switches to over 100 switches total (comprised of a mix of 16-, 32-, and 128-port switches), now hosts almost all of the organization's mission-critical data. These switches make up multiple fabrics in redundant configurations for a total of over 2000 SAN ports. The FC fabrics themselves are location-based, similar to typical IP networks. Each datacenter is divided into multiple raised floor rooms, and each room hosts its own redundant, dual-fabric SAN. Over time, the storage services team expects to see the fabrics in each datacenter connected via SAN extensions; however, currently the environments remain separate and distinct in an effort to limit the impact of propagated errors. For now, the architecture remains a collection of tiered fabrics with no distinct core and no distinct edge.
Tiered Storage Implementation
To lower costs, the storage services team implemented a three-tiered storage architecture. At the high end are the typical large, high-performance arrays with single or multiple FC fabrics as needed for redundancy. In the middle are the same high-performance arrays or mid-sized arrays configured without redundant switch fabrics. At the low end, Serial ATA (SATA) drive arrays are provided for large data sets that are written once and rarely read. Environments that fall into this category are low criticality applications with minimal performance requirements or database dumps that are used to quickly recreate databases in the event of data loss or corruption.The use of approximately 200 TB of SATA drives since the second quarter of 2003 has made a significant financial impact, saving a large amount of money in terms of immediate cost avoidance. Large, high-performance arrays previously utilized for these types of applications were released to be used for applications with higher IO performance requirements.SATA disk drives have an acquisition cost of approximately one-third or less than typical external array drives, and if used appropriately, management costs are minimized. The decreased mean time between failure (MTBF) of SATA drives typically scares off some decision makers; however, using these devices in a RAID or mirrored configuration and for less IO-intensive applications should offset the costs associated with a higher failure rate. In addition, data with lower availability requirements can be offline on occasion making the choice for SATA drive-based arrays ideal, as long as corresponding service level agreements (SLAs) are accordingly adjusted.Even though the storage services team has built a tiered infrastructure, the team is not currently implementing solutions as part of a broader Information Lifecycle Management (ILM) strategy. For the foreseeable future, the team plans to stick with the current architecture, whereby data with the highest availability requirements are hosted on redundant, dual-fabric SANs with multiple, replicated copies of the data stored on large, high-performance storage arrays. Applications or data sets with more modest availability and uptime requirements are stored on SATA drives. Data whose requirements fit in between are hosted on dual fabrics and large or mid-sized arrays with no replication.The storage services team believes that today the costs associated with implementing a more rigid ILM framework and managing heterogeneous storage across the enterprise outweigh the benefits. Accordingly, there are currently no plans to utilize network-attached storage (NAS) storage to build out additional tiers.NoteThere are plans, however, to utilize NAS storage for office-automation applications and user home directories, but those environments are not managed by the storage services team.The storage services team concedes that currently client SLAs are poorly documented and that the few SLAs that are documented do not explicitly state a client's expected availability in terms of percentages. Although these SLAs are not rigid devices to be used to adjust charge-back schedules, they are well established agreements between parties that document which applications reside on what type of storage and which applications receive priority status.As far as charge-back is concerned, the storage services team knows approximately how much storage their clients consume; however, the mechanism required to do a complete departmental charge-back has not been fully developed. The storage services team is currently updating the request and reservation systems and plans to move to a full charge-back system by the end of 2004.To execute this initiative, the team needs access to accurate and up-to-date TCO and utilization data, which is difficult to come by. In the absence of these numbers, however, the storage services team can adjust storage cost estimates based on the storage lease schedule, which sees frames regularly rotating out of the datacenter as leases expire.
Storage TCO
The storage services team has an estimate of the overall TCO per MB: It is a floating number that depends primarily on the number and frequency of backups and, obviously, administrator efficiencies. With six individuals supporting approximately 180 TB each, and with data storage still growing, the team continues to push the limit of TB per administrator supported.The storage services team uses a home-grown spreadsheet solution to model storage usage and associated costs, but with costs continually fluctuating and the number of storage frames onsite changing weekly, determining an exact TCO number is often a futile exercise. To get a more accurate picture of TCO, new incoming arrays and old outgoing arrays are added and removed from the spreadsheet as soon as possible. Therefore, the spreadsheet solution is a valuable tool for capacity planning and analysis. Each individual environment must be analyzed separately to determine the actual TCO per MB, but for the purposes of discussing an enterprise-wide cost model, the generalizations are worthy of a closer look.The most enlightening part of the analysis is the fact that backups are by far the most expensive cost component of the storage TCO. Like most TCO calculators, the team's model accounts for the costs of datacenter floor tiles, utilization efficiencies, Full-time Equivalent (FTE) labor, hardware and software acquisition costs, and the cost of backups. The team also factors in the costs of tape retention for multiple years. If you factor in numerous backups of multi-terabyte environments and include retention periods of up to seven years for some data sets, backup costs become significant.NoteIt is worth noting that a TCO that includes the costs of seven-year tape retention is significantly higher than the typical one-year snapshot of TCO. Both models are effective mechanisms for calculating the TCO for storage; however, a seven-year TCO model is better for estimating the value of the data that is stored on tape. The seven-year model (and the associated costs of backups) also leads one to believe that there is potential return on investment (ROI) to be achieved by adjusting tape retention policies and by archiving less critical data.Ultimately, there is no one right way to model storage TCO. Either method, as long as it provides value to the firm, is appropriate.In summary, backups represent over 80 percent of the cost structure for storage in the IT hosting environment. Hardware and software acquisition costs are only 10 percent of the TCO, and FTE time is only 3 percent. Approximately 1 percent of the total cost is attributed to the datacenter and facilities expenses (power, cooling, and floor space). The TCO cost components are illustrated in Figure 7-1.
Figure 7-1. TCO Components

Figure 7-2. Retention-Adjusted TCO Components

Replication
All of the business-critical 900 TB is spread across four datacenters.Dark fiber between five metro-area endpoints, including the four datacenters and a metropolitan-area network point-of-presence (MAN POP), provides connectivity for Fibre Channel (FC) over Dense Wave Division Multiplexing (DWDM) and for data replication. A portion of the replication is continuous, but the majority of the copies are done with simple sync-and-splits over the wire either nightly or during business hours, depending on performance requirements and the availability of application downtime windows. For example, the ETL and data warehouse applications are read-only for most of the day and that data is synchronized at night when the daily data load processes are completed. Financial and CRM applications are also typically copied after business hours when utilization periods for these systems are low.According to the storage services team, it is prepared for typical disaster recovery scenarios (weather events being the most common), with the most critical data being copied frequently enough between the multiple regional datacenters to allow for quick recovery and optimal business uptime.
Organizational Impact
From the inception of the first SAN until mid-2002, system administrators performed storage support duties on a strictly ad hoc, server-centric basis. Prior to the creation of the storage services team, systems administrators handled storage provisioning requests as a part of their daily activities.In November, 2002, the dedicated storage support team, currently comprised of six full-time storage administrators, was formed to focus specifically on the ongoing consolidation efforts and other storage-related initiatives. Working almost full-time on consolidation efforts to reduce points of management, increase utilization efficiencies, and conserve datacenter resources, the storage services team directly addresses storage supply issues by managing the storage inventory onsite.As the storage support processes mature, the storage services team is better prepared to outline in specific detail how much storage each application group consumes. The team admits, however, that it is not its policy to police storage