10.5 Complete Site FailureProtection from the complete failure of your primary Oracle site poses significant challenges. Your company must carefully evaluate the risks to its primary site. These risks include physical and environmental problems as well as hardware risks. For example, is the data center in an area prone to floods, tornadoes, or earthquakes? Are power failures a frequent occurrence? Previous versions of this book treated events such as "a terrorist attack or an airplane crash into the data center" as remote possibilities, but, unfortunately, these scenarios no longer seem so implausible.Protection from primary site failure involves monitoring of and redundancy controls for the following:Data center power supplyData center climate control facilitiesDatabase server redundancyDatabase redundancyData redundancy The first three items on the list are aimed at preventing the failure of the data center. Data server redundancy, through simple hardware failover or Real Application Clusters, provides protection from node failure within a data center but not from complete data center loss.Should the data center fail completely, the last two itemsdatabase redundancy and data redundancyprovide for disaster recovery. 10.5.1 Oracle Data Guard: Standby Database for RedundancyOracle's physical standby database functionality was introduced in Oracle 7.3 to provide database redundancy. In Oracle9i, this concept was extended to include support for a logical standby database. The enhanced feature set is called Oracle Data Guard.The concept of a physical standby database is simplekeep a copy of the database files at a second location, ship the redo logs to the second site as they are filled, and apply them to the copy of the database. This process keeps the standby database "a few steps" behind the primary database. If the primary site fails, the standby database is opened and becomes the production database. The potential data loss is limited to the transactions in any redo logs that have not been shipped to the standby site. Figure 10-10 illustrates the standby database feature. Figure 10-10. Standby database![]() The physical standby database can be opened only for read-only access. You can use read-only access to offload reporting, such as end-of-day reports, from the primary server to the standby server. The ability to offload reporting requests provides flexibility for reporting and queries and can help performance on the primary server while making use of the standby server.While the standby database is being used for reporting, the archived redo information from the primary site couldn't be applied prior to Oracle Database 10g. Recovery can continue when the standby database is closed again. This factor has important implications for the time it will take to recover from an outage with the standby database. If the primary site fails while the standby database is open for reporting, the archived redo information from the primary site that accumulated while the standby database was querying must be applied before the standby is brought online. This application of archived redo information increases the duration of the outage. You'll need to weigh the benefits of using the standby database for reporting against the recovery time and the duration of the outage should a failure occur while archived redo information is not being applied at the standby. Oracle Database 10g introduces a real-time feature enabling redo data to be applied at the standby as soon as it is received.Once a standby database is opened for read/write access, as opposed to read-only access, it can no longer be used as a standby database and you cannot resume applying archived redo information later. The standby database must be "re-cloned" from the primary site if it is opened accidentally in read/write mode. 10.5.1.1 Logical standby databaseOracle Data Guard also offers a logical standby database capability. With this capability, the standard Oracle archive logs are transformed into SQL transactions, and these are applied to an open standby database. The logical standby database is different physically from the primary standby database and can be used for different tasks. For example, the primary database might be indexed for transaction processing while the standby database might be indexed for data warehousing. Although physically different from the primary database, the secondary database is logically the same and can take over processing in case the primary fails. As archive logs are shipped from the primary to the secondary, undo records in the shipped archive log can be compared to the logical standby undo records to guard against potential corruption. As of Oracle Database 10g, you can instantiate the logical standby database without quiescing the primary. 10.5.1.2 Oracle Data Guard managementThe Oracle Data Guard broker provides monitoring and control for physical and logical standby databases and components. A single command can be used to perform failover. Oracle Enterprise Manager provides a Data Guard Manager GUI for setting up, monitoring, and managing the standby database.The Oracle Database 10g Data Guard broker adds support for creating and managing configurations containing RAC primary and standby databases. The Data Guard broker leverages the Cluster Ready Services in Oracle Database 10g. 10.5.2 Possible Causes of Lost Data with a Physical Standby DatabaseThere is a possibility that you will lose data, even if you use a physical standby database. There are three possible causes of lost data in the event of primary site failure:Archived redo logs have not been shipped to the standby siteFilled online redo logs have not been archived yetThe current online redo log is not a candidate for archiving until a log switch occurs These three potential problems are addressed in different ways, as described in the following sections. 10.5.2.1 Copying archived redo logs to a standby sitePrior to Oracle8i, copying of archived redo logs from the primary to the standby site was not automated. You were free to use any method to copy the files across the network. For example, you could schedule a batch job that copies archived logs to the standby site every N minutes. If the primary site fails, these copies would limit the lost redo information (and therefore the lost data) to a maximum of N minutes of work.Oracle8i first provided support for the archiving of redo logs to a destination on the primary server as well as on multiple remote servers. This feature automates the copying and application of the archived redo logs to one or more standby sites. The lost data is then limited to the contents of any filled redo logs that have not been completely archived, as well as the current online redo log. Oracle also automatically applies the archived redo logs to the standby database as they arrive.Oracle9i added the option to specify zero data loss to a standby machine. In this mode, all changes to a local log file are written synchronously to a remote log file. This mode guarantees that switching over to the standby database will not result in any lost data. As you might guess, this mode may impact performance, as each log write must also be completed to a remote log file. Oracle provides an option that will only wait to write to a remote log for a specified period of time, so that a network failure will not bring database processing to a halt. 10.5.2.2 Unarchived redo information and the role of geomirroringIf you cannot allow primary site failure to result in the loss of any data, and do not choose to use the zero data loss option of Data Guard, the solution is to mirror all redo log and control file activity from the primary site to the standby site.You can provide this level of reliability by using a remote mirroring technology sometimes known as geomirroring. Essentially, all writes to the online redo log files and the control files at the primary site must be mirrored synchronously to the standby site. For simplicity, you can also geomirror the archived log destination, which will duplicate the archived logs at the remote site, in effect copying the archived redo logs from the primary to the standby site. This approach can simplify operations; you use one solution for all the mirroring requirements, as opposed to having Oracle copy the archived logs and having geomirroring handle the other critical files.Geomirroring of the online redo logs results in every committed transaction being written to both the online redo log at the primary site and the copy of the online redo log at the standby site. This process adds some time to each transaction for the mirrored write to reach the standby site. Depending on the distance between the sites and the network used, geomirroring can hamper performance, so you should test its impact on the normal operation of your database.Geomirroring provides the most complete protection against primary site failure and, accordingly, it's a relatively expensive solution. You will need to weigh the cost of the sophisticated disk subsystems and high-speed telecommunication lines needed for nonintrusive geomirroring against the cost of losing the data in any unarchived redo logs and the current online redo log. See Appendix B for where to find more information about geomirroring. |