Oracle Essentials [Electronic resources] : Oracle Database 10g, 3rd Edition نسخه متنی

10.2 System Crashes

The abrupt
failure of the server machine running the database is one of the most
common causes of unplanned downtime. The server may crash because of
hardware problems, such as the failure of a power supply, or because
of software problems, such as a process that begins to consume all
the machine's CPU resources. Even if the underlying
server platform is fine, the Oracle instance itself can fail.
Whatever the cause of the crash, the effect on Oracle is the
samethe instance cannot deliver its promised functionality.
Remember that when an Oracle database crashes, it's
the instance that crashes, not the database, as described in Chapter
2. Even when a system crashes, the crash will not imperil any data
that's already safely stored within the disk files
used by the Oracle database.

The impact of the crash will depend on the activity in progress at
the time of the crash. Any connected sessions will no longer have a
server process to which to talk. All active queries and transactions
will be abruptly terminated. The process of cleaning up the resulting
mess is called instance recovery
or crash recovery.

Telltale Error Messages

The
following two error messages are often good indicators that an Oracle
instance is down:

ORA-03113:  End-of-file on communication channel

This message is usually received by
clients that try to resubmit an operation that failed due to an
instance failure. The message is somewhat cryptic but becomes clear
if you interpret it very literally. Oracle works using a pipe to
communicate between the client application and its associated server
process in the Oracle instance. When the instance fails, the
client's server process ceases to exist, so there is
no one listening on the other end of the pipe. The communication
channel between the client and the server is no longer valid.

ORA-01034:  Oracle not available

This terse message means that when the
client requested a connection to the Oracle instance, the instance
was not there. Clients that try to connect to a failed instance will
typically get this message. The client can connect to the Listener,
but when the Listener attempts to hand the client off to the
requested Oracle instance, the ORA-01034 condition results.

10.2.1 What Is Instance Recovery?

When you
restart an Oracle instance after a failure, Oracle detects that a
crash occurred using information in the control file and the headers
of the database files. Oracle then performs instance recovery
automatically and uses the online redo logs to guarantee that the
physical database is restored to a consistent state as it existed at
the time of the crash. This requires two actions:

All committed transactions will be recovered.

In-flight transactions will be rolled back or undone.

Note that an in-flight transaction might be one that a user
didn't commit or one that was committed by the user
but not confirmed by Oracle before the system failure. A
transaction
isn't considered committed until Oracle has written
the relevant details of the transaction to the current online redo
log and has sent back a message to the client application confirming
the committed transaction.

10.2.2 Phases of Instance Recovery

Instance
recovery has two phases: rollforward and rollback.

Recovering an instance requires the use of the redo logs, described
in Chapter 2. The redo logs contain a recording of all the physical
changes made to the database as a result of transactional activity,
both committed and uncommitted.

The checkpoint
concept, also introduced in Chapter 2, is critical to understanding
crash recovery. When a transaction is committed, Oracle writes all
associated database block changes to the current online redo log. The
actual database blocks may have already been flushed to disk, or may
be flushed at some later point. This means that the online redo log
can contain changes not yet reflected in the actual database blocks
stored in the datafiles. Oracle periodically ensures that the data
blocks in the datafiles on disk are synchronized with the redo log to
reflect all the committed changes up to a point in time. Oracle does
this by writing all the database blocks changed by those committed
transactions to the datafiles on disk. This operation is called a
checkpoint. Completed
checkpoints are recorded in the
control file,
datafile headers, and
redo log.

10.2.2.1 Rollforward

At any point in
time, the online redo logs will be ahead of the datafiles by a
certain amount of time or number of committed transactions. Instance
recovery closes this gap and ensures that the datafiles reflect all
committed transactions up to the time the instance crashed. Oracle
performs instance recovery by rolling forward through the online redo
log and replaying all the changes from the last completed checkpoint
to the time of instance failure. This operation is called the
rollforward phase of
instance recovery.

While implementing rollforward recovery, Oracle reads the necessary
database blocks into the System Global Area and reproduces the
changes that were originally applied to the blocks. This process
includes reproducing the undo or rollback information, in addition to
the data changes. Rollback segments are composed of extents and data
blocks just like tables, and all changes to rollback segment blocks
are part of the redo for a given transaction. For example, suppose
that a user changed an employee name from
"John" to
"Jonathan". As Oracle applies the
redo log, it will read the block containing the employee row into the
cache and redo the name change. As part of recovering the
transaction, Oracle will also write the old name
"John" to a rollback segment, as
was done for the original transaction.

When the rollforward phase is finished, all the changes for committed
and uncommitted transactions have been reproduced. The uncommitted
transactions are in-flight once again, just as they were at the time
the crash occurred. This leads to the next logical phase of instance
recoveryrollback. But before we discuss rollbacks themselves,
we need to look at how Oracle uses checkpoints and how the timing of
checkpoints can affect recovery time.

10.2.2.2 Fast-start fault recovery and bounded recovery time

Checkpoints cause an increase in I/O as
the database writer flushes all the database blocks to disk to bring
the datafiles up to the time of the checkpoint. Prior to Oracle8,
DBAs controlled the checkpoint frequency using
initialization file parameters and
the size of the redo log files. The relevant initialization file
parameters are:

LOG_CHECKPOINT_INTERVAL

The number of redo log blocks
(operating system blocks) written between checkpoints.

LOG_CHECKPOINT_TIMEOUT

The number of seconds between
checkpoints. A 0 value disables time-based checkpoints.

Oracle performs a checkpoint whenever a log switch occurs, regardless
of the values used for these parameters.

Reducing the checkpoint interval or timeout
can result in smaller amounts of data between checkpoints. This
reduced data can lead to faster recovery times, but can also
introduce the overhead of more frequent checkpoints and their
associated disk activity. For additional details about the subtleties
of these parameters, see the section "Managing the
Online Redo Log" in the appropriate backup and
recovery documentation.

A common strategy for minimizing the number of checkpoints is to set
these initialization file parameters so that checkpoints occur only
with log switches. For example, a DBA might set
LOG_CHECKPOINT_TIMEOUT to 0 and LOG_ CHECKPOINT_INTERVAL to a value
higher than the size of the redo log file. These settings result in
applying a minimum of one redo log file's worth of
redo in the event of a crash, which can result in unacceptably long
recovery times.

Oracle8i
introduced an initialization file parameter that provides
a simpler and more accurate way to control recovery times:
FAST_START_IO_TARGET. The bulk of
recovery activity involves performing I/O for reading database blocks
into the cache so that redo can be applied to them. This parameter
sets a target ceiling on how many database blocks Oracle will have to
read in applying redo information. Oracle will dynamically vary the
checkpoint frequency in an attempt to limit the number of blocks that
will need to be read for recovery to the value of this parameter.
This variance will, in turn, more effectively limit the associated
recovery time without setting checkpoint rates to a fixed value that
isn't valid for different workloads.

Oracle9i
further sped this recovery process. Beginning at the last checkpoint,
the redo log is scanned for data blocks that contain unsaved changes
and need to be recovered. In the subsequent scan, changes are applied
only where needed. Because the subsequent scan is a sequential read
and reading unnecessary blocks (random I/O) is eliminated, the
recovery time is reduced.

Oracle9i also introduced a fast-start time-based
recovery feature. DBAs can specify a target for recovery time in
seconds (in the FAST_START_MTTR_TARGET initialization parameter,
where MTTR stands for Mean Time to Recover) in order to meet Service
Level Agreements and other requirements. The database will
automatically determine values for FAST_START_IO_TARGET and
LOG_CHECKPOINT_INTERVAL. Estimated MTTR values are calculated and
placed in V$INSTANCE_RECOVERY, thereby providing a means for
real-world calibration and more accurate estimates over
time.

10.2.2.3 Rollback

The
rollforward phase re-creates uncommitted transactions and their
associated rollback information. These in-flight transactions must be
rolled back to return to a consistent state.

In Oracle releases prior to Version 7.3, the database
isn't available until all uncommitted transactions
had been rolled back. Although a DBA can control the checkpoint
frequency and therefore control the time required for the rollforward
phase of instance recovery, the number of uncommitted transactions at
the time of the crash can vary tremendously and cannot really be
accurately controlled or predicted. In a busy OLTP system, there are
typically a fair number of in-flight transactions requiring rollback
after a crash. This situation led to variable and unpredictable times
for crash recovery. The solution to this problem,
deferred rollback,
was introduced in Oracle 7.3.

10.2.2.4 Deferred rollback

In Version 7.3 and later, Oracle opens the database after the
rollforward phase of recovery and performs the rollback of
uncommitted transactions in the background. This process reduces
database downtime and helps to reduce the variability of recovery
times by deferring the rollback phase.

But what if a user's transaction begins working in a
database block that contains some changes left behind by an
uncommitted transaction? If this happens, the user's
transaction will trigger a foreground rollback to undo the changes
and will then proceed when rollback is complete. This action is
transparent to the userhe doesn't receive
error messages or have to resubmit the transaction.

10.2.2.5 Fast-start rollback

Oracle8i
further optimized the deferred rollback
process by limiting the rollback triggered by a user transaction to
the block in which the transaction is interested. For example,
suppose there is a large uncommitted transaction that affected 500
database blocks. Prior to Oracle8i, the first
user transaction that touches one of those 500 blocks will trigger a
foreground rollback and will absorb the overhead of rolling back the
entire transaction. Leveraging fast-start rollback, the
user's transaction will roll back only the changes
to the block in which it's interested. New
transactions don't have to wait for the complete
rollback of large uncommitted transactions.

Oracle Essentials [Electronic resources] : Oracle Database 10g, 3rd Edition نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی