Definitive MPLS Network Designs [Electronic resources] نسخه متنی

Network Recovery Design

The requirements in terms of network availability significantly differ between the PSTN traffic and the rest of the traffic. A convergence time of a few seconds is perfectly tolerable and in line with the SLAs for the Internet and Layer 3 MPLS VPN traffic. However, the objective is to provide a convergence time of a few tens of milliseconds to the PSTN traffic in case of a single link, SRLG, or node failure (similar to the availability provided with TK's former SDH infrastructure). Note that the PSTN traffic must also be rerouted within a few tens of milliseconds in case of a node failure in the MPC network. This was not possible with TK's previous PSTN network. Indeed, the links were protected with SDH. But in the case of Class 4 voice switch failure, all the voice calls were dropped, and the communication had to be reestablished. There was no possibility for a voice call to survive a node failure. That said, note that a Class 4 voice switch failure was extremely rare.

Network Recovery Design for the Internet and Layer 3 MPLS VPN Traffic

With an objective of a few seconds for the Internet and Layer 3 MPLS VPN traffic in case of failure, aggressive OSPF timer tuning clearly was not required. Thus, TK decided to choose conservative OSPF protocol tuning.

Failure Detection Time

By default, OSPF is configured with a 10-second hello interval and a 40-second RouterDeadTimer on most of the commercial router platforms. Because both the NAS and BAS devices are connected by means of Layer 2 switches, the default configuration does not meet the requirements of a few seconds in case of failure. The OSPF hello protocol must be used for failure detection; there is no lower-layer fast failure detection mechanism, as in the case of SDH and DWDM links.

Note

The case of point-to-point Gigabit Ethernet interfaces without intervening Layer 2 switches is quite different. Upon fiber cut, a loss of signal (LoS) is quickly detected, making tuning the OSPF hello interval unnecessary. But in the case of TK, Layer 2 switches are used to reduce the number of required ports. Consequently, the failure of a link or port would not be detected by equipment connected behind the Layer 2 switch.

The hello frequency has been set to 1 second with a RouterDeadTimer of 3 seconds. This effectively means that in worst-case failure scenarios the failure is detected within 3 seconds. The configuration template for these changes is shown in Example 4-8.

Example 4-8. OSPF Timer Configuration Template


interface pos3/0
ip ospf hello-interval 1
ip ospf dead-interval 3
!

On the other hand, on SDH and DWDM links (which represent the vast majority of the links in the MPC network), network failures are detected within a few milliseconds.

LSA Generation

As soon as the failure has been detected and reported to the OSPF process, the first step is to originate a new LSA to inform the other routers of the topology change. As mentioned in Chapter 2, the challenge is to quickly originate a new LSA so as to improve the IGP converge time while preserving the network stability in case of unstable network resources (such as a link flap). To that end, modern routers provide dynamic mechanisms such as the exponential back-off algorithm described in Chapter 2. TK elected to use the configuration shown in Example 4-9.

Example 4-9. OSPF LSA Origination Configuration


router ospf 1
timers throttle lsa all 0 40 5000

Chapter 2.

Failure Notification Time

For the traffic to be rerouted along an alternate path if a failure occurs, the LSA originated by the node that detects the failure must first be received by the rerouting router, which might be several hops away from the failure. Thus, this period (usually called the failure notification time) is the sum of the propagation, queuing, and processing delays along the path between those two nodes. Note that the processing delay may be optimized by means of various mechanisms on some router platforms, but this component of the failure notification time is considered sufficiently small not to require any further tuning.

TK conducted some studies that showed that the failure notification time in worst-case conditions in its network (considering the high degree of meshing and low propagation delays) rarely exceeded 100 ms. This is negligible considering the overall goal of a few seconds of total convergence time.

SPF Triggering

Similar to the LSA origination case, on a Cisco router an exponential back-off mechanism can be used for the SPF triggering. TK chose the configuration shown in Example 4-10.

Example 4-10. Exponential Back-Off Configuration


router ospf 1
timers throttle spf 50 50 10000

The variables shown in Chapter 2.

Note

On modern routers the SPF complexity is usually close to n * log(n), where n is the number of routers in the network. Algorithm complexity characterizes the SPF duration time. TK measured the SPF duration in its network and found that it was always less than 40 ms. Thus, using SPF computation optimization such as incremental SPF was not required.

RIB and FIB Updates

The RIB and FIB update times are, of course, highly hardware-dependent, but TK measured that those times were systematically less than 0.5 seconds in its network on any router platform.

OSPF Design Conclusions

TK's OSPF design clearly allows for rerouting times on the order of a few seconds. This is in line with TK's objective for the Internet and Layer 3 MPLS VPN traffic in case of failure. It is also worth mentioning that in case of failure of the inter-POP links (SDH and DWDM), significantly faster rerouting times can be achieved (about 1 second) thanks to the ability to quickly detect the failure. The worst-case scenario is a failure within a POP caused by the requirement of relying on OSPF to detect the failure (which is 3 seconds in the case of the elected design).

In case of link, SRLG, or node failure, congestion may occur. The congestion is handled by the DiffServ mechanisms that are in place to protect traffic according to its respective importance thanks to appropriate queuing. That said, based on capacity planning, the OSPF metrics have been computed to limit the likelihood of degraded service that would impact the traffic SLA should a single failure occur in the MPC network.

Network Recovery Design for the PSTN Traffic

Because the PSTN traffic must be rerouted within a few tens of milliseconds in case of link, SRLG, or node failure, and because such traffic is routed to TE LSPs, the most appropriate network recovery mechanism is undoubtedly MPLS TE Fast Reroute.

Failure Detection

A key aspect to consider when choosing a network recovery strategy is the network element failure detection time. It might represent a nonnegligible part of the overall rerouting time. In the case of the MPC network, TK decided to exclusively use SDH and DWDM alarms reported in case of link failure by its SDH and DWDM equipment. Because of this, a link failure is usually detected within a few milliseconds.

It is worth elaborating on the case of a router failure, because the P routers of the MPC network are all based on a distributed architecture. This has the advantage that in case of a control-plane failure, the traffic does not suffer from any traffic disruption, so the failure detection time does not matter as much. In the case of a control-plane failure, it is sufficient to rely on the expiration of the RouterDeadTimer (3 seconds) and subsequent failure of the routing adjacency to trigger a reroute for the IP traffic because traffic is not affected in the meantime. Note that for the PSTN traffic routed to TE LSPs, as soon as their respective headend router is informed of the control-plane failure, the TE LSPs are rerouted along another path, avoiding the failed router. Similar to the previous case, the traffic remains unaffected by the control-plane failure. Note that this does not require any specific mechanism and should be considered the default behavior of a distributed architecture platform. That said, note that this assumes that the route processor does not reboot, for example. In that case, upon reloading its software, the route processor may update its line card's control-plane processor, which may lead to traffic disruption. The failure case considered here is a simple route processor failure, such as a hardware failure.

Note

The case of PE-PSTN node failure is studied later in this section.

The case of a power supply failure results in the failure of all the router-attached links. Similarly, a line card failure provokes the failure of all its links. Consequently, such failures are equivalent to link failures in terms of triggering network rerouting.

Set of Backup Tunnels

Two types of backup tunnels must be provisioned in the MPC network. The first type is next-hop (NHOP) backup tunnels, which protect the PSTN traffic from the failure of a link or SRLG. The second type is next-next-hop (NNHOP) backup tunnels, which protect the PSTN traffic from a node failure (such as a hardware node failure that affects both the control and forwarding planes).

Backup Tunnel Constraints

The first constraint a backup tunnel path must meet is to be diversely routed from the protected facility.

In the case of an NHOP backup tunnel, the backup tunnel path must be diverse from the link under protection. If the link belongs to an SRLG, the backup tunnel must be diversely routed from the SRLG the protected link belongs to. In other words, the backup tunnel must not traverse any link that shares one or more SRLGs with the protected link. An SRLG failure would provoke the failure of both the protected link and the backup tunnel used to protect that link.

A more optimal solution would be to have a backup tunnel protect the link and another backup tunnel protect the SRLG. Thus, in case of failure, the Point of Local Repair (PLR) would select the appropriate backup tunnel. Furthermore, the same concept could be applied to the case of overlapping SRLGs. Instead of having one backup tunnel SRLG that is diverse from all the SRLGs the protected link belongs to, you could have one backup tunnel per SRLG. Unfortunately, this is not a viable option because a router acting as a PLR cannot differentiate a link from an SRLG failure. Hence, when a link belongs to an SRLG, the NHOP backup tunnel must systematically be SRLG-diverse. This important concept requires some additional explanation. Consider Figure 4-32, which shows the set of SRLGs in the MPC network.

Figure 4-32. Telecom Kingland SRLG Membership

[View full size image]

SRLG3 is made up of the links cw2c1 and cw1n1. The router cw2 in the Center-West POP that is attached to link cw2c1 cannot differentiate a failure of the link cw2c1 (because of an interface failure on the router c1, for instance) from a failure of SRLG3. In both cases, this results in a failure of the link cw2c1, which is locally detected by cw2. Hence, when computing the path of an NHOP backup tunnel protecting the link cw2c1, the constraint of having an NHOP backup tunnel that is SRLG-diverse from the link is required. In other words, each time a link belongs to an SRLG, the NHOP backup tunnel protecting against a failure of that link must be SRLG-diverse because of the inability to tell a link from an SRLG failure.

Some research has been conducted to design mechanisms that would identify the cause of such failures before the reroute of a TE LSP were triggered. (This way, several backup tunnels with different constraints would be computed, and the appropriate backup tunnel would be used upon link or SRLG failure.) Such mechanisms rely on the receipt of additional IGP LSAs, potentially in combination with a probing mechanism. The problem with this approach is that it leads to an increase in the overall complexity of the design and the rerouting times. Thus, using a single set of backup tunnels whose path is SRLG-diverse to protect against both link and SRLG failure is the most appropriate choice for TK. Note that this may potentially lead to a less-optimal backup tunnel path. However, this was not an issue in the case of the MPC network, considering the density of the network and the fact that the link propagation delays are not very high.

The second backup tunnel constraint is related to the provisioning of a backup tunnel that provides an equivalent QoS during fast-rerouted periods (during the time that the primary TE LSP is locally fast-rerouted onto its backup tunnel and thus before it is reoptimized by its headend router). This therefore implies that a backup path offering an equivalent bandwidth should be followed.

A possible approach to resolving this constraint (called bandwidth protection) is to compute the backup tunnel path so as to guarantee equivalent bandwidth to all protected tunnels during fast-rerouted periods. Such an approach, to be efficient in terms of required backup capacity, usually requires the use of sophisticated backup tunnel path computation algorithms (which take advantage of the fact that backup tunnels protecting independent facilities can share the backup capacity) in conjunction with additional backup tunnel-selection algorithms at each PLR. These algorithms are based on various criteria to intelligently "pack" the primary TE LSP to their respective backup tunnel while minimizing the problem of bandwidth fragmentation.

Because of the complexity of this approach, TK first analyzed a simpler approach. It was based on the use of zero-bandwidth backup tunnels (computed with CSPF) to evaluate the QoS consequences of various failure scenarios on the PSTN traffic. With such an approach, the backup tunnels just follow the IGP shortest path satisfying the link/SRLG/node-diversity constraint, but without any additional bandwidth constraint. This drastically simplifies the Fast Reroute design. The results of this study showed that in the worst case of an SRLG or node failure, the total proportion of PSTN traffic on any link would never exceed 50 percent during the fast-reroute time. This included the PSTN traffic carried on the primary TE LSP not affected by the failure and the PSTN traffic coming from the rerouted TE LSP (affected by the failure) to the backup tunnel. The DiffServ mechanisms deployed by TK ensure that the PSTN traffic (independent of whether it is carried over a primary TE LSP or is fast-rerouted into a backup TE LSP) is prioritized and queued in the EF queue. Hence, during the fast-rerouted period (which will not exceed a few hundred milliseconds), the PSTN traffic still receives an appropriate QoS satisfying the SLA requirements. Therefore, although the constraint of not exceeding 40 percent of PSTN traffic on any single link may not be satisfied upon certain failure scenarios for a very short period, TK elected to use the simple approach of zero-bandwidth backup tunnels. The potential QoS degradation would not be noticeable and would last only a very short time, until the headend router reoptimization (see the section, "Period of Time During Which Backup Tunnels Are in Use").

Backup Tunnel Design Between Level 1 POPs

One of the objectives of the TK design is to protect any TE LSP from link, SRLG, or node failure. To achieve this aim, one NHOP SRLG-diverse backup tunnel is required per protected link, and one NNHOP SRLG-diverse backup tunnel per next-next hop. To illustrate this Fast Reroute design, you should consider the example of the cw2c1 link attached to the router in the Center-West POP.

Protecting the cw2c1 link requires that an NHOP backup tunnel path be computed that is SRLG-diverse from the link. You can do this by manually considering each link in the network, the SRLG membership, and so on, explicitly configuring the backup tunnel path. This also can be done automatically by each router.

TK opted for an automatic computation and configuration of the backup tunnels. To that end, each router in charge of computing a backup tunnel path for each of its neighbors must be aware of the SRLG memberships of all the links (such as the fact that the links cw2c1 and c1s1 belong to the same SRLG). The Internet Engineering Task Force (IETF) has specified some IGP extensions to flood the SRLG membership. In the case of OSPF, [OSPF-GMPLS] defines several new sub-TLVs carried in the link TLV (Type 2) that provide additional link characteristics. One of them is the SRLG membership (sub-TLV 16). On a Cisco router, the SRLG membership of a given link is configured only once on that link, as indicated in Example 4-11. Then it is automatically flooded by means of OSPF to the other routers in the same OSPF area (because the opaque LSA used for MPLS Traffic Engineering extensions has a Type 10).

Example 4-11. SRLG Membership Configuration


interface POS3/0
mpls traffic-engineering srlg 1

Following the configuration of Example 4-11, on each link, the set of SRLGs the link belongs to is configured. Then the SRLG membership is passed to OSPF and flooded throughout the area (TE LSAopaque LSA Type 10). Figure 4-33 shows the OSPF SRLG sub-TLV format.

Figure 4-33. OSPF SRLG Sub-TLV Format

Such OSPF extensions allow each router to automatically learn about the SRLG membership. Note that more than one SRLG can be specified for each link. For instance, this is the case with link cw2c1, which belongs to both SRLG 1 and SRLG 4. Such a situation is called overlapping SRLGs.

Each router is then configured to automatically establish an NHOP SRLG-diverse backup tunnel for each of its attached links where a routing adjacency has been established. It also establishes an NNHOP SRLG-diverse backup tunnel for each of its next-next-hop neighbors.

On each router, configure the following for automatic SRLG-diverse backup tunnel path computation and provisioning:


mpls traffic-engineering auto-tunnel backup srlg exclude preferred

This does the following:

Automatically configures NHOP and NNHOP backup tunnel(s)

Ensures that the backup tunnels are SRLG-diverse when possible

This command triggers the following set of actions:

By examining its OSPF topology database, each router first determines its set of links where a routing adjacency is established. For each link, an SRLG-diverse NHOP backup tunnel path is computed and presignaled, provided that at least a primary TE LSP traverses the protected link (if no TE LSP exists, there is no need to instantiate a backup tunnel).

The router then determines its set of next-next hops and configures for each of them an SRLG-diverse NNHOP backup tunnel, provided that at least a primary TE LSP follows this protected section.

Figure 4-34 provides examples of NHOP and NNHOP backup tunnels.

Figure 4-34. Example of NHOP and NNHOP Backup Tunnels

[View full size image]

Figure 4-34 shows that for the link cw2c1, the PLR cw2 computes the shortest path for the NHOP backup tunnel B1. This protects against a failure of the link cw2c1 that not only avoids the link cw2c1 but also any link having an SRLG in common with that link (the link c1s1 in this case). Hence, the resulting path for B1 (assuming that all the links have an equal cost) is cw2s1s2c2c1.

The second step for cw2 is to compute an NNHOP SRLG-diverse backup tunnel path for each next-next-hop neighbor, should the node c1 fail. (Note that this is done only if at least one primary TE LSP follows the protected path.) The set of next-next-hop neighbors is made up of c2 and s1. Therefore, in this case, two NNHOP backup tunnels are computed and established by cw2 as follows:

B2 This follows the path cw2s1, which is used to reroute all the primary TE LSP(s) (such as T3) that follow the path cw2c1s1 in case of failure of the node c1.

B3 This follows the path cw2s1s2c2. This path is used to protect all the primary TE LSPs (such as T1 and T2) that follow the path cw2c1c2 in case of failure of the node c1.

Note

Although both NHOP and NNHOP backup tunnels are configured, the only TE LSPs that are rerouted onto an NHOP backup tunnel in case of a link/SRLG failure are the TE LSPs that terminate on the next hop. Other TE LSPs are systematically rerouted onto their NNHOP backup tunnel. This is because a PLR cannot differentiate a link failure from a node failure. Consider the case of the failure of the link cw2c1 or a node power failure of c1. Both of these result in a failure of the link cw2c1. Thus, in the case of the failure of the link cw2c1, the PLR router cw2 has to assume a node failure to be on the safe side. If it turns out that the problem was a link failure, the TE LSPs potentially might have to be rerouted onto a longer backup tunnel path. (This is because the path of an NNHOP backup tunnel is usually longer than the path the traffic would have followed if rerouted via the NHOP backup tunnel.) This is preferable to rerouting onto the NHOP backup tunnel if the failure was in fact a node failure. Some schemes (based on probing) have been proposed to distinguish a link failure from a node failure. The idea is to send a probe message to the NHOP backup tunnel right after the occurrence of a link failure. This allows the PLR to determine whether the next-hop neighbor is alive. If a response is received, the failure is just a link failure; otherwise, the failure is a node failure. Given this, the designer has two choices:

Make the assumption of a link failure and switch back to the NNHOP backup tunnel if the PLR determines that the node is a node failure.

Make the assumption of a node failure and switch back to the NHOP backup tunnel (potentially offering a more optimal backup path) if the failure is characterized as a link failure.

In the first mode, the rerouted TE LSPs follow a more optimal path. In case of a node failure, the traffic disruption is significantly longer because it requires some time for the PLR to determine that the failure was in fact a node failure. In the second mode, the path in case of link failure is potentially slightly longer, but the rerouting time is always minimized. The drawback of a potentially less-optimal backup path for a limited period of time (until the rerouted TE LSP is reoptimized along a more optimal path) is limited compared to the advantage of always minimizing the traffic disruption. Therefore, most of the current implementations have elected the second mode without any mechanism to switch back to the NHOP backup tunnel if the failure is a link failure. Indeed, it would take some time for the PLR to characterize the failure. The time during which the rerouted TE LSPs would be rerouted onto the NHOP backup tunnel would then become very limited.

The ability to keep track of the SRLG membership is of the utmost importance. Therefore, TK maintains a database of the MPC links' SRLG memberships that is populated by the team in charge of the network infrastructure. For example, such an SRLG membership could occur because the team in charge of the transport network decided to reroute some optical light paths along another route. Consequently, this may lead to changes in terms of SRLG membership.

Each time an SRLG membership is modified in the database, an alarm is triggered. This tells the team in charge of the MPC IP/MPLS network to reconfigure the SRLG membership accordingly on the relevant links. Note that on a Cisco router, a change in SRLG membership configuration is automatically detected by all the routers in the network. They all trigger the recomputation of their set of backup tunnels to ensure the backup tunnel path SRLG diversity.

Note that such SRLG membership has no impact on the primary TE LSPs. It potentially impacts only the backup tunnels.

Relaxing the SRLG Diversity Constraint

In some situations the constraint of computing an SRLG-diverse backup tunnel path might keep the PLR from finding a solution. Indeed, in some networks where overlapping SRLGs are very common, there might be some regions of the network where an SRLG-diverse backup tunnel could not be found and still a backup tunnel could be useful to protect against interface and router failures. This is not the case with the MPC network. In steady state, an SRLG backup tunnel can always be found. That said, the inability to find an SRLG-diverse backup tunnel could still occur in case of multiple failures. (Note that in case of multiple failures, the QoS objectives may no longer be reached, but at least it could be useful to still be able to have a backup tunnel after the first failure has occurred if a second failure occurs.) Consider the case of a first failure of SRLG2, as shown in Figure 4-35.

Figure 4-35. Relaxation of the SRLG Diversity Constraint

[View full size image]

If SRLG2 fails, for instance, no SRLG-diverse path exists for an NNHOP backup tunnel between cw2 in the Center-West POP and c2 in the Center POP. However, such a backup tunnel is required to reroute TE LSPs following the path cw2c1c2 should the node c1 fail (double failure case).

This is why the SRLG diversity constraint should be relaxed in this particular case. (This is achieved by adding the keyword preferred to the configuration of the automatic backup tunnel, as discussed earlier in this section.)

Design of the Backup Tunnels Between Level 2 and Level 1 POPs

Because exactly two links connect a Level 2 POP to a Level 1 POP, TK ensured that they do not share any SRLG with any other link. (Otherwise, a single SRLG failure would result in isolating a POP, which is unacceptable.) The same design as for the Level 1 POP case applies here, with the additional simplification of not having to deal with any SRLG. Each router is configured to automatically compute the required set of NHOP and NNHOP backup tunnels. The only constraint is diversity from the protected section (link or node). This is shown in Figure 4-36. NHOP back tunnel B1 protects against failure of link x2c1, and NNHOP backup tunnel B2 protects against failure of c1 for tunnels transiting c1 and s1.

Figure 4-36. Backup Tunnel Design Between Level 2 and Level 1 POPs

[View full size image]

Period of Time During Which Backup Tunnels Are in Use

MPLS TE Fast Reroute (FRR) is a temporary mechanism. The protected TE LSPs are locally rerouted by the node immediately upstream of the failure (the PLR) until they get rerouted along a potentially more optimal path by their headend router. In the case of the MPC network, TK conducted some analysis to approximate the period during which a TE LSP would be rerouted to its backup tunnel in case of a network element failure. This was particularly important because the decision had been made to use zero-bandwidth backup tunnels.

Consider Figure 4-37. Upon a failure of the link cw2c1, a primary TE LSP T1 (between PE-PSTN2-1 and PE-PSTN2-4) would be locally rerouted to the NNHOP backup tunnel B3 within a few tens of milliseconds. Extensive lab testing established that such local fast rerouting would take 60 ms in the worst case. This includes the time to detect the failure and effectively reroute all the TE LSPs to their respective backup tunnels. Then an RSVP Path Error message is sent to the headend router PE-PSTN2-1 to notify it of the local reroute. Such an RSVP message must be processed by each intermediate hop before it is forwarded toward the headend. The receipt of such a notification by the headend router immediately triggers the computation of a new path for T1. The path computation in the case of the MPC network was less than 2 ms. Because a headend router can potentially have multiple affected TE LSPs, to compute the worst-case reoptimization time, the CSPF duration must be multiplied by the maximum number of affected TE LSPs (less than 62, which is the maximum number of TE LSPs per headend router).

Figure 4-37. Estimation of the Time During Which a Backup Tunnel Is Active

Quality of Service Design." Hence, the queuing delay along each hop is negligible.

Finally, the processing delay at each hop was estimated to be at most 10 ms. Consequently, because the maximum number of visited hops is ten, the round-trip signaling time is always less than [10 * 10 (processing delay) + 15 ms (propagation delay)] * 2 (in each direction), which equals 230 ms. Consequently, the maximum amount of time necessary to reroute a TE LSP at the headend router after a failure is 469 ms:

(10 * 10 ms) + 15 ms (time for the headend to receive the RSVP Path Error message notifying it of the failure)

+ 62 * 2 ms (time to recompute the new path)

+ 230 ms (round-trip signaling time)

= 469 ms

This means that in the very worst-case scenario, upon a link, SRLG, or node failure, a TE LSP would be rerouted to its backup tunnel within 60 ms. It would use the backup tunnel for a period of 469 ms before being reoptimized by its headend router. Note that in reality this time typically is significantly shorter for most TE LSPs. (The preceding computation uses the very worst case of a TE LSP following a ten-hop path, a very remote failure, and the improbable case of a headend router having to reroute all its TE LSPs affected by the failure.)

Configuration of a Hold-Off Timer

The need for a hold-off timer was explained in Chapter 2. However, as a reminder, when network recovery schemes are available at multiple layers (optical, SONET-SDH, IP/MPLS), it is desirable to introduce some delays at each layer before triggering the recovery to give a lower-layer network recovery mechanism a chance to recover the fault.

The MPC network has two link types:

Unprotected STM-16 and STM-64 In the case of the unprotected link, no such hold-off timer is required. As soon as the fault is detected, the network recovery mechanism is triggered (IP routing and MPLS TE Fast Reroute).

Protected STM-1 links (by SDH) between some Level 2 and Level 1 POPs Conversely, in the case of the protected STM-1 links, it is desirable to wait for a period of time before triggering MPLS TE Fast Reroute. This way, in case of link failure, the SDH layer tries to recover the fault. If, after some timer X has elapsed, the fault is not recovered, this means that the SDH layer could not recover the affected resource. The inability of the SDH layer to recover the affected link can be because of an SDH equipment failure or because the fault is outside the SDH layer protection scope (for example, in the case of a router or router interface failure). TK determined that the SDH recovery time in its network was bounded to 80 ms (based on its SDH ring size, number of Add/Drop Multiplexers (ADMs), and so on). Hence, TK decided to set the hold-off timer to 100 ms.

Note

The activation of such a timer has the following consequence: In case of a router interface failure or a router failure, the MPLS TE Fast Reroute time is increased by 100 ms.

On a Cisco router, the hold-off timer can be configured via the use of the carrier-delay command (configured on each interface), as shown in Example 4-12.

Example 4-12. Configuration of Carrier Delay


interface pos3/0
carrier-delay ms x (where x is the timer value)
!

Failure of a PE-PSTN Router

The case of a PE-PSTN failure is shown in Figure 4-38.

Figure 4-38. PE-PSTN Node Failure

[View full size image]

In steady state, a VoIP gateway routes all its traffic to a single PE-PSTN (via static routing). Upon failure of this primary PE-PSTN, the VoIP gateway reroutes all its traffic to the alternate co-located PE-PSTN. As already pointed out, the TE LSP between a pair of PE-PSTNs is determined based on the peak load between the two PE-PSTNs. This does not account for the potential rerouting of traffic from other VoIP gateways upon a failure of their primary PE-PSTN. This is because this would result in drastically overestimating all the TE LSPs, potentially leading to suboptimal routing of the TE LSP in steady state. Thus, in case of a PE-PSTN failure, the TE LSP of the PE-PSTN used as a backup carries more traffic (during the period of the PE-PSTN failure), but there are several aspects to consider.

Each TE LSP is sized for peak load. Hence, if the PE-PSTN failure does not occur during busy periods, it is quite likely that the TE LSP will actually carry no more traffic than it has been sized for. That said, even if it turns out that a TE LSP carries more traffic than its actual bandwidth, this does not have any severe consequence. In steady state, the MPC network has been designed such that each TE LSP follows its shortest IGP path (or the shortest path satisfying the color constraints when those are used). Thus, although the traffic carried on those TE LSPs might temporarily exceed the signaled bandwidth, the proportion of PSTN traffic stays below some desirable threshold guaranteeing the QoS. The only case where this might not be true is when PE-PSTN and catastrophic failure are combined in the network. However, TK considered this scenario extremely unlikely and felt it was outweighed by the advantages of the current design rules.