Case Study 13-1: L2TPv3 Path MTU Discovery

In this section, you learn the PMTUD and fragmentation issues that are present with large packets in L2TPv3 networks. Just like any other encapsulation protocol, L2TPv3 adds a series of overheads or protocol control information (PCI) to a service data unit (SDU) that is being encapsulated to create the L2TPv3 protocol data unit (PDU). This section uses the network and L2TPv3 pseudowire shown in Figure 13-1 to cover PMTUD. The maximum transmission unit (MTU) is left as 1500 bytes default for all serial links in the network.

Figure 13-1. L2TPv3 PMTUD Topology

[View full size image]

The Problem: MTU and Fragmentation with L2TPv3

With any tunneling protocol, the packet that results from the encapsulation is N bytes longer than the original Layer 2 frame being tunneled. With L2TPv3 over IP as the tunneling protocol, the value of N has a range of 4 to 16 bytes, not including the outermost IP header. The exact value depends on the cookie size and the presence of Layer 2-Specific Sublayer. As such, the combined sizes of the encapsulated frame plus the encapsulation data might exceed the packet-switched network (PSN) path MTU, leading to the ingress PE performing fragmentation and the egress PE performing reassembly. The reassembly operation is an expensive one in terms of processing power. Avoid it in the PE device whenever possible. Cisco IOS has several configurations and features that prevent fragmentation and reassembly in the switching path if you adjust or "tune" the MTU.

Note

An L2TP node that exists at either end of an L2TP control connection is referred to as L2TP control connection endpoint (LCCE). An LCCE can either be an L2TP access concentrator (LAC) when tunneled frames are processed at the data link layer (Layer 2) or an L2TP network server (LNS) when tunneled frames are processed at the network layer (Layer 3). A LAC cross-connects an L2TP session directly to a data link and is analogous to a Pseudowire Emulation Edge to Edge (PWE3) provider edge (PE). This chapter uses the terms LCCE, LAC (from L2TP nomenclature), and PE (from pseudowire name assignment) interchangeably to refer to an L2TPv3 tunnel endpoint, unless specifically noted.

You configure an L2TPv3 HDLC pseudowire with a remote and local cookie size of 4 bytes and with sequencing enabled. The configuration for the SanFran end is shown in Example 13-1. The NewYork PE configuration is analogous to this one.

Example 13-1. L2TPv3 HDLC Pseudowire (HDLCPW) Configuration

!
hostname SanFran
!
l2tp-class l2tpv3-wan
 cookie size 4                                  
!
pseudowire-class wan-l2tpv3-pw
encapsulation l2tpv3
 sequencing both                                
protocol l2tpv3 l2tpv3-wan
ip local interface Loopback0
!
interface Serial5/0
no ip address
no cdp enable
 xconnect 10.0.0.203 50 pw-class wan-l2tpv3-pw  
!

You can see in Example 13-2 that the L2TPv3 session is UP, and the encapsulation size is 32 bytes. The command show sss circuit also displays the encapsulation size.

Example 13-2. L2TPv3 HDLCPW Verification

SanFran#show l2tun session all vcid 50 | include Session is|state|encap
Session is L2TP signalled
Session state is established, time since change 00:05:41
Circuit state is UP
    encap size = 32 bytes                                    
SanFran#

The encapsulation size of 32 bytes comes from the following:

20 bytes of IPv4 Delivery header

4 bytes of L2TPv3 Session ID

4 bytes of L2TPv3 cookie

4 bytes of the default Layer 2-Specific Sublayer header used for sequencing

You also need to add the transport overhead for High-Level Data Link Control (HDLC), which is constant and equal to 4 extra bytes. Refer to Chapter 12 for details about MTU considerations.

In the transport and tunneling of HDLC frames over HDLC pseudowires (HDLCPW), encapsulating an HDLC frame received from a customer edge (CE) in L2TPv3 over IP adds 36 bytes in the core to the enclosed IP packet that the CE device sends. Before you can address what would happen if you were to send IP packets that were larger than the core MTU minus 36 bytes, you would need a baseline sample for comparison.

Start by sending 500 Internet Control Message Protocol (ICMP) ping packets that total 1464 bytes, which is exactly 1500 bytes (core MTU) - 36 bytes (total encapsulation overhead) from the Oakland CE to the Albany CE. While these packets are being sent, profile the IP Input IOS process by using the command show processes cpu, which displays detailed CPU utilization statistics on Cisco IOS processes. The IP Input process takes care of process switching received IP packets in Cisco IOS. Example 13-3 shows the CPU profile for the IP Input process when the packets are being transferred.

Example 13-3. Baseline IP Input CPU Profile

NewYork#show processes cpu | include util|PID|IP Input
CPU utilization for five seconds: 5%/0%; one minute: 5%; five minutes: 5%
PID  Runtime(ms)  Invoked      uSecs   5Sec   1Min   5Min TTY Process
18         5368      648       8283  0.07%  0.08%  0.12%   0 IP Input
NewYork#

You can see from Example 13-3 that the IP Input process CPU utilization is low. That is consistent with the fact that those L2TPv3 packets from the HDLC-PW are being Cisco Express Forwarding (CEF) switched. They are switched in the fast path and not in the process path.

Perform a similar experiment with packets that are larger than the core MTU minus the encapsulation overhead. Disable HDLC keepalives and Cisco Discovery Protocol (CDP) on the CE devices so that you have an accurate count of packets existing in the core routers (see Example 13-4).

Example 13-4. CE Configuration

!
hostname Oakland
!
interface Serial5/0
ip address 192.168.105.1 255.255.255.252
 no keepalive                              
serial restart-delay 0
 no cdp enable                             
!

Next, send 500 ICMP echo packets that total 1465 bytes (1 byte larger than 1500 bytes minus 36 bytes) and the don' fragment (DF) bit set in the IP header from the Oakland CE (see Example 13-5).

Example 13-5. 1465-Byte Packets from the Oakland CE

Oakland#ping 192.168.105.2 repeat 500 size 1465 df-bit
Type escape sequence to abort.
Sending 500, 1465-byte ICMP Echos to 192.168.105.2, timeout is 2 seconds:
Packet sent with the DF bit set
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!Output omitted for brevity
!!!!!!!!!!
Success rate is 100 percent (500/500), round-trip min/avg/max = 20/42/388 ms
Oakland#

Note that even though the DF bit is set and the packets do not fit, the pings are successful. Although the DF bit is set in packets that are sent from the Oakland CE device, they are further encapsulated in L2TPv3 over IPv4. The DF bit in this outer IPv4 delivery header is not set. Therefore, these oversized packets that are carrying ICMP over IPv4 over HDLC over L2TPv3 over IPv4 are being fragmented after tunnel encapsulation.

Check the CPU utilization for the IP Input process in the NewYork PE device (see Example 13-6).

Example 13-6. IP Input CPU Profile During IP Fragmentation Reassembly

NewYork#show processes cpu | include util|PID|IP Input
CPU utilization for five seconds: 20%/0%; one minute: 6%; five minutes: 1%
PID Runtime(ms)   Invoked      uSecs   5Sec   1Min   5Min TTY Process
18        1700       199       8542 16.04%  3.54%  0.41%   0 IP Input
NewYork#

The line labeled CPU utilization for five seconds shows that the CPU utilization was 20 percent, and 0 percent was spent at the interrupt level. This means that the process level used all the CPU, and the IP Input process added the delta. The reason for this huge difference is the reassembly of the fragmented IP packets carrying L2TPv3. Not surprisingly, the CPU utilization for the IP Input process jumps around 15 percent when compared to the CPU baseline profile shown in Example 13-3.

Note

Although this case study shows numbers for the CPU utilization and its variation because of reassembly, consider the numbers qualitatively and not quantitatively. The impact of reassembly in CPU performance varies significantly across platforms and traffic patterns. The intent of this case study is to show the effect of reassembly in a router CPU, although the actual variation usually deviates from the sample values shown.

In essence, IP fragmentation that is defined in RFC 791, "Internet Protocol," involves breaking up an IP datagram into several pieces that are sent in different IP packets and reassembling them later. The fact that fragmentation and reassembly is occurring is proven in Example 13-7 using the command show ip traffic.

Example 13-7. IP Fragmentation and Reassembly in NewYork PE

NewYork#show ip traffic | include IP stat|frag|reass
IP statistics:
Frags: 500 reassembled, 0 timeouts, 0 couldn't reassemble
501 fragmented, 0 couldn't fragment
NewYork#

A Cisco IOS router does not attempt to reassemble all IP fragments; it only fragments those that are destined to the router that need to be reassembled before decapsulation. Several issues could make you want to avoid IP fragmentation and reassembly. In a router, reassembly is an expensive operation. A router architecture is designed to switch packets as quickly as possible. Holding a packet for a relatively long period of time is not what a router is intended for; it is more of a host operation. When fragmenting a packet, a router needs to make copies of the original IP packet. When fragments are received, a Cisco IOS device chooses the largest buffer of 18 KB because the length of the total packet is unknown when the first fragment is received and before coalescing the fragments. This is an inefficient use of the buffers, but even more important is the fact that IP packets are process switched (process level switching path or slow path) for reassembly. This can degrade throughput and performance and increase CPU utilization.

Note

In Cisco IOS, multiple fragments from an IP packet are counted as a single IP packet. Therefore, the counters from Example 13-7 and Example 13-8 indicate 500 packets.

Example 13-8. IP Reassembly Is Process Switched

NewYork#show interfaces Serial 5/0 stats
Serial5/0
Switching path    Pkts In   Chars In   Pkts Out  Chars Out
Processor          0          0        500     734500
Route cache        500     734500          0          0
Total        500     734500        500     734500
NewYork#show interfaces Serial 5/0 switching
Serial5/0
Throttle count          0
Drops         RP          0         SP          0
SPD Flushes       Fast          0        SSE          0
SPD Aggress       Fast          0
SPD Priority     Inputs          0      Drops          0
Protocol       Path    Pkts In   Chars In   Pkts Out  Chars Out
Other    Process          0          0        500     734500
Cache misses          0
Fast        500     734500          0          0
Auton/SSE          0          0          0          0
NewYork#

To verify that the reassembly process takes place in the process path, you can use these two show interface commands from the NewYork PE: show interfaces stats and show interfaces switching. These commands are hidden in some IOS releases (see Example 13-8).

Example 13-8 shows that packets coming into Serial5/0 interface and sent into the tunnel are fast switched (CEF switched), but packets that are sent out of interface Serial5/0 coming from the L2TPv3 session and sent to Albany CE are process switched (switched in the process level switching path by a software process level component). You can see that the 500 IP packets sent from the Oakland CE and fragmented by the SanFran PE are process switched at the NewYork PE because of reassembly and then sent to the Albany CE device.

In summary, stay away from reassembly by avoiding fragmentation by means of MTU tuning. As you will learn in the upcoming examples, Sweep Ping is a useful tool to identify fragmentation issues and their boundary conditions.

The Solution: Path MTU Discovery

The solution to the MTU and fragmentation problem is L2TPv3 PMTUD, which is defined in RFC 1191. To prevent reassembly by the L2TPv3 edge routers, PMTUD allows the PE to dynamically adjust the Session MTU and is only supported for dynamic sessions.

Understanding PMTUD

Enabling PMTUD in the PE device further enables a set of new behaviors, as follows:

The ingress LCCE copies the DF bit from the IP header in the CE IPv4 packet into the IPv4 delivery header. The DF bit is reflected from the inner IP header to the tunnel IP header.

The ingress LCCE listens to ICMP Unreachable messages with code 4 to find out the path MTU and records the discovered path MTU for the session.

The ingress LCCE inspects the IPv4 packet inside the Layer 2 frame it receives from the CE. If the IPv4 packet has the DF bit cleared and the resulting L2TPv3 packet exceeds the discovered MTU, it determines the number of fragments so that each fragment plus the encapsulation overhead is smaller than the path MTU. It fragments the CE IPv4 packet, copies the original Layer 2 header and appends it into each of the generated fragments, and sends multiple L2TPv3 packets. This procedure effectively pushes the computational expensive IPv4 reassembly into the receiving CE device and relieves the PE from being a centralized reassembly point. Note that this action occurs only after the path MTU is discovered.

The ingress LCCE generates ICMP unreachable messages to the CE device when the IPv4 CE packet contains the DF bit set and the resulting L2TPv3 packet exceeds the discovered MTU. The MTU value informed by the PE to the CE in this ICMP unreachable is called the adjusted MTU. The adjusted MTU is the discovered path MTU (PMTU) in the core minus the L2TPv3 overhead (IPv4 header, Session ID, cookie, and Layer 2-Specific Sublayer). Consequently, this adjusted MTU plus the L2TPv3 overhead adds up to the core discovered PMTU and enables PMTUD applications in the customer (C) network to work correctly. Note that this action occurs only after the path MTU is discovered.

If the path MTU has not been discovered, the ingress LCCE performs only the actions in the first two bulleted points.

Note

With PMTUD enabled, the PE device decodes ICMP Destination Unreachable (Type 3) messages with code 4 ("The datagram is too big. Packet fragmentation is required, but the DF bit in the IP header is set" ) and updates the session MTU accordingly. This ICMP message is also referred to as the Datagram Too Big message and is defined in RFC 792, "Internet Control Message Protocol."

To illustrate the behavior and the new rules, compare a sample network behavior with and without PMTUD. Figure 13-2 shows a sample network where the PMTUD feature is disabled.

Figure 13-2. Processing with L2TPv3 PMTUD Disabled

[View full size image]

The operational procedures are as follows:

1.

CE1 sends an IP packet that is encapsulated in HDLC.

2.

PE1 encapsulates the Layer 2 frame in L2TPv3 and sends the single L2TPv3 packet onto P1. The outer IPv4 header always has the DF bit cleared.

3.

P1 determines that the MTU of the outgoing interface is smaller than the L2TPv3 over IPv4 packet size. Because the DF bit in the delivery header is cleared, P1 fragments the packet and sends two fragments of an IPv4 packet to P2.

4.

P2 switches the two fragments that PE2 receives. PE2 reassembles the IPv4 packet that contains the L2TPv3 packet and decapsulates the reassembled L2TPv3 packet.

5.

PE2 sends the Layer 2 PDU that contains the CE IPv4 packet toward CE2.

Note

The process that is described in Step 4 is the reassembly of post-fragmentation because it fragments the IPv4 delivery packet. The prefix "post" is used in reference to encapsulation. Therefore, post-fragmentation means fragmentation after L2TPv3 encapsulation. PE2 carries out processor-intensive reassembly in Step 4.

In contrast, Figure 13-3 presents the case in which PMTUD is enabled. This assumes that PE1 has already discovered the path MTU by processing an ICMP unreachable "datagram too big" message from the core P1 router, and the session MTU has been updated with a value equal to 1400 bytes.

Figure 13-3. Processing with L2TPv3 PMTUD Enabled

[View full size image]

The following steps take place when PMTUD is enabled:

1.

CE1 sends an IP packet that is encapsulated in HDLC.

2.

PE1 determines that the resulting L2TPv3 over IPv4 packet is greater than the path that MTU discovered. PE1 proceeds to fragment the IPv4 CE packet inside the HDLC frame and appends a copy of the HDLC header to the seconds fragment. The result is that two Layer 2 frames are passed onto L2TPv3 for encapsulation and two L2TPv3 over IPv4 packets are sent onto P1.

3.

P1 router does not need to perform fragmentation.

4.

PE2 receives the two L2TPv3 data packets, and as far as PE2 knows, they are from two different Layer 2 frames. PE2 then decapsulates the two L2Tpv3 packets to end up with two Layer 2 frames containing two fragments of a single IPv4 packet from CE1.

5.

PE2 sends two Layer 2 PDUs toward CE2, each containing a fragment of the CE IPv4 packet. CE2 reassembles the two fragments into an IPv4 packet.

Note

The process that is described in Step 2 is called prefragmentation because it fragments the data and not the delivery header packet. The prefix "pre" is used in reference to encapsulation. Therefore, pre-fragmentation means fragmentation before L2TPv3 encapsulation. CE2 carries out processor-intensive reassembly in Step 5.

You can see that PMTUD forces the CPU-intensive reassembly to happen in the receiving CE device. In essence, fragmentation of IP packets from the CE occurs before data enters the pseudowire (prefragmentation). The goal is that tunneled L2TPv3 packets are not fragmented along the way through the IP PSN, so the receiving PE does not perform reassembly. You have learned that the default behavior is to fragment L2TPv3 packets that are larger than the MTU.

Note

Another important aspect of MTU handling is that the Layer 2 frames being tunneled should fall within the MTU of the remote attachment circuit. In a bidirectional communication, this means that attachment circuit MTUs need to match. As opposed to Any Transport over MPLS (AToM), where pseudowires do not come up if an MTU mismatch occurs between the attachment circuits, the attachment circuit MTU is not advertised or enforced in L2TPv3.

Implementing PMTUD

Now that you have learned the operational procedures of PMTUD, it is time to see it in action. Example 13-9 shows the configuration changes that are required to enable PMTUD. This configuration is applied in the SanFran and NewYork PE devices.

Example 13-9. Enabling PMTUD

!
hostname SanFran
!
pseudowire-class wan-l2tpv3-pw-pmtu
encapsulation l2tpv3
sequencing both
protocol l2tpv3 l2tpv3-wan
ip local interface Loopback0
 ip pmtu                                              
!
interface Serial5/0
no ip address
no ip directed-broadcast
no cdp enable
no clns route-cache
xconnect 10.0.0.203 50 pw-class wan-l2tpv3-pw-pmtu
!

Example 13-9 shows the ip pmtu command added into a new pseudowire class for the L2TPv3 pseudowire. The ip pmtu command can also hard-code the maximum path MTU for the session by adding the max keyword and the maximum path MTU value to the ip pmtu command. This is most useful to account for the extra overheads when the core network has further encapsulations. Example 13-10 highlights a new line of output that specifies that PMTUD is enabled for the session.

Example 13-10. Verifying PMTUD

SanFran#show l2tun session all vcid 50
Session Information Total tunnels 1 sessions 3
Tunnel control packets dropped due to failed digest 0
Session id 61603 is up, tunnel id 51402
Call serial number is 2310500000
Remote tunnel name is NewYork
Internet address is 10.0.0.203
Session is L2TP signalled
Session state is established, time since change 00:00:23
0 Packets sent, 0 received
0 Bytes sent, 0 received
Receive packets dropped:
out-of-order:             0
total:                    0
Send packets dropped:
exceeded session MTU:     0
total:                    0
Session vcid is 50
Session Layer 2 circuit, type is HDLC, name is Serial5/0
Circuit state is UP
Remote session id is 5399, remote tunnel id 51995
  Session PMTU enabled, path MTU is not known                   
DF bit off, ToS reflect disabled, ToS value 0, TTL value 255
Session cookie information:
local cookie, size 4 bytes, value 0B B4 A2 90
remote cookie, size 4 bytes, value BA 12 10 7F
FS cached header information:
encap size = 32 bytes
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
Sequencing is on
Ns 0, Nr 0, 0 out of order packets received
SanFran#

Session PMTU is now enabled, but the path MTU is still unknown because the PE has not received an ICMP unreachable "packet too big" message yet. Nevertheless, until the path MTU is known, the default behavior is the same as before. Fragmentation is not possible until the MTU is known; therefore, if you were to perform the initial experiment with this current setup, the result would be analogous to before.

To trigger the path MTU to be discovered and the session PMTU to be updated, you need to send an IP packet from the Oakland CE that is at least 1465 bytes long and has the DF bit set that will be copied over to the delivery header. Meanwhile, enable debug ip icmp. This example uses the same network from Figure 13-1 (see Example 13-11).

Example 13-11. Triggering PMTU Discovery

Oakland#debug ip icmp
ICMP packet debugging is on
Oakland#ping 192.168.105.2 size 1465 df-bit
Type escape sequence to abort.
Sending 5, 1465-byte ICMP Echos to 192.168.105.2, timeout is 2 seconds:
Packet sent with the DF bit set
.MMMM                                                                             
Success rate is 0 percent (0/5)
Oakland#
02:13:02: ICMP: dst (192.168.105.1) frag. needed and DF set unreachable rcv from
192.168.105.2
02:13:02: ICMP: dst (192.168.105.1) frag. needed and DF set unreachable rcv from
192.168.105.2
02:13:02: ICMP: dst (192.168.105.1) frag. needed and DF set unreachable rcv from
192.168.105.2
02:13:02: ICMP: dst (192.168.105.1) frag. needed and DF set unreachable rcv from
192.168.105.2
Oakland#

You can see in the Oakland CE that the first ping times out ("." ). This is because the SanFran PE drops the first ping packet in the P network, which triggers the ICMP unreachable message that the SanFran PE absorbs, inspects, and uses to discover the path MTU. For the remaining four ICMP echo packets, you see an M character standing for MTU, which means "Could not fragment." The four M characters correspond to the four ICMP frag. needed and DF set unreachable messages sent by the SanFran PE, received by the Oakland CE, and shown in the debug output. Although the source for these ICMP unreachables is 192.168.105.2, the SanFran PE generates these ICMP unreachable messages by using a source IP address that is equal to the destination IP address in the ICMP echo packets.

The first ping that is dropped triggers an ICMP packet too big unreachable in the core that is sent toward and terminated in the SanFran PE, because the DF bit is copied onto the IPv4 core delivery header. In the basic network in Figure 13-1, the ICMP error is sent from SanFran to SanFran because all the MTUs are the same and equal to 1500 bytes. After the first ping is dropped, the PMTU is discovered. Example 13-12 shows the respective output in the SanFran PE with debug ip icmp and debug vpdn l2x-events enabled.

Example 13-12. Discovering PMTUD in the SanFran PE

SanFran#debug ip icmp
ICMP packet debugging is on
SanFran#debug vpdn l2x-events
L2X protocol events debugging is on
SanFran#
*Jul  6 03:09:47.799: ICMP: dst (10.0.0.203) frag. needed and DF set unreachable sent
to 10.0.0.201
*Jul  6 03:09:47.835: ICMP: dst (10.0.0.201)  frag. needed and DF set unreachable rcv
from 10.0.0.201
*Jul  6 03:09:47.835: Tnl46820 L2TP: Socket  MTU changed to 1500
SanFran#
SanFran#show l2tun session all vcid 50 | include PMTU
  Session PMTU enabled, path MTU is 1500 bytes                                         
SanFran#

You can also see the ICMP unreachables that the SanFran PE generates with a sweep ping with verbose output. This is, in fact, how devices in the C network learn about the adjusted path MTU when they perform their own PMTUD by setting the DF bit (see Example 13-13).

Example 13-13. ICMP Unreachables Sent from the SanFran PE with PMTUD Enabled

Oakland#ping
Protocol [ip]:
Target IP address: 192.168.105.2
Repeat count [5]: 1
Datagram size [100]:
Timeout in seconds [2]:
Extended commands [n]: y
Source address or interface:
Type of service [0]:
Set DF bit in IP header? [no]:  y
Validate reply data? [no]:
Data pattern [0xABCD]:
Loose, Strict, Record, Timestamp, Verbose[none]: v                                
Loose, Strict, Record, Timestamp, Verbose[V]:
Sweep range of sizes [n]: y
Sweep min size [36]: 1460
Sweep max size [18024]: 1470
Sweep interval [1]:
Type escape sequence to abort.
Sending 11, [1460..1470]-byte ICMP Echos to 192.168.105.2, timeout is 2 seconds:
Packet sent with the DF bit set
Reply to request 0 (20 ms) (size 1460)
Reply to request 1 (20 ms) (size 1461)
Reply to request 2 (36 ms) (size 1462)
Reply to request 3 (20 ms) (size 1463)
Reply to request 4 (28 ms) (size 1464)
Unreachable from 192.168.105.2, maximum MTU 1464 (size 1465)                      
Unreachable from 192.168.105.2, maximum MTU 1464 (size 1466)                      
Unreachable from 192.168.105.2, maximum MTU 1464 (size 1467)                      
Unreachable from 192.168.105.2, maximum MTU 1464 (size 1468)
Unreachable from 192.168.105.2, maximum MTU 1464 (size 1469)
Unreachable from 192.168.105.2, maximum MTU 1464 (size 1470)
Success rate is 45 percent (5/11), round-trip min/avg/max = 20/24/36 ms
Oakland#

To prove the benefits of PMTUD, perform the original experiment sending 500 1465-byte packets from the Oakland CE to the Albany CE and checking the NewYork PE and Albany CE counters. First clear all counters (see Example 13-14).

Example 13-14. 1465-Byte Packets from the Oakland CE with PMTUD

Oakland#ping 192.168.105.2 size 1465 repeat 500
Type escape sequence to abort.
Sending 500, 1465-byte ICMP Echos to 192.168.105.2, timeout is 2 seconds:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!Output omitted for brevity
!!!!!!!!!!
Success rate is 100 percent (500/500), round-trip min/avg/max = 16/32/360 ms
Oakland#

Example 13-15 shows the switching statistics in the NewYork PE.

Example 13-15. Switching Statistics in the NewYork PE

NewYork#show interfaces Serial 5/0 stats
Serial5/0
Switching path    Pkts In   Chars In  Pkts Out  Chars Out
Processor          0          0         0          0
Route cache        500     734500      1000     746500
Total        500     734500      1000     746500
NewYork#
NewYork#show interfaces Serial 5/0 switching
Serial5/0
Throttle count         0
Drops        RP          0          SP                 0
SPD Flushes      Fast          0         SSE                 0
SPD Aggress      Fast          0
SPD Priority    Inputs          0       Drops                 0
Protocol      Path    Pkts In    Chars In          Pkts Out    Chars Out
Other   Process          0           0                0            0
Cache misses          0
Fast        500      734500             1000       746500
Auton/SSE          0           0                0            0
NewYork#

Example 13-15 shows that from the 500 packets that the Oakland CE sent and the SanFran PE received, the NewYork PE received 1000 packets from the Denver PE and sent them to the Albany CE. This is twice as many. The CE IPv4 packet inside each frame that was received from SanFran was divided into two fragments and sent to two separate HDLC over L2TPv3 packets with their respective HDLC transport overhead. See Example 13-16 for the SanFran PE statistics.

Example 13-16. CE IPv4 Fragmentation and Packet Statistics in the SanFran PE

SanFran#show ip traffic | include IP stat|frag|reass
IP statistics:
Frags: 0 reassembled, 0 timeouts, 0 couldn't reassemble
         500 fragmented, 1 couldn't fragment                                  
SanFran#
SanFran#show l2tun session packets vcid 50
Session Information Total tunnels 2 sessions 3
Tunnel control packets dropped due to failed digest 0
LocID      RemID     TunID       Pkts-In    Pkts-Out  Bytes-In  Bytes-Out
4437       64786     48966       1000       1001      746500    748001
SanFran#

Example 13-16 shows that 500 packets were fragmented. One packet could not be fragmented but triggered PMTUD. From the 500 fragmented IPv4 CE packets, 1000 L2TPv3 packets were sent into the tunnel. You can also validate the results using debug ip packet so that you can see the fragments in the CE device, as shown in Example 13-17.

Example 13-17. IP Fragments in the Oakland CE

Oakland#debug ip packet
IP packet debugging is on
Oakland#ping 192.168.105.2 size 1465 repeat 1
Type escape sequence to abort.
Sending 1, 1465-byte ICMP Echos to 192.168.105.2, timeout is 2 seconds:
!
Success rate is 100 percent (1/1), round-trip min/avg/max = 32/32/32 ms
Oakland#
03:41:58: IP: s=192.168.105.1 (local),  d=192.168.105.2 (Serial5/0), len 1465, sending
03:41:58: IP: s=192.168.105.1 (local), d=192.168.105.2 (Serial5/0), len 1465, sending   
full packet                                                                             
03:41:58: IP: s=192.168.105.2 (Serial5/0), d=192.168.105.1, len 44, rcvd 2
03:41:58: IP: recv fragment from 192.168.105.2 offset 0 bytes
03:41:58: IP: s=192.168.105.2 (Serial5/0), d=192.168.105.1, len 1441, rcvd 2
03:41:58: IP: recv fragment from 192.168.105.2 offset 24 bytes
03:41:58: ICMP: echo reply rcvd, src 192.168.105.2, dst 192.168.105.1
Oakland#

Notice in Example 13-17 that although the Oakland and Albany CE devices send full unfragmented IP packets, they receive fragmented IP packets. You can see an IP packet composed of two fragments of lengths 44 bytes and 1441 bytes. Coalescing the two and removing the extra IP header make the 1465-byte packet (44 bytes + 1441 bytes 20 bytes = 1465 bytes).

The most important thing, however, is that the NewYork PE device does not perform reassembly, and the 1000 packets are switched in the fast switching path, which is CEF-switched in this case. As far as the NewYork PE and the Albany CE can see, it is as if the Oakland CE fragmented the packets. The CPU-intensive reassembly is now pushed onto the Albany CE (see Example 13-18).

Example 13-18. Fragmentation Statistics in the Albany CE

Albany#show ip traffic | include IP stat|frag|reass
IP statistics:
Frags: 500 reassembled, 0 timeouts, 0 couldn't reassemble
0 fragmented, 0 couldn't fragment
Albany#

You can see from Example 13-18 that the Albany CE reassembles the 500 packets. This is the goal of prefragmentation: to effectively push the costly reassembly operation to the CE device.

Combining PMTUD with DF Bit

As powerful as the PMTUD feature is, by itself it contains the weakest link. The key to the correct functioning of PMTUD is the discovery of the path MTU. As you have seen already, to discover the path MTU, you need to have a large packet with the DF bit set sent from the CE device, which means the whole process is controlled by the CE. Moreover, after a specified timer that defaults to 10 minutes, the path that MTU discovers is restored to a default value (see Example 13-19).

Example 13-19. PMTUD Timeout

SanFran#debug vpdn l2x-events
L2X protocol events debugging is on
SanFran#
*Jul 6 03:19:47.852: L2X: Restoring default pmtu for peer 10.0.0.203

Also, PE devices that have PMTUD enabled but do not have the path that MTU discovered copy the DF bit from the inner CE IPv4 header into the outer IPv4 delivery header but otherwise act as if PMTUD is disabled. If PMTUD is configured but the path MTU is not discovered and CE packets do not have the DF bit set, the reassembly occurs in the PE device (see Example 13-20).

Example 13-20. PMTU Not Discovered

SanFran#show l2tun session all vcid 50 | include PMTU
  Session PMTU enabled, path MTU is not known                                  
SanFran#
Oakland#ping 192.168.105.2 size 1465 repeat 500
Type escape sequence to abort.
Sending 500, 1465-byte ICMP Echos to 192.168.105.2, timeout is 2 seconds:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!Output omitted for brevity
!!!!!!!!!!
Success rate is 100 percent (500/500), round-trip min/avg/max = 28/41/108 ms
Oakland#
NewYork#show l2tun session all vcid 50 | include Packets
    500 Packets sent, 500 received                                             
NewYork#show interface Serial5/0 stats
Serial5/0
Switching path    Pkts In  Chars In  Pkts Out  Chars Out
Processor          0         0       500     734500
Route cache        500    734500         0          0
Total        500    734500       500     734500
NewYork#

You can see in Example 13-20 that PMTU is not discovered and reassembly occurs in the NewYork PE. If you trigger the discovery of path MTU and perform the same exercise, IP fragmentation reassembly is pushed onto the CE device (see Example 13-21).

Example 13-21. PMTU Discovered

Oakland#
! Triggering PMTUD
Oakland#ping 192.168.105.1 size 1465 df-bit repeat 1
Type escape sequence to abort.
Sending 1, 1465-byte ICMP Echos to 192.168.105.1, timeout is 2 seconds:
Packet sent with the DF bit set
.
Success rate is 0 percent (0/1)
Oakland#ping 192.168.105.2 size 1465 repeat 500
Type escape sequence to abort.
Sending 500, 1465-byte ICMP Echos to 192.168.105.2, timeout is 2 seconds:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!Output omitted for brevity
!!!!!!!!!!
Success rate is 100 percent (500/500), round-trip min/avg/max = 8/31/252 ms
Oakland#
NewYork#show l2tun session all vcid 50 | include Packets
    1000 Packets sent, 1500 received                                          
NewYork#show interface Serial5/0 stats
Serial5/0
Switching path    Pkts In    Chars In  Pkts Out  Chars Out
Processor          0           0       500     734500
Route cache       1000     1469000      1000     746500
Total       1000     1469000      1500    1481000
NewYork#

Observe in Example 13-21 that the 1465-byte packet with the DF bit set triggers PMTUD, and the 500 packets that the Oakland CE sends to the Albany CE are received as 1000 packets in the NewYork PE and then switched in the fast path. The highlighted Route cache line in the show interface stats command shows that these packets are fast switched. On the other hand, PMTUD is not triggered in the other direction (return path from the Albany CE to the Oakland CE); therefore, only 500 packets are sent from the NewYork PE to the SanFran PE on the way back, and the SanFran PE does the reassembly.

Note

Because of this reason and to add predictability to the PMTUD process and decouple the CE devices from driving the process, use PMTUD in conjunction with the DF bit set. Otherwise, packets are not prefragmented. They are post-fragmented unless the CE device sends a large packet with the DF bit set to trigger PMTUD as usual. Combining PMTUD with setting the DF bit allows the PE to obtain the PMTU more quickly and predictably.

The PE device can take a more active role in the PMTUD process and strengthen the whole concept. You can bring this about by setting the DF bit in all packets in the outer delivery IPv4 header. As a result, reassembly is prevented in the PE devices. The required configuration, shown in Example 13-22, is accomplished by using the ip dfbit set command in the pseudowire class. You create a new pseudowire class exactly like the previous one, with the addition of the ip dfbit set command, which you use in the Serial 5/0 xconnect.

Example 13-22. PMTUD Combined with DF Bit Setting Configuration

!
hostname SanFran
!
pseudowire-class wan-l2tpv3-pw-pmtu-df
encapsulation l2tpv3
sequencing both
protocol l2tpv3 l2tpv3-wan
ip local interface Loopback0
 ip pmtu                                                 
 ip dfbit set                                            
!
interface Serial5/0
no ip address
no cdp enable
xconnect 10.0.0.203 50 pw-class wan-l2tpv3-pw-pmtu-df
!

Caution

Before you enable PMTUD, make sure that end-to-end PMTUD works. If it does not work, you could break applications just by setting the DF bit. PMTUD might not operate correctly if ICMP unreachables are blocked or end devices are noncompliant.

You can see the DF bit configuration in the show l2tun session command output, as shown in Example 13-23.

Example 13-23. PMTUD Combined with DF Bit Setting Verification

SanFran#show l2tun session all vcid 50
Session Information Total tunnels 1 sessions 3
Tunnel control packets dropped due to failed digest 0
Session id 58502 is up, tunnel id 56513
Call serial number is 905100000
Remote tunnel name is NewYork
Internet address is 10.0.0.203
Session is L2TP signalled
Session state is established, time since change 00:09:13
0 Packets sent, 0 received
0 Bytes sent, 0 received
Receive packets dropped:
out-of-order:             0
total:                    0
Send packets dropped:
exceeded session MTU:     0
total:                    0
Session vcid is 50
Session Layer 2 circuit, type is HDLC, name is Serial5/0
Circuit state is UP
Remote session id is 38115, remote tunnel id 47670
Session PMTU enabled, path MTU is not known
DF bit on, ToS reflect disabled, ToS value 0, TTL value 255
Session cookie information:
local cookie, size 4 bytes, value 17 72 1D 8B
remote cookie, size 4 bytes, value 3D 49 99 2F
FS cached header information:
encap size = 32 bytes
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
Sequencing is on
Ns 0, Nr 0, 0 out of order packets received
SanFran#

With this configuration, the PE sets the DF bit in the IPv4 delivery header and participates in PMTUD regardless of the DF bit setting in the packets it receives. If the CE devices do not set the DF bit in IPv4 packets, the session PMTU is still discovered and acted upon.

Next, examine a complete example of PMTUD when the PE has ip dfbit set explicitly configured. Example 13-24 shows 500 1465-byte IP packets without the DF bit set sent from the Oakland CE to the Albany CE.

Example 13-24. PMTUD Combined with DF Bit Setting Operation

Oakland#ping 192.168.105.2 size 1465 repeat 500
Type escape sequence to abort.
Sending 500, 1465-byte ICMP Echos to 192.168.105.2, timeout is 2 seconds:
..!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!Output omitted for brevity
!!!!!!!!!!
Success rate is 99 percent (498/500), round-trip min/avg/max = 4/19/120 ms
Oakland#

You can see that only 498 of the 500 pings are successful. The following steps outline the process:

1.

The Oakland CE sends the first IP/ICMP request packet over HDLC. The SanFran PE receives it, encapsulates it with the L2TPv3 and IPv4 delivery headers, and sends it to the IP layer. The SanFran PE sets the DF bit in the delivery header because of the ip dfbit set command.

2.

The IP packet is dropped in the SanFran PE in the IP layer because it is too big and has the DF bit set. This triggers an ICMP type 3 code 4 message that is used to discover the path MTU. The ICMP type 3 code 4 packet is generated source from and destined to the SanFran PE; the source IP address is from the outgoing interface of the originating device, and the destination address comes from the source IP address of the dropped L2TPv3 packet. In the general case, the ICMP type 3 code 4 packet would be sourced from a router in the IP cloud that is destined to the PE device. Note that as far as L2TPv3 is concerned, this packet was sent and will show up in L2TPv3 counters. This packet times out in the Oakland CE.

3.

The Oakland CE sends the second IP packet over HDLC. The SanFran PE prefragments (before encapsulation) this packet and sends two L2TPv3 packets to the NewYork PE and onto the Albany CE.

4.

The Albany CE reassembles the IP packet and replies with a 1465-byte ICMP Echo reply, which is encapsulated in L2TPv3 in the NewYork PE but later dropped at the IP layer in the NewYork PE. This triggers an ICMP type 3 code 4 message that is used to discover the path MTU on the NewYork PE for the return path. The ICMP type3 code 4 packet is generated source from and destined to the NewYork PE. The source IP address is from the outgoing interface of the originating device, and the destination address comes from the source IP address of the dropped L2TPv3 packet. This packet times out in the Oakland CE.

5.

The Oakland CE sends a third IP packet over HDLC. The SanFran PE receives it, prefragments it, and sends it as two L2TPv3 packets to the NewYork PE and as two IPv4 over HDLC fragments to the Albany CE.

6.

The Albany CE performs the reassembly and replies with an ICMP Echo reply message. The NewYork PE, which now knows the PMTU, receives this reply, prefragments the IPv4 CE packet, and sends two L2TPv3 packets to the SanFran PE. The SanFran PE sends the two IPv4 over HDLC frames to the Oakland CE.

7.

The Oakland CE receives the two fragments, reassembles the IP packet, and receives the ping reply. This third ping is successful.

8.

The remaining 497 packets follow the same process as this third packet.

This complete process is depicted in Figure 13-4.

Figure 13-4. Processing with L2TPv3 PMTUD and DF Bit Setting Enabled

[View full size image]

You can also track various packet counters along the way, starting from the SanFran PE (see Example 13-25).

Example 13-25. PMTUD and DF Bit Counters in the SanFran PE

SanFran#show ip traffic | include IP stat|frag|reass
IP statistics:
Frags: 0 reassembled, 0 timeouts, 0 couldn't reassemble
         499 fragmented, 1 couldn't fragment                                  
SanFran#show l2tun session packet vcid 50
Session Information Total tunnels 1 sessions 3
Tunnel control packets dropped due to failed digest 0
LocID      RemID      TunID       Pkts-In    Pkts-Out   Bytes-In   Bytes-Out
12967      49396      62563       996        999        743514     746508
SanFran#

In the output of the show ip traffic command, you can see that the SanFran PE could not fragment one packet: packet number 1.

This packet also shows up in the L2TPv3 session counters. The SanFran PE fragmented the remaining 499 packets, creating 2 * 499 = 998 L2TPv3 packets. These 999 (1 + 998) packets are shown as packets sent out and into the tunnel in the L2TPv3 session packet counters. Example 13-26 shows the respective counters in the NewYork PE, including the 998 L2TPv3 packets that are counted as packets from the SanFran PE.

Example 13-26. PMTUD and DF Bit Counters in the NewYork PE

NewYork#show ip traffic | include IP stat|frag|reass
IP statistics:
Frags: 0 reassembled, 0 timeouts, 0 couldn't reassemble
         498 fragmented, 1 couldn't fragment                                   
NewYork#show l2tun session packet vcid 50
Session Information Total tunnels 1 sessions 3
Tunnel control packets dropped due to failed digest 0
LocID      RemID     TunID       Pkts-In    Pkts-Out   Bytes-In   Bytes-Out
49396      12967     35876       998        997        745007     745015
NewYork#

The output of the command show ip traffic from the NewYork PE shows that one packetthe reply to packet number 2could not be fragmented. The remaining 498 packets (from packet 3 through packet 500) were prefragmented, creating 996 L2TPv3 packets that you can see in Example 13-25 in the SanFran PE as packets in from the tunnel from the NewYork PE. These 997 (1 + 996) packets appear as packets out, meaning into the L2TPv3 tunnel in the output of the command show l2tun session packet in NewYork.

Example 13-27. PMTUD and DF Bit Counters in the Albany CE

Albany#show ip traffic | i IP stat|frag|reass|ICMP|echo
IP statistics:
Frags: 499 reassembled, 0 timeouts, 0 couldn't reassemble
0 fragmented, 0 couldn't fragment
ICMP statistics:                                                           
499 echo, 0 echo reply, 0 mask requests, 0 mask replies, 0 quench
Sent: 0 redirects, 0 unreachable, 0 echo, 499 echo reply
Albany#

The new configuration effectively pushes the reassembly into the CE devices. Albany reassembled 499 packets (all except packet number 1) from ICMP Echo messages and replied to them.

Example 13-28 shows the IP traffic counters for the Oakland CE.

Example 13-28. PMTUD and DF Bit Counters in the Oakland CE

PMTUD and DF Bit Counters in the Oakland CE
Oakland#show ip traffic | i IP stat|frag|reass|ICMP|echo
IP statistics:
Frags: 498 reassembled, 0 timeouts, 0 couldn't reassemble
0 fragmented, 0 couldn't fragment
ICMP statistics:
0 echo, 498 echo reply, 0 mask requests, 0 mask replies, 0 quench
Sent: 0 redirects, 0 unreachable, 500 echo, 0 echo reply
Oakland#

You can see that the Oakland CE reassembled 498 packets from the 498 respective Echo replies (all except packets 1 and 2, which were dropped in the core to discover the PMTU).