Archive for the ‘Convergence’ Category

Ethernet, STP, Topology change and the behaviour of Ethernet

June 24, 2014 2 comments


This post is inspired by a post at IEOC about Uplinkfast and TCN which
can be found here.

Before we get to those parts, let’s recap how Ethernet and STP work together.

Spanning Tree

The Spanning Tree Algorithm builds a loop free tree by comparing Bridge ID(BID) and
least cost paths to the root bridge. By doing this it blocks all links not leading
to the root.


MAC Learning

Switches learn where to forward frames by looking at the source MAC address of the frame
on the port that the frame was received on. This learning is done in the data plane
as opposed to routing where the routes are learned in control plane. I will come back
to this later in the post.

MAC learn1

S4 learns that A is located on port 1 after A has sent a frame. This is stored in
the MAC address table located in Content Addressable Memory (CAM). The CAM is a
fast memory optimized for quick lookups in the table. By default there is a 300
second aging timeout for learned MAC addressesm, meaning that if the switch
does not see any traffic from a source MAC within five minutes the entry will
age out of the table. This is used to remove stale entries and to keep the
MAC address table from becoming too large.

Potential Issues

As I mentioned briefly earlier in the post, MAC learning is done in the data plane.
When we exchange routes through protocols such as OSPF, EIGRP and BGP, this is
done in the control plane. If there is a /24 route in the routing table pointing
at a router, then those up to 254 hosts are behind that router. With MAC learning
every source MAC has its own entry, which would be the same as if we had /32 routes
for every host in the network. Not very effecient! This can also become a scalibility
issue in large networks if there are more hosts than the CAM can hold.

There are also other issues such as not being able to use all the links in the
network. Spanning tree will block the redundant links so we don’t get more bandwidth
if we add more links unless we put them into an Etherchannel or use technologies
such as vPC. In datacenter designs, using STP will lead to low bisectional bandwidth,
meaning that even if there are lots of links between a section in the network, most of
them will actually be blocked.

Another issue is that broadcast and unknown unicast traffic is flooded in the network.
Imagine a scenario as below where A is sending unicast traffic to B and it’s
an unidirectional flow. B rarely sends any traffic so its entry has been aged out
of the MAC address table.

Unknown unicast

In this scenario the unknown unicast will be flooded to all the switches and
all servers will have to receive the 300 Mbit/s stream and then discard the
traffic until the switches have learned the MAC of B again!

There is also a potential for black holing of traffic. In the topology below there
are four switches connected together and the primary path is through S4-S1-S2-S3.


Then the link between S1 and S2 fails.


When using 802.1D, there is no synchronization of the topology. It will take up to
50 seconds for the link between S3 and S4 to come up unless Backbonefast has been
deployed. When traffic is going from A to B, it will be blackholed. S4 still has an
entry for B towards S1. When the traffic reaches S1 it has nowhere to go.
Without aging of stale entries, this would take up to five minutes. This is
the purpose of topology change in STP, to faster age out stale entries.

Topology Change

Like I described above, without a mechanism for topology change, traffic could
potentially be black holed for quite a while. In 802.1D, when a link goes up
or down, the switch will generate a TCN BPDU which is a special BPDU sent out
the root port. Normally switches only relay BPDUs from the root on their designated
ports but this is a special case. A switch that receives a TCN BPDU will reply
to it with a configuration BPDU with the TC Acknowledge bit set.


The TCN BPDU will eventually reach the root which will then send out a configuration
BPDU with the TC bit set. This is done for a duration of MaxAge + FwDelay
seconds which is 20 + 15 seconds by default.


When switches receive this BPDU from the root with the TC bit set, they will age out
entries in the CAM at a faster pace. The aging timeout will be set to 15 seconds.
This will age out any stale entries in the CAM. If there are active flows they will
not be aged out because the age will be reset as the switch sees frames coming in
with the source MAC in question. As I described earlier there could be unidirectional
flows leading to flooding. Also flows that are inactive for a while and then resume
can get flooded if their entries time out during the period that the root bridge is
sending out these configuration BPDUs with TC set.


Uplinkfast is a feature deployed on access switches which have dual links to
the distribution layer. Because the switches are located at the edge of the network
it is safe to bring up an alternate port immediately without going through the regular
listening and learning phase, saving up to 30 seconds.

After a switch has failed over to the alternate link it will start to send out
dummy multicast frames. This is to speed up convergence. Even if a configuration
BPDU with TC set is sent by the root, it can still take up to 15 seconds before
stale entries age out.


So based on the thread at IEOC, what is the consequence of Uplinkfast and TC together?
The configuration BPDU with TC is sent for 35 seconds by default. Dummy multicast frames
will be sent out for a duration that is unknown. It depends on how many entries there are
in the CAM and the rate that the packets are sent at. So depending on when the multicast
frame is sent and if you have an unidirectional flow or a host gone silent, then yes
the configuration BPDU with TC could be counter productive. Traffic would reach its
destination though but it would be through flooding of the traffic.

In reality I doubt this would be much of an issue and most networks would be running
RSTP today. RSTP works differently by synchronizing the topology and when the TC bit
is set in BPDUs the entire CAM is flushed on all ports except where the BPDU was

Why fast IGP timers aren’t always beneficial

March 31, 2014 4 comments


When tuning your IGP of choice, the first thing people look at is usually the hello
and dead interval. This is a flawed logic, it is true that it can help in certain
cases but convergence consists of much more than just hello timers.

Why tune timers?

Detecting that the other side of the link is down is an important part of converging.
That’s why your design should avoid putting any bump in the wires such as converters
or a L2 cloud between the L3 endpoints. If you avoid such things when one end of the
link goes down the other end will as well which provides fast detection of failure.

In rare cases you can have the link being up but traffic is not passing over it. For
such cases or for those cases where there was no chance of avoiding a converter or
L2 cloud, tuning the hello timers can help with failure detection. The answer is almost
always BFD though, if the platform supports it.

Topologies where tuning timers is bad

When using a topology where VSS is involved such as Catalyst 6500 or Catalyst 4500,
tuning the timers is very bad. A common topology might look like this:


The L3 switches are dually connected to the VSS. These L3 switches might be in the
distribution layer and the VSS is part of the core. The distribution switches run
LACP towards the VSS which acts as one device from an outside perspective.

The VSS runs Stateful Switchover (SSO) which syncs configuration, boots the standby
supervisor with the software and has the line cards ready to go in case of failure
of the primary chassis. Hardware forwarding tables are also synchronized, SSO
switchover takes somewhere up to 10 seconds.


The active VSS chassis runs the control plane. Routing protocols such as OSPF are not
HA aware, meaning that the state of the routing protocols is not synchronized between
the chassis.

When using fast timers and a switchover occurs, what happens is that OSPF detects that the
neighbor is not replying and tears down the adjacency. The secondary chassis then has to go
bring the adjacency back up by sending out hello packets, exchanging LSAs and updating
RIB/FIB. This may take as long as 20 seconds with the time included from the switchover.


Non Stop Forwarding (NSF)

NSF combined with graceful restart is a technology used to forward packets when
a switchover has occured. The goal of NSF is to delay the failure detection which
may sound strange from a convergence perspective. Remember though that the VSS acts
as one device.

With NSF the forwarding is done according to the last known FIB entries. After a
switchover the secondary VSS will use graceful restart to inform its neighbors that
it has restarted and needs to synchronize its LSDB. This is done by sending hello packets
with a special bit set and the synchronization is done Out Of Band (OOB) to not tear
down the existing adjacency. The neighbors exchange LSAs and run SPF as normal. The
RIB and FIB can then be updated and and normal forwarding ensues.

This process is dependant on that the neighbors are also NSF aware otherwise they
would tear down the adjacency when the secondary VSS is restarting its routing
processes. So the key here is that the adjacency must stay up and that’s why timers
should be left at default if running VSS. This goes for both the VSS and any routers
that are neighbors to the VSS.


When using VSS always leave IGP timers at the default. Fast timers ruins the NSF
process and will lead to much higher convergence times than leaving them at the

Resilient Ethernet Protocol (REP)

November 11, 2013 8 comments


I’m writing a short summary of REP as part of my CCDE studies. REP is an alternative protocol
used in place of STP and is most often run in ring based topologies. It is not limited to
these topologies however and it can also interact with STP if there is a desire to do so.
REP is Cisco proprietary, other vendors have similar protocols like EAPS from Extreme Networks.

Basic REP

REP uses the concept of segments. A segment ID is configured on all switches
belonging to the same segment. Two edge ports are selected where the REP
segment ends. These edge ports must not have connectivity with each other.

One port is blocking and this port is called the alternate port. All other
ports are transit ports.


Traffic flows towards the edge ports.

REP port roles

REP ports are either failed, open or alternate.

  • All regular segment ports start out as failed ports
  • After adjacencies have been determined, ports move to Alternate state. After negotiations on Alternate port is done the remaining ports move to open state while one port stays in Alternate state.
  • When a failure occurs on a link all ports move to failed state. When the Alternate port receives the notification it is moved to open state.

Failure Detection

REP does not work the same way that EAPS does. EAPS sends out a poll on one port
and expects to see it back on the other port facing the ring. It has a master node
that is responsible for this action.

REP works by detecting link failure (Loss of Signal). REP also forms adjacencies
with directly connected switches. Because the main method of converging is to detect LoS
that means that the network should be designed without converters or shared segments that
could affect the detection of a failure. REP Link Status Layer (LSL) is responsible for
detecting REP aware neighbors and establishing connectivity within a segment. After
connectivity has been setup, REP will choose which port is to be alternate and the other
ports will be forwarding. The alternate port can also manually be selected if desired.


Like mentioned earlier the main mechanism is to detect Loss of Signal. In the rare case
that the interface does not go down but connectivity it lost, REP must rely on timers.
The default is that the interface will stay up for five seconds when LSL hellos have
not been received from a neighbor.

When a link fails a notification is sent to a multicast destination address. This notification
is flooded in hardware speeding up the convergence. When a switch receives the notification
it must flush its L2 MAC table.

Interaction with STP

REP can interact with STP by generating TCN BPDUs. This could be desirable if you run REP
in a metro network and then have STP running in the network above that. Generally though
it would be best to not have that a large L2 segment so the REP segment should be
connected to a PE that runs MPLS/IP to the core.

End Port Advertisements

Starting from the edge ports End Port Advertisements (ESA) are sent out every four seconds.
These messages are used to discover the REP topology. The messages are relayed by all
intermediate ports and means that all the switches in the same segment knows what the
topology looks like and the state of all the ports in the segment. This can also be used
to see what the topology looked like before a failure because REP has an archive feature.

Other features of REP

REP supports preemption, meaning that when a failed link comes back the network can go
back to what it looked like before the failure. Manual preemption can also be used but
it will cause a temporary loss of traffic.

REP also supports VLAN load balancing meaning that the topology can look different
depending on the VLAN. However REP is not per VLAN in the sense that the hellos are
always sent on one VLAN compared to PVST+/RPVST+ which sends BPDUs per VLAN.
REP uses a concept of administrative VLAN which can be configured, the default is
to use VLAN 1.


Like any control plane protocols that are running in our networks, they can be open for
attacks. What would happen if someone faked PDUs for REP trying to make the network
converge in an unexpected manner or kept sending these PDUs to flap ports at a
very high rate.

Obviously this could be a dangerous scenario. Cisco thought of this and implemented a key
mechanism that starts from the Alternate port. The key consists of a port ID and a random
generated number created when the port activates. This key is distributed through the
segment to the other devices which can then use this key to unblock the alternate port.


REP is a Cisco proprietary protocol mainly used in metro based ring networks. It is likely
to converge faster than STP and can achieve best case convergence of around 50 ms. REP
can interact with STP by sending TCN BPDUs. REP is a similar technology to EAPS with some
differences. REP is supported on Cisco ME switches.

In the future I think protocols like REP and EAPS will start to fade away as metro based
networks go all MPLS/IP.

Categories: Convergence, Ethernet Tags: , , , ,

Detecting Network Failure

September 26, 2013 7 comments


In todays networks, reliability is critical. Reliability needs to be high and
convergence needs to be fast. There are several ways of detecting network failure
but not all of them scale. This post takes a look at different methods of
detection and discusses when one or the other should be used.

Routing Convergence Components

There are mainly four components of routing convergence:

  1. Failure detection
  2. Failure propagation (flooding)
  3. Topology/Routing recalculation
  4. Update of the routing and forwarding table (RIB and FIB)

With modern networking networking equipment and CPUs it’s actually the first
one that takes most time and not the flooding or recalculation of the topology.

Failure can be detected at different level of the OSI model. It can be layer 1, 2
or 3. When designing the network it’s important to look at complexity and cost
vs the convergence gain. A more complex solution could increase the Mean Time
Between Failure (MTBF) but also increase the Mean Time To Repair (MTTR) leading
to a lower reliability in the end.


Layer 1 Failure Detection – Ethernet

Ethernet has builtin detection of link failure. This works by sending
pulses across the link to test the integrity of it. This is dependant on
auto negotiation so don’t hard code links unless you must! In the case of
running a P2P link over a CWDM/DWDM network make sure that link failure
detection is still operational or use higher layer methods for detecting

Carrier Delay

  • Runs in software
  • Filters link up and down events, notifies protocols
  • By default most IOS versions defaults to 2 seconds to suppress flapping
  • Not recommended to set it to 0 on SVI
  • Router feature

Debounce Timer

  • Delays link down event only
  • Runs in firmware
  • 100 ms default in NX-OS
  • 300 ms default on copper in IOS and 10 ms for fiber
  • Recommended to keep it at default
  • Switch feature

IP Event Dampening

If modifying the carrier delay and/or debounce timer look at implementing IP
event dampening. Otherwise there is a risk of having the interface flap a lot
if the timers are too fast.

Layer 2 Failure Detection

Some layer 2 protocols have their own keepalives like Frame Relay and PPP. This
post only looks at Ethernet.


  • Detects one-way connections due to hardware failure
  • Detects one-way connections due to soft failure
  • Detects miswiring
  • Runs on any single Ethernet link even inside a bundle
  • Typically centralized implementation

UDLD is not a fast protocol. Detecting a failure can take more than 20 seconds so
it shouldn’t be used for fast convergence. There is a fast version of UDLD but this
still runs centralized so it does not scale well and should only be used on a select
few ports. It supports sub second convergence.

Spanning Tree Bridge Assurance

  • Turns STP into a bidirectional protocol
  • Ensures spanning tree fails “closed” rather than “open”
  • If port type is “network” send BPDU regardless of state
  • If network port stops receiving BPDU it’s put in BA-inconsistent state


Bridge Assurance (BA) can help protect against bridging loops where a port becomes
designated because it has stopped receiving BPDUs. This is similar to the function
of loop guard.


It’s not common knowledge that LACP has builtin mechanisms to detect failures.
This is why you should never hardcode Etherchannels between switches, always
use LACP. LACP is used to:

  • Ensure configuration consistence across bundle members on both ends
  • Ensure wiring consistency (bundle members between 2 chassis)
  • Detect unidirectional links
  • Bundle member keepalive

LACP peers will negotiate the requested send rate through the use of PDUs.
If keepalives are not received a port will be suspended from the bundle.
LACP is not a fast protocol, default timers are usually 30 seconds for keepalive
and 90 seconds for dead. The timer can be tuned but it doesn’t scale well if you
have many links because it’s a control plane protocol. IOS XR has support for
sub second timers for LACP.

Layer 3 Failure Detection

There are plenty of protocol timers available at layer 3. OSPF, EIGRP, ISIS,
HSRP and so on. Tuning these from their default values is common and many of
these protocols support sub second timers but because they must run to the
RP/CPU they don’t scale well if you have many interfaces enabled. Tuning these
timers can work well in small and controlled environments though. These are
some reasons to not tune layer 3 timers too low:

  • Each interface may have several protocols like PIM, HSRP, OSPF running
  • Increased supervisor CPU utilization leading to false positives
  • More complex configuration and bandwidth wasted
  • Might not support ISSU/SSO


Bidirectional Forwarding Detection (BFD) is a lightweight protocol designed to
detect liveliness over links/bundles. BFD is:

  • Designed for sub second failure detection
  • Any interested client (OSPF, HSRP, BGP) registers with BFD and is notified when BFD detects loss
  • All registered clients benefit from uniform failure detection
  • Uses UDP port 3784/3785 (echo)

Because any interested protocol can register with BFD there are less packets
going across the link which means less wasting of bandwidth and the packets
are also smaller in size which reduces this even more.

Many platforms also support offloading BFD to line cards which means that the
CPU does not get increased load when BFD is enabled. It also supports ISSU/SSO.

BFD negotiates the transmit and receive interval. If we have a router R1
that wants to transmit at 50 ms interval but R2 can only receive at 100 ms
then R1 has to transmit at 100ms interval.

BFD can run in asynchronous mode or echo mode. In asynchronous mode the BFD
packets go to the control plane to detect liveliness. This can also be combined
with echo mode which sends a packet with a source and destination IP of the
sending router itself. This way the packet is looped back at the other end
testing the data plane. When echo mode is enabled the control plane packets
are sent at a slower pace.

Link bundles

There can be challenges running BFD over link bundles. Due to CEF polarization
control plane/data plane packets might only be sent over the same link. This
means that not all links in the bundle can be properly tested. There is
a per link BFD mode but it seems to have limited support so far.

Event Driven vs Polled

Generally event driven mechanisms are both faster and scale better than polling
based mechanisms of detecting failure. Rely on event driven if you have the option
and only use polled mechanisms when neccessary.


Detecting a network failure is a very important part of network convergence. It
is generally the step that takes the most time. Which protocols to use depends
on network design and the platforms used. Don’t enable all protocols on a link
without knowing what they actually do. Don’t tune timers too low unless you
know why you are tuning them. Use BFD if you can as it is faster and uses
less resources. For more information refer to BRKRST-2333.