Posts Tagged ‘Ethernet’

Ethernet, STP, Topology change and the behaviour of Ethernet

June 24, 2014 2 comments


This post is inspired by a post at IEOC about Uplinkfast and TCN which
can be found here.

Before we get to those parts, let’s recap how Ethernet and STP work together.

Spanning Tree

The Spanning Tree Algorithm builds a loop free tree by comparing Bridge ID(BID) and
least cost paths to the root bridge. By doing this it blocks all links not leading
to the root.


MAC Learning

Switches learn where to forward frames by looking at the source MAC address of the frame
on the port that the frame was received on. This learning is done in the data plane
as opposed to routing where the routes are learned in control plane. I will come back
to this later in the post.

MAC learn1

S4 learns that A is located on port 1 after A has sent a frame. This is stored in
the MAC address table located in Content Addressable Memory (CAM). The CAM is a
fast memory optimized for quick lookups in the table. By default there is a 300
second aging timeout for learned MAC addressesm, meaning that if the switch
does not see any traffic from a source MAC within five minutes the entry will
age out of the table. This is used to remove stale entries and to keep the
MAC address table from becoming too large.

Potential Issues

As I mentioned briefly earlier in the post, MAC learning is done in the data plane.
When we exchange routes through protocols such as OSPF, EIGRP and BGP, this is
done in the control plane. If there is a /24 route in the routing table pointing
at a router, then those up to 254 hosts are behind that router. With MAC learning
every source MAC has its own entry, which would be the same as if we had /32 routes
for every host in the network. Not very effecient! This can also become a scalibility
issue in large networks if there are more hosts than the CAM can hold.

There are also other issues such as not being able to use all the links in the
network. Spanning tree will block the redundant links so we don’t get more bandwidth
if we add more links unless we put them into an Etherchannel or use technologies
such as vPC. In datacenter designs, using STP will lead to low bisectional bandwidth,
meaning that even if there are lots of links between a section in the network, most of
them will actually be blocked.

Another issue is that broadcast and unknown unicast traffic is flooded in the network.
Imagine a scenario as below where A is sending unicast traffic to B and it’s
an unidirectional flow. B rarely sends any traffic so its entry has been aged out
of the MAC address table.

Unknown unicast

In this scenario the unknown unicast will be flooded to all the switches and
all servers will have to receive the 300 Mbit/s stream and then discard the
traffic until the switches have learned the MAC of B again!

There is also a potential for black holing of traffic. In the topology below there
are four switches connected together and the primary path is through S4-S1-S2-S3.


Then the link between S1 and S2 fails.


When using 802.1D, there is no synchronization of the topology. It will take up to
50 seconds for the link between S3 and S4 to come up unless Backbonefast has been
deployed. When traffic is going from A to B, it will be blackholed. S4 still has an
entry for B towards S1. When the traffic reaches S1 it has nowhere to go.
Without aging of stale entries, this would take up to five minutes. This is
the purpose of topology change in STP, to faster age out stale entries.

Topology Change

Like I described above, without a mechanism for topology change, traffic could
potentially be black holed for quite a while. In 802.1D, when a link goes up
or down, the switch will generate a TCN BPDU which is a special BPDU sent out
the root port. Normally switches only relay BPDUs from the root on their designated
ports but this is a special case. A switch that receives a TCN BPDU will reply
to it with a configuration BPDU with the TC Acknowledge bit set.


The TCN BPDU will eventually reach the root which will then send out a configuration
BPDU with the TC bit set. This is done for a duration of MaxAge + FwDelay
seconds which is 20 + 15 seconds by default.


When switches receive this BPDU from the root with the TC bit set, they will age out
entries in the CAM at a faster pace. The aging timeout will be set to 15 seconds.
This will age out any stale entries in the CAM. If there are active flows they will
not be aged out because the age will be reset as the switch sees frames coming in
with the source MAC in question. As I described earlier there could be unidirectional
flows leading to flooding. Also flows that are inactive for a while and then resume
can get flooded if their entries time out during the period that the root bridge is
sending out these configuration BPDUs with TC set.


Uplinkfast is a feature deployed on access switches which have dual links to
the distribution layer. Because the switches are located at the edge of the network
it is safe to bring up an alternate port immediately without going through the regular
listening and learning phase, saving up to 30 seconds.

After a switch has failed over to the alternate link it will start to send out
dummy multicast frames. This is to speed up convergence. Even if a configuration
BPDU with TC set is sent by the root, it can still take up to 15 seconds before
stale entries age out.


So based on the thread at IEOC, what is the consequence of Uplinkfast and TC together?
The configuration BPDU with TC is sent for 35 seconds by default. Dummy multicast frames
will be sent out for a duration that is unknown. It depends on how many entries there are
in the CAM and the rate that the packets are sent at. So depending on when the multicast
frame is sent and if you have an unidirectional flow or a host gone silent, then yes
the configuration BPDU with TC could be counter productive. Traffic would reach its
destination though but it would be through flooding of the traffic.

In reality I doubt this would be much of an issue and most networks would be running
RSTP today. RSTP works differently by synchronizing the topology and when the TC bit
is set in BPDUs the entire CAM is flushed on all ports except where the BPDU was


Detecting Network Failure

September 26, 2013 7 comments


In todays networks, reliability is critical. Reliability needs to be high and
convergence needs to be fast. There are several ways of detecting network failure
but not all of them scale. This post takes a look at different methods of
detection and discusses when one or the other should be used.

Routing Convergence Components

There are mainly four components of routing convergence:

  1. Failure detection
  2. Failure propagation (flooding)
  3. Topology/Routing recalculation
  4. Update of the routing and forwarding table (RIB and FIB)

With modern networking networking equipment and CPUs it’s actually the first
one that takes most time and not the flooding or recalculation of the topology.

Failure can be detected at different level of the OSI model. It can be layer 1, 2
or 3. When designing the network it’s important to look at complexity and cost
vs the convergence gain. A more complex solution could increase the Mean Time
Between Failure (MTBF) but also increase the Mean Time To Repair (MTTR) leading
to a lower reliability in the end.


Layer 1 Failure Detection – Ethernet

Ethernet has builtin detection of link failure. This works by sending
pulses across the link to test the integrity of it. This is dependant on
auto negotiation so don’t hard code links unless you must! In the case of
running a P2P link over a CWDM/DWDM network make sure that link failure
detection is still operational or use higher layer methods for detecting

Carrier Delay

  • Runs in software
  • Filters link up and down events, notifies protocols
  • By default most IOS versions defaults to 2 seconds to suppress flapping
  • Not recommended to set it to 0 on SVI
  • Router feature

Debounce Timer

  • Delays link down event only
  • Runs in firmware
  • 100 ms default in NX-OS
  • 300 ms default on copper in IOS and 10 ms for fiber
  • Recommended to keep it at default
  • Switch feature

IP Event Dampening

If modifying the carrier delay and/or debounce timer look at implementing IP
event dampening. Otherwise there is a risk of having the interface flap a lot
if the timers are too fast.

Layer 2 Failure Detection

Some layer 2 protocols have their own keepalives like Frame Relay and PPP. This
post only looks at Ethernet.


  • Detects one-way connections due to hardware failure
  • Detects one-way connections due to soft failure
  • Detects miswiring
  • Runs on any single Ethernet link even inside a bundle
  • Typically centralized implementation

UDLD is not a fast protocol. Detecting a failure can take more than 20 seconds so
it shouldn’t be used for fast convergence. There is a fast version of UDLD but this
still runs centralized so it does not scale well and should only be used on a select
few ports. It supports sub second convergence.

Spanning Tree Bridge Assurance

  • Turns STP into a bidirectional protocol
  • Ensures spanning tree fails “closed” rather than “open”
  • If port type is “network” send BPDU regardless of state
  • If network port stops receiving BPDU it’s put in BA-inconsistent state


Bridge Assurance (BA) can help protect against bridging loops where a port becomes
designated because it has stopped receiving BPDUs. This is similar to the function
of loop guard.


It’s not common knowledge that LACP has builtin mechanisms to detect failures.
This is why you should never hardcode Etherchannels between switches, always
use LACP. LACP is used to:

  • Ensure configuration consistence across bundle members on both ends
  • Ensure wiring consistency (bundle members between 2 chassis)
  • Detect unidirectional links
  • Bundle member keepalive

LACP peers will negotiate the requested send rate through the use of PDUs.
If keepalives are not received a port will be suspended from the bundle.
LACP is not a fast protocol, default timers are usually 30 seconds for keepalive
and 90 seconds for dead. The timer can be tuned but it doesn’t scale well if you
have many links because it’s a control plane protocol. IOS XR has support for
sub second timers for LACP.

Layer 3 Failure Detection

There are plenty of protocol timers available at layer 3. OSPF, EIGRP, ISIS,
HSRP and so on. Tuning these from their default values is common and many of
these protocols support sub second timers but because they must run to the
RP/CPU they don’t scale well if you have many interfaces enabled. Tuning these
timers can work well in small and controlled environments though. These are
some reasons to not tune layer 3 timers too low:

  • Each interface may have several protocols like PIM, HSRP, OSPF running
  • Increased supervisor CPU utilization leading to false positives
  • More complex configuration and bandwidth wasted
  • Might not support ISSU/SSO


Bidirectional Forwarding Detection (BFD) is a lightweight protocol designed to
detect liveliness over links/bundles. BFD is:

  • Designed for sub second failure detection
  • Any interested client (OSPF, HSRP, BGP) registers with BFD and is notified when BFD detects loss
  • All registered clients benefit from uniform failure detection
  • Uses UDP port 3784/3785 (echo)

Because any interested protocol can register with BFD there are less packets
going across the link which means less wasting of bandwidth and the packets
are also smaller in size which reduces this even more.

Many platforms also support offloading BFD to line cards which means that the
CPU does not get increased load when BFD is enabled. It also supports ISSU/SSO.

BFD negotiates the transmit and receive interval. If we have a router R1
that wants to transmit at 50 ms interval but R2 can only receive at 100 ms
then R1 has to transmit at 100ms interval.

BFD can run in asynchronous mode or echo mode. In asynchronous mode the BFD
packets go to the control plane to detect liveliness. This can also be combined
with echo mode which sends a packet with a source and destination IP of the
sending router itself. This way the packet is looped back at the other end
testing the data plane. When echo mode is enabled the control plane packets
are sent at a slower pace.

Link bundles

There can be challenges running BFD over link bundles. Due to CEF polarization
control plane/data plane packets might only be sent over the same link. This
means that not all links in the bundle can be properly tested. There is
a per link BFD mode but it seems to have limited support so far.

Event Driven vs Polled

Generally event driven mechanisms are both faster and scale better than polling
based mechanisms of detecting failure. Rely on event driven if you have the option
and only use polled mechanisms when neccessary.


Detecting a network failure is a very important part of network convergence. It
is generally the step that takes the most time. Which protocols to use depends
on network design and the platforms used. Don’t enable all protocols on a link
without knowing what they actually do. Don’t tune timers too low unless you
know why you are tuning them. Use BFD if you can as it is faster and uses
less resources. For more information refer to BRKRST-2333.

The history of Ethernet – DIX vs 802.3

June 6, 2012 14 comments

I’m planning to do a post on BPDUs sent by Cisco switches and analyze why they are sent. To fully understand the coming post first we need to understand the different versions of Ethernet. There is more than one version? Yes, there is although mainly one is used for all communication.

Most people will know that Robert Metcalfe was one of the inventors of Ethernet. Robert was working for Xerox back then. Digital, Intel and Xerox worked together on standardizing Ethernet. This is why it is often referred to as a DIX frame. The DIX version 1 standard was published in 1980 and the version used today is version 2. This is why we refer to Ethernet II or Ethernet version 2. The DIX version is the frame type that is most often used.

IEEE was also working on standardizing Ethernet. They began working on it in February 1980 and that is why the standard is called 802 where 802.3 is the Ethernet standard. We refer to it as Ethernet even though when IEEE released their standard it was called “IEEE 802.3 Carrier Sense Multiple Access with Collision Detection (CSMA/CD)
Access Method and Physical Layer Specifications”. So here we see the term CSMA/CD for the first time.

I’m not here to give you a history lesson but instead explain the frame types and briefly discuss the fields in them. We start with the DIX frame or Ethernet II frame. This is the frame that is most commonly used today. It looks like this.

The preamble is a pattern of alternating ones and zeroes and ending with two ones. When this pattern is received it is known that anything that comes after this pattern is the actual frame.

The source and destination MAC is used for switching based on the MAC.

The EtherType field specifies that upper level protocol. Some of the most well known ones are:

0x0800 – IP
0x8100 – 802.1Q tagged frame
0x0806 – ARP
0x86DD – IPv6

After that follow the actual payload which should be between 46 – 1500 bytes in size.

In the end there is a Frame Checking Sequence (FCS) which is used to check the validity of the frame. If the CRC check fails the frame is dropped.

In total the frame will be maximum 1514 bytes or 1518 if counting the FCS.

When it comes to 802.3 Ethernet there are actually two frame formats. One is 802.3 with 802.2 LLC SAP header. It looks like this.

This was the original version from the IEEE. Many of the fields are the same. Let’s look at those that are not.

The preamble is now divided in preamble and Start Frame Delimiter (SFD) but the function is the same.

The length field is used to indicate how many bytes of data are following this field before the FCS. It can also be used to distinguish between DIX frame and 802.3 frame as for DIX the values in this field will be higher e.g. 0x806 for ARP. If this value is greater than 1536 (0x600 Hex) then it is a DIX frame and the value is an Ethertype value.

Then we have some interesting values called DSAP, SSAP and Control. SAP stands for Service Access Point, the S and D in SSAP and DSAP stands for source and destination.

They have a similar function as the Ethertype. The SAP is used to distinguish between different data exchanges on the same station. The SSAP indicates from which service the LLC data unit was sent and the DSAP indicates the service to which the LLC data unit is being sent. IP has a SAP of 6 and 802.1D (STP) has a SAP of 42. It would be very strange to have a different SSAP and DSAP so these values should be the same. IP to IP would be SSAP of 06 and DSAP of 06. One bit (LSB) in the DSAP is used to indicate if it is a group address or an individual address. If it is set to zero it refers to an individual address going to a Local SAP (LSAP). One bit in the SSAP (LSB) indicates if it is a command or response packet. That leaves us with 128 possible different SAPs for SSAP and DSAP.

The contol field is used to select if communication should be connection-less or connection-oriented. Usually error recovery and flow control are performed by higher level services such as TCP.

The IEEE had problems to address all the layer 3 processes due to the short DSAP and SSAP fields in the header. This is why they introduced a new frame format called Subnetwork Access Protocol (SNAP). Basically this header is using the type field found in the DIX header. If the SSAP and DSAP is set to 0xAA and the Control field is set to 0x03 then SNAP encapsulation will follow. SNAP has a five byte extension to the standard 802.2 LLC header and it consists of a 3 byte OUI and a two byte Type field.

From a vendor perspective this is good because then they can have an OUI and then create their own types to use. If we look at PVST+ BPDUs from a Cisco device we will see that they are SNAP encapsulated where the organization code is Cisco (0x00000c) and the PID is PVSTP+ (0x010b). CDP is also using SNAP and it has a PID of CDP (0x0200). I will talk more about BPDUs and STP in a following post but first I wanted to provide the background on the Ethernet frame types used.

In summary there are three different Ethernet frame types used. DIX frame, also called Ethernet II, IEEE 802.3 with LLC and IEEE 802.3 with SNAP encapsulation. There are others out there as well but these are the three major ones and the DIX one is by far the most common one.

Categories: CCIE, Ethernet Tags: , , , , , ,

Ethernet – notes

December 16, 2010 Leave a comment

RJ 45 pinouts

10-BASE-T and 100BASE-TX uses pairs two and three, gigabit Ethernet uses all four pairs.
Pinout for straight cable: 1-1;2-2;3-3;6-6
Pinout for crossover cable: 1-3;2-6;3-1;6-2

A standard PC transmits on pair one and two and receives on three and six. A switchport is
the opposite. If two alike devices are connected a crossover cable should be used although
MDI-X is a standard today.

Cisco switches can detect the speed of a link through Fast Link Pulses (FLP) even if autonegotiation is disabled but the duplex can not be detected and this means that half duplex must be assumed. This is true for 10BASE-T and 100BASE-TX. Gigabit Ethernet uses all four pairs in the cable and can only use full duplex mode of operation. Also note that for gigabit Ethernet autonegotiation is mandatory although it is possible to hardcode speed and duplex .

Ethernet uses Carrier Sense Multiple Acess/Collision Detection (CSMA/CD). Before a client can send a frame it listens to the wire to see that it is not busy. It sends the frame and listens to ensure a collision has not occured. If a collision occurs all stations that sent a frame send a jamming signal to ensure that all stations recognized the collision. The senders of the original collided frames wait for a random amount of time before sending again.

Deferred frames

Frames that were meant to be sent but were paused because frames were being received at the moment. If in half duplex sending and receiving can not occur at the same time.


Collisions that are detected while the first 64 bytes are being transmitted are called collisions and collisions detected after the first 64 bytes are called late collisions.


Provides synchronization and signal transitions to allow proper clocking of the transmitted signal. Consists of 62 alternating one and zeroes and then ends with a pair of ones.

I/G bit and U/L bit

The I/G bit is placed in the most significant byte and the most significant bit of the MAC address. If set to zero it is an Individual (I) address and if set to one it is a Group (G) address. Multicast at layer two always sends to 01.00.5E which means that the G bit is set. The bit before the I/G bit is the U/L bit, this indicates if it is an Universally (U) administerad address or an Locally (L) assigned address. If it is an MAC address set by a manufacturer this should be set to zero.


SPAN and RSPAN are used to mirror traffic. The source of traffic can be a VLAN or a switchport or a routed port. Traffic can be mirrored from both rx and tx or just one of them. SPAN sends the traffic to a local destination port, RSPAN sends the traffic to a RSPAN VLAN which is used to transfer the traffic to its destination. Note that some layer two frames are not sent by default including CDP, VTP, DTP, BPDU and PagP, to include these use the command encapsulation replicate. SPAN is configured with the monitor session command.

Categories: CCIE, Ethernet, Notes Tags: , ,

The facts of Ethernet – Round three

August 9, 2010 Leave a comment

The previous post talked about autonegotiation. This time I will talk about cables and pinouts and how auto MDIX works. Although I’m not very old I still like to do it the old school way. I don’t rely on auto MDIX, instead I use the right cable. Lets look at a pinout for T568B:

A regular end device like a PC transmits on pin one and two and receives on pin three and six. Although we have four pairs only two are actually used, unless we are using gigabit Ethernet but that is another topic. A device like a switch does the opposite, it receives on pin one and two and sends on three and six. This is why we use a straight through cable. When connecting similar devices like a switch to a switch we need to use a cross over cable since they want to send on the same pins and receive on the same. So when choosing a cable remember that similar devices requires cross over and different devices needs a straight through.

An engineer at HP developed the auto MDIX standard since he was tired of looking for cross over cables. But how does it work?

The NIC expects to receive Fast Link Pulses (FLP) on pins three and six. If it receives FLPs it will know that the configuration is correct. If it doesn’t receive FLP’s it will switch over to MDI-X mode. This is a very simplified view of it, the process involves different timers and a XOR algorithm. If you want to know more check out the IEEE 802.3 specification section 3, clause 40.4.4.

Categories: Ethernet Tags: , , ,

The facts of Ethernet – Round two

August 7, 2010 1 comment

Autonegotiation – Either you love it or you hate it but pretty much everyone has an opinion on it. I was going to write something more lengthy at first but decided a blog was the wrong place.

Autonegotiation works by sending eletrical pulses. In 10Base-T these are called Normal Link Pulses (NLP). They are sent every 16th ms with a tolerance of 8 ms. They are only sent when the Network Interface Card (NIC) is not receiving or sending traffic. They look like this:

In the fast Ethernet standard (802.3u) these are called Fast Link Pulses (FLP) and they look like this:

These electrical pulses lets us determine the speed and duplex mode that is available in autonegotiation. The priority for choosing a speed and duplex mode goes like this:

  • 1000Base-T – Full duplex
  • 1000Base-T – Half duplex
  • 100Base-T2 – Full duplex
  • 100Base-TX – Full duplex
  • 100Base-T2 – Half duplex
  • 100Base-T4
  • 100Base-TX – Half duplex
  • 10BaseT – Full duplex
  • 10BaseT – Half duplex

If one side is set to auto and the other side hardcoded parallell detection kicks in. Parallell detection can determine the speed by looking at the format of the electrical pulses it is receiving from its link partner. Duplex can’t be detected so that will default to half duplex. This is why we sometimes see links with 100/half duplex. If one side is auto and the other 100/full the auto side will be set to 100/half.

Half duplex is of course very bad, it leads to frame errors, dropped packets and late collisions.

The facts of Ethernet round one

August 5, 2010 Leave a comment

Ethernet is the most used layer 2 protocol today and it’s dominance is not likely to end anytime soon. I decided to make a section with some quick facts about Ethernet. There is a lot to know about Ethernet but we usually neglect this because we are very focused on IP. Take a  look at an Ethernet frame:

The preamble field is not known to many people. It won’t show up in a packet capture since the network card will already have stripped it before it’s available for capture. So what is the purpose of preamble? The preamble field contains a synchronization pattern that consists of alternating ones and zeros and ends with two consecutive ones. It is used to synchronize node communication but also to indicate where the frame start. Because it is not processed in the same way as the rest of the frame we do not have to count the eight bytes of preamble when calculating Ethernet frame size. This is what preamble looks like:


Categories: Ethernet Tags: ,