Archive

Archive for February, 2015

A Quick Look at Cisco FabricPath

February 26, 2015 6 comments

Cisco FabricPath is a proprietary protocol that uses ISIS to populate a “routing table” that is used for layer 2 forwarding.

Whether we like or not, there is often a need for layer 2 in the Datacenter for the following reasons:

  • Some applications or protocols require to be layer 2 adjacent
  • It allows for virtual machine/workload mobility
  • Systems administrators are more familiar with switching than routing

A traditional network with layer 2 and Spanning Tree (STP) has a lot of limitations that makes it less than optimal for a Datacenter:

  • Local problems have a network-wide impact
  • The tree topology provides limited bandwidth
  • The tree topology also introduces suboptimal paths
  • MAC address tables don’t scale

In the traditional network, because STP is running, a tree topology is built. This works better for for flows that are North to South, meaning that traffic passes from the Access layer, up to Distribution, to the Core and then down to Distribution and to the Access layer again. This puts a lot of strain on Core interconnects and is not well suited for East-West traffic which is the name for server to server traffic.

A traditional Datacenter design will look something like this:

DC1

If we want end-to-end L2, we could build a network like this:

DC2

What would be the implications of building such a network though?

  • Large failure domain
  • Unknown unicast and broadcast flooding through large parts of the network
  • A large number of STP instances needed unless using MST
  • Topology change will have a large impact on the network and may cause flooding
  • Large MAC address tables
  • Difficult to troubleshoot
  • A very brittle network

So let’s agree that we don’t want to build a network like this. What other options do we have if we still need layer 2? One of the options is Cisco FabricPath.

FabricPath provides the following benefits:

  • Reduction/elimination of STP
  • Better stability and convergence characteristics
  • Simplified configuration
  • Leverage parallell paths
  • Deterministic throughput and latency using typical designs
  • “VLAN anywhere” flexibility

The FabricPath control plane consists of the following elements:

  • Routing table – Uses ISIS to learn Switch IDS (SIDs) and build a routing table
  • Multidestination trees – Elects roots and builds multidestination trees
  • Mroute table – IGMP snooping learns group membership at the edge, Group Member LSPs (GM-LSPs) are flooded by ISIS into the fabric

Observe that LSPs has nothing to do with MPLS in this case and that this is not MAC based routing, routing is based on SIDs.

FabricPath ISIS learns the shortest path to each SID based on link metrics/path cost. Up to 16 equal (ECMP) routes can be installed. Choosing a path is based on a hashing function using Src IP/Dst IP/L4/VLAN which should be good for avoiding polarization.

FabricPath supports multidestination trees with the following capabilities:

  • Multidestination traffic is contained to a tree topology, a network-wide identifier (Ftag) is assigned to each tree
  • A root switch is elected for each multidestination tree
  • Multipathing is supported through multiple trees

Note that root here has nothing to do with STP, think of it in terms of multicast routing.

DC3

Multidestination trees do not dictate forwarding for unicast, only for multidestination packets.

The FabricPath data plane behaves according to the following forwarding rules:

  • MAC table – Hardware performs MAC lookup at CE/FabricPath edge only
  • Switch table – Hardware performs destination SID lookups to forward unicast frames to other switches
  • Multidestination table – A hashing function selects the tree, multidestination table identifies on which interfaces to flood based on selected tree

The Ftag used in FabricPath identifies which ISIS topology to use for unicast packets and for multidestination packets, which tree to use.

If a FabricPath switch belongs to a topology, all VLANs of that topology should be configured on that switch to avoid blackholing issues.

FabricPath supports 802.1p but can also match/set DSCP and match on other L2/L3/L4 information.

With FabricPath, edge switches only need to learn:

  • Locally connected host MACs
  • MACs with which those hosts are bidirectionally communicating

This reduces the MAC address table capacity requirements on Edge switches.

FabricPath Designs

There are different designs that can be used together with FabricPath. The first one is routing at the Aggregation layer.

DC4

The first design is the most classic one where STP has been replaced by FP in the Access layer and routing is used above the Aggregation layer.

This design has the following characteristics:

  • Evolution of current design practices
  • The Aggregation layer functions as FabricPath spine and L2/L3 boundary
    – FabricPath switching for East – West intra VLAN traffic
    – SVIs for East – West inter VLAN traffic
    – Routed uplinks for North – South routed flows

  • Access layer provides pure L2 functions
    – FabricPath core ports facing Aggregation layer
    – CE edge ports facing hosts
    – Optionally vPC+ can be used for active/active host connections

This design is the simplest option and is an extension of regular Access/Aggregation designs. It provides the following benefits:

  • Simplified configuration
  • Removal of STP
  • Traffic distribution over all uplinks without the use of vPC
  • Active/active gateways
  • “VLAN anywhere” at the Access layer
  • Topological flexibility
    – Direct-path forwarding option
    – Easily provision additional AccessAggregation bandwidth
    – Easily deploy L4-L7 services
    – Can use vPC+ towards legacy Access switches

There is also the centralized routing design which looks like the following:

DC5

Centralized routing has the following characteristics:

  • Traditional Aggregation layer becomes pure FabricPath spine
    – Provides uniform any-to-any connectivity between leaf switches
    – In simplest case, only FabricPath switching occurs in spine
    – Optionally, some CE edge ports exist to provide external router connections

  • FabricPath leaf switches, connecting to spine, have specific “personality”
    – Most of the leaf switches will provide server connectivity, like traditional access switches in “Routing at Aggregation” designs
    – Two or more leaf switches provide L2/L3 boundary, inter-VLAN routing and North-South routing
    – Other or same leaf switches may provide L4-L7 services

  • Decouples L2/L3 boundary and L4-L7 services provisioning from Spine
    – Simplifies Spine design

The different traffic flows in this design looks like the following:

DC6

Another design is the multi-pod design which can look like the following:

DC7

The multi-pod design has the following characteristics:

  • Allows for more elegant DC-wide versus pod-local VLAN definition/isolation
    – No need for pod-local VLANs to exist in core
    – Can support VLAN id reuse in multiple pods

  • Define FabricPath VLANs -> map VLANs to topology -> map topology to FabricPath core port(s)
  • Default topology always includes all FabricPath core ports
    – Map DC-wide VLANs to default topology

  • Pod-local core ports also mapped to pod-local topology
    – Map pod-local VLANs to pod-local topology

This post briefly describes Cisco FabricPath which is a technology for building scalable L2 topologies, allowing for more bisectional bandwidth to support East-West flows which are common in Datacenters. To dive deeper into FabricPath, visit the Cisco Live website.

Advertisements
Categories: CCDE Tags: , ,

CLUS Keynote Speaker – It’s a Dirty Job but Somebody’s Gotta Do It

February 14, 2015 Leave a comment

Did you guess by the title who will be the celebrity keynote speaker for CLUS San Diego? It’s none other than Mike Rowe, also known as the dirtiest man on TV.

Mike is the man behind “Dirty Jobs” on the Discovery Channel. Little did he know when pitching the idea to Discovery that they would order 39 episodes of it. Mike traveled through 50 states and completed 300 different jobs going through swamps, sewers, oil derricks, lumberjack camps and what not.

Mike is also a narrator and can be heard in “American Chopper”, “American Hot Rod”, “Deadliest Catch”, “How the Universe Works” and other TV shows.

He is also a public speaker and often hired by Fortune 500 companies to tell their employees frightening stories of maggot farmers and sheep castrators.

Mike also believes in skilled trades and in working smart AND hard. He has written extensively on the country’s relationship with work and the skill gap.

I’m sure Mike’s speach will be very interesting…and maybe a bit gross…

The following two links take you to Cisco Live main page and the registration packages:

Cisco Live
Cisco Live registration packages

HSRP AWARE PIM

February 13, 2015 Leave a comment

In environments that require redundancy towards clients, HSRP will normally be running. HSRP is a proven protocol and it works but how do we handle when we have clients that need multicast? What triggers multicast to converge when the Active Router (AR) goes down? The following topology is used:

PIM1

One thing to notice here is that R3 is the PIM DR even though R2 is the HSRP AR. The network has been setup with OSPF, PIM and R1 is the RP. Both R2 and R3 will receive IGMP reports but only R3 will send PIM Join, due to it being the PIM DR. R3 builds the (*,G) towards the RP:

R3#sh ip mroute 239.0.0.1
IP Multicast Routing Table
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group, C - Connected,
       L - Local, P - Pruned, R - RP-bit set, F - Register flag,
       T - SPT-bit set, J - Join SPT, M - MSDP created entry, E - Extranet,
       X - Proxy Join Timer Running, A - Candidate for MSDP Advertisement,
       U - URD, I - Received Source Specific Host Report, 
       Z - Multicast Tunnel, z - MDT-data group sender, 
       Y - Joined MDT-data group, y - Sending to MDT-data group, 
       G - Received BGP C-Mroute, g - Sent BGP C-Mroute, 
       N - Received BGP Shared-Tree Prune, n - BGP C-Mroute suppressed, 
       Q - Received BGP S-A Route, q - Sent BGP S-A Route, 
       V - RD & Vector, v - Vector, p - PIM Joins on route
Outgoing interface flags: H - Hardware switched, A - Assert winner, p - PIM Join
 Timers: Uptime/Expires
 Interface state: Interface, Next-Hop or VCD, State/Mode

(*, 239.0.0.1), 02:54:15/00:02:20, RP 1.1.1.1, flags: SJC
  Incoming interface: Ethernet0/0, RPF nbr 13.13.13.1
  Outgoing interface list:
    Ethernet0/2, Forward/Sparse, 00:25:59/00:02:20

We then ping 239.0.0.1 from the multicast source to build the (S,G):

S1#ping 239.0.0.1 re 3
Type escape sequence to abort.
Sending 3, 100-byte ICMP Echos to 239.0.0.1, timeout is 2 seconds:

Reply to request 0 from 10.0.0.10, 35 ms
Reply to request 1 from 10.0.0.10, 1 ms
Reply to request 2 from 10.0.0.10, 2 ms

The (S,G) has been built:

R3#sh ip mroute 239.0.0.1
IP Multicast Routing Table
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group, C - Connected,
       L - Local, P - Pruned, R - RP-bit set, F - Register flag,
       T - SPT-bit set, J - Join SPT, M - MSDP created entry, E - Extranet,
       X - Proxy Join Timer Running, A - Candidate for MSDP Advertisement,
       U - URD, I - Received Source Specific Host Report, 
       Z - Multicast Tunnel, z - MDT-data group sender, 
       Y - Joined MDT-data group, y - Sending to MDT-data group, 
       G - Received BGP C-Mroute, g - Sent BGP C-Mroute, 
       N - Received BGP Shared-Tree Prune, n - BGP C-Mroute suppressed, 
       Q - Received BGP S-A Route, q - Sent BGP S-A Route, 
       V - RD & Vector, v - Vector, p - PIM Joins on route
Outgoing interface flags: H - Hardware switched, A - Assert winner, p - PIM Join
 Timers: Uptime/Expires
 Interface state: Interface, Next-Hop or VCD, State/Mode

(*, 239.0.0.1), 02:57:14/stopped, RP 1.1.1.1, flags: SJC
  Incoming interface: Ethernet0/0, RPF nbr 13.13.13.1
  Outgoing interface list:
    Ethernet0/2, Forward/Sparse, 00:28:58/00:02:50

(41.41.41.10, 239.0.0.1), 00:02:03/00:00:56, flags: JT
  Incoming interface: Ethernet0/0, RPF nbr 13.13.13.1
  Outgoing interface list:
    Ethernet0/2, Forward/Sparse, 00:02:03/00:02:50

The unicast and multicast topology is not currently congruent, this may or may not be important. What happens when R3 fails?

R3(config)#int e0/2
R3(config-if)#sh
R3(config-if)#

No replies to the pings coming in until PIM on R2 detects that R3 is gone and takes over the DR role, this will take between 60 to 90 seconds with the default timers in use.

S1#ping 239.0.0.1 re 100 ti 1
Type escape sequence to abort.
Sending 100, 100-byte ICMP Echos to 239.0.0.1, timeout is 1 seconds:

Reply to request 0 from 10.0.0.10, 18 ms
Reply to request 1 from 10.0.0.10, 2 ms....................................................................
.......
Reply to request 77 from 10.0.0.10, 10 ms
Reply to request 78 from 10.0.0.10, 1 ms
Reply to request 79 from 10.0.0.10, 1 ms
Reply to request 80 from 10.0.0.10, 1 ms

We can increase the DR priority on R2 to make it become the DR.

R2(config-if)#ip pim dr-priority 50  
*Feb 13 12:42:45.900: %PIM-5-DRCHG: DR change from neighbor 10.0.0.3 to 10.0.0.2 on interface Ethernet0/2

HSRP aware PIM is a feature that started appearing in IOS 15.3(1)T and makes the HSRP AR become the PIM DR. It will also send PIM messages from the virtual IP which is useful in situations where you have a router with a static route towards an Virtual IP (VIP). This is how Cisco describes the feature:

HSRP Aware PIM enables multicast traffic to be forwarded through the HSRP active router (AR), allowing PIM to leverage HSRP redundancy, avoid potential duplicate traffic, and enable failover, depending on the HSRP states in the device. The PIM designated router (DR) runs on the same gateway as the HSRP AR and maintains mroute states.

In my topology, I am running HSRP towards the clients, so even though this feature sounds as a perfect fit it will not help me in converging my multicast. Let’s configure this feature on R2:

R2(config-if)#ip pim redundancy HSRP1 hsrp dr-priority 100
R2(config-if)#
*Feb 13 12:48:20.024: %PIM-5-DRCHG: DR change from neighbor 10.0.0.3 to 10.0.0.2 on interface Ethernet0/2

R2 is now the PIM DR, R3 will now see two PIM neighbors on interface E0/2:

R3#sh ip pim nei e0/2
PIM Neighbor Table
Mode: B - Bidir Capable, DR - Designated Router, N - Default DR Priority,
      P - Proxy Capable, S - State Refresh Capable, G - GenID Capable
Neighbor          Interface                Uptime/Expires    Ver   DR
Address                                                            Prio/Mode
10.0.0.1          Ethernet0/2              00:00:51/00:01:23 v2    0 / S P G
10.0.0.2          Ethernet0/2              00:07:24/00:01:23 v2    100/ DR S P G

R2 now has the (S,G) and we can see that it was the Assert winner because R3 was previously sending multicasts to the LAN segment.

R2#sh ip mroute 239.0.0.1
IP Multicast Routing Table
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group, C - Connected,
       L - Local, P - Pruned, R - RP-bit set, F - Register flag,
       T - SPT-bit set, J - Join SPT, M - MSDP created entry, E - Extranet,
       X - Proxy Join Timer Running, A - Candidate for MSDP Advertisement,
       U - URD, I - Received Source Specific Host Report, 
       Z - Multicast Tunnel, z - MDT-data group sender, 
       Y - Joined MDT-data group, y - Sending to MDT-data group, 
       G - Received BGP C-Mroute, g - Sent BGP C-Mroute, 
       N - Received BGP Shared-Tree Prune, n - BGP C-Mroute suppressed, 
       Q - Received BGP S-A Route, q - Sent BGP S-A Route, 
       V - RD & Vector, v - Vector, p - PIM Joins on route
Outgoing interface flags: H - Hardware switched, A - Assert winner, p - PIM Join
 Timers: Uptime/Expires
 Interface state: Interface, Next-Hop or VCD, State/Mode

(*, 239.0.0.1), 00:20:31/stopped, RP 1.1.1.1, flags: SJC
  Incoming interface: Ethernet0/0, RPF nbr 12.12.12.1
  Outgoing interface list:
    Ethernet0/2, Forward/Sparse, 00:16:21/00:02:35

(41.41.41.10, 239.0.0.1), 00:00:19/00:02:40, flags: JT
  Incoming interface: Ethernet0/0, RPF nbr 12.12.12.1
  Outgoing interface list:
    Ethernet0/2, Forward/Sparse, 00:00:19/00:02:40, A

What happens when R2’s LAN interface goes down? Will R3 become the DR? And how fast will it converge?

R2(config)#int e0/2
R2(config-if)#sh

HSRP changes to active on R3 but the PIM DR role does not converge until the PIM query interval has expired (3x hellos).

*Feb 13 12:51:44.204: HSRP: Et0/2 Grp 1 Redundancy "hsrp-Et0/2-1" state Standby -> Active
R3#sh ip pim nei e0/2
PIM Neighbor Table
Mode: B - Bidir Capable, DR - Designated Router, N - Default DR Priority,
      P - Proxy Capable, S - State Refresh Capable, G - GenID Capable
Neighbor          Interface                Uptime/Expires    Ver   DR
Address                                                            Prio/Mode
10.0.0.1          Ethernet0/2              00:04:05/00:00:36 v2    0 / S P G
10.0.0.2          Ethernet0/2              00:10:39/00:00:36 v2    100/ DR S P G
R3#
*Feb 13 12:53:02.013: %PIM-5-NBRCHG: neighbor 10.0.0.2 DOWN on interface Ethernet0/2 DR
*Feb 13 12:53:02.013: %PIM-5-DRCHG: DR change from neighbor 10.0.0.2 to 10.0.0.3 on interface Ethernet0/2
*Feb 13 12:53:02.013: %PIM-5-NBRCHG: neighbor 10.0.0.1 DOWN on interface Ethernet0/2 non DR

We lose a lot of packets while waiting for PIM to converge:

S1#ping 239.0.0.1 re 100 time 1
Type escape sequence to abort.
Sending 100, 100-byte ICMP Echos to 239.0.0.1, timeout is 1 seconds:

Reply to request 0 from 10.0.0.10, 5 ms
Reply to request 0 from 10.0.0.10, 14 ms...................................................................
Reply to request 68 from 10.0.0.10, 10 ms
Reply to request 69 from 10.0.0.10, 2 ms

Reply to request 70 from 10.0.0.10, 1 ms

HSRP aware PIM didn’t really help us here… So when is it useful? If we use the following topology instead:

PIM2

The router R5 has been added and the receiver sits between R5 instead. R5 does not run routing with R2 and R3, only static routes pointing at the RP and the multicast source:

R5(config)#ip route 1.1.1.1 255.255.255.255 10.0.0.1
R5(config)#ip route 41.41.41.0 255.255.255.0 10.0.0.1

Without HSRP aware PIM, the RPF check would fail because PIM would peer with the physical address but R5 sees three neighbors on the segment, where one is the VIP:

R5#sh ip pim nei
PIM Neighbor Table
Mode: B - Bidir Capable, DR - Designated Router, N - Default DR Priority,
      P - Proxy Capable, S - State Refresh Capable, G - GenID Capable
Neighbor          Interface                Uptime/Expires    Ver   DR
Address                                                            Prio/Mode
10.0.0.2          Ethernet0/0              00:03:00/00:01:41 v2    100/ DR S P G
10.0.0.1          Ethernet0/0              00:03:00/00:01:41 v2    0 / S P G
10.0.0.3          Ethernet0/0              00:03:00/00:01:41 v2    1 / S P G

R2 will be the one forwarding multicast during normal conditions due to it being the PIM DR via HSRP state of active router:

R2#sh ip mroute 239.0.0.1
IP Multicast Routing Table
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group, C - Connected,
       L - Local, P - Pruned, R - RP-bit set, F - Register flag,
       T - SPT-bit set, J - Join SPT, M - MSDP created entry, E - Extranet,
       X - Proxy Join Timer Running, A - Candidate for MSDP Advertisement,
       U - URD, I - Received Source Specific Host Report, 
       Z - Multicast Tunnel, z - MDT-data group sender, 
       Y - Joined MDT-data group, y - Sending to MDT-data group, 
       G - Received BGP C-Mroute, g - Sent BGP C-Mroute, 
       N - Received BGP Shared-Tree Prune, n - BGP C-Mroute suppressed, 
       Q - Received BGP S-A Route, q - Sent BGP S-A Route, 
       V - RD & Vector, v - Vector, p - PIM Joins on route
Outgoing interface flags: H - Hardware switched, A - Assert winner, p - PIM Join
 Timers: Uptime/Expires
 Interface state: Interface, Next-Hop or VCD, State/Mode

(*, 239.0.0.1), 00:02:12/00:02:39, RP 1.1.1.1, flags: S
  Incoming interface: Ethernet0/0, RPF nbr 12.12.12.1
  Outgoing interface list:
    Ethernet0/2, Forward/Sparse, 00:02:12/00:02:39

Let’s try a ping from the source:

S1#ping 239.0.0.1 re 3
Type escape sequence to abort.
Sending 3, 100-byte ICMP Echos to 239.0.0.1, timeout is 2 seconds:

Reply to request 0 from 20.0.0.10, 1 ms
Reply to request 1 from 20.0.0.10, 2 ms
Reply to request 2 from 20.0.0.10, 2 ms

The ping works and R2 has the (S,G):

R2#sh ip mroute 239.0.0.1
IP Multicast Routing Table
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group, C - Connected,
       L - Local, P - Pruned, R - RP-bit set, F - Register flag,
       T - SPT-bit set, J - Join SPT, M - MSDP created entry, E - Extranet,
       X - Proxy Join Timer Running, A - Candidate for MSDP Advertisement,
       U - URD, I - Received Source Specific Host Report, 
       Z - Multicast Tunnel, z - MDT-data group sender, 
       Y - Joined MDT-data group, y - Sending to MDT-data group, 
       G - Received BGP C-Mroute, g - Sent BGP C-Mroute, 
       N - Received BGP Shared-Tree Prune, n - BGP C-Mroute suppressed, 
       Q - Received BGP S-A Route, q - Sent BGP S-A Route, 
       V - RD & Vector, v - Vector, p - PIM Joins on route
Outgoing interface flags: H - Hardware switched, A - Assert winner, p - PIM Join
 Timers: Uptime/Expires
 Interface state: Interface, Next-Hop or VCD, State/Mode

(*, 239.0.0.1), 00:04:18/00:03:29, RP 1.1.1.1, flags: S
  Incoming interface: Ethernet0/0, RPF nbr 12.12.12.1
  Outgoing interface list:
    Ethernet0/2, Forward/Sparse, 00:04:18/00:03:29

(41.41.41.10, 239.0.0.1), 00:01:35/00:01:24, flags: T
  Incoming interface: Ethernet0/0, RPF nbr 12.12.12.1
  Outgoing interface list:
    Ethernet0/2, Forward/Sparse, 00:01:35/00:03:29

What happens when R2 fails?

R2#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
R2(config)#int e0/2
R2(config-if)#sh
R2(config-if)#
S1#ping 239.0.0.1 re 200 ti 1
Type escape sequence to abort.
Sending 200, 100-byte ICMP Echos to 239.0.0.1, timeout is 1 seconds:

Reply to request 0 from 20.0.0.10, 9 ms
Reply to request 1 from 20.0.0.10, 2 ms
Reply to request 1 from 20.0.0.10, 11 ms....................................................................
......................................................................
............................................................

The pings time out because when the PIM Join from R5 comes in, R3 does not realize that it should process the Join.

*Feb 13 13:20:13.236: PIM(0): Received v2 Join/Prune on Ethernet0/2 from 10.0.0.5, not to us
*Feb 13 13:20:32.183: PIM(0): Generation ID changed from neighbor 10.0.0.2

As it turns out, the PIM redundancy command must be configured on the secondary router as well for it to process PIM Joins to the VIP.

R3(config-if)#ip pim redundancy HSRP1 hsrp dr-priority 10

After this has configured, the incoming Join will be processed. R3 triggers R5 to send a new Join because the GenID is set in the PIM hello to a new value.

*Feb 13 13:59:19.333: PIM(0): Matched redundancy group VIP 10.0.0.1 on Ethernet0/2 Active, processing the Join/Prune, to us
*Feb 13 13:40:34.043: PIM(0): Generation ID changed from neighbor 10.0.0.1

After configuring this, the PIM DR role converges as fast as HSRP allows. I’m using BFD in this scenario.

The key concept for understanding HSRP aware PIM here is that:

  • Initially configuring PIM redundancy on the AR will make it the DR
  • PIM redundancy must be configured on the secondary router as well, otherwise it will not process PIM Joins to the VIP
  • The PIM DR role does not converge until PIM hellos have timed out, the secondary router will process the Joins though so the multicast will converge

This feature is not very well documented so I hope you have learned a bit from this post how this feature really works. This feature does not work when you have receiver on a HSRP LAN, because the DR role is NOT moved until PIM adjacency expires.

Categories: Multicast Tags: , , ,

Network Design Webinar With Yours Truly at CLN

February 12, 2015 3 comments

I’m hosting a network design webinar at the Cisco Learning Network on Feb 19th, 20.00 UTC+1.

As you may know, I am studying for the CCDE so I’m focusing on design right now but my other reason for hosting this is to remind people that with all the buzzwords around SDN and NfV going around, the networking fundamentals still hold true. TCP/IP is as important as ever, building a properly designed network is a must if you want to have overlays running on it. If you build a house and do a sloppy job with the foundation, what will happen? The same holds true in networking.

I will introduce the concepts of network design. What does a network designer do? What tools are used? What is CAPEX? What is OPEX? What certifications are available? What is important in network design? We will also look at a couple of design scenarios and reason about the impact of our choices. There is always a tradeoff!

If you are interested in network design or just want to tune in to yours truly, follow this link to CLN.

I hope to see you there!

Cisco Live in San Diego – Will You Make It?

February 8, 2015 3 comments

“Make it” was one of the first singles released by the the band Aerosmith. Since then these guys have been rocking away for about 40 years. What does this have to do with Cisco Live? Aerosmith will be the band playing at the Customer Appreciation Event (CAE). A good time is pretty much guaranteed. Aerosmith knows how to entertain a crowd.

Aero - new version LOGO copy

The CAE will take place at Petco Park, the home of the San Diego Padres. This photo shows the arena in the evening, looks quite spectacular to me.

Petco-Park-Photo1000x1000 (3)

Cisco Live is much more than just having fun though. If you want to make it in the IT industry, there is a lot to gain by going to Cisco Live. Here are some of my reason why I want to go:

  • Stay on top of new technologies – Where is ACI going?
  • Dip my toes into other technologies that I find interesting
  • Gain deep level knowledge of platforms or features that will benefit me and my customers
  • Go to sessions that will aid me on my certification path
  • Connect with people!
  • Learn a lot while having fun at the same time!
  • Learn from the experience of others

When you are in the IT industry, there is a lot going on – always! It can be easier to focus on following industry trends while not having to check your phone or e-mail constantly. The keynotes are also great to hear what is coming and what the vision of the technology is.

At Cisco Live you will find deep dives into the architectures of platforms and how to troubleshoot platforms. As an example I have a few Catalyst 4500-x showing high CPU, how do you troubleshoot that? General troubleshooting is easy but how do you go beyond that? Cisco Live is perfect for that. If you’re lucky will even get to ask a few questions during or at the end of a session relating to your specific case. And the person answering will be a real expert and you might even get to have contact with that person after CLUS.

I’m moving towards the CCDE. When you go to CLUS, normally you get to take a free exam. If I haven’t taken the CCDE written by then, I might do it. More importantly, I will try to go to sessions that are design related and attend a techtorial or labtorial related to the CCDE, if I can.

One of the best things about going to CLUS is that you will meet a lot of people. Just hanging out and talking with these people is a great experience. I have gained a lot of friends and contacts which have proven to be very very valuable when I need to bounce some ideas or get some input into a project.

Going to Cisco Live is fun! It’s learning and relaxing at the same time! You have to go there to experience it.

If you are interested in going to Cisco Live, I am including some links. The first one is to the main page and the second one is for the registration packages.

Cisco Live
Cisco Live registration packages

I hope I’ll see you there!

STP Notes for CCDE

February 8, 2015 Leave a comment

These are my study notes for CCDE based on “CCIE Routing and Switching v5.0 Official Cert Guide, Volume 1, Fifth Edition” and “Designing Cisco Network Service Architectures (ARCH) Foundation Learning Guide: (CCDP ARCH 642-874), Third Edition“, “INE – Understanding MSTP” and “Spanning Tree Design Guidelines for Cisco NX-OS Software and Virtual PortChannels“. This post is not meant to cover STP and all its aspects, it’s a summary of key concepts and design aspects of running STP.

STP

STP was originally defined in IEEE 802.1D and improvements were defined in amendments to the standard. RSTP was defined in amendment 802.1w and MSTP was defined in 802.1s. The latest 802.1D-2004 standard does not include “legacy STP”, it covers RSTP. MSTP was integrated into 802.1Q-2005 and later revisions.

STP has two types of BPDUs: Configuration BPDUs and Topology Change Notification BPDUs. To handle topology change, there are two flags in the Configuration BPDU: Topology Change Acknowledgment flag and Topology Change flag.

MessageAge is an estimation of the age of BPDU since it was generated by root, root sends it with an age of 0 and other switches increment this value by 1. The lifetime of a BPDU is MaxAge – MessageAge. MaxAge, HelloTime and ForwardDelay are values set by the root and locally configured values will only be used if that switch becomes the root.

STP works by comparing which Configuration BPDU is superior according to the following ordered list where lower values are better:

  1. Root Bridge ID (RBID)
  2. Root Path Cost (RPC)
  3. Sender Bridge ID (SBID)
  4. Sender Port ID (SPID)
  5. Receiver Port ID (RPID; not included in the BPDU, evaluated locally)

Each port stores the superior BPDU that has been sent or received, depending on the port role. Root and blocking ports store the received BPDU, designated ports store the sent BPDU.

To determine port roles and which ports forward and block, the following three-step process is used:

  1. Elect the root switch
  2. Determine each switch’s Root port
  3. Determine the Designated port for each segment

Root bridge is elected based on lowest bridge ID, which consists of 4 bits Priority, 12 bits System ID Extension and 6 bytes System ID (MAC address). Before 802.1t, a lot of MAC addresses were consumed to make the BID unique when using PVST+ or MST.

BPDUs are only forwarded on designated ports, root ports and blocking ports do not send them since they would be inferior on the segment. A designated port is a port with a superior BPDU on a segment.

Topology Change

A topology change event occurs when:

  • A TCN BPDU is received by a Designated Port of a switch
  • A port moves to the Forwarding state and the switch has at least one Designated Port
  • A port moves from Learning or Forwarding to Blocking
  • A switch becomes the root switch

STP is slow to converge, especially with indirect failures where a link fails between a root switch and an intermediary switch. When inferior BPDUs are received, MaxAge has to expire before a switch will act on it.

When the topology has changed, CAM table needs to be updated on all switches, a timer equivalent to ForwardDelay is used to time out unused entries.

A topology change starts at a switch and it sends TCN BPDU out its root port. The designated switch sets TCA bit in the field of the configuration BPDU to acknowledge the TCN. The TCN then travels upstream until it reaches the root. The root will then send configuration BPDU with TC bit set for MaxAge + ForwardDelay seconds and all switches will shorten the aging time for the CAM table to ForwardDelay seconds.

PVST+

PVST+ runs one spanning tree instance per VLAN. This does not scale well for a large number of VLANs and normally there will only be a few logical topologies anyway.

Switches that do not support PVST+ run Common Spanning Tree (CST) which has one instance of STP for all VLANs. Cisco switches can interact with CST through VLAN 1 by sending untagged BPDUs. All other VLANs in the PVST+ region will tag their BPDUs and tunnel the BPDUs through the CST region by using a special destination MAC address. The CST region is treated as a loop-free shared segment from the viewpoint of the PVST+ region. The destination MAC address is a multicast address that will get flooded by the CST switches.

RPVST+

RSTP has four different port roles:

  • Root Port
  • Designated Port
  • Alternate Port
  • Backup port

The first two are the same as in legacy STP and the last two are new. An alternate port is a port that is a potential backup for the Root Port. A backup port is a replacement for a Designated Port, you would rarely, if ever, see a Backup Port because it is only used on shared segments.

RSTP uses synchronization process to achieve fast convergence. This only works on links that are point to point and is detected by the duplex mode of an interface. The link type can be hard coded in the rare case where a port is half duplex but still not on a shared segment.

RSTP uses more bits in the Configuration BPDU to encode additional information. These are the Proposal bit, Port Role bits, Learning bit, Forwarding bit and Agreement bit.

RSTP switches send their own BPDUs as opposed to only relaying the roots BPDU as in legacy STP. If no BPDU is heard for 3x hello interval, the BPDU is expired. RSTP does not rely on the MaxAge timer to expire BPDUs. RSTP can also act on inferior BPDUs directly instead of waiting for MaxAge to expire. This speeds up indirect link failure scenarios.

RSTP uses a proposal/agreement process where switches negotiate which port that will become Designated. If proposal bit is set, the switch is proposing that its port should become Designated and the other switch will reply with Agreement to immediately allow this. When ports first come up they are in Designated Discarding state. To not create a temporary loop during the synchronization process, all Non-Edge Designated ports are put into a Discarding state. I have in detail described this process in an earlier post.

With RSTP, only ports moving to a Forwarding state will cause a topology change. RSTP sets the TC bit in the BPDU to notify of a topology change and sends it out its Root Port and Designated Ports that are Non-Edge. MAC addresses are immediately flushed on these ports.

MST

MST uses the same underlying structure such as RSTP with regards to BPDU parameters but it decouples VLANs from spanning tree instances, multiple VLANs can be mapped to a single instance. MST is more efficient because the operator can define the number of instances needed and map the VLANs to these instances. MST is the only standard that supports VLANs and is suitable in a multi vendor environment.

MST switches organize the network into regions, switches within a region use MST in a consistent way. For switches to be in the same region, the name, revision and instance to VLAN mapping must match.

The System ID in MST uses the Instance ID instead of the VLAN ID to create the BID, used in BPDUs. MST sends a single BPDU containing information about all instances. In MST, a port sends BPDUs if it is Designated for at least one MST instance.

MST instance 0 is special and contains all VLANs by default, it is called the Internal Spanning Tree (IST). IST interacts with STP switches that are outside the region. The port role and state determined by the interaction of IST with a neighboring switch will be inherited by all VLANs on that port, not just the VLANs mapped to the IST. This behavior makes the region appear as a single switch to the outside of the region. If running multiple regions, each region can be seen as a single switch from the outside. The resulting network can still contain loops if there are multiple inter region links. MST blocks these loops by building a Common Spanning Tree (CST) running between the regions. CST is also used to interact with non MST switches. The tree built by CST will be used for all VLANs. The IST and CST is then merged together and called the Common and Internal Spanning Tree (CIST).

The CIST Root switch is elected based on the lowest BID from all switches that in any region. This switch will also become the root for the IST (instance 0) within the region, this is called the CIST Regional Root.

In regions that do not contain the CIST Root, only boundary switches are allowed to become the IST Root. A boundary switch is a switch that has a link (or several) to other MST regions. The IST Root is elected based on external root path cost, which is the cost of using the inter region links between MST regions. If there is a tie in cost, the lowest BID is used as a tiebreaker to elect the CIST Regional Root. Cost inside a region is not taken into account.

The CIST Regional Root switch will have its Root Port towards the CIST Root, this is called the master port and this port is used by all MST instances to reach the CIST Root.

The following pictures show the different concepts of MST, starting with a physical topology:

MST1

The IST runs within the region to block ports, to break up the physical loop. One switch will be the CIST root and one switch will be the CIST Regional root.

MST2

In reality, all these things tie in together and happen simultaneously but to solidify the understanding, we divide them into steps. The IST has run internally and blocked ports. This is what the CST looks like:

MST3

The CST runs between regions and/or non MST devices and makes sure there is no loop between regions or to non MST domains. If we combine the CST and the IST, we get the CIST which is the final topology:

MST4

Interopability Between MST and Other STP Versions

When communicating with IEEE STP or RSTP switch, the MST switch must share the role and state on the port towards the non MST switch for all VLANs. STP or RSTP can’t see into the MST region so it is treated as a single logical switch. The MST switch will speak by using the IST (instance 0) on boundary ports and format the BPDU to be STP or RSTP. The IST will also process inbound BPDUs from the non MST switch.

When communicating with PVST+ or RPVST+ region, things get a bit more complex. One STP instance is run for each VLAN and port role and state is individually calculated per VLAN. The IST will communicate with the non MST switch and must make sure that the information it sends to each PVST+/RPVST+ instance gets the same information to make a consistent choice. MST and PVST+ must arrive at the same port role and state for all instances even though a single MST instance and PVST+ instance directly interact with each other. This is also known as PVST Simulation mechanism.

The IST will replicate BPDUs for all active VLANs towards the PVST+ switch, meaning that the PVST+ switch will make a consistent choice for port role and state for all VLANs. The IST does this by formatting the BPDUs as PVST+ BPDUs.

In the opposite direction, the IST takes the BPDU from VLAN 1 as a representative for the entire PVST+ region and processes this in the IST. The boundary ports role and state will be binding for all active VLANs on that port. The MST switch must make certain that the result of the IST interaction with VLAN 1 STP instance is consistent with the state of STP instances run in other VLANs.

An MST boundary port will become a Designated Port if the BPDUs it sends out are superior to incoming VLAN 1 PVST+ BPDUs. The port will then be forwarding for all VLANs. To make sure that other PVST+ instances make a consistent decision, the MST switch must check that all incoming PVST+ BPDUs are inferior to its own outgoing BPDUs. If not, the PVST Simulation mechanism will fail.

The CIST Root can be located in the PVST+ region and the boundary port can have a port role of Root if the incoming VLAN 1 PVST+ BPDUs are not only superior to the MST switch but also better than any other VLAN 1 PVST+ BPDUs received on any other boundary port. Once again, to check the consistency of of port role, all Root bridges must be located in the PVST+ region and use the same boundary port to reach these switches. The PVST Simulation mechanism will check that incoming PVST+ BPDUs for VLANs other than VLAN 1 are identical or superior to the VLAN 1 PVST+ BPDUs.

An MST boundary port will become Non-Designated if it receives superior VLAN 1 PVST+ BPDUs compared to its own but not superior enought to make it a Root Port.

It is recommended to have the MST region appear as a Root switch to all PVST+ instances by lowering the IST root’s priority below the priorities of all PVST+ switches in all VLANs.

When an MST switch is communicating to a PVST+ or RPVST+ switch it will always revert back to PVST+. There is less state involved with PVST+ due to not having a Proposal/Agreement process which simplifies the interworking of MST and PVST+.

Portfast Ports

  • Transitions directly to Forwarding state, saving 2x ForwardDelay
  • Does not generate topology change events
  • Does not flush CAM due to topology change
  • DOES send BPDUs
  • Does not expect to receive BPDUs
  • Not influenced by the Sync step in Proposal/Agreement procedure(RSTP)

Portfast enabled ports may also be referred to as Edge ports. If a Portfast enabled port receives BPDUs it will lose its Portfast status until the port has gone up and down. RSTP uses Proposal/Agreement process and when going through Sync, it will put all Non-Edge Designated ports into a Discarding state. Unless enduser ports are configured as Edge ports they will be affected and lose connectivity briefly during the Sync process. Portfast is also important so that when a PC boots up and requests an IP address via DHCP, it gets one assigned before the process times out, waiting for the port to go into a Forwarding state. Portfast can be enabled per port or globally for all access ports.

  • BPDU Guard: Enabled per port of globally for all Portfast enabled ports, will error-disable the port upon receiving ANY BPDU
  • Root Guard: Only enabled per port, ignores any superior BPDUs received to prevent the port from becoming a Root Port. If a superior BPDU is received, the port is put into a root-inconsistent blocking state, cease forwarding and receiving data frames until the superior BPDUs cease

After BPDU-Guard has error-disabled a port, it must manually be recovered or by using error-disable recovery feature.

Root Guard will block the port if a superior BPDU comes in, this does not have to be the best BPDU, simply better than what the local switch is originating. Root-Guard will recover the port after the superior BPDU has expired which would be MaxAge – MessageAge or 3x Hello for STP and RSTP respectively.

BPDU Filter

  • If enabled on a port it will unconditionally stop sending and receiving BPDUs
  • If enabled globally for Edge ports, it will send 11 BPDUs after enabling the feature and then stop sending BPDUs. If a BPDU is received at any point in time, BPDU Filter is operationally disabled on the port and will revert to normal STP rules, sending and receiving BPDUs.

Protecting Against Unidirectional Link Issues

Several mechanism are available to protects against unidirectional links such as Loop Guard, UDLD, RSTP Dispute mechanism and Bridge Assurance.

UDLD

UDLD is a Cisco-proprietary layer 2 protocol that serves as an echo mechanism between a pair of devices. It sends UDLD messages advertising its identity and port identifier pair as well as a list of all neighboring switch/port pairs heard on the same segment. The following explicit conditions are used by UDLD to detect an unidirectional link:

  • UDLD messages arriving from a neighbor that do not contain the exact switch/port pair matching the receiving switch and its port in the list of detected neighbors. This would suggest that either the neighbor does not hear this switch at all (fiber cut) or that neighbor’s port sending these UDLD messages is different from the neighbor’s port receiving the UDLD messages. This could be the case if the TX fiber is plugged into a different port than the RX fiber.
  • If the incoming UDLD messages contain the same switch/port originator pair as the receiving switch, which would indicated that the port is self-looped.
  • A switch has detected only a single neighbor but the neighbor’s UDLD messages contain several switch/port pairs in the list of neighbors, this would indicated shared media and lack of visibility between all connected devices.

The above are explicit examples which will error-disable a port due to it being unidirectional. UDLD runs either in normal or aggressive mode. In normal mode, UDLD tries to reconnect with its neighbor(s) up to 8 times if there is a loss of incoming UDLD messages. Normal mode does not react to this implicit condition if not successfull, aggressive mode will error-disable the port if it stops receiving UDLD messages and the reconnect(s) fails. UDLD can be enabled globally or per port, globally enabling it will only enable UDLD on fiber ports.

Loop Guard prevents Root and Alternate ports from becoming Designated in the case of loss of incoming BPDUs. When the stored BPDU on a port expires, Loop Guard will put the port into a loop-inconsistent state. Loop-Guard can be configured clobally or per port.

Bridge Assurance is another mechanism that is available on select platforms and works with RPVST+ and MST on point-to-point links. A port will send BPDUs regardless of state if Bridge Assurance is enabled. If BPDUs are not received, the port will be put into a BA-inconsistent state. This protects from unidirectional links as well as malfunctioning switches that stop participating in RPSVT+/MST.

Finally the Dispute mechanism available in RPVST+/MST works by checking the incoming BPDU flags. If an inferior BPDU is received but the flags are Designated Learning or Forwarding, the local port will move into a Discarding state.

Port Channel

Interfaces can be bundled into a Port Channel which increases the available bandwidth by carrying multiple frames over multiple links. A hashing mechanism run over selected frames address fields will determine which physical link to send the frame over. The hashing is deterministic, meaning that frames of the same flow will travel the same physical link.

Load sharing can be based on MAC address, IP address or on some platforms even port numbers. A choice needs to be made depending on the type of flow, which load sharing mechanism will be most beneficial. Normally only one type of load sharing can be used for all flows on a switch. Normally load sharing will be more balanced if using a number of links divisible by 2. This varies by platform and the number of hash buckets.

To bring interfaces into a bundle, several parameters must match, such as speed, duplex, trunk/access, allowed VLANs, STP cost and so on.

It is recommended to run a dynamic protocol such as LACP to setup the bundle, this will prevent from failure modes where a switching loop is created where one side is unconditionally bundling links and the other side has not yet formed the bundle. Portchannels are treated as a single logical interface by STP and a single physical interface will be responsible for transmitting BPDUs for the bundle. Etherchannel misconfig guard can protect against failures where multiple BPDUs are incoming with different source MAC on ports in the bundle.

STP Scalability and vPC

MST offers greater scalability than RPVST+ due to sending only one BPDU and the decoupling of VLANs from instances. Normally two instances is enough with MST. With MST, VLANs can be created without affecting the STP instances. MST can also better support stretched layer 2 domains through the use of regions.

To achieve load balancing with MST, at least two STP instances need to be defined and different switches will be the root for each of these instances.

Recommendations for MST:

  • Define a region configuration to be copied to all the switches that are part of the Layer 2 topology
  • As part of the region configuration, define to which instances all the VLANs belong. Normally two instances would be enough
  • Define primary and secondary root switches for all the instances that you have defined, also for instance 0. Typically one switch would be the root for instance 0 and instance 1 and a redundant aggregation switch for instance 2
  • Preprovision all VLAN mapppings and topologies and later create VLANs as needed

Special Considerations for Spanning Tree with vPCs

Virtual Port Channel (vPC) is a technology used on Nexus switches where to switches act as if they were one by having the primary switch generate BPDUs, LACP messages and so on. The two switches use a link between them to synchronize state and to pass traffic over, this link is called the vPC peer link. Ports that are not configured for vPC behave as normal ports, meaning that BPDUs get generated by the local switch.

Some modifications have been done to STP to be used in combination with vPC, they are the following:

  • The peer link should never be blocking because it carries important traffic such as Cisco Fabric Services over Ethernet (CFSoE) Protocol. The peer link is always forwarding
  • On vPC ports, only the primary switch generates BPDUs. The secondary switch will relay incoming BPDUs to the primary switch

The following picture shows the behavior of Spanning Tree on Nexus switches:

VPC1

The operational primary switch sends BPDUs towards Access1 even though it is not the STP Root. BPDUs that come from Access1 are relayed by Agg2. On ports that are not member of a vPC, normal rules apply, meaning that both Agg switches will send BPDUs towards Access2.

It is recommended to align the operational primary role with the STP Root role. If the peer-link fails, the vPC ports on the secondary switch will be shutdown. To keep SVIs up for non vPC VLANs if the peer-link fails, use a backup link between the switches that is independent from the peer-link or the dual-active exclude command. If using an extra link, remove all the non vPC VLANs from the vPC peer-link.

MST and vPC Best Practices

  • Associate the root and secondary root role at the aggregation layer and match the vPC primary and secondary roles with the STP root role.
  • One MST instance is enough
  • Configure regions during the deployment phase
  • If changing the VLAN to instance mapping, change both the primary and secondary vPC to avoid global inconsistency
  • Use dual-active exclude command to not isolate non vPC VLANs when the peer-link is lost

If using RPVST+, use pathcost method so that lower speed interfaces do not get the same metric as higher speed interfaces. This should be the default for MST but may vary by platform.

Scaling Considerations

Scaling may be affected by the following parameters:

  • The number of PortChannels
  • The number of VLANs supported by the switch
  • Logical interface count
  • Oversubscription rate

A logical port is the sum of the number of physical ports times the number of VLANs on each port. When vPC is used, the secondary device passes BPDUs to the primary device which increases the scale of logical interfaces. A PortChannel is a logical interface so it counts as a single logical port regardless of the number of links it contains. To calculate the logical ports, multiply the number of vPCs times the number of VLANs on each vPC. For non vPC switches, the logical ports is the number of trunks * number of VLANs + number of access ports. For a switch with 10 trunks with 100 VLANs and 10 access ports that is 1010 logical ports.

Virtual ports is a line card limitation where a line card can support a maximum number of logical ports per line card. Virtual ports are calculated the same way but for a PortChannel, all physical interfaces count individually.

To reduce the number of logical ports, the following concepts are important:

  • Implement multiple aggregation modules
  • Perform manual pruning on trunks
  • Use MST instead of (R)PVST+
  • Distribute trunks and access ports across line cards
  • Remove unused VLANs going to Content Switching Modules (CSM) – The CSM automatically has all VLANs defined in the system configuration

This post describes key concepts of STP, different STP optimizations and which scaling factors are important in designing a layer 2 network.