Archive

Archive for the ‘Multicast’ Category

Next Generation Multicast – NG-MVPN

April 10, 2015 Leave a comment

Introduction

Multicast is a great technology that although it provides great benefits, is seldomly deployed. It’s a lot like IPv6 in that regard. Service providers or enterprises that run MPLS and want to provide multicast services have not been able to use MPLS to provide multicast  Multicast has then typically been delivered by using Draft Rosen which is a mGRE technology to provide multicast. This post starts with a brief overview of Draft Rosen.

Draft Rosen

Draft Rosen uses GRE as an overlay protocol. That means that all multicast packets will be encapsulated inside GRE. A virtual LAN is emulated by having all PE routers in the VPN join a multicast group. This is known as the default Multicast Distribution Tree (MDT). The default MDT is used for PIM hello’s and other PIM signaling but also for data traffic. If the source sends a lot of traffic it is inefficient to use the default MDT and a data MDT can be created. The data MDT will only include PE’s that have receivers for the group in use.

Rosen1

Draft Rosen is fairly simple to deploy and works well but it has a few drawbacks. Let’s take a look at these:

  • Overhead – GRE adds 24 bytes of overhead to the packet. Compared to MPLS which typically adds 8 or 12 bytes there is 100% or more of overhead added to each packet
  • PIM in the core – Draft Rosen requires that PIM is enabled in the core because the PE’s must join the default and or data MDT which is done through PIM signaling. If PIM ASM is used in the core, an RP is needed as well. If PIM SSM is run in the core, no RP is needed.
  • Core state – Unneccessary state is created in the core due to the PIM signaling from the PE’s. The core should have as little state as possible
  • PIM adjacencies – The PE’s will become PIM neighbors with each other. If it’s a large VPN and a lot of PE’s, a lot of PIM adjacencies will be created. This will generate a lot of hello’s and other signaling which will add to the burden of the router
  • Unicast vs multicast – Unicast forwarding uses MPLS, multicast uses GRE. This adds complexity and means that unicast is using a different forwarding mechanism than multicast, which is not the optimal solution
  • Inefficency – The default MDT sends traffic to all PE’s in the VPN regardless if the PE has a receiver in the (*,G) or (S,G) for the group in use

Based on this list, it is clear that there is a room for improvement. The things we are looking to achieve with another solution is:

  • Shared control plane with unicast
  • Less protocols to manage in the core
  • Shared forwarding plane with unicast
  • Only use MPLS as encapsulation
  • Fast Restoration (FRR)

NG-MVPN

To be able to build multicast Label Switched Paths (LSPs) we need to provide these labels in some way. There are three main options to provide these labels today:

  • Multipoint LDP(mLDP)
  • RSVP-TE P2MP
  • Unicast MPLS + Ingress Replication(IR)

MLDP is an extension to the familiar Label Distribution Protocol (LDP). It supports both P2MP and MP2MP LSPs and is defined in RFC 6388.

RSVP-TE is an extension to the unicast RSVP-TE which some providers use today to build LSPs as opposed to LDP. It is defined in RFC 4875.

Unicast MPLS uses unicast and no additional signaling in the core. It does not use a multipoint LSP.

Multipoint LSP

Normal unicast forwarding through MPLS uses a point to point LSP. This is not efficient for multicast. To overcome this, multipoint LSPs are used instead. There are two different types, point to multipoint and multipoint to multipoint.

P2MP1

  • Replication of traffic in core
  • Allows only the root of the P2MP LSP to inject packets into the tree
  • If signaled with mLDP – Path based on IP routing
  • If signaled with RSVP-TE – Constraint-based/explicit routing. RSVP-TE also supports admission control

MP2MP1

  • Replication of traffic in core
  • Bidirectional
  • All the leafs of the LSP can inject and receive packets from the LSP
  • Signaled with mLDP
  • Path based on IP routing

Core Tree Types

Depending on the number of sources and where the sources are located, different type of core trees can be used. If you are familiar with Draft Rosen, you may know of the default MDT and the data MDT.

Coretree1

Signalling the Labels

As mentioned previously there are three main ways of signalling the labels. We will start by looking at mLDP.

  • LSPs are built from the leaf to the root
  • Supports P2MP and MP2MP LSPs
    • mLDP with MP2MP provides great scalability advantages for “any to any” topologies
      • “any to any” communication applications:
        • mVPN supporting bidirectional PIM
        • mVPN Default MDT model
        • If a provider does not want  tree state per ingress PE source
  • Supports Fast Reroute (FRR) via RSVP-TE unicast backup path
  • No periodic signaling, reliable using TCP
  • Control plane is P2MP or MP2MP
  • Data plane is P2MP
  • Scalable due to receiver driven tree building
  • Supports MP2MP
  • Does not support traffic engineering

RSVP-TE can be used as well with the following characteristics.

  • LSPs are built from the head-end to the tail-end
  • Supports only P2MP LSPs
  • Supports traffic engineering
    • Bandwidth reservation
    • Explicit routing
    • Fast Reroute (FRR)
  • Signaling is periodic
  • P2P technology at control plane
    • Inherits P2P scaling limitations
  • P2MP at the data plane
    • Packet replication in the core

RSVP-TE will mostly be interesting for SPs that are already running RSVP-TE for unicast or for SPs involved in video delivery. The following table shows a comparision of the different protocols.

Core protocols

Assigning Flows to LSPs

After the LSPs have been signalled, we need to get traffic onto the LSPs. This can be done in several different ways.

  • Static
  • PIM
    • RFC 6513
  • BGP Customer Multicast (C-Mcast)
    • RFC 6514
    • Also describes Auto-Discovery
  • mLDP inband signaling
    • RFC 6826

Static

  • Mostly applicable to RSVP-TE P2MP
  • Static configuration of multicast flows per LSP
  • Allows aggregation of multiple flows in a single LSP

PIM

  • Dynamically assigns flows to an LSP by running PIM over the LSP
  • Works over MP2MP and PPMP LSP types
  • Mostly used but not limited to default MDT
  • No changes needed to PIM
  • Allows aggregation of multiple flows in a single LSP

BGP Auto-Discovery

  • Auto-Discovery
    • The process of discovering all the PE’s with members in a given mVPN
  • Used to establish the MDT in the SP core
  • Can also be used to discover set of PE’s interested in a given customer multicast group (to enable S-PSMSI creation)
    • S-PMSI = Data MDT
  • Used to advertise address of the originating PE and tunnel attribute information (which kind of tunnel)

BGP MVPN Address Family

  • MPBGP extensions to support mVPN address family
  • Used for advertisement of AD routes
  • Used for advertisement of C-mcast routes (*,G) and (S,G)
  • Two new extended communities
    • VRF route import – Used to import mcast routes, similar to RT for unicast routes
    • Source AS – Used for inter-AS mVPN
  • New BGP attributes
    • PMSI Tunnel Attribute (PTA) – Contains information about advertised tunnel
    • PPMP label attribute – Upstream generated label used by the downstream clients to send unicast messages towards the source
  • If mVPN address family is not used the address family ipv4 mdt must be used

BGP Customer Multicast

  • BGP Customer Multicast (C-mcast) signalling on overlay
  • Tail-end driven updates is not a natural fit for BGP
    • BGP is more suited for one-to-many not many-to-one
  • PIM is still the PE-CE protocol
  • Easy to use with SSM
  • Complex to understand and troubleshoot for ASM

MLDP Inband Signaling

  • Multicast flow information encoded in the mLDP FEC
  • Each customer mcast flow creates state on the core routers
    • Scaling is the same as with default MDT with every C-(S,G) on a Data MDT
  • IPv4 and IPv6 multicast in global or VPN context
  • Typical for SSM or PIM sparse mode sources
  • IPTV walled garden deployment
  • RFC 6826

The natural choice is to stick with PIM unless you need very high scalability. Here is a comparison of PIM and BGP.

Slide1

BGP C-Signaling

  • With C-PIM signaling on default MDT models, data needs to be monitored
    • On default/data tree to detect duplicate forwarders over MDT and to trigger the assert process
    • On default MDT to perform SPT switchover (from (*,G) to (S,G))
  • On default MDT models with C-BGP signaling
    • There is only one forwarder on MDT
      • There are no asserts
    • The BGP type 5 routes are used for SPT switchover on PEs
  • Type 4 leaf AD route used to track type 3 S-PMSI (Data MDT) routes
  • Needed when RR is deployed
  • If source PE sets leaf-info-required flag on type 3 routes, the receiver PE responds with with a type 4 route

Migration

If PIM is used in the core, this can be migrated to mLDP. PIM can also be migrated to BGP. This can be done per multicast source, per multicast group and per source ingress router. This means that migration can be done gradually so that not all core trees must be replaced at the same time.

It is also possible to have both mGRE and MPLS encapsulation in the network for different PE’s.

To summarize the different options for assigning flows to LSPs

  • Static
    • Mostly applicable to RSVP-TE
  • PIM
    • Well known, has been in use since mVPN introduction over GRE
  • BGP A-D
    • Useful where head-end assigns the flows to the LSP
  • BGP C-mcast
    • Alternative to PIM in mVPN context
    • May be required in dual vendor networks
  • MLDP inband signaling
    • Method to stitch a PIM tree to a mLDP LSP without any additional signaling

Optimizing the MDT

There are some drawbacks with the normal operation of the MDT. The tree is signalled even if there is no customer traffic leading to unneccessary state in the core. To overcome these limitations there is a model called the partitioned MDT running over mLDP with the following characteristics.

  • Dynamic version of default MDT model
  • MDT is only built when customer traffic needs to be transported across the core
  • It addresses issues with the default MDT model
    • Optimizes deployments where sources are located in a few sites
    • Supports anycast sources
      • Default MDT would use PIM asserts
    • Reduces the number of PIM neighbors
      • PIM neighborship is unidirectional – The egress PE sees ingress PEs as PIM neighbors

Conclusion

There are many many different profiles supported, currently 27 profiles on Cisco equipment. Here are some guidelines to guide you in the selection of a profile for NG-MVPN.

  • Label Switched Multicast (LSM) provides unified unicast and multicast forwarding
  • Choosing a profile depends on the application and scalability/feature requirements
  • MLDP is the natural and safe choice for general purpose
    • Inband signalling is for walled garden deployments
    • Partitioned MDT is most suitable if there are few sources/few sites
    • P2MP TE is used for bandwidth reservation and video distribution (few source sites)
    • Default MDT model is for anyone (else)
  • PIM is still used as the PE-CE protocol towards the customer
  • PIM or BGP can be used as an overlay protocol unless inband signaling or static mapping is used
  • BGP is the natural choice for high scalability deployments
    • BGP may be the natural choice if already using it for Auto-Discovery
  • The beauty of NG-MVPN is that profile can be selected per customer/VPN
    • Even per source, per group or per next-hop can be done with Routing Policy Language (RPL)

This post was heavily inspired and is basically a summary of the Cisco Live session BRKIPM-3017 mVPN Deployment Models by Ijsbrand Wijnands and Luc De Ghein. I recommend that you read it for more details and configuration of NG-MVPN.

Advertisements
Categories: Multicast Tags: , , ,

HSRP AWARE PIM

February 13, 2015 Leave a comment

In environments that require redundancy towards clients, HSRP will normally be running. HSRP is a proven protocol and it works but how do we handle when we have clients that need multicast? What triggers multicast to converge when the Active Router (AR) goes down? The following topology is used:

PIM1

One thing to notice here is that R3 is the PIM DR even though R2 is the HSRP AR. The network has been setup with OSPF, PIM and R1 is the RP. Both R2 and R3 will receive IGMP reports but only R3 will send PIM Join, due to it being the PIM DR. R3 builds the (*,G) towards the RP:

R3#sh ip mroute 239.0.0.1
IP Multicast Routing Table
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group, C - Connected,
       L - Local, P - Pruned, R - RP-bit set, F - Register flag,
       T - SPT-bit set, J - Join SPT, M - MSDP created entry, E - Extranet,
       X - Proxy Join Timer Running, A - Candidate for MSDP Advertisement,
       U - URD, I - Received Source Specific Host Report, 
       Z - Multicast Tunnel, z - MDT-data group sender, 
       Y - Joined MDT-data group, y - Sending to MDT-data group, 
       G - Received BGP C-Mroute, g - Sent BGP C-Mroute, 
       N - Received BGP Shared-Tree Prune, n - BGP C-Mroute suppressed, 
       Q - Received BGP S-A Route, q - Sent BGP S-A Route, 
       V - RD & Vector, v - Vector, p - PIM Joins on route
Outgoing interface flags: H - Hardware switched, A - Assert winner, p - PIM Join
 Timers: Uptime/Expires
 Interface state: Interface, Next-Hop or VCD, State/Mode

(*, 239.0.0.1), 02:54:15/00:02:20, RP 1.1.1.1, flags: SJC
  Incoming interface: Ethernet0/0, RPF nbr 13.13.13.1
  Outgoing interface list:
    Ethernet0/2, Forward/Sparse, 00:25:59/00:02:20

We then ping 239.0.0.1 from the multicast source to build the (S,G):

S1#ping 239.0.0.1 re 3
Type escape sequence to abort.
Sending 3, 100-byte ICMP Echos to 239.0.0.1, timeout is 2 seconds:

Reply to request 0 from 10.0.0.10, 35 ms
Reply to request 1 from 10.0.0.10, 1 ms
Reply to request 2 from 10.0.0.10, 2 ms

The (S,G) has been built:

R3#sh ip mroute 239.0.0.1
IP Multicast Routing Table
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group, C - Connected,
       L - Local, P - Pruned, R - RP-bit set, F - Register flag,
       T - SPT-bit set, J - Join SPT, M - MSDP created entry, E - Extranet,
       X - Proxy Join Timer Running, A - Candidate for MSDP Advertisement,
       U - URD, I - Received Source Specific Host Report, 
       Z - Multicast Tunnel, z - MDT-data group sender, 
       Y - Joined MDT-data group, y - Sending to MDT-data group, 
       G - Received BGP C-Mroute, g - Sent BGP C-Mroute, 
       N - Received BGP Shared-Tree Prune, n - BGP C-Mroute suppressed, 
       Q - Received BGP S-A Route, q - Sent BGP S-A Route, 
       V - RD & Vector, v - Vector, p - PIM Joins on route
Outgoing interface flags: H - Hardware switched, A - Assert winner, p - PIM Join
 Timers: Uptime/Expires
 Interface state: Interface, Next-Hop or VCD, State/Mode

(*, 239.0.0.1), 02:57:14/stopped, RP 1.1.1.1, flags: SJC
  Incoming interface: Ethernet0/0, RPF nbr 13.13.13.1
  Outgoing interface list:
    Ethernet0/2, Forward/Sparse, 00:28:58/00:02:50

(41.41.41.10, 239.0.0.1), 00:02:03/00:00:56, flags: JT
  Incoming interface: Ethernet0/0, RPF nbr 13.13.13.1
  Outgoing interface list:
    Ethernet0/2, Forward/Sparse, 00:02:03/00:02:50

The unicast and multicast topology is not currently congruent, this may or may not be important. What happens when R3 fails?

R3(config)#int e0/2
R3(config-if)#sh
R3(config-if)#

No replies to the pings coming in until PIM on R2 detects that R3 is gone and takes over the DR role, this will take between 60 to 90 seconds with the default timers in use.

S1#ping 239.0.0.1 re 100 ti 1
Type escape sequence to abort.
Sending 100, 100-byte ICMP Echos to 239.0.0.1, timeout is 1 seconds:

Reply to request 0 from 10.0.0.10, 18 ms
Reply to request 1 from 10.0.0.10, 2 ms....................................................................
.......
Reply to request 77 from 10.0.0.10, 10 ms
Reply to request 78 from 10.0.0.10, 1 ms
Reply to request 79 from 10.0.0.10, 1 ms
Reply to request 80 from 10.0.0.10, 1 ms

We can increase the DR priority on R2 to make it become the DR.

R2(config-if)#ip pim dr-priority 50  
*Feb 13 12:42:45.900: %PIM-5-DRCHG: DR change from neighbor 10.0.0.3 to 10.0.0.2 on interface Ethernet0/2

HSRP aware PIM is a feature that started appearing in IOS 15.3(1)T and makes the HSRP AR become the PIM DR. It will also send PIM messages from the virtual IP which is useful in situations where you have a router with a static route towards an Virtual IP (VIP). This is how Cisco describes the feature:

HSRP Aware PIM enables multicast traffic to be forwarded through the HSRP active router (AR), allowing PIM to leverage HSRP redundancy, avoid potential duplicate traffic, and enable failover, depending on the HSRP states in the device. The PIM designated router (DR) runs on the same gateway as the HSRP AR and maintains mroute states.

In my topology, I am running HSRP towards the clients, so even though this feature sounds as a perfect fit it will not help me in converging my multicast. Let’s configure this feature on R2:

R2(config-if)#ip pim redundancy HSRP1 hsrp dr-priority 100
R2(config-if)#
*Feb 13 12:48:20.024: %PIM-5-DRCHG: DR change from neighbor 10.0.0.3 to 10.0.0.2 on interface Ethernet0/2

R2 is now the PIM DR, R3 will now see two PIM neighbors on interface E0/2:

R3#sh ip pim nei e0/2
PIM Neighbor Table
Mode: B - Bidir Capable, DR - Designated Router, N - Default DR Priority,
      P - Proxy Capable, S - State Refresh Capable, G - GenID Capable
Neighbor          Interface                Uptime/Expires    Ver   DR
Address                                                            Prio/Mode
10.0.0.1          Ethernet0/2              00:00:51/00:01:23 v2    0 / S P G
10.0.0.2          Ethernet0/2              00:07:24/00:01:23 v2    100/ DR S P G

R2 now has the (S,G) and we can see that it was the Assert winner because R3 was previously sending multicasts to the LAN segment.

R2#sh ip mroute 239.0.0.1
IP Multicast Routing Table
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group, C - Connected,
       L - Local, P - Pruned, R - RP-bit set, F - Register flag,
       T - SPT-bit set, J - Join SPT, M - MSDP created entry, E - Extranet,
       X - Proxy Join Timer Running, A - Candidate for MSDP Advertisement,
       U - URD, I - Received Source Specific Host Report, 
       Z - Multicast Tunnel, z - MDT-data group sender, 
       Y - Joined MDT-data group, y - Sending to MDT-data group, 
       G - Received BGP C-Mroute, g - Sent BGP C-Mroute, 
       N - Received BGP Shared-Tree Prune, n - BGP C-Mroute suppressed, 
       Q - Received BGP S-A Route, q - Sent BGP S-A Route, 
       V - RD & Vector, v - Vector, p - PIM Joins on route
Outgoing interface flags: H - Hardware switched, A - Assert winner, p - PIM Join
 Timers: Uptime/Expires
 Interface state: Interface, Next-Hop or VCD, State/Mode

(*, 239.0.0.1), 00:20:31/stopped, RP 1.1.1.1, flags: SJC
  Incoming interface: Ethernet0/0, RPF nbr 12.12.12.1
  Outgoing interface list:
    Ethernet0/2, Forward/Sparse, 00:16:21/00:02:35

(41.41.41.10, 239.0.0.1), 00:00:19/00:02:40, flags: JT
  Incoming interface: Ethernet0/0, RPF nbr 12.12.12.1
  Outgoing interface list:
    Ethernet0/2, Forward/Sparse, 00:00:19/00:02:40, A

What happens when R2’s LAN interface goes down? Will R3 become the DR? And how fast will it converge?

R2(config)#int e0/2
R2(config-if)#sh

HSRP changes to active on R3 but the PIM DR role does not converge until the PIM query interval has expired (3x hellos).

*Feb 13 12:51:44.204: HSRP: Et0/2 Grp 1 Redundancy "hsrp-Et0/2-1" state Standby -> Active
R3#sh ip pim nei e0/2
PIM Neighbor Table
Mode: B - Bidir Capable, DR - Designated Router, N - Default DR Priority,
      P - Proxy Capable, S - State Refresh Capable, G - GenID Capable
Neighbor          Interface                Uptime/Expires    Ver   DR
Address                                                            Prio/Mode
10.0.0.1          Ethernet0/2              00:04:05/00:00:36 v2    0 / S P G
10.0.0.2          Ethernet0/2              00:10:39/00:00:36 v2    100/ DR S P G
R3#
*Feb 13 12:53:02.013: %PIM-5-NBRCHG: neighbor 10.0.0.2 DOWN on interface Ethernet0/2 DR
*Feb 13 12:53:02.013: %PIM-5-DRCHG: DR change from neighbor 10.0.0.2 to 10.0.0.3 on interface Ethernet0/2
*Feb 13 12:53:02.013: %PIM-5-NBRCHG: neighbor 10.0.0.1 DOWN on interface Ethernet0/2 non DR

We lose a lot of packets while waiting for PIM to converge:

S1#ping 239.0.0.1 re 100 time 1
Type escape sequence to abort.
Sending 100, 100-byte ICMP Echos to 239.0.0.1, timeout is 1 seconds:

Reply to request 0 from 10.0.0.10, 5 ms
Reply to request 0 from 10.0.0.10, 14 ms...................................................................
Reply to request 68 from 10.0.0.10, 10 ms
Reply to request 69 from 10.0.0.10, 2 ms

Reply to request 70 from 10.0.0.10, 1 ms

HSRP aware PIM didn’t really help us here… So when is it useful? If we use the following topology instead:

PIM2

The router R5 has been added and the receiver sits between R5 instead. R5 does not run routing with R2 and R3, only static routes pointing at the RP and the multicast source:

R5(config)#ip route 1.1.1.1 255.255.255.255 10.0.0.1
R5(config)#ip route 41.41.41.0 255.255.255.0 10.0.0.1

Without HSRP aware PIM, the RPF check would fail because PIM would peer with the physical address but R5 sees three neighbors on the segment, where one is the VIP:

R5#sh ip pim nei
PIM Neighbor Table
Mode: B - Bidir Capable, DR - Designated Router, N - Default DR Priority,
      P - Proxy Capable, S - State Refresh Capable, G - GenID Capable
Neighbor          Interface                Uptime/Expires    Ver   DR
Address                                                            Prio/Mode
10.0.0.2          Ethernet0/0              00:03:00/00:01:41 v2    100/ DR S P G
10.0.0.1          Ethernet0/0              00:03:00/00:01:41 v2    0 / S P G
10.0.0.3          Ethernet0/0              00:03:00/00:01:41 v2    1 / S P G

R2 will be the one forwarding multicast during normal conditions due to it being the PIM DR via HSRP state of active router:

R2#sh ip mroute 239.0.0.1
IP Multicast Routing Table
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group, C - Connected,
       L - Local, P - Pruned, R - RP-bit set, F - Register flag,
       T - SPT-bit set, J - Join SPT, M - MSDP created entry, E - Extranet,
       X - Proxy Join Timer Running, A - Candidate for MSDP Advertisement,
       U - URD, I - Received Source Specific Host Report, 
       Z - Multicast Tunnel, z - MDT-data group sender, 
       Y - Joined MDT-data group, y - Sending to MDT-data group, 
       G - Received BGP C-Mroute, g - Sent BGP C-Mroute, 
       N - Received BGP Shared-Tree Prune, n - BGP C-Mroute suppressed, 
       Q - Received BGP S-A Route, q - Sent BGP S-A Route, 
       V - RD & Vector, v - Vector, p - PIM Joins on route
Outgoing interface flags: H - Hardware switched, A - Assert winner, p - PIM Join
 Timers: Uptime/Expires
 Interface state: Interface, Next-Hop or VCD, State/Mode

(*, 239.0.0.1), 00:02:12/00:02:39, RP 1.1.1.1, flags: S
  Incoming interface: Ethernet0/0, RPF nbr 12.12.12.1
  Outgoing interface list:
    Ethernet0/2, Forward/Sparse, 00:02:12/00:02:39

Let’s try a ping from the source:

S1#ping 239.0.0.1 re 3
Type escape sequence to abort.
Sending 3, 100-byte ICMP Echos to 239.0.0.1, timeout is 2 seconds:

Reply to request 0 from 20.0.0.10, 1 ms
Reply to request 1 from 20.0.0.10, 2 ms
Reply to request 2 from 20.0.0.10, 2 ms

The ping works and R2 has the (S,G):

R2#sh ip mroute 239.0.0.1
IP Multicast Routing Table
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group, C - Connected,
       L - Local, P - Pruned, R - RP-bit set, F - Register flag,
       T - SPT-bit set, J - Join SPT, M - MSDP created entry, E - Extranet,
       X - Proxy Join Timer Running, A - Candidate for MSDP Advertisement,
       U - URD, I - Received Source Specific Host Report, 
       Z - Multicast Tunnel, z - MDT-data group sender, 
       Y - Joined MDT-data group, y - Sending to MDT-data group, 
       G - Received BGP C-Mroute, g - Sent BGP C-Mroute, 
       N - Received BGP Shared-Tree Prune, n - BGP C-Mroute suppressed, 
       Q - Received BGP S-A Route, q - Sent BGP S-A Route, 
       V - RD & Vector, v - Vector, p - PIM Joins on route
Outgoing interface flags: H - Hardware switched, A - Assert winner, p - PIM Join
 Timers: Uptime/Expires
 Interface state: Interface, Next-Hop or VCD, State/Mode

(*, 239.0.0.1), 00:04:18/00:03:29, RP 1.1.1.1, flags: S
  Incoming interface: Ethernet0/0, RPF nbr 12.12.12.1
  Outgoing interface list:
    Ethernet0/2, Forward/Sparse, 00:04:18/00:03:29

(41.41.41.10, 239.0.0.1), 00:01:35/00:01:24, flags: T
  Incoming interface: Ethernet0/0, RPF nbr 12.12.12.1
  Outgoing interface list:
    Ethernet0/2, Forward/Sparse, 00:01:35/00:03:29

What happens when R2 fails?

R2#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
R2(config)#int e0/2
R2(config-if)#sh
R2(config-if)#
S1#ping 239.0.0.1 re 200 ti 1
Type escape sequence to abort.
Sending 200, 100-byte ICMP Echos to 239.0.0.1, timeout is 1 seconds:

Reply to request 0 from 20.0.0.10, 9 ms
Reply to request 1 from 20.0.0.10, 2 ms
Reply to request 1 from 20.0.0.10, 11 ms....................................................................
......................................................................
............................................................

The pings time out because when the PIM Join from R5 comes in, R3 does not realize that it should process the Join.

*Feb 13 13:20:13.236: PIM(0): Received v2 Join/Prune on Ethernet0/2 from 10.0.0.5, not to us
*Feb 13 13:20:32.183: PIM(0): Generation ID changed from neighbor 10.0.0.2

As it turns out, the PIM redundancy command must be configured on the secondary router as well for it to process PIM Joins to the VIP.

R3(config-if)#ip pim redundancy HSRP1 hsrp dr-priority 10

After this has configured, the incoming Join will be processed. R3 triggers R5 to send a new Join because the GenID is set in the PIM hello to a new value.

*Feb 13 13:59:19.333: PIM(0): Matched redundancy group VIP 10.0.0.1 on Ethernet0/2 Active, processing the Join/Prune, to us
*Feb 13 13:40:34.043: PIM(0): Generation ID changed from neighbor 10.0.0.1

After configuring this, the PIM DR role converges as fast as HSRP allows. I’m using BFD in this scenario.

The key concept for understanding HSRP aware PIM here is that:

  • Initially configuring PIM redundancy on the AR will make it the DR
  • PIM redundancy must be configured on the secondary router as well, otherwise it will not process PIM Joins to the VIP
  • The PIM DR role does not converge until PIM hellos have timed out, the secondary router will process the Joins though so the multicast will converge

This feature is not very well documented so I hope you have learned a bit from this post how this feature really works. This feature does not work when you have receiver on a HSRP LAN, because the DR role is NOT moved until PIM adjacency expires.

Categories: Multicast Tags: , , ,

The Tale of the Mysterious PIM Prune

December 15, 2014 3 comments

Christmas is lurking around the corner and in the spirit of Denise “Fish” Fishburne, I give you the “The Tale of the Mysterious PIM Prune”.

I have been working a lot with multicast lately which is also why I’ve blogged about it. To start off this story, let’s begin with a network topology.

Topology1

The multicast source is located in AS 65000 and contains two routers that are connected to the multicast source. The routers run BFD, OSPF, iBGP, PIM internally and the RP is located on C1. There is a local receiver in AS 65000 and a remote one in AS 64512. The networks 10.0.1.0/24 and 10.0.21.0/24 come off the same physical interface. If you want to replicate this lab, all the configs are provided here.

This network requires fast convergence and I have been troubleshooting a scenario where the active multicast router (R1) has its LAN interface go down, meaning that the traffic from the source must come in on R2. In this scenario I have seen convergence in up to 60 seconds which is not acceptable. The BGP design is for R2 to still exit out via R1 if the link is available towards C1. The picture below shows the normal multicast flow.

Normal flow

When R1 has its LAN interface go down, the traffic will pass from R2 over the link to R1 and out to C1.

Backup flow

R1 and R2 have a default route learned via BGP that points at C1. This will be an important piece of the puzzle later. Let’s go through what happens, step by step when R1 has its LAN interface go down. To simulate the multicast traffic I have an Ubuntu machine acting as the source. The receivers are CSR1000v routers with debug ip icmp to see how often the traffic is coming in. I’m sending ICMP packets every 100 ms.

*Dec 14 19:48:08.922: HSRP: Gi3.1 Interface DOWN

The interface on R1 goes down.

*Dec 14 19:48:08.926: is_up: GigabitEthernet3.1 0 state: 6 sub state: 1 line: 1
*Dec 14 19:48:08.926: RT: interface GigabitEthernet3.1 removed from routing table
*Dec 14 19:48:08.926: RT: del 10.0.1.0 via 0.0.0.0, connected metric [0/0]
*Dec 14 19:48:08.926: RT: delete subnet route to 10.0.1.0/24
*Dec 14 19:48:08.926: is_up: GigabitEthernet3.1 0 state: 6 sub state: 1 line: 1
*Dec 14 19:48:08.927: RT(multicast): interface GigabitEthernet3.1 removed from routing table
*Dec 14 19:48:08.927: RT(multicast): del 10.0.1.0 via 0.0.0.0, connected metric [0/0]
*Dec 14 19:48:08.927: RT(multicast): delete subnet route to 10.0.1.0/24
*Dec 14 19:48:08.927: RT: del 10.0.1.2 via 0.0.0.0, connected metric [0/0]
*Dec 14 19:48:08.927: RT: delete subnet route to 10.0.1.2/32
*Dec 14 19:48:08.927: RT(multicast): del 10.0.1.2 via 0.0.0.0, connected metric [0/0]
*Dec 14 19:48:08.927: RT(multicast): delete subnet route to 10.0.1.2/32

It takes roughly 5 ms to remove the route from the RIB and the MRIB.

*Dec 14 19:48:08.935: RT: add 10.0.1.0/24 via 10.0.255.2, bgp metric [200/0]
*Dec 14 19:48:08.936: RT: updating bgp 10.0.21.0/24 (0x0)  :
    via 10.0.255.2   0 1048577

R1 starts to install the route to the source via R2 into the RIB, but the route is not yet installed! This is a key concept.

*Dec 14 19:48:08.937: MRT(0): Delete GigabitEthernet1/239.0.0.1 from the olist of (10.0.1.10, 239.0.0.1)
*Dec 14 19:48:08.937: MRT(0): Reset the PIM interest flag for (10.0.1.10, 239.0.0.1)
*Dec 14 19:48:08.938: MRT(0): set min mtu for (10.0.1.10, 239.0.0.1) 1500->1500
*Dec 14 19:48:08.938: MRT(0): (10.0.1.10,239.0.0.1), RPF change from GigabitEthernet3.1/0.0.0.0 to GigabitEthernet1/10.0.11.1
*Dec 14 19:48:08.938: MRT(0): Reset the F-flag for (10.0.1.10, 239.0.0.1)
*Dec 14 19:48:08.938: PIM(0): Insert (10.0.1.10,239.0.0.1) join in nbr 10.0.11.1's queue
*Dec 14 19:48:08.938: PIM(0): Building Join/Prune packet for nbr 10.0.11.1
*Dec 14 19:48:08.938: PIM(0):  Adding v2 (10.0.1.10/32, 239.0.0.1), S-bit Join
*Dec 14 19:48:08.938: PIM(0): Send v2 join/prune to 10.0.11.1 (GigabitEthernet1)
*Dec 14 19:48:08.938: RT: updating bgp 10.0.1.0/24 (0x0)  :
    via 10.0.255.2   0 1048577

R1 cleans up some multicast state and then it sends a PIM Join towards C1, but the source is not located in that direction! The giveaway here is the message about RPF change and 0.0.0.0 which is the default route pointing towards C1. This default route is already installed into the RIB so the PIM Join is sent out the RPF interface which for a brief period happens to be towards C1. The receiver is still located in that direction though.

*Dec 14 19:48:08.938: RT: closer admin distance for 10.0.1.0, flushing 1 routes
*Dec 14 19:48:08.938: RT: add 10.0.1.0/24 via 10.0.255.2, bgp metric [200/0]
*Dec 14 19:48:08.939: MRT(0): (10.0.1.10,239.0.0.1), RPF change from GigabitEthernet1/10.0.11.1 to GigabitEthernet2/10.0.112.2
*Dec 14 19:48:08.939: PIM(0): Insert (10.0.1.10,239.0.0.1) join in nbr 10.0.112.2's queue
*Dec 14 19:48:08.939: PIM(0): Insert (10.0.1.10,239.0.0.1) prune in nbr 10.0.11.1's queue
*Dec 14 19:48:08.939: PIM(0): Building Join/Prune packet for nbr 10.0.11.1
*Dec 14 19:48:08.939: PIM(0):  Adding v2 (10.0.1.10/32, 239.0.0.1), S-bit Prune
*Dec 14 19:48:08.939: PIM(0): Send v2 join/prune to 10.0.11.1 (GigabitEthernet1)
*Dec 14 19:48:08.939: PIM(0): Building Join/Prune packet for nbr 10.0.112.2
*Dec 14 19:48:08.939: PIM(0):  Adding v2 (10.0.1.10/32, 239.0.0.1), S-bit Join
*Dec 14 19:48:08.939: PIM(0): Send v2 join/prune to 10.0.112.2 (GigabitEthernet2)

R1 then installs the route via R2, it tries to cover its tracks by pruning off the interface towards C1. Hold on a second here though, isn’t that where the receiver is located? Indeed it is, this means the (S,G) tree is broken until C1 sends a periodic Join which could take from 0-60 seconds depending on when the last Join came in. The arrows on the topology shows the events in order, first a Join towards C1, then a Prune towards C1, then a Join towards R2.

Join and Prune

*Dec 14 19:48:09.932: PIM(0): Insert (10.0.1.10,239.0.0.1) prune in nbr 10.0.112.2's queue - deleted
*Dec 14 19:48:09.932: PIM(0): Building Join/Prune packet for nbr 10.0.112.2
*Dec 14 19:48:09.932: PIM(0):  Adding v2 (10.0.1.10/32, 239.0.0.1), S-bit Prune
*Dec 14 19:48:09.932: PIM(0): Send v2 join/prune to 10.0.112.2 (GigabitEthernet2)

R1 no longer has any receivers interested in the traffic, it pruned off the only remaining receiver located towards C1 so it sends a Prune towards R2.

*Dec 14 19:48:50.864: PIM(0): Received v2 Join/Prune on GigabitEthernet1 from 10.0.11.1, to us
*Dec 14 19:48:50.864: PIM(0): Join-list: (10.0.1.10/32, 239.0.0.1), S-bit set
*Dec 14 19:48:50.864: MRT(0): WAVL Insert interface: GigabitEthernet1 in (10.0.1.10,239.0.0.1) Successful
*Dec 14 19:48:50.864: MRT(0): set min mtu for (10.0.1.10, 239.0.0.1) 18010->1500
*Dec 14 19:48:50.864: MRT(0): Add GigabitEthernet1/239.0.0.1 to the olist of (10.0.1.10, 239.0.0.1), Forward state - MAC built
*Dec 14 19:48:50.864: PIM(0): Add GigabitEthernet1/10.0.11.1 to (10.0.1.10, 239.0.0.1), Forward state, by PIM SG Join
*Dec 14 19:48:50.864: MRT(0): Add GigabitEthernet1/239.0.0.1 to the olist of (10.0.1.10, 239.0.0.1), Forward state - MAC built
*Dec 14 19:48:50.864: MRT(0): Set the PIM interest flag for (10.0.1.10, 239.0.0.1)
*Dec 14 19:48:50.864: PIM(0): Insert (10.0.1.10,239.0.0.1) join in nbr 10.0.112.2's queue
*Dec 14 19:48:50.864: PIM(0): Building Join/Prune packet for nbr 10.0.112.2
*Dec 14 19:48:50.864: PIM(0):  Adding v2 (10.0.1.10/32, 239.0.0.1), S-bit Join
*Dec 14 19:48:50.864: PIM(0): Send v2 join/prune to 10.0.112.2 (GigabitEthernet2)

C1 is still interested in this multicast traffic, it sends a periodic Join towards R1 which then triggers R1 to send a Join towards R2. There was roughly a 40 second delay between these events. This can also be seen from the ping that I had running.

*Dec 14 19:48:08.842: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 19:48:08.942: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 19:48:09.042: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 19:48:51.054: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 19:48:51.158: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 19:48:51.255: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 19:48:51.355: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0

This story shows how the unicast routing table and a race condition can effect your multicast traffic. What would a story be without a happy ending though? What can we do to solve the race condition?

The key here is that the default route in already installed into the RIB. To beat it we will have to put a longer match into the RIB or into the MRIB. This can be done by putting a static unicast route or a static multicast route. I prefer to use a static multicast route since that will have no effect on unicast traffic.

ip mroute 10.0.1.0 255.255.255.0 10.0.112.2

or

ip route 10.0.1.0 255.255.255.0 10.0.112.2

The connected route is the best route until R1 has its LAN interface go down. The MRIB will then use the next match which is 10.0.1.0 via 10.0.112.2. Let’s run another test now that we have altered the MRIB.

*Dec 14 20:29:03.128: HSRP: Gi3.1 Interface going DOWN

Interface goes down.

*Dec 14 20:29:03.142: is_up: GigabitEthernet3.1 0 state: 6 sub state: 1 line: 1
*Dec 14 20:29:03.142: RT: interface GigabitEthernet3.1 removed from routing table
*Dec 14 20:29:03.142: RT: del 10.0.1.0 via 0.0.0.0, connected metric [0/0]
*Dec 14 20:29:03.142: RT: delete subnet route to 10.0.1.0/24
*Dec 14 20:29:03.143: is_up: GigabitEthernet3.1 0 state: 6 sub state: 1 line: 1
*Dec 14 20:29:03.143: RT(multicast): interface GigabitEthernet3.1 removed from routing table
*Dec 14 20:29:03.143: RT(multicast): del 10.0.1.0 via 0.0.0.0, connected metric [0/0]
*Dec 14 20:29:03.143: RT(multicast): delete subnet route to 10.0.1.0/24
*Dec 14 20:29:03.143: RT: del 10.0.1.2 via 0.0.0.0, connected metric [0/0]
*Dec 14 20:29:03.143: RT: delete subnet route to 10.0.1.2/32
*Dec 14 20:29:03.143: RT(multicast): del 10.0.1.2 via 0.0.0.0, connected metric [0/0]
*Dec 14 20:29:03.143: RT(multicast): delete subnet route to 10.0.1.2/32

Remove 10.0.1.0/24 from the RIB and MRIB.

*Dec 14 20:29:03.150: RT: add 10.0.1.0/24 via 10.0.255.2, bgp metric [200/0]
*Dec 14 20:29:03.150: RT: updating bgp 10.0.21.0/24 (0x0)  :
    via 10.0.255.2   0 1048577
*Dec 14 20:29:03.150: RT: add 10.0.21.0/24 via 10.0.255.2, bgp metric [200/0]

Start installing the BGP route.

*Dec 14 20:29:03.152: MRT(0): (10.0.1.10,239.0.0.1), RPF change from GigabitEthernet3.1/0.0.0.0 to GigabitEthernet2/10.0.112.2
*Dec 14 20:29:03.152: MRT(0): Reset the F-flag for (10.0.1.10, 239.0.0.1)
*Dec 14 20:29:03.152: PIM(0): Insert (10.0.1.10,239.0.0.1) join in nbr 10.0.112.2's queue
*Dec 14 20:29:03.152: PIM(0): Building Join/Prune packet for nbr 10.0.112.2
*Dec 14 20:29:03.152: PIM(0):  Adding v2 (10.0.1.10/32, 239.0.0.1), S-bit Join
*Dec 14 20:29:03.152: PIM(0): Send v2 join/prune to 10.0.112.2 (GigabitEthernet2)
*Dec 14 20:29:03.153: RT: updating bgp 10.0.1.0/24 (0x0)  :
    via 10.0.255.2   0 1048577
*Dec 14 20:29:03.153: RT: closer admin distance for 10.0.1.0, flushing 1 routes
*Dec 14 20:29:03.153: RT: add 10.0.1.0/24 via 10.0.255.2, bgp metric [200/0]

The RPF interface changes but now points towards the correct destination, which is R2. The multicast traffic can flow again. How many packets did we lose?

*Dec 14 20:29:02.017: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:02.115: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:02.216: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:02.317: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:02.418: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:02.519: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:02.620: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:02.720: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:02.820: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:02.920: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:03.021: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:03.122: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:03.223: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:03.324: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:03.424: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:03.526: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:03.626: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:03.727: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:03.828: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:03.927: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0

No packets lost! I would probably have to send packets more often than every 100ms to catch the tree converging here. In a real network you would see some delay because pulling the cable is different than shutting down an interface, the carrier delay would come into play here. Here are some key concepts you should learn from this post.

  • Unicast routing table will impact the multicast routing table
  • Never assume anything, verify
  • Installing routes takes time

There are a lot of moving parts here. If you don’t understand it all at once, don’t worry. It’s a complex scenario and don’t be afraid to ask questions in the comments section. Here are the key concepts of converging in this scenario.

  • Interface goes down (0-2 seconds)
  • Remove stale route (14 ms)
  • Install new route (3 ms)
  • RPF change/PIM Join (2 ms)

These are values from my tests and you will see higher values on production equipment. Because PIM Joins are triggered by the routing table changing, convergence can be very fast depending on how fast you detect it. As you can see from the values above, you can achieve convergence somewhere between 0.5 – 3 seconds realistically depending on how aggressively you tune your timers.

After reading this post you should have a better understanding of multicast, the RIB, MRIB and the unicast routing table impact on multicast flows.

Categories: Multicast Tags: , , ,

Using EEM to Speed up Multicast Convergence when Receiver is Dually Connected

November 22, 2014 3 comments

When deploying PIM ASM, the Designated Router (DR) role plays a significant part in how PIM ASM works. The DR on a segment is responsible for registering mulicast sources with the Rendezvous Point (RP) and/or sending PIM Joins for the segment. Routers with PIM enabled interfaces send out PIM Hello messages every 30 seconds by default.

EEM1

After missing three Hellos the secondary router will take over as the DR. With the standard timer value, this can take between 60 to 90 seconds depending on when the last Hello came in. Not really acceptable in a modern network.

The first thought is to lower the PIM query interval, this can be done and it supports sending PIM Hellos at msec level. In my particular case I needed convergence within two seconds. I tuned the PIM query interval to 500 msec meaning that the PIM DR role should converge within 1.5 seconds. The problem though is that these Hellos are sent at process level. Even though my routers were barely breaking a sweat CPU wise I would see PIM adjacencies flapping.

The answer to my problems would be to have Bidirectional Forwarding Dectection (BFD) for PIM but it’s only supported on a very limited set of platforms. I already have BFD running for OSPF and BGP but unfortunately it’s not supported for PIM. The advantage of BFD is that the Hellos are more light weight and they are sent through interrupt instead of process level. This provides more deterministic behavior than than the regular PIM Hellos.

So how did I solve my problem? I need something that detects failure, I need BFD. Hot Standby Routing Protocol (HSRP) detects failures, HSRP has support for BFD. I could then use HSRP to detect the failure and act on the Syslog message generated by HSRP. Even though I didn’t really need HSRP on that segment it helped me in moving the PIM DR role which I wrote this Embedded Event Manager (EEM) applet for. A thank you to Peter Paluch for providing this idea and support 🙂

The configuration of the interface is this:

interface GigabitEthernet0/2.100
 description *** Receiver LAN ***   
 encapsulation dot1Q 100
 ip address 10.0.100.3 255.255.255.0
 no ip redirects
 no ip unreachables
 no ip proxy-arp
 ip pim sparse-mode
 standby version 2
 standby 1 ip 10.0.100.1
 standby 1 preempt delay reload 180
 standby 1 name HSRP-1
 bfd interval 300 min_rx 300 multiplier 3

BFD is sending Hellos every 300 msec so it will converge within 900 msec. The key is then to find the Syslog message that HSRP generates when it detects a failure. These messages look like this:

%HSRP-5-STATECHANGE: GigabitEthernet0/2.100 Grp 1 state Standby -> Active
%HSRP-5-STATECHANGE: GigabitEthernet0/2.100 Grp 1 state Speak -> Standby

It is then possible to write an EEM applet acting on this message and setting the DR priority on the secondary router.

event manager applet CHANGE-DR-UP-RECEIVER
 event syslog pattern "%HSRP-5-STATECHANGE: GigabitEthernet0/2.100 Grp 1 state Standby -> Active"
 action 1.0 syslog msg "Changing DR on interface Gi0/2.100 due to AR is DOWN"
 action 1.1 cli command "enable"
 action 1.2 cli command "conf t"
 action 1.3 cli command "interface gi0/2.100"
 action 1.4 cli command "ip pim dr-priority 100"
 action 1.5 cli command "end"

When HSRP has detected the failure, the EEM apple will trigger very quickly and set the priority.

116072: Nov 20 13:03:04.544 UTC: %HSRP-5-STATECHANGE: GigabitEthernet0/2.100 Grp 1 state Standby -> Active
116080: Nov 20 13:03:04.552 UTC: %HA_EM-6-LOG: CHANGE-DR-UP-RECEIVER : DEBUG(cli_lib) : : CTL : cli_open called.
116120: Nov 20 13:03:04.604 UTC: PIM(0): Changing DR for GigabitEthernet0/2.100, from 10.0.100.2 to 10.0.100.3 (this system)
116121: Nov 20 13:03:04.604 UTC: %PIM-5-DRCHG: DR change from neighbor 10.0.100.2 to 10.0.100.3 on interface GigabitEthernet0/2.100

It took 60 msec from HSRP detecting the failure through BFD until the DR role had converged. It’s then possible to recover from a failure within a second.

It’s also important to set the DR priority back after the network converges. We use another applet for this:

event manager applet CHANGE-DR-DOWN-RECEIVER
 event syslog pattern "%HSRP-5-STATECHANGE: GigabitEthernet0/2.100 Grp 1 state Speak -> Standby"
 action 1.0 syslog msg "Changing DR on interface Gi0/2.100 due to AR is UP"
 action 1.1 cli command "enable"
 action 1.2 cli command "conf t"
 action 1.3 cli command "interface gi0/2.100"
 action 1.4 cli command "no ip pim dr-priority 100"
 action 1.5 cli command "end"

This works very well. There are some considerations when running EEM. Firstly, if you are running AAA then the EEM applet will fail authorization. This can be bypassed with the following command:

event manager applet CHANGE-DR-UP-RECEIVER authorization bypass

It’s also important to note that the EEM applet will use a VTY line when executing so make sure that there are available VTY’s when the applet runs.

After the PIM DR role has converged, the router will send out a PIM Join and the multicast will start flowing to the receiver.

Categories: Multicast Tags: , , ,

Lessons Learned from Deploying Multicast

November 10, 2014 2 comments

Lately I have been working a lot with multicast, which is fun and challenging! Even if you have a good understanding of multicast unless you work on it a lot there may be some concepts that fall out of memory or that you only run into in real life and not in the lab. Here is a summary of some things I’ve noticed so far.

PIM Register

PIM Register are control plane messages sent from the First Hop Router (FHR) towards the Rendezvous Point (RP). These are unicast messages encapsulating the multicast from the multicast source. There are some considerations here, firstly because these packets are sent from the FHR control plane to the RP control plane, they are not subject to any access list configured outbound on the FHR. I had a situation where I wanted to route the multicast locally but not send it outbound.

PIM Register 1

 

Even if the ACL was successful, care would have to be taken to not break the control plane between the FHR and the RP or all multicast traffic for the group would be at jeopardy.

The PIM Register messages are control plane messages, this means that the RP has to process them in the control plane which will tax the CPU. Depending on the rate that the FHR is sending to the RP and the number of sources, this can be very stressful on the CPU. As a safeguard the following command can be implemented:


ip pim register-rate-limit 20000

This command is applied on FHRs and limits the rate of the PIM Register messages to 20 kbit/s. By default there is no limit, set it to something that makes sense in your environment.

Storm Control

If you have switches in your multicast environement, and most likely you will, implement storm control. If a loop forms you don’t want to have an unlimited amount of broadcast and multicast flooding your layer 2 domain. Combined with the PIM Register this can be a real killer for the control plane if your FHR is trying to register sources at a very high packet rate.


storm-control broadcast level pps 100

storm-control multicast level pps 1k

The above is just an example, you have to set it to something that fits your environment, make sure to leave some room for more traffic than expected but not enough to hurt your devices if somethings goes wrong.

S,G Timeout

PIM Any Source Multicast (ASM) relies on using a RP when setting up the flow between the multicast sender and receiver. The receiver will first join the (*,G) tree which is rooted at the RP. After the receiver learns of the source it can switch over to the source tree (S,G). The (S,G) mroute in the Multicast Routing Information Base (MRIB) has a standard lifetime of 180 seconds. It can be beneficial to raise this timeout depending on the topology. Look at the following topology:

Timeout

 

If something happens to the source making it go away for three minutes, the (S,G) state will time out. Let’s then say that the source comes back but the RP is not available, then the FHR will not be able to register the source and no traffic can flow between the source and the receiver. If a higher timeout was configured for the (S,G) then the traffic would start flowing again when the source came back online. It’s not a very common scenario but can be a reasonable safe guard for important multicast groups. The drawback of configuring is that you will keep state for a longer time even if it is not needed.


ip pim sparse sg-expiry-timer <value>

The maximum timeout is 57600 seconds which is 16h. Setting it to a couple of hours may cut you some slack if something happens to the RP. Be careful if you have a lot of groups running though.

These are some important aspects but certainly not all. What lessons have you learned from deploying multicast?

 

Categories: Multicast Tags: , , , ,

Potential Issues with Multicast within a VLAN Spanning Switches

July 11, 2014 7 comments

Background

I ran into an interesting issue yesterday at work. There is a new video system
being installed, which takes the video output from computers, encodes it and
sends it as multicast to a controller. The controller then displays it on
a video wall. I had been told that the network has to support multicast.
As all the devices were residing in the same VLAN, I did not expect any issues.
However, the system was not able to receive the multicast. At first we expected
it could be the virtual environment and that the vSwitch did not support multicast,
because one server was deployed on the ESX cluster. The topology was this:

Snooping1

Multicast at Layer 2

Before describing the issue, let’s think about how multicast at layer 2 works.
The source will send to a multicast destination IP. This IP is the converted to a
destination MAC address. If the group is 227.0.0.1, this would map to the MAC
address 0100.5e00.0001. Switches forward multicast and broadcast frames to all
ports in a VLAN. This is not effective in the case of multicast as the traffic
may not have been requested by the host connected to a port receiving the
traffic.

IGMP Snooping

IGMP is the protocol used by hosts to indicate that they are interested in
receiving traffic for a multicast group. Cisco switches will by default run a
feature called IGMP snooping. Instead of forwarding multicast traffic to all ports
in the VLAN, the switch will only forward it to the ports that are interested in
receiving it. This works by “snooping” on the IGMP reports from the hosts, if the
switch sees an IGMP report on a port for a group, it knows that the host is interested
in receiving the traffic. The switch will then only forward that traffic to ports
where interested hosts reside. This is a nice feature but it can cause issues
if running multicast in the same VLAN between different switches.

The Issue

Going back to the topology at the start, the source is now sending multicast
traffic into the VLAN. The receiver has sent an IGMP report, because it wants
to receive the traffic.

Snooping2

The multicast traffic from the source reaches the 3650 stack but never leaves,
keeping the traffic from passing through the other switches to the receiver.
Why? The switches are running IGMP snooping, when the Catalyst 3550 at the far
right receives the IGMP report, it adds the port of receiver to receive traffic
for the multicast group. It does not however forward the IGMP report.
This means that the 3650 stack does not know that there are receivers that want
to receive this multicast traffic. The 3650 stack has no entry for the multicast
group in its snooping table. The traffic is essentially black holed.

The Fix

To understand the fix, you must first know about the mrouter port. The mrouter port
is a port leads to a multicast enabled router. For multicast traffic to pass through
IGMP snooping enabled switches, there must be a mrouter port. This port can be
statically assigned or can be dynamically learned. When the switch has a mrouter
port it will forward some of the IGMP reports out the mrouter port, which means that
the IGMP reports will reach the switch where the source is located. Not all reports
must be relayed, it’s enough that the switch learns that there are receivers out there.

Snooping3

In my case I configured the 3650 stack to be an IGMP querier.

SW-1#sh run | i querier
ip igmp snooping querier

This is a dynamic feature that when configured on a switch it considers itself
to be a mrouter, acting as a proxy instead of a router. It will send out general
queries and when the other switches sees this query, they will learn that as a
mrouter port.

SW-3#show ip igmp snooping mrouter 
Vlan    ports
----    -----
  10    Gi0/1(dynamic)

After turning on the IGMP querier feature, the issue was solved.
I hope this post can help someone in case you have issues with multicast in a
switched environment. There is an excellent post about it here from Cisco.

More on SSM

June 9, 2012 4 comments

As you’ve noticed I’ve been studying SSM and what better way to learn than to blog about it. I recently got a Safari subscription which has been great so far. I’ve been reading some in the book Interdomain Multicast Routing: Practical Juniper Networks and Cisco Systems Solutions which has been great so far.

We are still using the same topology and now we will look a bit more in detail what is happening.

R1 will be the source, sending traffic from its loopback. R3 will be the client running IGMPv3 on its upstream interface to R2. As explained in previous post I am doing this to simulate an end host otherwise I would configure it on R3 downstream interface and then it would sen a PIM Join upstream.

To run SSM we need IGMPv3 or use some form of mapping as described in previous post. It is important to note though that IGMPv3 is not specific for SSM. With SSM a (S,G) pair is described as a channel. Instead of join/leave it is now called subscribe and unsubscribe.

So the first thing that happens is that the client (computer or STB) sends IGMPv3 membership report to the destination IP 224.0.0.22. This is the IP used for IGMPv3. This is how the packet looks in Wireshark.

The destination IP is 224.0.0.22 which corresponds to the multicast MAC 01-00-5E-00-00-16. 16 in hex is 22 in decimal.

We clearly see that it is version 3 and the type is Membership Report 0x22. Number of group records show how many groups are being joined.

Then the actual group record is shown (225.0.0.1) and the type is Allow New Sources. The number of sources is 1. And then we see the channel (S,G) that is joined.

Then R2 sends a PIM Join towards the source.

We can see that it is a (S,G) join. The SPT is built.

R2 will send general IGMPv3 queries to see if there are still any receivers connected to the LAN segment.

The query is sent to all multicast hosts (224.0.0.1) and if still receiving the multicast the host will reply with a report.

The type is Membership Query (0x11). The Max Response Time is 10 seconds which is the time that the host has to reply within.

We can see in this report that the record type is Mode is include (1) compared to Allow New Sources when the first report was sent.

Now R3 unsubscribes to the channel and the IGMP report is used once again.

The type is now Block Old Sources (6).

After this has been sent the IGMP querier (router) has to make sure that there are no other subscribers to the channel so it sends out a channel specific query.

If noone responds to this the router will send a PIM Prune upstream as can be seen here.

Finally. How can we see which router is the IGMP querier? Use the show ip igmp interface command.

R2#show ip igmp interface fa0/0
FastEthernet0/0 is up, line protocol is up
  Internet address is 23.23.23.2/24
  IGMP is enabled on interface
  Current IGMP host version is 3
  Current IGMP router version is 3
  IGMP query interval is 60 seconds
  IGMP querier timeout is 120 seconds
  IGMP max query response time is 10 seconds
  Last member query count is 2
  Last member query response interval is 1000 ms
  Inbound IGMP access group is not set
  IGMP activity: 2 joins, 1 leaves
  Multicast routing is enabled on interface
  Multicast TTL threshold is 0
  Multicast designated router (DR) is 23.23.23.2 (this system)
  IGMP querying router is 23.23.23.2 (this system)
  Multicast groups joined by this system (number of users):
      224.0.1.40(1)

We can see some interesting things here. We can see which router is the designated router and IGMP querier. By default the IGMP querier is the router with the lowest IP and the DR is the one with highest IP. DR can be affected by chancing the DR priority. We can also see which timers are used for query interval and max response time among other timers.

So by now you should have a good grasp of SSM. It does not have a lot of moving parts which is nice.

Categories: CCIE, Multicast Tags: , , ,