Archive

Posts Tagged ‘PIM’

Many to Many Multicast – PIM BiDir

August 9, 2015 3 comments

Introduction

This post will describe PIM Bidir, why it is needed and the design considerations for using PIM BiDir. This post is focused on technology overview and design and will not contain any actual configurations.

Multicast Applications

Multicast is a technology that is mainly used for one-to-many and many-to-many applications. The following are examples of applications that use or can benefit from using multicast.

One-to-many

One-to-many applications have a single sender and multiple receivers. These are examples of applications in the one-to-many model.

Scheduled audio/video: IP-TV, radio, lectures

Push media: News headlines, weather updates, sports scores

File distributing and caching: Web site content or any file-based updates sent to distributed end-user or replicating/caching sites

Announcements: Network time, multicast session schedules

Monitoring: Stock prices, security system or other real-time monitoring applications

Many-to-many

Many-to-many applications have many senders and many receivers. One-to-many applications are unidirectional and many-to-many applications are bidirectional.

Multimedia conferencing: Audio/video and whiteboard is the classic conference application

Synchronized resources: Shared distributed databases of any type

Distance learning: One-to-many lecture but with “upstream” capability where receivers can question the lecturer

Multi-player games: Many multi-player games are distributed simulations and also have chat group capabilities.

Overview of PIM

PIM has different implimentations to be able to handle the above applications. There are mainly three implementatios of PIM, PIM Any Source Multicast (ASM), PIM Source Specific Multicast (SSM) and PIM BiDirectional (BiDir).

PIM ASM

PIM ASM was the first implementation and is well suited for one-to-many applications. ASM means that traffic from any source to a group will be delivered to the receiver(s). PIM ASM uses the concept of a Rendezvous Point Tree (RPT) and Shortest Path Tree (SPT). The RPT is a tree built from the receiver to a Rendezvous Point (RP). The tree from a multicast source to a receiver is called the SPT. Before the receiver can learn the source and build the SPT, the RP will have sent a PIM Join towards the source to build the SPT between the source and the RP. When looking in the mroute table, RPT state will be shown as (*,G) and SPT state will be shown as (S,G)

PIM1

The responsibilities of the RP are:

  • Receive PIM Register messages from the First Hop Router (FHR) and send Register Stop
  • Join the SPT and the RPT so the receivers get traffic and find out the source of the multicast

Initially traffic flows through the RP, there is a more efficient path though. When the Last Hop Router (LHR) starts receiving the multicast it will switch over to the SPT.The SPT will be a more optimal path and (likely) introduce lower delay between the source and the receiver.

PIM2

PIM ASM can support both one-to-many and many-to-many applications since it can use both SPT and RPT. To prevent LHR to switch to SPT, ip pim spt-threshold command can be used. It can either be set to switch over at a certain rate of traffic (kbps) or be set to infinity to always stay on the RPT. This can be combined with ACL to have certain groups always stay on the RPT and for others to switch over. PIM ASM can therefore use SPT for some groups and RPT for other groups. There are still drawbacks to PIM ASM, a few are mentioned here:

  • Complex protocol state with Register messages
  • Redundancy requires the use of MSDP
  • Any source can send which opens attack vector for DoS and sending traffic from spoofed source

PIM SSM

PIM SSM was created to work better with one-to-many flows compared to PIM ASM. In PIM SSM, there is no complex handling of state and there is only SPT, no RPT. That also means that there is no need for a RP. PIM SSM is much easier to setup and use, it does require clients to support IGMPv3 so that the IGMP Report can contain which source the receiver wants to receive the traffic from. This also means that since there is no RP, there has to be some way for the receiver to know which sources send to which groups. This has to be hanled by some form of Out Of Band(OOB) mechanism. The most common use for SSM is IP-TV where the Set Top Box (STB) receives a list of sources and groups by contacting a server.

The drawback of PIM SSM is that (S,G) state is created requiring more memory. Depending on the number of sources, this may be a factor or not.

PIM BiDir

Bidirectional PIM was created to work better with many-to-many applications. PIM BiDir uses only RPT and no SPT. This means that there has to be a RP. With bidirectional PIM, the RP does not perform any of the functions of PIM ASM though, such as sending Register Stop messages or joining the SPT. Remember, in PIM BiDir, there is no SPT.The RP in PIM BiDir does not have to be a physical device since the RP is not performing any control plane functions. It is simply a way of forwarding traffic the right way, think of it as a vector. The RP can be a physical device and in that case it is a normal RP, just without the responsibilities of an RP as we know it in PIM ASM. When configuring PIM BiDir to have redundant RPs the RP is sometimes called Phantom RP, because it does not have to reside on a physical device.

PIM BiDir is often used in “hoot n holler” and financial applications. PIM BiDir and PIM SSM are at different ends of the spectrum where PIM ASM can serve both type of applications.

PIM uses the concept of Reverse Path Forwarding (RPF) to ensure loop free forwarding. RPF ensures that traffic comes in on the interface that would be used to send traffic out towards the source. PIM BiDir can send traffic both up and down the RPT. This is not normally supported by using RPF, to support this PIM BiDir uses a Designated Forwarder (DF) on each segment, even point-to-point segments. The main responsibility of the DF is to forward traffic upstream towards the RP. The DF is elected based on the metric towards the RP, essentially building a tree along the best path without having to install any (S,G) state. RPF is still used to find the appropiate path towards the Rendezvous Point Link (RPL) but it is the DF mechanism that ensures loop free forwarding.

RP Considerations

In PIM BiDir there isno MSDP, it does not use (S,G) so this is expected. To provide redundant RP in PIM BiDir, Phantom RP is used. The Phantom RP is a virtual RP which is not assigned to a physical device, it is often implemented by having two routers use a loopback with different subnet mask length.

PIM3

Routers are assigned the RP adress of 192.0.2.1 which is then the Phantom RP, the actual routers where the traffic will flow through have been assigned 192.0.2.2 and 192.0.2.3 but with different net mask lengths. Normal best path rules will then forward traffic towards the longest path match which will be RP1 when it is available and RP2 when RP1 is not available. It is important to not configure the RP address as a physical interface address since this would break the redundancy. If a router was configured with the real address, it would not forward the traffic since the traffic would be destined for one of its own addresses.

Since the RP is so critical, redundancy must be provided. All traffic will pass through the RP which means that certain links in the network may have to carry a lot of the traffic. For this reason it can be necessary to have several RPs, that are acting as RPs for different multicast groups. The placement of the RP also becomes very important since traffic must flow through the RP.

PIM BiDir Considerations

PIM BiDir uses the DF mechanism and for the election to succeed, all the PIM routers on the segment must support PIM BiDir, otherwise the DF election will fail and PIM BiDir will not be supported on the segment. It is possible to have non PIM BiDir routers on a segment if a PIM neighbor filter is implemented to not form PIM adjacencies with certain routers. That way PIM BiDir can be gradually implemented into the network.

Closing Thoughts

PIM ASM supports all multicast models but at the cost of complexity. One could say that it’s a jack of all trades but does not excel at anything. PIM SSM is less complex and the best choice for one-to-many applications if the receivers have support for IGMPv3. PIM BiDir is best suited for many-to-many applications and keeps the least state of all the PIM implementations. Every PIM implementation has its use case and as an architect/designer its your job to know all the models and pick the best one based on business requirements.

Categories: CCDE Tags: , , , , ,

IPv6 Multicast

July 14, 2015 2 comments

These are my notes for IPv6 multicast for the CCDE exam. Overview

  • Prefix FF::/8 reserved for multicast
  • Multicast Listener Discovery (MLD) replaces IGMP
    • MLD is part of ICMPv6
    • MLDv1 equivalent to IGMPv2
    • MLDv2 equivalent to IGMPv3
  • ASM, SSM and Bidir supported
  • PIM identified by IPv6 next header 103
  • BSR and static RP supported
  • No support for MSDP
    • Anycast supported through PIM, defined in RFC4610
  • Any Source Multicast (ASM)
    • PIM-SM, PIM-BiDir
    • Default for generic multicast and unicast prefix-based multicast
    • Starts with FF3x::/12
  • Source Specific Multicast (SSM)
    • PIM-SSM
    • FF3X::/32 is allocated for SSM by IANA
    • Currently prefix and plen is zero so FF3X::/96 is useable for SSM
  • Embedded RP groups
    • PIM-SM, PIM-BIDir
    • Starts with FF70::/12

IPv6 Multicast Addressing

IPv6 multicast address format includes variable bits to define what type of address it is and what the scope is of the multicast group. The scope can be:

1 – Node

2 – Link

3 – Subnet

4 – Admin

5 – Site

8 – Organization

E – Global

The flags define if embedded RP is used, if the address is based on unicast and if the address is IANA assigned or not (temporary). The unicast based IPv6 multicast address allows an organization to create globally unique IPv6 multicast groups based on their unicast prefixes. This is similar to GLOP addressing in IPv4 but does not require an Autonomous System Number (ASN). IPv6 also allows for embedding the RP address into the multicast address itself. This provides a static RP to multicast group mapping mechanism and can be used to provide interdomain IPv6 multicast as there is no MSDP in IPv6. When using Ethernet, the destination MAC address of the frame will start with 33:33 and the remaining 32 bits will consist of the low order 32 bits of the IPv6 multicast address.

Well Known Multicast Addresses

FF02::1 – All Nodes

FF02::2 – All Routers

FF02::5 – OSPF All Routers

FF02::6 – OSPF DR Routers

FF02::A – EIGRP Routers

FF02::D – PIM Routers

Neighbor Solicitation and DAD

IPv6 also uses multicast to replace ARP through the neighbor solicitation process. To do this the solicited node multicast address is used and the prefix is FF02::1:FF/104 and the last 24 bits are taken from the lower 24 bits of the IPv6 unicast address. If Host A needs to get the MAC of Host B, Host A will send the NS to the solicited node multicast address of B. IPv6 also does Duplicate Address Detection (DAD) to check that noone else is using the same IPv6 address and this also uses the solicited node multicast address. If Host A is checking uniqueness of its IPv6 address, the message will be sent to the solicited node multicast address of Host A.

Multicast Listener Discovery (MLD)

  • MLDv1 messages
    • Listener Query
    • Listener Report
    • Listener Done
  • MLDv2 messages
    • Listener Query
    • Listener Report

MLDv2 does not use a specific Done message which is equivalent to the Leave message in IGMP. It will stop sending Reports or send a Report which excludes the source it was previously interested in.

Protocol Independent Multicast (PIM) for IPv6

  • PIM-SM (RP is required)
    • Many to many applications (multiple sources, single group)
    • Uses shared tree initially but may switch to source tree
  • PIM-BiDir (RP is required)
    • Bidirectional many to many applications (hosts can be sources and receivers)
    • Only uses shared tree, less state
  • PIM-SSM
    • One to many applications (single source, single group)
    • Always uses source tree
    • Source must be learnt through out of band mechanism

Anycast RP

IPv6 does not have support for MSDP. It can support anycast RP through the use of PIM which can implement this feature. All the RPs doing anycast will use the same IPv6 address but they also require a unique IPv6 address that will be used to relay the PIM Register messages coming from the multicast sources. A RP-set is defined with the RPs that should be included in the Anycast RP and the PIM Register messages will be relayed to all the RPs defined in the RP-set. If the PIM Register message comes from an IPv6 address that is defined in the RP-set, the Register will not be sent along which is a form of split horizon to prevent looping of control plane messages. When a RP relays a PIM Register, this is done from a unique IPv6 address which is similar to how MSDP works.

Sources will find the RP based on the unicast metric as is normally done when implementing anycast RP. If a RP goes offline, messages will be routed to the next RP which now has the best metric.

Interdomain Multicast

These are my thoughts on interdomain multicast since there is no MSDP for IPv6. Embedded RP can be used which means that other organization needs to use your RP. Define a RP prefix that is used for interdomain multicast only or use a prefix that is used for internal usage but implement a data plane filter to filter out requests for groups that should not cross organizational boundaries. This could also be done by filtering on the the scope of the multicast address.

Another option would be to anycast RP with the other organization but this could get a lot messier unless a RP is defined for only a set of groups that are used for interdomain multicast. Each side would then have a RP defined for the groups and PIM Register messages would be relayed. The drawback would be that both sides could have sources but the policy may be that only one side should have sources and the other side only has listeners. This would be difficult to implement in a data plane filter. It might be possible to solve in the control plane by defining which sources the RP will allow to Register.

If using SSM, there is no need for a RP which makes it easier to implement interdomain multicast. There is always the consideration of joining two PIM domains but this could be solved by using static joins at the edge and implementing data plane filtering. Interdomain multicast is not something that is implemented a lot and it requires some thought to not merge into one failure domain and one administrative domain.

Final Thoughts

Multicast is used a lot in IPv6, multicast is more tightly integrated into the protocol than in IPv4, and it’s there even if you see it or not. The addressing, flags and scope can be a bit confusing at first but it allows for using multicast in a better way in IPv6 than in IPv4.

Categories: CCDE, IPv6 Tags: , , , ,

Next Generation Multicast – NG-MVPN

April 10, 2015 Leave a comment

Introduction

Multicast is a great technology that although it provides great benefits, is seldomly deployed. It’s a lot like IPv6 in that regard. Service providers or enterprises that run MPLS and want to provide multicast services have not been able to use MPLS to provide multicast  Multicast has then typically been delivered by using Draft Rosen which is a mGRE technology to provide multicast. This post starts with a brief overview of Draft Rosen.

Draft Rosen

Draft Rosen uses GRE as an overlay protocol. That means that all multicast packets will be encapsulated inside GRE. A virtual LAN is emulated by having all PE routers in the VPN join a multicast group. This is known as the default Multicast Distribution Tree (MDT). The default MDT is used for PIM hello’s and other PIM signaling but also for data traffic. If the source sends a lot of traffic it is inefficient to use the default MDT and a data MDT can be created. The data MDT will only include PE’s that have receivers for the group in use.

Rosen1

Draft Rosen is fairly simple to deploy and works well but it has a few drawbacks. Let’s take a look at these:

  • Overhead – GRE adds 24 bytes of overhead to the packet. Compared to MPLS which typically adds 8 or 12 bytes there is 100% or more of overhead added to each packet
  • PIM in the core – Draft Rosen requires that PIM is enabled in the core because the PE’s must join the default and or data MDT which is done through PIM signaling. If PIM ASM is used in the core, an RP is needed as well. If PIM SSM is run in the core, no RP is needed.
  • Core state – Unneccessary state is created in the core due to the PIM signaling from the PE’s. The core should have as little state as possible
  • PIM adjacencies – The PE’s will become PIM neighbors with each other. If it’s a large VPN and a lot of PE’s, a lot of PIM adjacencies will be created. This will generate a lot of hello’s and other signaling which will add to the burden of the router
  • Unicast vs multicast – Unicast forwarding uses MPLS, multicast uses GRE. This adds complexity and means that unicast is using a different forwarding mechanism than multicast, which is not the optimal solution
  • Inefficency – The default MDT sends traffic to all PE’s in the VPN regardless if the PE has a receiver in the (*,G) or (S,G) for the group in use

Based on this list, it is clear that there is a room for improvement. The things we are looking to achieve with another solution is:

  • Shared control plane with unicast
  • Less protocols to manage in the core
  • Shared forwarding plane with unicast
  • Only use MPLS as encapsulation
  • Fast Restoration (FRR)

NG-MVPN

To be able to build multicast Label Switched Paths (LSPs) we need to provide these labels in some way. There are three main options to provide these labels today:

  • Multipoint LDP(mLDP)
  • RSVP-TE P2MP
  • Unicast MPLS + Ingress Replication(IR)

MLDP is an extension to the familiar Label Distribution Protocol (LDP). It supports both P2MP and MP2MP LSPs and is defined in RFC 6388.

RSVP-TE is an extension to the unicast RSVP-TE which some providers use today to build LSPs as opposed to LDP. It is defined in RFC 4875.

Unicast MPLS uses unicast and no additional signaling in the core. It does not use a multipoint LSP.

Multipoint LSP

Normal unicast forwarding through MPLS uses a point to point LSP. This is not efficient for multicast. To overcome this, multipoint LSPs are used instead. There are two different types, point to multipoint and multipoint to multipoint.

P2MP1

  • Replication of traffic in core
  • Allows only the root of the P2MP LSP to inject packets into the tree
  • If signaled with mLDP – Path based on IP routing
  • If signaled with RSVP-TE – Constraint-based/explicit routing. RSVP-TE also supports admission control

MP2MP1

  • Replication of traffic in core
  • Bidirectional
  • All the leafs of the LSP can inject and receive packets from the LSP
  • Signaled with mLDP
  • Path based on IP routing

Core Tree Types

Depending on the number of sources and where the sources are located, different type of core trees can be used. If you are familiar with Draft Rosen, you may know of the default MDT and the data MDT.

Coretree1

Signalling the Labels

As mentioned previously there are three main ways of signalling the labels. We will start by looking at mLDP.

  • LSPs are built from the leaf to the root
  • Supports P2MP and MP2MP LSPs
    • mLDP with MP2MP provides great scalability advantages for “any to any” topologies
      • “any to any” communication applications:
        • mVPN supporting bidirectional PIM
        • mVPN Default MDT model
        • If a provider does not want  tree state per ingress PE source
  • Supports Fast Reroute (FRR) via RSVP-TE unicast backup path
  • No periodic signaling, reliable using TCP
  • Control plane is P2MP or MP2MP
  • Data plane is P2MP
  • Scalable due to receiver driven tree building
  • Supports MP2MP
  • Does not support traffic engineering

RSVP-TE can be used as well with the following characteristics.

  • LSPs are built from the head-end to the tail-end
  • Supports only P2MP LSPs
  • Supports traffic engineering
    • Bandwidth reservation
    • Explicit routing
    • Fast Reroute (FRR)
  • Signaling is periodic
  • P2P technology at control plane
    • Inherits P2P scaling limitations
  • P2MP at the data plane
    • Packet replication in the core

RSVP-TE will mostly be interesting for SPs that are already running RSVP-TE for unicast or for SPs involved in video delivery. The following table shows a comparision of the different protocols.

Core protocols

Assigning Flows to LSPs

After the LSPs have been signalled, we need to get traffic onto the LSPs. This can be done in several different ways.

  • Static
  • PIM
    • RFC 6513
  • BGP Customer Multicast (C-Mcast)
    • RFC 6514
    • Also describes Auto-Discovery
  • mLDP inband signaling
    • RFC 6826

Static

  • Mostly applicable to RSVP-TE P2MP
  • Static configuration of multicast flows per LSP
  • Allows aggregation of multiple flows in a single LSP

PIM

  • Dynamically assigns flows to an LSP by running PIM over the LSP
  • Works over MP2MP and PPMP LSP types
  • Mostly used but not limited to default MDT
  • No changes needed to PIM
  • Allows aggregation of multiple flows in a single LSP

BGP Auto-Discovery

  • Auto-Discovery
    • The process of discovering all the PE’s with members in a given mVPN
  • Used to establish the MDT in the SP core
  • Can also be used to discover set of PE’s interested in a given customer multicast group (to enable S-PSMSI creation)
    • S-PMSI = Data MDT
  • Used to advertise address of the originating PE and tunnel attribute information (which kind of tunnel)

BGP MVPN Address Family

  • MPBGP extensions to support mVPN address family
  • Used for advertisement of AD routes
  • Used for advertisement of C-mcast routes (*,G) and (S,G)
  • Two new extended communities
    • VRF route import – Used to import mcast routes, similar to RT for unicast routes
    • Source AS – Used for inter-AS mVPN
  • New BGP attributes
    • PMSI Tunnel Attribute (PTA) – Contains information about advertised tunnel
    • PPMP label attribute – Upstream generated label used by the downstream clients to send unicast messages towards the source
  • If mVPN address family is not used the address family ipv4 mdt must be used

BGP Customer Multicast

  • BGP Customer Multicast (C-mcast) signalling on overlay
  • Tail-end driven updates is not a natural fit for BGP
    • BGP is more suited for one-to-many not many-to-one
  • PIM is still the PE-CE protocol
  • Easy to use with SSM
  • Complex to understand and troubleshoot for ASM

MLDP Inband Signaling

  • Multicast flow information encoded in the mLDP FEC
  • Each customer mcast flow creates state on the core routers
    • Scaling is the same as with default MDT with every C-(S,G) on a Data MDT
  • IPv4 and IPv6 multicast in global or VPN context
  • Typical for SSM or PIM sparse mode sources
  • IPTV walled garden deployment
  • RFC 6826

The natural choice is to stick with PIM unless you need very high scalability. Here is a comparison of PIM and BGP.

Slide1

BGP C-Signaling

  • With C-PIM signaling on default MDT models, data needs to be monitored
    • On default/data tree to detect duplicate forwarders over MDT and to trigger the assert process
    • On default MDT to perform SPT switchover (from (*,G) to (S,G))
  • On default MDT models with C-BGP signaling
    • There is only one forwarder on MDT
      • There are no asserts
    • The BGP type 5 routes are used for SPT switchover on PEs
  • Type 4 leaf AD route used to track type 3 S-PMSI (Data MDT) routes
  • Needed when RR is deployed
  • If source PE sets leaf-info-required flag on type 3 routes, the receiver PE responds with with a type 4 route

Migration

If PIM is used in the core, this can be migrated to mLDP. PIM can also be migrated to BGP. This can be done per multicast source, per multicast group and per source ingress router. This means that migration can be done gradually so that not all core trees must be replaced at the same time.

It is also possible to have both mGRE and MPLS encapsulation in the network for different PE’s.

To summarize the different options for assigning flows to LSPs

  • Static
    • Mostly applicable to RSVP-TE
  • PIM
    • Well known, has been in use since mVPN introduction over GRE
  • BGP A-D
    • Useful where head-end assigns the flows to the LSP
  • BGP C-mcast
    • Alternative to PIM in mVPN context
    • May be required in dual vendor networks
  • MLDP inband signaling
    • Method to stitch a PIM tree to a mLDP LSP without any additional signaling

Optimizing the MDT

There are some drawbacks with the normal operation of the MDT. The tree is signalled even if there is no customer traffic leading to unneccessary state in the core. To overcome these limitations there is a model called the partitioned MDT running over mLDP with the following characteristics.

  • Dynamic version of default MDT model
  • MDT is only built when customer traffic needs to be transported across the core
  • It addresses issues with the default MDT model
    • Optimizes deployments where sources are located in a few sites
    • Supports anycast sources
      • Default MDT would use PIM asserts
    • Reduces the number of PIM neighbors
      • PIM neighborship is unidirectional – The egress PE sees ingress PEs as PIM neighbors

Conclusion

There are many many different profiles supported, currently 27 profiles on Cisco equipment. Here are some guidelines to guide you in the selection of a profile for NG-MVPN.

  • Label Switched Multicast (LSM) provides unified unicast and multicast forwarding
  • Choosing a profile depends on the application and scalability/feature requirements
  • MLDP is the natural and safe choice for general purpose
    • Inband signalling is for walled garden deployments
    • Partitioned MDT is most suitable if there are few sources/few sites
    • P2MP TE is used for bandwidth reservation and video distribution (few source sites)
    • Default MDT model is for anyone (else)
  • PIM is still used as the PE-CE protocol towards the customer
  • PIM or BGP can be used as an overlay protocol unless inband signaling or static mapping is used
  • BGP is the natural choice for high scalability deployments
    • BGP may be the natural choice if already using it for Auto-Discovery
  • The beauty of NG-MVPN is that profile can be selected per customer/VPN
    • Even per source, per group or per next-hop can be done with Routing Policy Language (RPL)

This post was heavily inspired and is basically a summary of the Cisco Live session BRKIPM-3017 mVPN Deployment Models by Ijsbrand Wijnands and Luc De Ghein. I recommend that you read it for more details and configuration of NG-MVPN.

Categories: Multicast Tags: , , ,

HSRP AWARE PIM

February 13, 2015 Leave a comment

In environments that require redundancy towards clients, HSRP will normally be running. HSRP is a proven protocol and it works but how do we handle when we have clients that need multicast? What triggers multicast to converge when the Active Router (AR) goes down? The following topology is used:

PIM1

One thing to notice here is that R3 is the PIM DR even though R2 is the HSRP AR. The network has been setup with OSPF, PIM and R1 is the RP. Both R2 and R3 will receive IGMP reports but only R3 will send PIM Join, due to it being the PIM DR. R3 builds the (*,G) towards the RP:

R3#sh ip mroute 239.0.0.1
IP Multicast Routing Table
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group, C - Connected,
       L - Local, P - Pruned, R - RP-bit set, F - Register flag,
       T - SPT-bit set, J - Join SPT, M - MSDP created entry, E - Extranet,
       X - Proxy Join Timer Running, A - Candidate for MSDP Advertisement,
       U - URD, I - Received Source Specific Host Report, 
       Z - Multicast Tunnel, z - MDT-data group sender, 
       Y - Joined MDT-data group, y - Sending to MDT-data group, 
       G - Received BGP C-Mroute, g - Sent BGP C-Mroute, 
       N - Received BGP Shared-Tree Prune, n - BGP C-Mroute suppressed, 
       Q - Received BGP S-A Route, q - Sent BGP S-A Route, 
       V - RD & Vector, v - Vector, p - PIM Joins on route
Outgoing interface flags: H - Hardware switched, A - Assert winner, p - PIM Join
 Timers: Uptime/Expires
 Interface state: Interface, Next-Hop or VCD, State/Mode

(*, 239.0.0.1), 02:54:15/00:02:20, RP 1.1.1.1, flags: SJC
  Incoming interface: Ethernet0/0, RPF nbr 13.13.13.1
  Outgoing interface list:
    Ethernet0/2, Forward/Sparse, 00:25:59/00:02:20

We then ping 239.0.0.1 from the multicast source to build the (S,G):

S1#ping 239.0.0.1 re 3
Type escape sequence to abort.
Sending 3, 100-byte ICMP Echos to 239.0.0.1, timeout is 2 seconds:

Reply to request 0 from 10.0.0.10, 35 ms
Reply to request 1 from 10.0.0.10, 1 ms
Reply to request 2 from 10.0.0.10, 2 ms

The (S,G) has been built:

R3#sh ip mroute 239.0.0.1
IP Multicast Routing Table
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group, C - Connected,
       L - Local, P - Pruned, R - RP-bit set, F - Register flag,
       T - SPT-bit set, J - Join SPT, M - MSDP created entry, E - Extranet,
       X - Proxy Join Timer Running, A - Candidate for MSDP Advertisement,
       U - URD, I - Received Source Specific Host Report, 
       Z - Multicast Tunnel, z - MDT-data group sender, 
       Y - Joined MDT-data group, y - Sending to MDT-data group, 
       G - Received BGP C-Mroute, g - Sent BGP C-Mroute, 
       N - Received BGP Shared-Tree Prune, n - BGP C-Mroute suppressed, 
       Q - Received BGP S-A Route, q - Sent BGP S-A Route, 
       V - RD & Vector, v - Vector, p - PIM Joins on route
Outgoing interface flags: H - Hardware switched, A - Assert winner, p - PIM Join
 Timers: Uptime/Expires
 Interface state: Interface, Next-Hop or VCD, State/Mode

(*, 239.0.0.1), 02:57:14/stopped, RP 1.1.1.1, flags: SJC
  Incoming interface: Ethernet0/0, RPF nbr 13.13.13.1
  Outgoing interface list:
    Ethernet0/2, Forward/Sparse, 00:28:58/00:02:50

(41.41.41.10, 239.0.0.1), 00:02:03/00:00:56, flags: JT
  Incoming interface: Ethernet0/0, RPF nbr 13.13.13.1
  Outgoing interface list:
    Ethernet0/2, Forward/Sparse, 00:02:03/00:02:50

The unicast and multicast topology is not currently congruent, this may or may not be important. What happens when R3 fails?

R3(config)#int e0/2
R3(config-if)#sh
R3(config-if)#

No replies to the pings coming in until PIM on R2 detects that R3 is gone and takes over the DR role, this will take between 60 to 90 seconds with the default timers in use.

S1#ping 239.0.0.1 re 100 ti 1
Type escape sequence to abort.
Sending 100, 100-byte ICMP Echos to 239.0.0.1, timeout is 1 seconds:

Reply to request 0 from 10.0.0.10, 18 ms
Reply to request 1 from 10.0.0.10, 2 ms....................................................................
.......
Reply to request 77 from 10.0.0.10, 10 ms
Reply to request 78 from 10.0.0.10, 1 ms
Reply to request 79 from 10.0.0.10, 1 ms
Reply to request 80 from 10.0.0.10, 1 ms

We can increase the DR priority on R2 to make it become the DR.

R2(config-if)#ip pim dr-priority 50  
*Feb 13 12:42:45.900: %PIM-5-DRCHG: DR change from neighbor 10.0.0.3 to 10.0.0.2 on interface Ethernet0/2

HSRP aware PIM is a feature that started appearing in IOS 15.3(1)T and makes the HSRP AR become the PIM DR. It will also send PIM messages from the virtual IP which is useful in situations where you have a router with a static route towards an Virtual IP (VIP). This is how Cisco describes the feature:

HSRP Aware PIM enables multicast traffic to be forwarded through the HSRP active router (AR), allowing PIM to leverage HSRP redundancy, avoid potential duplicate traffic, and enable failover, depending on the HSRP states in the device. The PIM designated router (DR) runs on the same gateway as the HSRP AR and maintains mroute states.

In my topology, I am running HSRP towards the clients, so even though this feature sounds as a perfect fit it will not help me in converging my multicast. Let’s configure this feature on R2:

R2(config-if)#ip pim redundancy HSRP1 hsrp dr-priority 100
R2(config-if)#
*Feb 13 12:48:20.024: %PIM-5-DRCHG: DR change from neighbor 10.0.0.3 to 10.0.0.2 on interface Ethernet0/2

R2 is now the PIM DR, R3 will now see two PIM neighbors on interface E0/2:

R3#sh ip pim nei e0/2
PIM Neighbor Table
Mode: B - Bidir Capable, DR - Designated Router, N - Default DR Priority,
      P - Proxy Capable, S - State Refresh Capable, G - GenID Capable
Neighbor          Interface                Uptime/Expires    Ver   DR
Address                                                            Prio/Mode
10.0.0.1          Ethernet0/2              00:00:51/00:01:23 v2    0 / S P G
10.0.0.2          Ethernet0/2              00:07:24/00:01:23 v2    100/ DR S P G

R2 now has the (S,G) and we can see that it was the Assert winner because R3 was previously sending multicasts to the LAN segment.

R2#sh ip mroute 239.0.0.1
IP Multicast Routing Table
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group, C - Connected,
       L - Local, P - Pruned, R - RP-bit set, F - Register flag,
       T - SPT-bit set, J - Join SPT, M - MSDP created entry, E - Extranet,
       X - Proxy Join Timer Running, A - Candidate for MSDP Advertisement,
       U - URD, I - Received Source Specific Host Report, 
       Z - Multicast Tunnel, z - MDT-data group sender, 
       Y - Joined MDT-data group, y - Sending to MDT-data group, 
       G - Received BGP C-Mroute, g - Sent BGP C-Mroute, 
       N - Received BGP Shared-Tree Prune, n - BGP C-Mroute suppressed, 
       Q - Received BGP S-A Route, q - Sent BGP S-A Route, 
       V - RD & Vector, v - Vector, p - PIM Joins on route
Outgoing interface flags: H - Hardware switched, A - Assert winner, p - PIM Join
 Timers: Uptime/Expires
 Interface state: Interface, Next-Hop or VCD, State/Mode

(*, 239.0.0.1), 00:20:31/stopped, RP 1.1.1.1, flags: SJC
  Incoming interface: Ethernet0/0, RPF nbr 12.12.12.1
  Outgoing interface list:
    Ethernet0/2, Forward/Sparse, 00:16:21/00:02:35

(41.41.41.10, 239.0.0.1), 00:00:19/00:02:40, flags: JT
  Incoming interface: Ethernet0/0, RPF nbr 12.12.12.1
  Outgoing interface list:
    Ethernet0/2, Forward/Sparse, 00:00:19/00:02:40, A

What happens when R2’s LAN interface goes down? Will R3 become the DR? And how fast will it converge?

R2(config)#int e0/2
R2(config-if)#sh

HSRP changes to active on R3 but the PIM DR role does not converge until the PIM query interval has expired (3x hellos).

*Feb 13 12:51:44.204: HSRP: Et0/2 Grp 1 Redundancy "hsrp-Et0/2-1" state Standby -> Active
R3#sh ip pim nei e0/2
PIM Neighbor Table
Mode: B - Bidir Capable, DR - Designated Router, N - Default DR Priority,
      P - Proxy Capable, S - State Refresh Capable, G - GenID Capable
Neighbor          Interface                Uptime/Expires    Ver   DR
Address                                                            Prio/Mode
10.0.0.1          Ethernet0/2              00:04:05/00:00:36 v2    0 / S P G
10.0.0.2          Ethernet0/2              00:10:39/00:00:36 v2    100/ DR S P G
R3#
*Feb 13 12:53:02.013: %PIM-5-NBRCHG: neighbor 10.0.0.2 DOWN on interface Ethernet0/2 DR
*Feb 13 12:53:02.013: %PIM-5-DRCHG: DR change from neighbor 10.0.0.2 to 10.0.0.3 on interface Ethernet0/2
*Feb 13 12:53:02.013: %PIM-5-NBRCHG: neighbor 10.0.0.1 DOWN on interface Ethernet0/2 non DR

We lose a lot of packets while waiting for PIM to converge:

S1#ping 239.0.0.1 re 100 time 1
Type escape sequence to abort.
Sending 100, 100-byte ICMP Echos to 239.0.0.1, timeout is 1 seconds:

Reply to request 0 from 10.0.0.10, 5 ms
Reply to request 0 from 10.0.0.10, 14 ms...................................................................
Reply to request 68 from 10.0.0.10, 10 ms
Reply to request 69 from 10.0.0.10, 2 ms

Reply to request 70 from 10.0.0.10, 1 ms

HSRP aware PIM didn’t really help us here… So when is it useful? If we use the following topology instead:

PIM2

The router R5 has been added and the receiver sits between R5 instead. R5 does not run routing with R2 and R3, only static routes pointing at the RP and the multicast source:

R5(config)#ip route 1.1.1.1 255.255.255.255 10.0.0.1
R5(config)#ip route 41.41.41.0 255.255.255.0 10.0.0.1

Without HSRP aware PIM, the RPF check would fail because PIM would peer with the physical address but R5 sees three neighbors on the segment, where one is the VIP:

R5#sh ip pim nei
PIM Neighbor Table
Mode: B - Bidir Capable, DR - Designated Router, N - Default DR Priority,
      P - Proxy Capable, S - State Refresh Capable, G - GenID Capable
Neighbor          Interface                Uptime/Expires    Ver   DR
Address                                                            Prio/Mode
10.0.0.2          Ethernet0/0              00:03:00/00:01:41 v2    100/ DR S P G
10.0.0.1          Ethernet0/0              00:03:00/00:01:41 v2    0 / S P G
10.0.0.3          Ethernet0/0              00:03:00/00:01:41 v2    1 / S P G

R2 will be the one forwarding multicast during normal conditions due to it being the PIM DR via HSRP state of active router:

R2#sh ip mroute 239.0.0.1
IP Multicast Routing Table
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group, C - Connected,
       L - Local, P - Pruned, R - RP-bit set, F - Register flag,
       T - SPT-bit set, J - Join SPT, M - MSDP created entry, E - Extranet,
       X - Proxy Join Timer Running, A - Candidate for MSDP Advertisement,
       U - URD, I - Received Source Specific Host Report, 
       Z - Multicast Tunnel, z - MDT-data group sender, 
       Y - Joined MDT-data group, y - Sending to MDT-data group, 
       G - Received BGP C-Mroute, g - Sent BGP C-Mroute, 
       N - Received BGP Shared-Tree Prune, n - BGP C-Mroute suppressed, 
       Q - Received BGP S-A Route, q - Sent BGP S-A Route, 
       V - RD & Vector, v - Vector, p - PIM Joins on route
Outgoing interface flags: H - Hardware switched, A - Assert winner, p - PIM Join
 Timers: Uptime/Expires
 Interface state: Interface, Next-Hop or VCD, State/Mode

(*, 239.0.0.1), 00:02:12/00:02:39, RP 1.1.1.1, flags: S
  Incoming interface: Ethernet0/0, RPF nbr 12.12.12.1
  Outgoing interface list:
    Ethernet0/2, Forward/Sparse, 00:02:12/00:02:39

Let’s try a ping from the source:

S1#ping 239.0.0.1 re 3
Type escape sequence to abort.
Sending 3, 100-byte ICMP Echos to 239.0.0.1, timeout is 2 seconds:

Reply to request 0 from 20.0.0.10, 1 ms
Reply to request 1 from 20.0.0.10, 2 ms
Reply to request 2 from 20.0.0.10, 2 ms

The ping works and R2 has the (S,G):

R2#sh ip mroute 239.0.0.1
IP Multicast Routing Table
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group, C - Connected,
       L - Local, P - Pruned, R - RP-bit set, F - Register flag,
       T - SPT-bit set, J - Join SPT, M - MSDP created entry, E - Extranet,
       X - Proxy Join Timer Running, A - Candidate for MSDP Advertisement,
       U - URD, I - Received Source Specific Host Report, 
       Z - Multicast Tunnel, z - MDT-data group sender, 
       Y - Joined MDT-data group, y - Sending to MDT-data group, 
       G - Received BGP C-Mroute, g - Sent BGP C-Mroute, 
       N - Received BGP Shared-Tree Prune, n - BGP C-Mroute suppressed, 
       Q - Received BGP S-A Route, q - Sent BGP S-A Route, 
       V - RD & Vector, v - Vector, p - PIM Joins on route
Outgoing interface flags: H - Hardware switched, A - Assert winner, p - PIM Join
 Timers: Uptime/Expires
 Interface state: Interface, Next-Hop or VCD, State/Mode

(*, 239.0.0.1), 00:04:18/00:03:29, RP 1.1.1.1, flags: S
  Incoming interface: Ethernet0/0, RPF nbr 12.12.12.1
  Outgoing interface list:
    Ethernet0/2, Forward/Sparse, 00:04:18/00:03:29

(41.41.41.10, 239.0.0.1), 00:01:35/00:01:24, flags: T
  Incoming interface: Ethernet0/0, RPF nbr 12.12.12.1
  Outgoing interface list:
    Ethernet0/2, Forward/Sparse, 00:01:35/00:03:29

What happens when R2 fails?

R2#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
R2(config)#int e0/2
R2(config-if)#sh
R2(config-if)#
S1#ping 239.0.0.1 re 200 ti 1
Type escape sequence to abort.
Sending 200, 100-byte ICMP Echos to 239.0.0.1, timeout is 1 seconds:

Reply to request 0 from 20.0.0.10, 9 ms
Reply to request 1 from 20.0.0.10, 2 ms
Reply to request 1 from 20.0.0.10, 11 ms....................................................................
......................................................................
............................................................

The pings time out because when the PIM Join from R5 comes in, R3 does not realize that it should process the Join.

*Feb 13 13:20:13.236: PIM(0): Received v2 Join/Prune on Ethernet0/2 from 10.0.0.5, not to us
*Feb 13 13:20:32.183: PIM(0): Generation ID changed from neighbor 10.0.0.2

As it turns out, the PIM redundancy command must be configured on the secondary router as well for it to process PIM Joins to the VIP.

R3(config-if)#ip pim redundancy HSRP1 hsrp dr-priority 10

After this has configured, the incoming Join will be processed. R3 triggers R5 to send a new Join because the GenID is set in the PIM hello to a new value.

*Feb 13 13:59:19.333: PIM(0): Matched redundancy group VIP 10.0.0.1 on Ethernet0/2 Active, processing the Join/Prune, to us
*Feb 13 13:40:34.043: PIM(0): Generation ID changed from neighbor 10.0.0.1

After configuring this, the PIM DR role converges as fast as HSRP allows. I’m using BFD in this scenario.

The key concept for understanding HSRP aware PIM here is that:

  • Initially configuring PIM redundancy on the AR will make it the DR
  • PIM redundancy must be configured on the secondary router as well, otherwise it will not process PIM Joins to the VIP
  • The PIM DR role does not converge until PIM hellos have timed out, the secondary router will process the Joins though so the multicast will converge

This feature is not very well documented so I hope you have learned a bit from this post how this feature really works. This feature does not work when you have receiver on a HSRP LAN, because the DR role is NOT moved until PIM adjacency expires.

Categories: Multicast Tags: , , ,

The Tale of the Mysterious PIM Prune

December 15, 2014 3 comments

Christmas is lurking around the corner and in the spirit of Denise “Fish” Fishburne, I give you the “The Tale of the Mysterious PIM Prune”.

I have been working a lot with multicast lately which is also why I’ve blogged about it. To start off this story, let’s begin with a network topology.

Topology1

The multicast source is located in AS 65000 and contains two routers that are connected to the multicast source. The routers run BFD, OSPF, iBGP, PIM internally and the RP is located on C1. There is a local receiver in AS 65000 and a remote one in AS 64512. The networks 10.0.1.0/24 and 10.0.21.0/24 come off the same physical interface. If you want to replicate this lab, all the configs are provided here.

This network requires fast convergence and I have been troubleshooting a scenario where the active multicast router (R1) has its LAN interface go down, meaning that the traffic from the source must come in on R2. In this scenario I have seen convergence in up to 60 seconds which is not acceptable. The BGP design is for R2 to still exit out via R1 if the link is available towards C1. The picture below shows the normal multicast flow.

Normal flow

When R1 has its LAN interface go down, the traffic will pass from R2 over the link to R1 and out to C1.

Backup flow

R1 and R2 have a default route learned via BGP that points at C1. This will be an important piece of the puzzle later. Let’s go through what happens, step by step when R1 has its LAN interface go down. To simulate the multicast traffic I have an Ubuntu machine acting as the source. The receivers are CSR1000v routers with debug ip icmp to see how often the traffic is coming in. I’m sending ICMP packets every 100 ms.

*Dec 14 19:48:08.922: HSRP: Gi3.1 Interface DOWN

The interface on R1 goes down.

*Dec 14 19:48:08.926: is_up: GigabitEthernet3.1 0 state: 6 sub state: 1 line: 1
*Dec 14 19:48:08.926: RT: interface GigabitEthernet3.1 removed from routing table
*Dec 14 19:48:08.926: RT: del 10.0.1.0 via 0.0.0.0, connected metric [0/0]
*Dec 14 19:48:08.926: RT: delete subnet route to 10.0.1.0/24
*Dec 14 19:48:08.926: is_up: GigabitEthernet3.1 0 state: 6 sub state: 1 line: 1
*Dec 14 19:48:08.927: RT(multicast): interface GigabitEthernet3.1 removed from routing table
*Dec 14 19:48:08.927: RT(multicast): del 10.0.1.0 via 0.0.0.0, connected metric [0/0]
*Dec 14 19:48:08.927: RT(multicast): delete subnet route to 10.0.1.0/24
*Dec 14 19:48:08.927: RT: del 10.0.1.2 via 0.0.0.0, connected metric [0/0]
*Dec 14 19:48:08.927: RT: delete subnet route to 10.0.1.2/32
*Dec 14 19:48:08.927: RT(multicast): del 10.0.1.2 via 0.0.0.0, connected metric [0/0]
*Dec 14 19:48:08.927: RT(multicast): delete subnet route to 10.0.1.2/32

It takes roughly 5 ms to remove the route from the RIB and the MRIB.

*Dec 14 19:48:08.935: RT: add 10.0.1.0/24 via 10.0.255.2, bgp metric [200/0]
*Dec 14 19:48:08.936: RT: updating bgp 10.0.21.0/24 (0x0)  :
    via 10.0.255.2   0 1048577

R1 starts to install the route to the source via R2 into the RIB, but the route is not yet installed! This is a key concept.

*Dec 14 19:48:08.937: MRT(0): Delete GigabitEthernet1/239.0.0.1 from the olist of (10.0.1.10, 239.0.0.1)
*Dec 14 19:48:08.937: MRT(0): Reset the PIM interest flag for (10.0.1.10, 239.0.0.1)
*Dec 14 19:48:08.938: MRT(0): set min mtu for (10.0.1.10, 239.0.0.1) 1500->1500
*Dec 14 19:48:08.938: MRT(0): (10.0.1.10,239.0.0.1), RPF change from GigabitEthernet3.1/0.0.0.0 to GigabitEthernet1/10.0.11.1
*Dec 14 19:48:08.938: MRT(0): Reset the F-flag for (10.0.1.10, 239.0.0.1)
*Dec 14 19:48:08.938: PIM(0): Insert (10.0.1.10,239.0.0.1) join in nbr 10.0.11.1's queue
*Dec 14 19:48:08.938: PIM(0): Building Join/Prune packet for nbr 10.0.11.1
*Dec 14 19:48:08.938: PIM(0):  Adding v2 (10.0.1.10/32, 239.0.0.1), S-bit Join
*Dec 14 19:48:08.938: PIM(0): Send v2 join/prune to 10.0.11.1 (GigabitEthernet1)
*Dec 14 19:48:08.938: RT: updating bgp 10.0.1.0/24 (0x0)  :
    via 10.0.255.2   0 1048577

R1 cleans up some multicast state and then it sends a PIM Join towards C1, but the source is not located in that direction! The giveaway here is the message about RPF change and 0.0.0.0 which is the default route pointing towards C1. This default route is already installed into the RIB so the PIM Join is sent out the RPF interface which for a brief period happens to be towards C1. The receiver is still located in that direction though.

*Dec 14 19:48:08.938: RT: closer admin distance for 10.0.1.0, flushing 1 routes
*Dec 14 19:48:08.938: RT: add 10.0.1.0/24 via 10.0.255.2, bgp metric [200/0]
*Dec 14 19:48:08.939: MRT(0): (10.0.1.10,239.0.0.1), RPF change from GigabitEthernet1/10.0.11.1 to GigabitEthernet2/10.0.112.2
*Dec 14 19:48:08.939: PIM(0): Insert (10.0.1.10,239.0.0.1) join in nbr 10.0.112.2's queue
*Dec 14 19:48:08.939: PIM(0): Insert (10.0.1.10,239.0.0.1) prune in nbr 10.0.11.1's queue
*Dec 14 19:48:08.939: PIM(0): Building Join/Prune packet for nbr 10.0.11.1
*Dec 14 19:48:08.939: PIM(0):  Adding v2 (10.0.1.10/32, 239.0.0.1), S-bit Prune
*Dec 14 19:48:08.939: PIM(0): Send v2 join/prune to 10.0.11.1 (GigabitEthernet1)
*Dec 14 19:48:08.939: PIM(0): Building Join/Prune packet for nbr 10.0.112.2
*Dec 14 19:48:08.939: PIM(0):  Adding v2 (10.0.1.10/32, 239.0.0.1), S-bit Join
*Dec 14 19:48:08.939: PIM(0): Send v2 join/prune to 10.0.112.2 (GigabitEthernet2)

R1 then installs the route via R2, it tries to cover its tracks by pruning off the interface towards C1. Hold on a second here though, isn’t that where the receiver is located? Indeed it is, this means the (S,G) tree is broken until C1 sends a periodic Join which could take from 0-60 seconds depending on when the last Join came in. The arrows on the topology shows the events in order, first a Join towards C1, then a Prune towards C1, then a Join towards R2.

Join and Prune

*Dec 14 19:48:09.932: PIM(0): Insert (10.0.1.10,239.0.0.1) prune in nbr 10.0.112.2's queue - deleted
*Dec 14 19:48:09.932: PIM(0): Building Join/Prune packet for nbr 10.0.112.2
*Dec 14 19:48:09.932: PIM(0):  Adding v2 (10.0.1.10/32, 239.0.0.1), S-bit Prune
*Dec 14 19:48:09.932: PIM(0): Send v2 join/prune to 10.0.112.2 (GigabitEthernet2)

R1 no longer has any receivers interested in the traffic, it pruned off the only remaining receiver located towards C1 so it sends a Prune towards R2.

*Dec 14 19:48:50.864: PIM(0): Received v2 Join/Prune on GigabitEthernet1 from 10.0.11.1, to us
*Dec 14 19:48:50.864: PIM(0): Join-list: (10.0.1.10/32, 239.0.0.1), S-bit set
*Dec 14 19:48:50.864: MRT(0): WAVL Insert interface: GigabitEthernet1 in (10.0.1.10,239.0.0.1) Successful
*Dec 14 19:48:50.864: MRT(0): set min mtu for (10.0.1.10, 239.0.0.1) 18010->1500
*Dec 14 19:48:50.864: MRT(0): Add GigabitEthernet1/239.0.0.1 to the olist of (10.0.1.10, 239.0.0.1), Forward state - MAC built
*Dec 14 19:48:50.864: PIM(0): Add GigabitEthernet1/10.0.11.1 to (10.0.1.10, 239.0.0.1), Forward state, by PIM SG Join
*Dec 14 19:48:50.864: MRT(0): Add GigabitEthernet1/239.0.0.1 to the olist of (10.0.1.10, 239.0.0.1), Forward state - MAC built
*Dec 14 19:48:50.864: MRT(0): Set the PIM interest flag for (10.0.1.10, 239.0.0.1)
*Dec 14 19:48:50.864: PIM(0): Insert (10.0.1.10,239.0.0.1) join in nbr 10.0.112.2's queue
*Dec 14 19:48:50.864: PIM(0): Building Join/Prune packet for nbr 10.0.112.2
*Dec 14 19:48:50.864: PIM(0):  Adding v2 (10.0.1.10/32, 239.0.0.1), S-bit Join
*Dec 14 19:48:50.864: PIM(0): Send v2 join/prune to 10.0.112.2 (GigabitEthernet2)

C1 is still interested in this multicast traffic, it sends a periodic Join towards R1 which then triggers R1 to send a Join towards R2. There was roughly a 40 second delay between these events. This can also be seen from the ping that I had running.

*Dec 14 19:48:08.842: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 19:48:08.942: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 19:48:09.042: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 19:48:51.054: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 19:48:51.158: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 19:48:51.255: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 19:48:51.355: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0

This story shows how the unicast routing table and a race condition can effect your multicast traffic. What would a story be without a happy ending though? What can we do to solve the race condition?

The key here is that the default route in already installed into the RIB. To beat it we will have to put a longer match into the RIB or into the MRIB. This can be done by putting a static unicast route or a static multicast route. I prefer to use a static multicast route since that will have no effect on unicast traffic.

ip mroute 10.0.1.0 255.255.255.0 10.0.112.2

or

ip route 10.0.1.0 255.255.255.0 10.0.112.2

The connected route is the best route until R1 has its LAN interface go down. The MRIB will then use the next match which is 10.0.1.0 via 10.0.112.2. Let’s run another test now that we have altered the MRIB.

*Dec 14 20:29:03.128: HSRP: Gi3.1 Interface going DOWN

Interface goes down.

*Dec 14 20:29:03.142: is_up: GigabitEthernet3.1 0 state: 6 sub state: 1 line: 1
*Dec 14 20:29:03.142: RT: interface GigabitEthernet3.1 removed from routing table
*Dec 14 20:29:03.142: RT: del 10.0.1.0 via 0.0.0.0, connected metric [0/0]
*Dec 14 20:29:03.142: RT: delete subnet route to 10.0.1.0/24
*Dec 14 20:29:03.143: is_up: GigabitEthernet3.1 0 state: 6 sub state: 1 line: 1
*Dec 14 20:29:03.143: RT(multicast): interface GigabitEthernet3.1 removed from routing table
*Dec 14 20:29:03.143: RT(multicast): del 10.0.1.0 via 0.0.0.0, connected metric [0/0]
*Dec 14 20:29:03.143: RT(multicast): delete subnet route to 10.0.1.0/24
*Dec 14 20:29:03.143: RT: del 10.0.1.2 via 0.0.0.0, connected metric [0/0]
*Dec 14 20:29:03.143: RT: delete subnet route to 10.0.1.2/32
*Dec 14 20:29:03.143: RT(multicast): del 10.0.1.2 via 0.0.0.0, connected metric [0/0]
*Dec 14 20:29:03.143: RT(multicast): delete subnet route to 10.0.1.2/32

Remove 10.0.1.0/24 from the RIB and MRIB.

*Dec 14 20:29:03.150: RT: add 10.0.1.0/24 via 10.0.255.2, bgp metric [200/0]
*Dec 14 20:29:03.150: RT: updating bgp 10.0.21.0/24 (0x0)  :
    via 10.0.255.2   0 1048577
*Dec 14 20:29:03.150: RT: add 10.0.21.0/24 via 10.0.255.2, bgp metric [200/0]

Start installing the BGP route.

*Dec 14 20:29:03.152: MRT(0): (10.0.1.10,239.0.0.1), RPF change from GigabitEthernet3.1/0.0.0.0 to GigabitEthernet2/10.0.112.2
*Dec 14 20:29:03.152: MRT(0): Reset the F-flag for (10.0.1.10, 239.0.0.1)
*Dec 14 20:29:03.152: PIM(0): Insert (10.0.1.10,239.0.0.1) join in nbr 10.0.112.2's queue
*Dec 14 20:29:03.152: PIM(0): Building Join/Prune packet for nbr 10.0.112.2
*Dec 14 20:29:03.152: PIM(0):  Adding v2 (10.0.1.10/32, 239.0.0.1), S-bit Join
*Dec 14 20:29:03.152: PIM(0): Send v2 join/prune to 10.0.112.2 (GigabitEthernet2)
*Dec 14 20:29:03.153: RT: updating bgp 10.0.1.0/24 (0x0)  :
    via 10.0.255.2   0 1048577
*Dec 14 20:29:03.153: RT: closer admin distance for 10.0.1.0, flushing 1 routes
*Dec 14 20:29:03.153: RT: add 10.0.1.0/24 via 10.0.255.2, bgp metric [200/0]

The RPF interface changes but now points towards the correct destination, which is R2. The multicast traffic can flow again. How many packets did we lose?

*Dec 14 20:29:02.017: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:02.115: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:02.216: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:02.317: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:02.418: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:02.519: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:02.620: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:02.720: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:02.820: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:02.920: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:03.021: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:03.122: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:03.223: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:03.324: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:03.424: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:03.526: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:03.626: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:03.727: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:03.828: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0
*Dec 14 20:29:03.927: ICMP: echo reply sent, src 10.0.32.10, dst 10.0.1.10, topology BASE, dscp 0 topoid 0

No packets lost! I would probably have to send packets more often than every 100ms to catch the tree converging here. In a real network you would see some delay because pulling the cable is different than shutting down an interface, the carrier delay would come into play here. Here are some key concepts you should learn from this post.

  • Unicast routing table will impact the multicast routing table
  • Never assume anything, verify
  • Installing routes takes time

There are a lot of moving parts here. If you don’t understand it all at once, don’t worry. It’s a complex scenario and don’t be afraid to ask questions in the comments section. Here are the key concepts of converging in this scenario.

  • Interface goes down (0-2 seconds)
  • Remove stale route (14 ms)
  • Install new route (3 ms)
  • RPF change/PIM Join (2 ms)

These are values from my tests and you will see higher values on production equipment. Because PIM Joins are triggered by the routing table changing, convergence can be very fast depending on how fast you detect it. As you can see from the values above, you can achieve convergence somewhere between 0.5 – 3 seconds realistically depending on how aggressively you tune your timers.

After reading this post you should have a better understanding of multicast, the RIB, MRIB and the unicast routing table impact on multicast flows.

Categories: Multicast Tags: , , ,

Using EEM to Speed up Multicast Convergence when Receiver is Dually Connected

November 22, 2014 3 comments

When deploying PIM ASM, the Designated Router (DR) role plays a significant part in how PIM ASM works. The DR on a segment is responsible for registering mulicast sources with the Rendezvous Point (RP) and/or sending PIM Joins for the segment. Routers with PIM enabled interfaces send out PIM Hello messages every 30 seconds by default.

EEM1

After missing three Hellos the secondary router will take over as the DR. With the standard timer value, this can take between 60 to 90 seconds depending on when the last Hello came in. Not really acceptable in a modern network.

The first thought is to lower the PIM query interval, this can be done and it supports sending PIM Hellos at msec level. In my particular case I needed convergence within two seconds. I tuned the PIM query interval to 500 msec meaning that the PIM DR role should converge within 1.5 seconds. The problem though is that these Hellos are sent at process level. Even though my routers were barely breaking a sweat CPU wise I would see PIM adjacencies flapping.

The answer to my problems would be to have Bidirectional Forwarding Dectection (BFD) for PIM but it’s only supported on a very limited set of platforms. I already have BFD running for OSPF and BGP but unfortunately it’s not supported for PIM. The advantage of BFD is that the Hellos are more light weight and they are sent through interrupt instead of process level. This provides more deterministic behavior than than the regular PIM Hellos.

So how did I solve my problem? I need something that detects failure, I need BFD. Hot Standby Routing Protocol (HSRP) detects failures, HSRP has support for BFD. I could then use HSRP to detect the failure and act on the Syslog message generated by HSRP. Even though I didn’t really need HSRP on that segment it helped me in moving the PIM DR role which I wrote this Embedded Event Manager (EEM) applet for. A thank you to Peter Paluch for providing this idea and support 🙂

The configuration of the interface is this:

interface GigabitEthernet0/2.100
 description *** Receiver LAN ***   
 encapsulation dot1Q 100
 ip address 10.0.100.3 255.255.255.0
 no ip redirects
 no ip unreachables
 no ip proxy-arp
 ip pim sparse-mode
 standby version 2
 standby 1 ip 10.0.100.1
 standby 1 preempt delay reload 180
 standby 1 name HSRP-1
 bfd interval 300 min_rx 300 multiplier 3

BFD is sending Hellos every 300 msec so it will converge within 900 msec. The key is then to find the Syslog message that HSRP generates when it detects a failure. These messages look like this:

%HSRP-5-STATECHANGE: GigabitEthernet0/2.100 Grp 1 state Standby -> Active
%HSRP-5-STATECHANGE: GigabitEthernet0/2.100 Grp 1 state Speak -> Standby

It is then possible to write an EEM applet acting on this message and setting the DR priority on the secondary router.

event manager applet CHANGE-DR-UP-RECEIVER
 event syslog pattern "%HSRP-5-STATECHANGE: GigabitEthernet0/2.100 Grp 1 state Standby -> Active"
 action 1.0 syslog msg "Changing DR on interface Gi0/2.100 due to AR is DOWN"
 action 1.1 cli command "enable"
 action 1.2 cli command "conf t"
 action 1.3 cli command "interface gi0/2.100"
 action 1.4 cli command "ip pim dr-priority 100"
 action 1.5 cli command "end"

When HSRP has detected the failure, the EEM apple will trigger very quickly and set the priority.

116072: Nov 20 13:03:04.544 UTC: %HSRP-5-STATECHANGE: GigabitEthernet0/2.100 Grp 1 state Standby -> Active
116080: Nov 20 13:03:04.552 UTC: %HA_EM-6-LOG: CHANGE-DR-UP-RECEIVER : DEBUG(cli_lib) : : CTL : cli_open called.
116120: Nov 20 13:03:04.604 UTC: PIM(0): Changing DR for GigabitEthernet0/2.100, from 10.0.100.2 to 10.0.100.3 (this system)
116121: Nov 20 13:03:04.604 UTC: %PIM-5-DRCHG: DR change from neighbor 10.0.100.2 to 10.0.100.3 on interface GigabitEthernet0/2.100

It took 60 msec from HSRP detecting the failure through BFD until the DR role had converged. It’s then possible to recover from a failure within a second.

It’s also important to set the DR priority back after the network converges. We use another applet for this:

event manager applet CHANGE-DR-DOWN-RECEIVER
 event syslog pattern "%HSRP-5-STATECHANGE: GigabitEthernet0/2.100 Grp 1 state Speak -> Standby"
 action 1.0 syslog msg "Changing DR on interface Gi0/2.100 due to AR is UP"
 action 1.1 cli command "enable"
 action 1.2 cli command "conf t"
 action 1.3 cli command "interface gi0/2.100"
 action 1.4 cli command "no ip pim dr-priority 100"
 action 1.5 cli command "end"

This works very well. There are some considerations when running EEM. Firstly, if you are running AAA then the EEM applet will fail authorization. This can be bypassed with the following command:

event manager applet CHANGE-DR-UP-RECEIVER authorization bypass

It’s also important to note that the EEM applet will use a VTY line when executing so make sure that there are available VTY’s when the applet runs.

After the PIM DR role has converged, the router will send out a PIM Join and the multicast will start flowing to the receiver.

Categories: Multicast Tags: , , ,

Lessons Learned from Deploying Multicast

November 10, 2014 2 comments

Lately I have been working a lot with multicast, which is fun and challenging! Even if you have a good understanding of multicast unless you work on it a lot there may be some concepts that fall out of memory or that you only run into in real life and not in the lab. Here is a summary of some things I’ve noticed so far.

PIM Register

PIM Register are control plane messages sent from the First Hop Router (FHR) towards the Rendezvous Point (RP). These are unicast messages encapsulating the multicast from the multicast source. There are some considerations here, firstly because these packets are sent from the FHR control plane to the RP control plane, they are not subject to any access list configured outbound on the FHR. I had a situation where I wanted to route the multicast locally but not send it outbound.

PIM Register 1

 

Even if the ACL was successful, care would have to be taken to not break the control plane between the FHR and the RP or all multicast traffic for the group would be at jeopardy.

The PIM Register messages are control plane messages, this means that the RP has to process them in the control plane which will tax the CPU. Depending on the rate that the FHR is sending to the RP and the number of sources, this can be very stressful on the CPU. As a safeguard the following command can be implemented:


ip pim register-rate-limit 20000

This command is applied on FHRs and limits the rate of the PIM Register messages to 20 kbit/s. By default there is no limit, set it to something that makes sense in your environment.

Storm Control

If you have switches in your multicast environement, and most likely you will, implement storm control. If a loop forms you don’t want to have an unlimited amount of broadcast and multicast flooding your layer 2 domain. Combined with the PIM Register this can be a real killer for the control plane if your FHR is trying to register sources at a very high packet rate.


storm-control broadcast level pps 100

storm-control multicast level pps 1k

The above is just an example, you have to set it to something that fits your environment, make sure to leave some room for more traffic than expected but not enough to hurt your devices if somethings goes wrong.

S,G Timeout

PIM Any Source Multicast (ASM) relies on using a RP when setting up the flow between the multicast sender and receiver. The receiver will first join the (*,G) tree which is rooted at the RP. After the receiver learns of the source it can switch over to the source tree (S,G). The (S,G) mroute in the Multicast Routing Information Base (MRIB) has a standard lifetime of 180 seconds. It can be beneficial to raise this timeout depending on the topology. Look at the following topology:

Timeout

 

If something happens to the source making it go away for three minutes, the (S,G) state will time out. Let’s then say that the source comes back but the RP is not available, then the FHR will not be able to register the source and no traffic can flow between the source and the receiver. If a higher timeout was configured for the (S,G) then the traffic would start flowing again when the source came back online. It’s not a very common scenario but can be a reasonable safe guard for important multicast groups. The drawback of configuring is that you will keep state for a longer time even if it is not needed.


ip pim sparse sg-expiry-timer <value>

The maximum timeout is 57600 seconds which is 16h. Setting it to a couple of hours may cut you some slack if something happens to the RP. Be careful if you have a lot of groups running though.

These are some important aspects but certainly not all. What lessons have you learned from deploying multicast?

 

Categories: Multicast Tags: , , , ,