Posts Tagged ‘BGP’

Busting Myths – IPv6 Link Local Next Hop into BGP

August 30, 2015 2 comments

In some publications it is mentioned that a link local next-hop can’t be used when redistributing routes into BGP because routers receiving the route will not know what to do with the next-hop. That is one of the reason why HSRPv2 got support for global IPv6 addresses. One such scenario is described in this link.

The topology used for this post is the following.


I have just setup enough of the topology to prove that it works with the next-hop, so I won’t be running any pings and so on. The routers R1 and R2 have a static route for the network behind R3 and R4.

ipv6 route 2001:DB8:100::/48 GigabitEthernet0/1 FE80::5:73FF:FEA0:1

When routing towards a link local address, the exit interface must be specified. R1 then runs BGP towards R5, notice that I’m not using next-hop-self.

router bgp 100
bgp router-id
bgp log-neighbor-changes
neighbor 2001:DB8:1::5 remote-as 100
address-family ipv6
redistribute static
neighbor 2001:DB8:1::5 activate

If we look in the BGP RIB, we can see that the route is installed with a link local next-hop.

R1#sh bgp ipv6 uni
BGP table version is 2, local router ID is
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal, 
              r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter, 
              x best-external, a additional-path, c RIB-compressed, 
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

     Network          Next Hop            Metric LocPrf Weight Path
 *>  2001:DB8:100::/48
                                                0         32768 ?

What next-hop do we have at R5 though?

R5#sh bgp ipv6 uni
BGP table version is 10, local router ID is
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal, 
              r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter, 
              x best-external, a additional-path, c RIB-compressed, 
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

     Network          Next Hop            Metric LocPrf Weight Path
 *>i 2001:DB8:100::/48
                       2001:DB8:1::1            0    100      0 ?

We see the next-hop of R1 and not the link local address. How did this happen? We aren’t using next-hop-self. If we debug at R1, we will see what happens.

R1#debug ip bgp updates
R1#debug ip bgp ipv6 uni
*Aug 30 06:19:15.863: BGP(1): 2001:DB8:1::5 NEXT_HOP part 1 net 2001:DB8:100::/48, 
next FE80::5:73FF:FEA0:1
*Aug 30 06:19:15.863: BGP(1): Can't advertise 2001:DB8:100::/48 to 2001:DB8:1::5 
with NEXT_HOP FE80::5:73FF:FEA0:1
*Aug 30 06:19:15.863: BGP(1): (base) 2001:DB8:1::5 send UPDATE (format) 
2001:DB8:100::/48, next 2001:DB8:1::1, metric 0, path Local

We can see that BGP was going to advertise it with the link local next-hop but then realized that this would not work. It then replaced the link local next-hop with a global next-hop.

While it may have been true at some point that routes must point to a global next-hop, this does not hold true in modern code. BGP will automatically advertise its updates with a global next-hop.

Categories: BGP, IPv6 Tags: , , ,

Next Generation Multicast – NG-MVPN

April 10, 2015 Leave a comment


Multicast is a great technology that although it provides great benefits, is seldomly deployed. It’s a lot like IPv6 in that regard. Service providers or enterprises that run MPLS and want to provide multicast services have not been able to use MPLS to provide multicast  Multicast has then typically been delivered by using Draft Rosen which is a mGRE technology to provide multicast. This post starts with a brief overview of Draft Rosen.

Draft Rosen

Draft Rosen uses GRE as an overlay protocol. That means that all multicast packets will be encapsulated inside GRE. A virtual LAN is emulated by having all PE routers in the VPN join a multicast group. This is known as the default Multicast Distribution Tree (MDT). The default MDT is used for PIM hello’s and other PIM signaling but also for data traffic. If the source sends a lot of traffic it is inefficient to use the default MDT and a data MDT can be created. The data MDT will only include PE’s that have receivers for the group in use.


Draft Rosen is fairly simple to deploy and works well but it has a few drawbacks. Let’s take a look at these:

  • Overhead – GRE adds 24 bytes of overhead to the packet. Compared to MPLS which typically adds 8 or 12 bytes there is 100% or more of overhead added to each packet
  • PIM in the core – Draft Rosen requires that PIM is enabled in the core because the PE’s must join the default and or data MDT which is done through PIM signaling. If PIM ASM is used in the core, an RP is needed as well. If PIM SSM is run in the core, no RP is needed.
  • Core state – Unneccessary state is created in the core due to the PIM signaling from the PE’s. The core should have as little state as possible
  • PIM adjacencies – The PE’s will become PIM neighbors with each other. If it’s a large VPN and a lot of PE’s, a lot of PIM adjacencies will be created. This will generate a lot of hello’s and other signaling which will add to the burden of the router
  • Unicast vs multicast – Unicast forwarding uses MPLS, multicast uses GRE. This adds complexity and means that unicast is using a different forwarding mechanism than multicast, which is not the optimal solution
  • Inefficency – The default MDT sends traffic to all PE’s in the VPN regardless if the PE has a receiver in the (*,G) or (S,G) for the group in use

Based on this list, it is clear that there is a room for improvement. The things we are looking to achieve with another solution is:

  • Shared control plane with unicast
  • Less protocols to manage in the core
  • Shared forwarding plane with unicast
  • Only use MPLS as encapsulation
  • Fast Restoration (FRR)


To be able to build multicast Label Switched Paths (LSPs) we need to provide these labels in some way. There are three main options to provide these labels today:

  • Multipoint LDP(mLDP)
  • Unicast MPLS + Ingress Replication(IR)

MLDP is an extension to the familiar Label Distribution Protocol (LDP). It supports both P2MP and MP2MP LSPs and is defined in RFC 6388.

RSVP-TE is an extension to the unicast RSVP-TE which some providers use today to build LSPs as opposed to LDP. It is defined in RFC 4875.

Unicast MPLS uses unicast and no additional signaling in the core. It does not use a multipoint LSP.

Multipoint LSP

Normal unicast forwarding through MPLS uses a point to point LSP. This is not efficient for multicast. To overcome this, multipoint LSPs are used instead. There are two different types, point to multipoint and multipoint to multipoint.


  • Replication of traffic in core
  • Allows only the root of the P2MP LSP to inject packets into the tree
  • If signaled with mLDP – Path based on IP routing
  • If signaled with RSVP-TE – Constraint-based/explicit routing. RSVP-TE also supports admission control


  • Replication of traffic in core
  • Bidirectional
  • All the leafs of the LSP can inject and receive packets from the LSP
  • Signaled with mLDP
  • Path based on IP routing

Core Tree Types

Depending on the number of sources and where the sources are located, different type of core trees can be used. If you are familiar with Draft Rosen, you may know of the default MDT and the data MDT.


Signalling the Labels

As mentioned previously there are three main ways of signalling the labels. We will start by looking at mLDP.

  • LSPs are built from the leaf to the root
  • Supports P2MP and MP2MP LSPs
    • mLDP with MP2MP provides great scalability advantages for “any to any” topologies
      • “any to any” communication applications:
        • mVPN supporting bidirectional PIM
        • mVPN Default MDT model
        • If a provider does not want  tree state per ingress PE source
  • Supports Fast Reroute (FRR) via RSVP-TE unicast backup path
  • No periodic signaling, reliable using TCP
  • Control plane is P2MP or MP2MP
  • Data plane is P2MP
  • Scalable due to receiver driven tree building
  • Supports MP2MP
  • Does not support traffic engineering

RSVP-TE can be used as well with the following characteristics.

  • LSPs are built from the head-end to the tail-end
  • Supports only P2MP LSPs
  • Supports traffic engineering
    • Bandwidth reservation
    • Explicit routing
    • Fast Reroute (FRR)
  • Signaling is periodic
  • P2P technology at control plane
    • Inherits P2P scaling limitations
  • P2MP at the data plane
    • Packet replication in the core

RSVP-TE will mostly be interesting for SPs that are already running RSVP-TE for unicast or for SPs involved in video delivery. The following table shows a comparision of the different protocols.

Core protocols

Assigning Flows to LSPs

After the LSPs have been signalled, we need to get traffic onto the LSPs. This can be done in several different ways.

  • Static
  • PIM
    • RFC 6513
  • BGP Customer Multicast (C-Mcast)
    • RFC 6514
    • Also describes Auto-Discovery
  • mLDP inband signaling
    • RFC 6826


  • Mostly applicable to RSVP-TE P2MP
  • Static configuration of multicast flows per LSP
  • Allows aggregation of multiple flows in a single LSP


  • Dynamically assigns flows to an LSP by running PIM over the LSP
  • Works over MP2MP and PPMP LSP types
  • Mostly used but not limited to default MDT
  • No changes needed to PIM
  • Allows aggregation of multiple flows in a single LSP

BGP Auto-Discovery

  • Auto-Discovery
    • The process of discovering all the PE’s with members in a given mVPN
  • Used to establish the MDT in the SP core
  • Can also be used to discover set of PE’s interested in a given customer multicast group (to enable S-PSMSI creation)
    • S-PMSI = Data MDT
  • Used to advertise address of the originating PE and tunnel attribute information (which kind of tunnel)

BGP MVPN Address Family

  • MPBGP extensions to support mVPN address family
  • Used for advertisement of AD routes
  • Used for advertisement of C-mcast routes (*,G) and (S,G)
  • Two new extended communities
    • VRF route import – Used to import mcast routes, similar to RT for unicast routes
    • Source AS – Used for inter-AS mVPN
  • New BGP attributes
    • PMSI Tunnel Attribute (PTA) – Contains information about advertised tunnel
    • PPMP label attribute – Upstream generated label used by the downstream clients to send unicast messages towards the source
  • If mVPN address family is not used the address family ipv4 mdt must be used

BGP Customer Multicast

  • BGP Customer Multicast (C-mcast) signalling on overlay
  • Tail-end driven updates is not a natural fit for BGP
    • BGP is more suited for one-to-many not many-to-one
  • PIM is still the PE-CE protocol
  • Easy to use with SSM
  • Complex to understand and troubleshoot for ASM

MLDP Inband Signaling

  • Multicast flow information encoded in the mLDP FEC
  • Each customer mcast flow creates state on the core routers
    • Scaling is the same as with default MDT with every C-(S,G) on a Data MDT
  • IPv4 and IPv6 multicast in global or VPN context
  • Typical for SSM or PIM sparse mode sources
  • IPTV walled garden deployment
  • RFC 6826

The natural choice is to stick with PIM unless you need very high scalability. Here is a comparison of PIM and BGP.


BGP C-Signaling

  • With C-PIM signaling on default MDT models, data needs to be monitored
    • On default/data tree to detect duplicate forwarders over MDT and to trigger the assert process
    • On default MDT to perform SPT switchover (from (*,G) to (S,G))
  • On default MDT models with C-BGP signaling
    • There is only one forwarder on MDT
      • There are no asserts
    • The BGP type 5 routes are used for SPT switchover on PEs
  • Type 4 leaf AD route used to track type 3 S-PMSI (Data MDT) routes
  • Needed when RR is deployed
  • If source PE sets leaf-info-required flag on type 3 routes, the receiver PE responds with with a type 4 route


If PIM is used in the core, this can be migrated to mLDP. PIM can also be migrated to BGP. This can be done per multicast source, per multicast group and per source ingress router. This means that migration can be done gradually so that not all core trees must be replaced at the same time.

It is also possible to have both mGRE and MPLS encapsulation in the network for different PE’s.

To summarize the different options for assigning flows to LSPs

  • Static
    • Mostly applicable to RSVP-TE
  • PIM
    • Well known, has been in use since mVPN introduction over GRE
  • BGP A-D
    • Useful where head-end assigns the flows to the LSP
  • BGP C-mcast
    • Alternative to PIM in mVPN context
    • May be required in dual vendor networks
  • MLDP inband signaling
    • Method to stitch a PIM tree to a mLDP LSP without any additional signaling

Optimizing the MDT

There are some drawbacks with the normal operation of the MDT. The tree is signalled even if there is no customer traffic leading to unneccessary state in the core. To overcome these limitations there is a model called the partitioned MDT running over mLDP with the following characteristics.

  • Dynamic version of default MDT model
  • MDT is only built when customer traffic needs to be transported across the core
  • It addresses issues with the default MDT model
    • Optimizes deployments where sources are located in a few sites
    • Supports anycast sources
      • Default MDT would use PIM asserts
    • Reduces the number of PIM neighbors
      • PIM neighborship is unidirectional – The egress PE sees ingress PEs as PIM neighbors


There are many many different profiles supported, currently 27 profiles on Cisco equipment. Here are some guidelines to guide you in the selection of a profile for NG-MVPN.

  • Label Switched Multicast (LSM) provides unified unicast and multicast forwarding
  • Choosing a profile depends on the application and scalability/feature requirements
  • MLDP is the natural and safe choice for general purpose
    • Inband signalling is for walled garden deployments
    • Partitioned MDT is most suitable if there are few sources/few sites
    • P2MP TE is used for bandwidth reservation and video distribution (few source sites)
    • Default MDT model is for anyone (else)
  • PIM is still used as the PE-CE protocol towards the customer
  • PIM or BGP can be used as an overlay protocol unless inband signaling or static mapping is used
  • BGP is the natural choice for high scalability deployments
    • BGP may be the natural choice if already using it for Auto-Discovery
  • The beauty of NG-MVPN is that profile can be selected per customer/VPN
    • Even per source, per group or per next-hop can be done with Routing Policy Language (RPL)

This post was heavily inspired and is basically a summary of the Cisco Live session BRKIPM-3017 mVPN Deployment Models by Ijsbrand Wijnands and Luc De Ghein. I recommend that you read it for more details and configuration of NG-MVPN.

Categories: Multicast Tags: , , ,

Noction Intelligent Routing Platform (IRP) – What is it?

January 27, 2015 6 comments

I was contacted by some people at Noction and asked if I was interested in writing about their platform, the Intelligent Routing Platform (IRP). Since it’s a product that uses Border Gateway Protocol (BGP), it peaked my interest. First let’s make the following things clear:

  • I am not being paid to write this blog post
  • My opinions can’t be bought
  • I will only write about a product if it’s something that interests me

BGP is the glue of the Internet (with DNS) and what keeps everything running. BGP is a well designed and scalable protocol which has been around for a long time. It has grown from carrying a few hundred routes to half a million routes. However, there will always be use cases where BGP might not fit your business model.

In Noction’s white paper they define the following as the network’s major challenges:

  • Meeting the customer’s demand for 100% uptime
  • Facing the low latency requirement
  • Achieving reliable data transmission
  • Avoiding network congestion and blackouts
  • Achieving consistency of throughput
  • Keeping bandwidth usage below predefined commit levels
  • Reducing the cost and time of network troubleshooting

The product is designed for multihomed networks running BGP. You can’t optimize network flows if you don’t have any other paths to switch to. Some of these challenges apply to all networks and some may be a bit more local. As an example, in Sweden (where I live), you usually pay a fixed amount for your bandwidth and you can use that all you want without going above some threshold defined by the Service Provider (SP).

So why do we have these challenges? Is it BGP’s fault? BGP has a lot of knobs but they are quite blunt tools. We need to keep in mind that BGP runs between organizations and every organization must make their own decisions on how to forward traffic. This means that there is no end to end policy to optimize the traffic flowing across these organizations.

If history has learned us anything, it is that protocols that try to keep too much state will eventually fail or hit scaling limitations. These protocols seem very intelligent and forward thinking at first but as soon as they hit large scale, the burden becomes too much. One such protcol is Resource Reservation Protocol (RSVP). BGP’s design is what has kept the Internet running for decades, this would not be the case if we were to inject all kind of metrics, latencies, jitter etc for all of the Network Layer Reachability Information (NLRI). As communities have grown more popular there could be a use case where information is tagged along as communities for the NLRI. The question is then, how often do we update the communities?

Does this mean that these are not real challenges or that there is no room for a product like Noction IRP? No, it means that unique forwarding decisions and intelligence needs to be kept at the edge of the network, not in the core. We should keep as little state as possible in the core for networks that need high availability.

How does BGP select which routes are the best? The default is to simply look at the AS-path:, the shorter AS-path, the better. Meaning that the traffic will pass through as few organizations as possible. This does not however give any consideration to how much bandwidth is available, nor takes into account latency and jitter of the path and the availability of the path.

How does this product work? The following picture shows the key components of IRP:


There is a collector that passively analyzes the traffic flowing to see which prefixes are being used the most, between which endpoints is the traffic flowing and so on. The collector can gather this data from a mirror port or preferably from Netflow/sFlow.

The Explorer will actively probe relevant prefixes for metrics such as latency, jitter and packet loss. This data is then sent to the Core.

The Core is based on the data received from the Explorer calculating improvements to optimize metrics such as latency, jitter and packet loss or the most cost effective path. These improvements are sent to the BGP daemon which will advertise BGP Updates to the edge router(s).

IRP is non-intrusive and does not sit in the data path. If IRP were to fail, traffic would fall back to their normal paths following the shortest AS-path or any other policies defined on the edge router. IRP can also act in BGP non-intrusive mode where it will report potential improvements without applying them.

If we pause here for a second, this sounds a lot like Performance Routing (PfR), doesn’t it? So what value would IRP add that PfR does not? I see mainly two benefits here. PfR may require a more senior network administrator to setup and administer, however PfR has been greatly simplified in later releases. The other main factor is the reporting through the frontend. PfR does not give you the monitoring platform, which is not to be expected of course.

When you login to the IRP you get a dashboard showing the status of the system and the number of prefixes being probed and how many of those prefixes are being improved.


In the demo, there are two service providers called “SwiftWay” and “FiberRing”. There is a graph to show how many prefixes have been rerouted to one of the providers.


There is also a list that shows you which prefixes were moved, what’s the AS number and the reason for being moved. If you do a mouseover on the flash symbol, it will show if the improvement was due to loss or latency.


There are a lot of different reports that can be generated. A nice feature is that all reports are exportable to CSV, XLS or PDF.


This report shows how loss has been improved: 75% of loss was totally avoided and 25% of loss was reduced.


There are also graphs showing top usage of traffic by AS or, as in this case, the bandwidth used per provider.


The monitoring and reports are extensive and easy to use. The IRP is certainly an interesting platform and depending on the business case it could be very useful. The main considerations would be how sensitive are you to loss and latency? How much does it cost you if you are not choosing the most optimal path? Do you trust a system to make these decisions for you? If you do, then certainly take a look at the Noction IRP.

Categories: Announcement Tags: , ,

Unique RD per PE in MPLS VPN for Load Sharing and Faster Convergence

January 11, 2015 3 comments

This post describes how load sharing and faster convergence in MPLS VPNs is possible by using an unique RD per VRF per PE. It assumes you are already familiar with MPLS but here is a quick recap.

The Route Distinguisher (RD) is used in MPLS VPNs to create unique routes. With IPv4, an IP address is 32 bits long but several customers may and probably will use the same networks. If CustomerA uses and CustomerX also uses, we must in some way make this route unique to transport it over MPBGP. The RD does exactly this by prepending a 64 bit value and together with the IPv4 address, creating a 96-bit VPNv4 prefix. This is all the RD does, it has nothing to do with the VPN in itself. It is common to create RD consisting of AS_number:VPN_identifier so that a VPN has the same RD on all PEs where it exists.

The Route Target (RT) is what defines the VPN, which routes are imported to the VPN and the topology of the VPN. These are extended communities that are tagged on to the BGP Update and transported over MPBGP.

MPLS uses labels, the transport label which is used to transport the packet through the network is generated by LDP. The VPN label which is used to make sure the packets make it to the right VPN is generated by MPBGP and can be per prefix or per VRF.

Below is a configuration snipper for creating a VRF with the newer syntax that is used.

PE1#sh run vrf
Building configuration...

Current configuration : 401 bytes
vrf definition CUST1
 address-family ipv4
  route-target export 64512:1
  route-target import 64512:1
interface GigabitEthernet1
 vrf forwarding CUST1
 ip address
 negotiation auto
router bgp 64512
 address-family ipv4 vrf CUST1
  neighbor remote-as 65000
  neighbor activate

The values for the RD and RT are defined under the VRF. Now the topology we will be using is the one below.


This topology uses a Route Reflector (RR) like most decently sized net works will to overcome the scalability limitations of a BGP full mesh. The negative part of using a RR is that we will have less routes because only the best routes will be reflected. This means that load sharing may not take place and that convergence takes longer time when a link between a PE and a CE goes down.

This diagram shows PE1 and PE2 advertising the same network to the RR. The RR then picks one as best and reflects that to PE3 (and others). This means that the path through PE2 will never be used until something happens with PE1. This is assuming that they are both using the same RD.



When PE1 loses its prefix it sends a BGP WITHDRAW to the RR, the RR then sends a WITHDRAW to PE3 and then it sends an UPDATE which is the prefix via PE2. The path via PE2 is not used until this happens. This means that load sharing is not taking place and that all traffic destined for has to converge.

If every PE is using unique RD for the VRF per PE then they become two different routes and both can be reflected by the RR. The RD is then usually written in the form PE_loopback:VPN_identifier. This also helps with troubleshooting to see where the prefix originated from.


PE3 now has two routes to in its routing table.

PE3#sh ip route vrf CUST1

Routing Table: CUST1
Routing entry for
  Known via "bgp 64512", distance 200, metric 0
  Tag 65000, type internal
  Last update from 01:10:52 ago
  Routing Descriptor Blocks:
  * (default), from, 01:10:52 ago
      Route metric is 0, traffic share count is 1
      AS Hops 1
      Route tag 65000
      MPLS label: 17
      MPLS Flags: MPLS Required (default), from, 01:10:52 ago
      Route metric is 0, traffic share count is 1
      AS Hops 1
      Route tag 65000
      MPLS label: 28
      MPLS Flags: MPLS Required

The PE is now doing load sharing meaning that some traffic will take the path over PE1 and some over PE2.


We have achieved load sharing and this also means that if something happens with PE1 or PE2, not all traffic will be effected. To see which path is being used from PE3 we can use the show ip cef exact-route command.

PE3#sh ip cef vrf CUST1 exact-route -> => label 17 label 16TAG adj out of GigabitEthernet1, addr
PE3#sh ip cef vrf CUST1 exact-route -> => label 28 label 17TAG adj out of GigabitEthernet1, addr

What is the drawback of using this? It consumes more memory because the prefixes are now unique, in effect doubling the required memory to store BGP Paths. The PEs have to store several copies with different RD for the prefix before it can import it into the RIB.

PE3#sh bgp vpnv4 uni all
BGP table version is 46, local router ID is
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal, 
              r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter, 
              x best-external, a additional-path, c RIB-compressed, 
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

     Network          Next Hop            Metric LocPrf Weight Path
Route Distinguisher:
 *>i              0    100      0 65000 i
Route Distinguisher:
 *>i              0    100      0 65000 i
Route Distinguisher: (default for vrf CUST1)
 *>               0             0 65001 i
 *mi              0    100      0 65000 i
 *>i                      0    100      0 65000 i

For the multipathing to take place, PE3 must allow more than one route to be installed via BGP. This is done through the maximum-paths eibgp command.

address-family ipv4 vrf CUST1
  maximum-paths eibgp 2

In newer releases there are other features to overcome the limitation of only reflecting one route, such as BGP Add Path. This post showed the benefits of enabling unique RD for a VRF per PE to enable load sharing and better convergence. It also showed that doing so will use more memory due to having to store multiple copies of essentially the same route. Because multiple routes get installed into the FIB, that should also be a consideration depending on how large the FIB is for your platform.

Categories: BGP, MPLS Tags: , , , ,

Some pointers on OSPF as PE to CE protocol

February 23, 2014 5 comments

There was a discussion at the Cisco Learning Network (CLN) about OSPF as PE to CE
I wanted to provide some pointers on using OSPF as PE to CE protocol.

RFC 4577 describes how to use OSPF as PE to CE protocol. When using BGP to carry the
OSPF routes the MPLS backbone is seen as a super backbone. This adds another level of
hierarchy making OSPF three levels compared to the usual two when using plain OSPF.


Because the the MPLS backbone is seen as a super area 0, that means that OSPF routes
going across the MPLS backbone can never be better than type 3 summary LSA. Even if
the same area is used on both sides of the backbone and the input is a type 1 or type 2
LSA it will be advertised as a summary LSA on the other side.

LSA across superbackbone

The only way to keep the type 1 or type 2 LSAs as they are is to use a sham link.
Sham links sets up a control plane mechanism acting as a tunnel for the LSAs passing
over the MPLS backbone. Sham links are outside the scope of this article.

A LSA can never be “better” than it originally was input as. This means that if the input
to the PE isa type 3 LSA this can never be converted to a type 1 or type 2 LSA on the other
side. If the LSA was type 5 external to begin it will be sent as type 5 on the other side
as well.

To understand how the LSAs are sent over the backbone, look at this picture.


OSPF LSA is sent to PE which is running OSPF in a VRF with the CPE. The PE installs
the LSA as a route in the OSPF RIB. If the route is the best one known to the router
it can install it to the global RIB.

The PE redistributes from OSPF into BGP. Only routes that are installed as OSPF in
the RIB will be redistributed. To be able to carry OSPF specific information the PE
has to add extended communities. To make the IPv4 route a VPNv4 route the PE has
to add the RD and RT values. The OSPF specific communities consist of:


The domain ID can either be hard coded or derived from the OSPF process running.
It is used to identify if LSAs are sent into the same domain as they originated
from. If the domain ID matches then type 3 summary LSAs can be sent for routes
that were internal or inter area. If the domain ID does not match then all routes
must be sent as external.

Domain ID match

Domain ID 1

Domain ID non match

Domain ID 2

OSPF Route Type

The route type consists of area number, route type and options.

Route Type

If we look at a MPBGP update we can see the route type encoded.

R4#sh bgp vpnv4 uni rd 1:1
BGP routing table entry for 1:1:, version 5
Paths: (1 available, best #1, table cust)
Flag: 0x820
  Not advertised to any peer
  Local (metric 21) from (
      Origin incomplete, metric 11, localpref 100, valid, internal, best
      Extended Community: RT:1:1 OSPF DOMAIN ID:0x0005:0x000000020200 
      mpls labels in/out nolabel/18

Something that is a bit peculiar is that this update has a route type of 2 even though
it originated from a type 1 LSA. In the end it doesn’t make a difference because it will
be advertised as type 3 LSA to the CPE.

OSPF Router ID

The router ID of the router that originated the LSA (PE) is also carried as an extended

R4#sh bgp vpnv4 uni rd 1:1
BGP routing table entry for 1:1:, version 5
Paths: (1 available, best #1, table cust)
Flag: 0x820
  Not advertised to any peer
  Local (metric 21) from (
      Origin incomplete, metric 11, localpref 100, valid, internal, best
      Extended Community: RT:1:1 OSPF DOMAIN ID:0x0005:0x000000020200 
      mpls labels in/out nolabel/18


The MED is set to the OSPF metric + 1 as defined by the RFC.

R4#sh bgp vpnv4 uni rd 1:1
BGP routing table entry for 1:1:, version 5
Paths: (1 available, best #1, table cust)
Flag: 0x820
  Not advertised to any peer
  Local (metric 21) from (
      Origin incomplete, metric 11, localpref 100, valid, internal, best
      Extended Community: RT:1:1 OSPF DOMAIN ID:0x0005:0x000000020200 
      mpls labels in/out nolabel/18

The goal of these extended communities is to extend BGP so that OSPF LSAs can be
carried transparently as if BGP hadn’t been involved at all. LSAs are translated
to BGP updates and then translated back to LSAs.

If we look at a packet capture we can see the extended communities attached.
This BGP Update originated from a type 5 external LSA with metric-type 1.


When using OSPF as the PE to CE protocol it is important to remember the design
rules of OSPF. Because of that you should avoid designs like this:


In this design area 1 is used on both sides but the CPE is then connected to area 0
which makes it an ABR. The rules of OSPF dictate that summary LSAs must only be
received over area 0 if it is an ABR. This means this topology is broken and would
require changing area or using a virtual link.

OSPF as PE to CE protocol has some complexity but must of it is still plain OSPF
which is in itself a complicated protocol. Combine that with BGP and MPLS and
it is easy to get confused which protocol is responsible for what. That is also
one of the reasons that I recommend to use eBGP or static when customers connect
to their ISP.

Categories: BGP, MPLS, OSPF Tags: , , , , ,

Scaling PEs in MPLS VPN – Route Target Constraint (RTC)

September 23, 2013 13 comments


In any decent sized service provider or even an enterprise network running
MPLS VPN, it will most likely be using Route Reflectors (RR). As described in
a previous post iBGP fully meshed does not really scale. By default all
PEs will receive all routes reflected by the RR even if the PE does not
have a VRF configured with an import matching the route. To mitigate this
ineffecient behavior Route Target Constraint (RTC) can be configured. This
is defined in RFC 4684.

Route Target Constraint

The way this feature works is that the PE will advertise to the RR which RTs
it intends to import. The RR will then implement an outbound filter only sending
routes matching those RTs to the PE. This is much more effecient than the default
behavior. Obviously the RR still needs to receive all the routes so no filtering
is done towards the RR. To enable this feature a new Sub Address Family (SAFI) is
used called rtfilter. To show this feature we will implement the following topology.


The scenario here is that PE1 is located in a large PoP where there are already plenty
of customers. It currently has 255 customers. PE2 is located in a new PoP and so far only
one customer is connected there. It’s unneccessary for the RR to send all routes to PE2
for all of PE1 customers because it does not need them. To simulate the customers I wrote
a simple bash script to create the VRFs for me in PE1.

for i in {0..255}
   echo "ip vrf $i"
   echo "rd 1:$i"
   echo "route-target 1:$i"
   echo "interface loopback$i"
   echo "ip vrf forwarding $i"
   echo "ip address 10.0.$i.1"
   echo "router bgp 65000"
   echo "address-family ipv4 vrf $i"
   echo "network 10.0.$i.0 mask"

PE2 will not import these due to that the RT is not matching any import statements in
its only VRF that is currently configured. If we debug BGP we can see lots of messages

BGP(4): Incoming path from
BGP(4): rcvd UPDATE w/ attr: nexthop, origin i, localpref 100, 
metric 0, originator, clusterlist, extended community RT:1:104
BGP(4): rcvd 1:104:, label 120 -- DENIED due to:  extended 
community not supported;

In this case we have 255 routes but what if it was 1 million routes? That would be
a big waste of both processing power and bandwidth, not to mention that the RR would
have to format all the BGP updates. These are the benefits of enabling RTC:

  • Eliminating waste of processing power on PE and RR and waste of bandwidth
  • Less VPNv4 formatted Updates
  • BGP convergence time is reduced

Currently the RR is advertising 257 prefixes to PE2.

RR#sh bgp vpnv4 uni all neighbors advertised-routes | i Total
Total number of prefixes 257


Implementing RTC is simple. It has to be supported on both the RR and the PE though.
Add the following commands under BGP:


RR(config)#router bgp 65000
RR(config-router)#address-family rtfilter unicast
RR(config-router-af)#nei activate
RR(config-router-af)#nei route-reflector-client


PE2(config)#router bgp 65000
PE2(config-router)#address-family rtfilter unicast
PE2(config-router-af)#nei activate

The BGP session will be torn down when doing this! Now to see how many routes the RR is

RR#sh bgp vpnv4 uni all neighbors advertised-routes | i Total
Total number of prefixes 0

No prefixes! To see the rt filter in effect use this command:

RR#sh bgp rtfilter unicast all
BGP table version is 3, local router ID is
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal, 
              r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter, 
              x best-external, a additional-path, c RIB-compressed, 
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

     Network          Next Hop            Metric LocPrf Weight Path
     0:0:0:0                                0 i
 *>i 65000:2:1:256                  0    100  32768 i

Now we add an import under the VRF in PE2 and one route should be sent.

PE2(config)#ip vrf 0
PE2(config-vrf)#route-target import 1:1
PE2#sh ip route vrf 0

Routing Table: 0
Codes: L - local, C - connected, S - static, R - RIP, M - mobile, B - BGP
       D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area 
       N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
       E1 - OSPF external type 1, E2 - OSPF external type 2
       i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
       ia - IS-IS inter area, * - candidate default, U - per-user static route
       o - ODR, P - periodic downloaded static route, H - NHRP, l - LISP
       + - replicated route, % - next hop override

Gateway of last resort is not set is variably subnetted, 3 subnets, 2 masks
B [200/0] via, 00:00:16
C is directly connected, Loopback1
L is directly connected, Loopback1
RR#sh bgp vpnv4 uni all neighbors advertised-routes | i Total
Total number of prefixes 1 
RR#sh bgp rtfilter unicast all                                       
BGP table version is 4, local router ID is
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal, 
              r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter, 
              x best-external, a additional-path, c RIB-compressed, 
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

     Network          Next Hop            Metric LocPrf Weight Path
     0:0:0:0                                0 i
 *>i 65000:2:1:1                  0    100  32768 i
 *>i 65000:2:1:256                  0    100  32768 i

Works as expected. From the output we can see that the AS is 65000, the extended
community type is 2 and the RT that should be exported is 1:1 and 1:256.


Route Target Constraint is a powerful feature that will lessen the load on both your
Route Reflectors and PE devices in an MPLS VPN enabled network. It can also help
with making BGP converging faster. Support is needed on both PE and RR and the BGP
session will be torn down when enabling it so it has to be done during maintenance

Categories: BGP, MPLS Tags: , , ,

iBGP – Fully meshed vs Route Reflection

September 9, 2013 4 comments


This post looks at the pros and cons with BGP Route Reflection compared to running
an iBGP full mesh.

Full mesh

Because iBGP routes are not propagated to iBGP sessions there must be a full mesh
inside the BGP network. This leads to scalability issues. For every N routers
there will be (N-1) iBGP neighbors and (N*(N-1))/2 BGP sessions. For a medium
sized ISP network with 100 routers running BGP this would be 99 iBGP neighbors
and 4950 BGP sessions in total.

Full mesh

There are 4 routers in AS 2 which gives 3 iBGP neighbors and 6 iBGP sessions in total.

Benefits of a full mesh:

  • Optimal Traffic Forwarding
  • Path Diversity
  • Convergence
  • Robustness

Optimal Traffic Forwarding:

Because all BGP speaking routers are fully meshed they will receive iBGP updates
from all peers. If no manipulating of attributes have been done then the tiebreaker
will be the metric to the next-hop (IGP) so traffic will take the optimal path.

Path Diversity:

Due to the full mesh the BGP speaking router will have multiple paths to choose
from. If it was connected to a RR it would generally only have one path, the one
the RR decided was the best.


Because the BGP speaking router has multiple paths if the current best one should fail
it can start using one of the alternate paths. Also the BGP UPDATE messages are sent
directly between the iBGP peers instead of passing through an additional router (RR)
which would have to process it and the packets would have to travel additional distance
unless the RR is located in the same PoP as the routers.


If one BGP speaking router fails then only the networks behind that router are
not reachable any longer. If a RR fails then all networks that were reachable via
clients to that RR would no longer be reachable.

Caveats of a full mesh:

  • Lack of Scalability
  • Management Overhead
  • Duplication of Information

Lack of Scalability:

Having hundreds of BGP sessions on all routers would mean a lot of BGP processing.
The number of BGP Updates coming in would be massive.
This would put a great burden on the CPU/RP of the router. For really large networks
this could potentially be more than the router can handle. In a network with 300 routers
there would be 44850 iBGP sessions. The RIB-in size would be very large because of the
large number of peers.

Management Overhead:

Adding a new device to the network means reconfiguring all the existing devices.
Configurations would be very big considering all the lines needed to setup the
full mesh.

Duplication of Information:

For every external network there could potentially be multiple paths internally
leading to using lots of RIB/FIB space on the devices. It does not make much sense
to install all paths into RIB/FIB.

Benefits of Route Reflection:

  • Scalability
  • Reduced Operational Cost
  • Reduced RIB-in Size
  • Reduced Number of BGP Updates
  • Incremental Deployability


The number of iBGP sessions needed is greatly reduced. A client only needs one session
or preferably two to have route reflector redundancy. A route reflector needs
(K*(K-1))/2 + C where K is the number of route reflectors and C is the number of
clients. The route reflectors still need to be in full mesh with each other.

Reduced Operational Cost:

With a full mesh when adding a new device it requires reconfiguring all the existing
devices. This requires operator intervention which is an added cost. With route reflection
when adding a new device only the new device and the RR it peers with needs new configuration.

Reduced RIB-in Size:

RIB-in contains the unprocessed BGP information. After processing this information
the best paths are installed into the Loc-RIB. The RIB-in grows proportionally with
the number of neighbors that the router peers with. If there is n routers and p prefixes
then the router would have a RIB-in that is of size n * p. In a full mesh n is very high
but with route reflection n is only the number of RRs that the router peers with.

Reduced Number of BGP Updates:

In a full mesh a router will receive N – 1 updates where N is the number of routers.
This is a large amount of updates. With route reflection N is small since this is
only the number of route reflectors the router peers with.

Incremental Deployability:

Route reflection does not require massive changes in the existing network like with
confederations. It can be deployed incrementally and routers can be migrated to the
RR topology gradually. Not all routers need to be moved at once.

Caveats of Route Reflection:

  • Robustness
  • Prolonged Routing Convergence
  • Potential Loops
  • Reduced Path Diversity
  • Suboptimal Routes


With a full mesh if a single router fails that only impacts the networks behind
that router. If a route reflector fails it affects all the networks that were
behind all of the route reflectors clients. To avoid single points of failure,
RRs are usually deployed in pairs.

Prolonged Routing Convergence:

In a full mesh every BGP update only travels a single hop. With route reflection
the number of hops is increased and if the route reflectors are setup in a
hierarchical topology the update could travel through several RRs. Every RR
will add some processing delay and propagation delay before the update reaches
the client.

Potential Loops:

In a topology where clients are connected to a single RR there should be no
data plane loops. When clients are connected to two RRs there is a risk
of a loop forming if the control plane topology does not match the physical
topology. Because of that it is important to try to match the two topologies.

Reduced Path Diversity:

In a full mesh if there are multiple paths to an external network then
all paths will be announced and the local router makes a decision which one
is the best. With route reflection the RR makes the decision which path is
the best and announces this path only. This leads to fewer paths being
announced which could lead to longer convergence delays.

There are drafts for announcing more than one best path which would help
with this issue. Some newer IOS releases supports this feature.

Suboptimal routes:

The RR will select a best path based on its own local routing information.
This could lead to routers using suboptimal paths because there may be
a shorter path available from a routers perspective but this is not the
path that the RR had chosen. Therefore it’s important to consider where
the RRs are placed.


This post takes a look at the benefits and caveats of a fully meshed iBGP network
vs route reflection. Although because of scalability it’s almost impossible to not
go with route reflection one should still consider the caveats of route reflection.
It’s important to consider the placement and the number of RRs in the topology.
This post is the first of posts that will focus on CCDE topics.

Categories: BGP Tags: , , ,