Archive

Archive for the ‘BGP’ Category

Busting Myths – IPv6 Link Local Next Hop into BGP

August 30, 2015 2 comments

In some publications it is mentioned that a link local next-hop can’t be used when redistributing routes into BGP because routers receiving the route will not know what to do with the next-hop. That is one of the reason why HSRPv2 got support for global IPv6 addresses. One such scenario is described in this link.

The topology used for this post is the following.

Topo1

I have just setup enough of the topology to prove that it works with the next-hop, so I won’t be running any pings and so on. The routers R1 and R2 have a static route for the network behind R3 and R4.

ipv6 route 2001:DB8:100::/48 GigabitEthernet0/1 FE80::5:73FF:FEA0:1

When routing towards a link local address, the exit interface must be specified. R1 then runs BGP towards R5, notice that I’m not using next-hop-self.

router bgp 100
bgp router-id 1.1.1.1
bgp log-neighbor-changes
neighbor 2001:DB8:1::5 remote-as 100
!
address-family ipv6
redistribute static
neighbor 2001:DB8:1::5 activate
exit-address-family

If we look in the BGP RIB, we can see that the route is installed with a link local next-hop.

R1#sh bgp ipv6 uni
BGP table version is 2, local router ID is 1.1.1.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal, 
              r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter, 
              x best-external, a additional-path, c RIB-compressed, 
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

     Network          Next Hop            Metric LocPrf Weight Path
 *>  2001:DB8:100::/48
                       FE80::5:73FF:FEA0:1
                                                0         32768 ?

What next-hop do we have at R5 though?

R5#sh bgp ipv6 uni
BGP table version is 10, local router ID is 5.5.5.5
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal, 
              r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter, 
              x best-external, a additional-path, c RIB-compressed, 
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

     Network          Next Hop            Metric LocPrf Weight Path
 *>i 2001:DB8:100::/48
                       2001:DB8:1::1            0    100      0 ?

We see the next-hop of R1 and not the link local address. How did this happen? We aren’t using next-hop-self. If we debug at R1, we will see what happens.

R1#debug ip bgp updates
R1#debug ip bgp ipv6 uni
*Aug 30 06:19:15.863: BGP(1): 2001:DB8:1::5 NEXT_HOP part 1 net 2001:DB8:100::/48, 
next FE80::5:73FF:FEA0:1
*Aug 30 06:19:15.863: BGP(1): Can't advertise 2001:DB8:100::/48 to 2001:DB8:1::5 
with NEXT_HOP FE80::5:73FF:FEA0:1
*Aug 30 06:19:15.863: BGP(1): (base) 2001:DB8:1::5 send UPDATE (format) 
2001:DB8:100::/48, next 2001:DB8:1::1, metric 0, path Local

We can see that BGP was going to advertise it with the link local next-hop but then realized that this would not work. It then replaced the link local next-hop with a global next-hop.

While it may have been true at some point that routes must point to a global next-hop, this does not hold true in modern code. BGP will automatically advertise its updates with a global next-hop.

Categories: BGP, IPv6 Tags: , , ,

Unique RD per PE in MPLS VPN for Load Sharing and Faster Convergence

January 11, 2015 3 comments

This post describes how load sharing and faster convergence in MPLS VPNs is possible by using an unique RD per VRF per PE. It assumes you are already familiar with MPLS but here is a quick recap.

The Route Distinguisher (RD) is used in MPLS VPNs to create unique routes. With IPv4, an IP address is 32 bits long but several customers may and probably will use the same networks. If CustomerA uses 10.0.0.0/24 and CustomerX also uses 10.0.0.0/24, we must in some way make this route unique to transport it over MPBGP. The RD does exactly this by prepending a 64 bit value and together with the IPv4 address, creating a 96-bit VPNv4 prefix. This is all the RD does, it has nothing to do with the VPN in itself. It is common to create RD consisting of AS_number:VPN_identifier so that a VPN has the same RD on all PEs where it exists.

The Route Target (RT) is what defines the VPN, which routes are imported to the VPN and the topology of the VPN. These are extended communities that are tagged on to the BGP Update and transported over MPBGP.

MPLS uses labels, the transport label which is used to transport the packet through the network is generated by LDP. The VPN label which is used to make sure the packets make it to the right VPN is generated by MPBGP and can be per prefix or per VRF.

Below is a configuration snipper for creating a VRF with the newer syntax that is used.

PE1#sh run vrf
Building configuration...

Current configuration : 401 bytes
vrf definition CUST1
 rd 11.11.11.11:1
 !
 address-family ipv4
  route-target export 64512:1
  route-target import 64512:1
 exit-address-family
!
!
interface GigabitEthernet1
 vrf forwarding CUST1
 ip address 111.0.0.0 255.255.255.254
 negotiation auto
!
router bgp 64512
 !
 address-family ipv4 vrf CUST1
  neighbor 111.0.0.1 remote-as 65000
  neighbor 111.0.0.1 activate
 exit-address-family
!         
end

The values for the RD and RT are defined under the VRF. Now the topology we will be using is the one below.

MPLS1

This topology uses a Route Reflector (RR) like most decently sized net works will to overcome the scalability limitations of a BGP full mesh. The negative part of using a RR is that we will have less routes because only the best routes will be reflected. This means that load sharing may not take place and that convergence takes longer time when a link between a PE and a CE goes down.

This diagram shows PE1 and PE2 advertising the same network 10.0.10.0/24 to the RR. The RR then picks one as best and reflects that to PE3 (and others). This means that the path through PE2 will never be used until something happens with PE1. This is assuming that they are both using the same RD.

MPLS BGP1

MPLS BGP2

When PE1 loses its prefix it sends a BGP WITHDRAW to the RR, the RR then sends a WITHDRAW to PE3 and then it sends an UPDATE which is the prefix via PE2. The path via PE2 is not used until this happens. This means that load sharing is not taking place and that all traffic destined for 10.0.10.0/24 has to converge.

If every PE is using unique RD for the VRF per PE then they become two different routes and both can be reflected by the RR. The RD is then usually written in the form PE_loopback:VPN_identifier. This also helps with troubleshooting to see where the prefix originated from.

MPLS BGP3

PE3 now has two routes to 10.0.10.0/24 in its routing table.

PE3#sh ip route vrf CUST1 10.0.10.0 255.255.255.0

Routing Table: CUST1
Routing entry for 10.0.10.0/24
  Known via "bgp 64512", distance 200, metric 0
  Tag 65000, type internal
  Last update from 11.11.11.11 01:10:52 ago
  Routing Descriptor Blocks:
  * 22.22.22.22 (default), from 111.111.111.111, 01:10:52 ago
      Route metric is 0, traffic share count is 1
      AS Hops 1
      Route tag 65000
      MPLS label: 17
      MPLS Flags: MPLS Required
    11.11.11.11 (default), from 111.111.111.111, 01:10:52 ago
      Route metric is 0, traffic share count is 1
      AS Hops 1
      Route tag 65000
      MPLS label: 28
      MPLS Flags: MPLS Required

The PE is now doing load sharing meaning that some traffic will take the path over PE1 and some over PE2.

MPLS BGP4

We have achieved load sharing and this also means that if something happens with PE1 or PE2, not all traffic will be effected. To see which path is being used from PE3 we can use the show ip cef exact-route command.

PE3#sh ip cef vrf CUST1 exact-route 10.0.0.10 10.0.10.1
10.0.0.10 -> 10.0.10.1 => label 17 label 16TAG adj out of GigabitEthernet1, addr 23.23.23.0
PE3#sh ip cef vrf CUST1 exact-route 10.0.0.5 10.0.10.1 
10.0.0.5 -> 10.0.10.1 => label 28 label 17TAG adj out of GigabitEthernet1, addr 23.23.23.0

What is the drawback of using this? It consumes more memory because the prefixes are now unique, in effect doubling the required memory to store BGP Paths. The PEs have to store several copies with different RD for the prefix before it can import it into the RIB.

PE3#sh bgp vpnv4 uni all
BGP table version is 46, local router ID is 33.33.33.33
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal, 
              r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter, 
              x best-external, a additional-path, c RIB-compressed, 
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

     Network          Next Hop            Metric LocPrf Weight Path
Route Distinguisher: 11.11.11.11:1
 *>i 10.0.10.0/24     11.11.11.11              0    100      0 65000 i
Route Distinguisher: 22.22.22.22:1
 *>i 10.0.10.0/24     22.22.22.22              0    100      0 65000 i
Route Distinguisher: 33.33.33.33:1 (default for vrf CUST1)
 *>  10.0.0.0/24      32.32.32.1               0             0 65001 i
 *mi 10.0.10.0/24     22.22.22.22              0    100      0 65000 i
 *>i                  11.11.11.11              0    100      0 65000 i

For the multipathing to take place, PE3 must allow more than one route to be installed via BGP. This is done through the maximum-paths eibgp command.

address-family ipv4 vrf CUST1
  maximum-paths eibgp 2

In newer releases there are other features to overcome the limitation of only reflecting one route, such as BGP Add Path. This post showed the benefits of enabling unique RD for a VRF per PE to enable load sharing and better convergence. It also showed that doing so will use more memory due to having to store multiple copies of essentially the same route. Because multiple routes get installed into the FIB, that should also be a consideration depending on how large the FIB is for your platform.

Categories: BGP, MPLS Tags: , , , ,

Some pointers on OSPF as PE to CE protocol

February 23, 2014 5 comments

There was a discussion at the Cisco Learning Network (CLN) about OSPF as PE to CE
protocol.
I wanted to provide some pointers on using OSPF as PE to CE protocol.

RFC 4577 describes how to use OSPF as PE to CE protocol. When using BGP to carry the
OSPF routes the MPLS backbone is seen as a super backbone. This adds another level of
hierarchy making OSPF three levels compared to the usual two when using plain OSPF.

Superbackbone

Because the the MPLS backbone is seen as a super area 0, that means that OSPF routes
going across the MPLS backbone can never be better than type 3 summary LSA. Even if
the same area is used on both sides of the backbone and the input is a type 1 or type 2
LSA it will be advertised as a summary LSA on the other side.

LSA across superbackbone

The only way to keep the type 1 or type 2 LSAs as they are is to use a sham link.
Sham links sets up a control plane mechanism acting as a tunnel for the LSAs passing
over the MPLS backbone. Sham links are outside the scope of this article.

A LSA can never be “better” than it originally was input as. This means that if the input
to the PE isa type 3 LSA this can never be converted to a type 1 or type 2 LSA on the other
side. If the LSA was type 5 external to begin it will be sent as type 5 on the other side
as well.

To understand how the LSAs are sent over the backbone, look at this picture.

MPBGP

OSPF LSA is sent to PE which is running OSPF in a VRF with the CPE. The PE installs
the LSA as a route in the OSPF RIB. If the route is the best one known to the router
it can install it to the global RIB.

The PE redistributes from OSPF into BGP. Only routes that are installed as OSPF in
the RIB will be redistributed. To be able to carry OSPF specific information the PE
has to add extended communities. To make the IPv4 route a VPNv4 route the PE has
to add the RD and RT values. The OSPF specific communities consist of:

Domain-ID

The domain ID can either be hard coded or derived from the OSPF process running.
It is used to identify if LSAs are sent into the same domain as they originated
from. If the domain ID matches then type 3 summary LSAs can be sent for routes
that were internal or inter area. If the domain ID does not match then all routes
must be sent as external.

Domain ID match

Domain ID 1

Domain ID non match

Domain ID 2

OSPF Route Type

The route type consists of area number, route type and options.

Route Type

If we look at a MPBGP update we can see the route type encoded.

R4#sh bgp vpnv4 uni rd 1:1 1.1.1.1/32
BGP routing table entry for 1:1:1.1.1.1/32, version 5
Paths: (1 available, best #1, table cust)
Flag: 0x820
  Not advertised to any peer
  Local
    2.2.2.2 (metric 21) from 2.2.2.2 (2.2.2.2)
      Origin incomplete, metric 11, localpref 100, valid, internal, best
      Extended Community: RT:1:1 OSPF DOMAIN ID:0x0005:0x000000020200 
        OSPF RT:0.0.0.0:2:0 OSPF ROUTER ID:22.22.22.22:0
      mpls labels in/out nolabel/18

Something that is a bit peculiar is that this update has a route type of 2 even though
it originated from a type 1 LSA. In the end it doesn’t make a difference because it will
be advertised as type 3 LSA to the CPE.

OSPF Router ID

The router ID of the router that originated the LSA (PE) is also carried as an extended
community.

R4#sh bgp vpnv4 uni rd 1:1 1.1.1.1/32
BGP routing table entry for 1:1:1.1.1.1/32, version 5
Paths: (1 available, best #1, table cust)
Flag: 0x820
  Not advertised to any peer
  Local
    2.2.2.2 (metric 21) from 2.2.2.2 (2.2.2.2)
      Origin incomplete, metric 11, localpref 100, valid, internal, best
      Extended Community: RT:1:1 OSPF DOMAIN ID:0x0005:0x000000020200 
        OSPF RT:0.0.0.0:2:0 OSPF ROUTER ID:22.22.22.22:0
      mpls labels in/out nolabel/18

MED

The MED is set to the OSPF metric + 1 as defined by the RFC.


R4#sh bgp vpnv4 uni rd 1:1 1.1.1.1/32
BGP routing table entry for 1:1:1.1.1.1/32, version 5
Paths: (1 available, best #1, table cust)
Flag: 0x820
  Not advertised to any peer
  Local
    2.2.2.2 (metric 21) from 2.2.2.2 (2.2.2.2)
      Origin incomplete, metric 11, localpref 100, valid, internal, best
      Extended Community: RT:1:1 OSPF DOMAIN ID:0x0005:0x000000020200 
        OSPF RT:0.0.0.0:2:0 OSPF ROUTER ID:22.22.22.22:0
      mpls labels in/out nolabel/18

The goal of these extended communities is to extend BGP so that OSPF LSAs can be
carried transparently as if BGP hadn’t been involved at all. LSAs are translated
to BGP updates and then translated back to LSAs.

If we look at a packet capture we can see the extended communities attached.
This BGP Update originated from a type 5 external LSA with metric-type 1.

Capture

When using OSPF as the PE to CE protocol it is important to remember the design
rules of OSPF. Because of that you should avoid designs like this:

OSPF1

In this design area 1 is used on both sides but the CPE is then connected to area 0
which makes it an ABR. The rules of OSPF dictate that summary LSAs must only be
received over area 0 if it is an ABR. This means this topology is broken and would
require changing area or using a virtual link.

OSPF as PE to CE protocol has some complexity but must of it is still plain OSPF
which is in itself a complicated protocol. Combine that with BGP and MPLS and
it is easy to get confused which protocol is responsible for what. That is also
one of the reasons that I recommend to use eBGP or static when customers connect
to their ISP.

Categories: BGP, MPLS, OSPF Tags: , , , , ,

Scaling PEs in MPLS VPN – Route Target Constraint (RTC)

September 23, 2013 13 comments

Introduction

In any decent sized service provider or even an enterprise network running
MPLS VPN, it will most likely be using Route Reflectors (RR). As described in
a previous post iBGP fully meshed does not really scale. By default all
PEs will receive all routes reflected by the RR even if the PE does not
have a VRF configured with an import matching the route. To mitigate this
ineffecient behavior Route Target Constraint (RTC) can be configured. This
is defined in RFC 4684.

Route Target Constraint

The way this feature works is that the PE will advertise to the RR which RTs
it intends to import. The RR will then implement an outbound filter only sending
routes matching those RTs to the PE. This is much more effecient than the default
behavior. Obviously the RR still needs to receive all the routes so no filtering
is done towards the RR. To enable this feature a new Sub Address Family (SAFI) is
used called rtfilter. To show this feature we will implement the following topology.

RTC

The scenario here is that PE1 is located in a large PoP where there are already plenty
of customers. It currently has 255 customers. PE2 is located in a new PoP and so far only
one customer is connected there. It’s unneccessary for the RR to send all routes to PE2
for all of PE1 customers because it does not need them. To simulate the customers I wrote
a simple bash script to create the VRFs for me in PE1.

#!/bin/bash
for i in {0..255}
do
   echo "ip vrf $i"
   echo "rd 1:$i"
   echo "route-target 1:$i"
   echo "interface loopback$i"
   echo "ip vrf forwarding $i"
   echo "ip address 10.0.$i.1 255.255.255.0"
   echo "router bgp 65000"
   echo "address-family ipv4 vrf $i"
   echo "network 10.0.$i.0 mask 255.255.255.0"
done

PE2 will not import these due to that the RT is not matching any import statements in
its only VRF that is currently configured. If we debug BGP we can see lots of messages
like:

BGP(4): Incoming path from 4.4.4.4
BGP(4): 4.4.4.4 rcvd UPDATE w/ attr: nexthop 1.1.1.1, origin i, localpref 100, 
metric 0, originator 1.1.1.1, clusterlist 4.4.4.4, extended community RT:1:104
BGP(4): 4.4.4.4 rcvd 1:104:10.0.104.0/24, label 120 -- DENIED due to:  extended 
community not supported;

In this case we have 255 routes but what if it was 1 million routes? That would be
a big waste of both processing power and bandwidth, not to mention that the RR would
have to format all the BGP updates. These are the benefits of enabling RTC:

  • Eliminating waste of processing power on PE and RR and waste of bandwidth
  • Less VPNv4 formatted Updates
  • BGP convergence time is reduced

Currently the RR is advertising 257 prefixes to PE2.

RR#sh bgp vpnv4 uni all neighbors 3.3.3.3 advertised-routes | i Total
Total number of prefixes 257

Implementation

Implementing RTC is simple. It has to be supported on both the RR and the PE though.
Add the following commands under BGP:

RR:

RR(config)#router bgp 65000
RR(config-router)#address-family rtfilter unicast
RR(config-router-af)#nei 3.3.3.3 activate
RR(config-router-af)#nei 3.3.3.3 route-reflector-client

PE2:

PE2(config)#router bgp 65000
PE2(config-router)#address-family rtfilter unicast
PE2(config-router-af)#nei 4.4.4.4 activate

The BGP session will be torn down when doing this! Now to see how many routes the RR is
sending.

RR#sh bgp vpnv4 uni all neighbors 3.3.3.3 advertised-routes | i Total
Total number of prefixes 0

No prefixes! To see the rt filter in effect use this command:

RR#sh bgp rtfilter unicast all
BGP table version is 3, local router ID is 4.4.4.4
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal, 
              r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter, 
              x best-external, a additional-path, c RIB-compressed, 
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

     Network          Next Hop            Metric LocPrf Weight Path
     0:0:0:0          0.0.0.0                                0 i
 *>i 65000:2:1:256    3.3.3.3                  0    100  32768 i

Now we add an import under the VRF in PE2 and one route should be sent.

PE2(config)#ip vrf 0
PE2(config-vrf)#route-target import 1:1
PE2#sh ip route vrf 0

Routing Table: 0
Codes: L - local, C - connected, S - static, R - RIP, M - mobile, B - BGP
       D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area 
       N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
       E1 - OSPF external type 1, E2 - OSPF external type 2
       i - IS-IS, su - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2
       ia - IS-IS inter area, * - candidate default, U - per-user static route
       o - ODR, P - periodic downloaded static route, H - NHRP, l - LISP
       + - replicated route, % - next hop override

Gateway of last resort is not set

      10.0.0.0/8 is variably subnetted, 3 subnets, 2 masks
B        10.0.1.0/24 [200/0] via 1.1.1.1, 00:00:16
C        10.1.1.0/24 is directly connected, Loopback1
L        10.1.1.1/32 is directly connected, Loopback1
RR#sh bgp vpnv4 uni all neighbors 3.3.3.3 advertised-routes | i Total
Total number of prefixes 1 
RR#sh bgp rtfilter unicast all                                       
BGP table version is 4, local router ID is 4.4.4.4
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal, 
              r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter, 
              x best-external, a additional-path, c RIB-compressed, 
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

     Network          Next Hop            Metric LocPrf Weight Path
     0:0:0:0          0.0.0.0                                0 i
 *>i 65000:2:1:1      3.3.3.3                  0    100  32768 i
 *>i 65000:2:1:256    3.3.3.3                  0    100  32768 i

Works as expected. From the output we can see that the AS is 65000, the extended
community type is 2 and the RT that should be exported is 1:1 and 1:256.

Conclusion

Route Target Constraint is a powerful feature that will lessen the load on both your
Route Reflectors and PE devices in an MPLS VPN enabled network. It can also help
with making BGP converging faster. Support is needed on both PE and RR and the BGP
session will be torn down when enabling it so it has to be done during maintenance
time.

Categories: BGP, MPLS Tags: , , ,

iBGP – Fully meshed vs Route Reflection

September 9, 2013 4 comments

Intro

This post looks at the pros and cons with BGP Route Reflection compared to running
an iBGP full mesh.

Full mesh

Because iBGP routes are not propagated to iBGP sessions there must be a full mesh
inside the BGP network. This leads to scalability issues. For every N routers
there will be (N-1) iBGP neighbors and (N*(N-1))/2 BGP sessions. For a medium
sized ISP network with 100 routers running BGP this would be 99 iBGP neighbors
and 4950 BGP sessions in total.

Full mesh

There are 4 routers in AS 2 which gives 3 iBGP neighbors and 6 iBGP sessions in total.

Benefits of a full mesh:

  • Optimal Traffic Forwarding
  • Path Diversity
  • Convergence
  • Robustness

Optimal Traffic Forwarding:

Because all BGP speaking routers are fully meshed they will receive iBGP updates
from all peers. If no manipulating of attributes have been done then the tiebreaker
will be the metric to the next-hop (IGP) so traffic will take the optimal path.

Path Diversity:

Due to the full mesh the BGP speaking router will have multiple paths to choose
from. If it was connected to a RR it would generally only have one path, the one
the RR decided was the best.

Convergence:

Because the BGP speaking router has multiple paths if the current best one should fail
it can start using one of the alternate paths. Also the BGP UPDATE messages are sent
directly between the iBGP peers instead of passing through an additional router (RR)
which would have to process it and the packets would have to travel additional distance
unless the RR is located in the same PoP as the routers.

Robustness:

If one BGP speaking router fails then only the networks behind that router are
not reachable any longer. If a RR fails then all networks that were reachable via
clients to that RR would no longer be reachable.

Caveats of a full mesh:

  • Lack of Scalability
  • Management Overhead
  • Duplication of Information

Lack of Scalability:

Having hundreds of BGP sessions on all routers would mean a lot of BGP processing.
The number of BGP Updates coming in would be massive.
This would put a great burden on the CPU/RP of the router. For really large networks
this could potentially be more than the router can handle. In a network with 300 routers
there would be 44850 iBGP sessions. The RIB-in size would be very large because of the
large number of peers.

Management Overhead:

Adding a new device to the network means reconfiguring all the existing devices.
Configurations would be very big considering all the lines needed to setup the
full mesh.

Duplication of Information:

For every external network there could potentially be multiple paths internally
leading to using lots of RIB/FIB space on the devices. It does not make much sense
to install all paths into RIB/FIB.

Benefits of Route Reflection:

  • Scalability
  • Reduced Operational Cost
  • Reduced RIB-in Size
  • Reduced Number of BGP Updates
  • Incremental Deployability

Scalability:

The number of iBGP sessions needed is greatly reduced. A client only needs one session
or preferably two to have route reflector redundancy. A route reflector needs
(K*(K-1))/2 + C where K is the number of route reflectors and C is the number of
clients. The route reflectors still need to be in full mesh with each other.

Reduced Operational Cost:

With a full mesh when adding a new device it requires reconfiguring all the existing
devices. This requires operator intervention which is an added cost. With route reflection
when adding a new device only the new device and the RR it peers with needs new configuration.

Reduced RIB-in Size:

RIB-in contains the unprocessed BGP information. After processing this information
the best paths are installed into the Loc-RIB. The RIB-in grows proportionally with
the number of neighbors that the router peers with. If there is n routers and p prefixes
then the router would have a RIB-in that is of size n * p. In a full mesh n is very high
but with route reflection n is only the number of RRs that the router peers with.

Reduced Number of BGP Updates:

In a full mesh a router will receive N – 1 updates where N is the number of routers.
This is a large amount of updates. With route reflection N is small since this is
only the number of route reflectors the router peers with.

Incremental Deployability:

Route reflection does not require massive changes in the existing network like with
confederations. It can be deployed incrementally and routers can be migrated to the
RR topology gradually. Not all routers need to be moved at once.

Caveats of Route Reflection:

  • Robustness
  • Prolonged Routing Convergence
  • Potential Loops
  • Reduced Path Diversity
  • Suboptimal Routes

Robustness:

With a full mesh if a single router fails that only impacts the networks behind
that router. If a route reflector fails it affects all the networks that were
behind all of the route reflectors clients. To avoid single points of failure,
RRs are usually deployed in pairs.

Prolonged Routing Convergence:

In a full mesh every BGP update only travels a single hop. With route reflection
the number of hops is increased and if the route reflectors are setup in a
hierarchical topology the update could travel through several RRs. Every RR
will add some processing delay and propagation delay before the update reaches
the client.

Potential Loops:

In a topology where clients are connected to a single RR there should be no
data plane loops. When clients are connected to two RRs there is a risk
of a loop forming if the control plane topology does not match the physical
topology. Because of that it is important to try to match the two topologies.

Reduced Path Diversity:

In a full mesh if there are multiple paths to an external network then
all paths will be announced and the local router makes a decision which one
is the best. With route reflection the RR makes the decision which path is
the best and announces this path only. This leads to fewer paths being
announced which could lead to longer convergence delays.

There are drafts for announcing more than one best path which would help
with this issue. Some newer IOS releases supports this feature.

Suboptimal routes:

The RR will select a best path based on its own local routing information.
This could lead to routers using suboptimal paths because there may be
a shorter path available from a routers perspective but this is not the
path that the RR had chosen. Therefore it’s important to consider where
the RRs are placed.

Conclusion

This post takes a look at the benefits and caveats of a fully meshed iBGP network
vs route reflection. Although because of scalability it’s almost impossible to not
go with route reflection one should still consider the caveats of route reflection.
It’s important to consider the placement and the number of RRs in the topology.
This post is the first of posts that will focus on CCDE topics.

Categories: BGP Tags: , , ,

BGP wedgies – Why isn’t my routing policy having effect?

September 8, 2013 1 comment

Intro

Brian McGahan from INE introduced me to something interesting the other day.
BGP wedgie, what is that? I had never heard of it before although I’ve heard
of such things occuring. A BGP wedgie is when a BGP configuration can lead
to different end states depending on in which order routes are sent. There is
actually an RFC for this – RFC 4264.

Peering relationships

To understand this RFC you need to have some knowledge of BGP and the different
kind of peering relationships between service providers and customers.

Service providers are usually described as Tier 1 or Tier 2. A Tier 1 service provider
is one that does not need to buy transit. They have private peerings with other service
providers to reach all networks in the Default Free Zone (DFZ). This is the
theory although it’s difficult in the real world to see who is Tier 1 or not.

Tier 2 service providers don’t have private peerings to reach all the networks so they
must buy transit from one or more Tier 1 service providers. This is a paid
service.

Service providers have different preference for routes coming in. The most
preferred routes are those coming from customers. After that it is preferred
to send traffic over private peerings since in theory this should be cheaper than
transit. The least preferred is to send traffic towards your transit.

Why is my policy not working?

Assume that you are a customer buying capacity from two service providers.
You want to use one service provider as primary and one as secondary.
This is usually done by sending a community towards your secondary provider
which then sets local preference. Keep in mind that providers will still have
their best economic result in mind though. Take a look at the following diagram.

Wedgie1

We will be configuring AS1. We want to have the network 1.1.1.0/24 as primary
by AS4 and secondary by AS2. We will use communities to achieve this. We
setup the primary path first.

This is the configuration of AS1 so far:

router bgp 1
 no synchronization
 bgp log-neighbor-changes
 neighbor 12.12.12.2 remote-as 2
 neighbor 12.12.12.2 description backup
 neighbor 12.12.12.2 shutdown
 neighbor 12.12.12.2 send-community
 neighbor 12.12.12.2 route-map set-backup out
 neighbor 14.14.14.4 remote-as 4
 neighbor 14.14.14.4 description primary
 no auto-summary
!
ip bgp-community new-format
!
route-map set-backup permit 10
 set community 2:50

The backup will be turned up later.

Looking from AS2 perspective we now have the correct path.

AS2#sh bgp ipv4 uni   
BGP table version is 2, local router ID is 23.23.23.2
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 1.1.1.0/24       23.23.23.3                             0 3 4 1 i
AS2#traceroute 1.1.1.1

Type escape sequence to abort.
Tracing the route to 1.1.1.1

  1 AS3 (23.23.23.3) 80 msec 36 msec 20 msec
  2 AS4 (34.34.34.4) 64 msec 56 msec 48 msec
  3 AS1 (14.14.14.1) 84 msec *  68 msec

Now the backup service is turned up.

AS1(config-router)#no nei 12.12.12.2 shut
AS1(config-router)#
%BGP-5-ADJCHANGE: neighbor 12.12.12.2 Up

AS2 still prefers the correct path due to local preference.

AS2#sh bgp ipv4 uni   
BGP table version is 2, local router ID is 23.23.23.2
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*  1.1.1.0/24       12.12.12.1               0     50      0 1 i
*>                  23.23.23.3                             0 3 4 1 i
AS2#traceroute 1.1.1.1

Type escape sequence to abort.
Tracing the route to 1.1.1.1

  1 AS3 (23.23.23.3) 84 msec 44 msec 20 msec
  2 AS4 (34.34.34.4) 56 msec 60 msec 44 msec
  3 AS1 (14.14.14.1) 100 msec *  100 msec

AS3 and AS4 has the following route-map to increase local pref for customer
routes.

AS3#sh route-map
route-map customer, permit, sequence 10
  Match clauses:
  Set clauses:
    local-preference 150
  Policy routing matches: 0 packets, 0 bytes

Now what happens if there is a failure between AS1 and AS4?
AS2 now only has one paith available.

AS2#sh bgp ipv4 uni
BGP table version is 3, local router ID is 23.23.23.2
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 1.1.1.0/24       12.12.12.1               0     50      0 1 i

This is advertised to R3 which sets local preference to 150.

AS3#sh bgp ipv4 uni
BGP table version is 4, local router ID is 34.34.34.3
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 1.1.1.0/24       23.23.23.2                    150      0 2 1 i

Now the primary circuit comes back. AS3 will prefer to go via AS2 because
that is a customer route.

AS3#sh bgp ipv4 uni
BGP table version is 4, local router ID is 34.34.34.3
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*  1.1.1.0/24       34.34.34.4                             0 4 1 i
*>                  23.23.23.2                    150      0 2 1 i

We now have a BGP wedgie. The same BGP configuration has generated two
different outcomes depending on the order of which the routes were announced.
The only way of breaking the wedgie is now to stop announcing the backup. Let
the network converge and then bring up the backup again. AS2 now has the correct
path again.

AS2#sh bgp ipv4 uni
BGP table version is 5, local router ID is 23.23.23.2
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*  1.1.1.0/24       12.12.12.1               0     50      0 1 i
*>                  23.23.23.3                             0 3 4 1 i

So to describe what is actually looking take a look at this diagram.

Wedgie2

The number describes in what order the UPDATE is sent. AS2 has two paths but
the one directly to AS1 has a local pref of 50 due to AS1 using it as a backup.
This means that AS2 does not send this path to AS3 so AS3 has to use the path
via AS4. This is the key. Now what happens when the circuit between AS1 and AS4
fails?

Wedgie3

The key here is step 3 where AS2 sends it only current path to AS3. AS3 will then
set local preference to 150 because this is a customer route. Then the primary
circuit comes back.

Wedgie4

AS1 announces the network to AS4. AS4 announces this to AS3. AS3 does NOT
advertise this to AS2 because it already has a best path via AS2 where
the local preference is 150. This means that the network can not converge
to the primary path until the backup path has been removed.

Conclusion

BGP is a distance vector protocol and sometimes the same configuration can
give different outcomes depending on which order updates are sent. Have
this in mind when setting up BGP and try to learn as much as possible about
your service providers peerings.

Categories: BGP Tags: , ,

Default routes in BGP

June 12, 2013 3 comments

I have seen in forums and in other places that some find that the
default route in BGP is a bit confusing. There are multiple ways of
orginating a default route in BGP. To start this is the topology
used:

Default

The following configurations are there from the start:

R1

interface FastEthernet0/0
 ip address 12.12.12.1 255.255.255.0
ip route 3.3.3.3 255.255.255.255 12.12.12.2
ip route 4.4.4.4 255.255.255.255 12.12.12.2

R2

interface FastEthernet0/0
 ip address 12.12.12.2 255.255.255.0
!
interface FastEthernet0/1
 ip address 23.23.23.2 255.255.255.0
!
interface FastEthernet1/0
 ip address 24.24.24.2 255.255.255.0
!
router bgp 2
 neighbor 23.23.23.3 remote-as 2
 neighbor 24.24.24.4 remote-as 4
!
ip route 0.0.0.0 0.0.0.0 12.12.12.1

R3

interface Loopback0
 ip address 3.3.3.3 255.255.255.255
!
interface FastEthernet0/0
 ip address 23.23.23.3 255.255.255.0
!
router bgp 2
 network 3.3.3.3 mask 255.255.255.255
 neighbor 23.23.23.2 remote-as 2

R4

interface Loopback0
 ip address 4.4.4.4 255.255.255.255
!
interface FastEthernet0/0
 ip address 24.24.24.4 255.255.255.0
!
router bgp 4
 network 4.4.4.4 mask 255.255.255.255
 neighbor 24.24.24.2 remote-as 2

R2 is learning the loopbacks from R3 and R4. R2 has a default route towards R1.
The goal is to announce default route in BGP. Redistribute static should be
enough to announce the default route?

R2(config)#router bgp 2
R2(config-router)#redistribute static

We are not seeing it being advertised to the peers…

R4#sh bgp ipv4 uni
BGP table version is 3, local router ID is 4.4.4.4
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 3.3.3.3/32       24.24.24.2                             0 2 i
*> 4.4.4.4/32       0.0.0.0                  0         32768 i

Is it in the BGP RIB of R2?

R2#sh bgp ipv4 uni
BGP table version is 5, local router ID is 24.24.24.2
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*>i3.3.3.3/32       23.23.23.3               0    100      0 i
*> 4.4.4.4/32       24.24.24.4               0             0 4 i

It is not. BGP does not redistribute a static default route unless the
default-information command is used. This protects against someone accidentally
redistributing a default route in BGP which could potentially be disastrous.

R2(config)#router bgp 2
R2(config-router)#default-information originate
R2(config-router)#^Z

R2#sh bgp ipv4 un
BGP table version is 18, local router ID is 24.24.24.2
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 0.0.0.0          12.12.12.1               0         32768 ?
R3#sh bgp ipv4 uni
BGP table version is 18, local router ID is 3.3.3.3
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
* i0.0.0.0          12.12.12.1               0    100      0 ?
R4#sh bgp ipv4 uni
BGP table version is 16, local router ID is 4.4.4.4
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 0.0.0.0          24.24.24.2               0             0 2 ?

Now the default route is spread. If we receive a default route in OSPF that
can be redistributed as well. Don’t forget to match externals or you will
have a facepalm moment like I did while writing this post.

R2#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
R2(config)#router bgp 2
R2(config-router)#no redistribute static
R2(config-router)#no ip route 0.0.0.0 0.0.0.0 12.12.12.1
R2(config)#int f0/0
R2(config-if)#ip ospf 1 area 0
R1#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
R1(config)#int f0/0
R1(config-if)#ip ospf 1 area 0
R1(config-if)#router ospf 1
R1(config-router)#default-information originate always

There is now a default route learned via OSPF.

R2#sh ip route ospf
O*E2 0.0.0.0/0 [110/1] via 12.12.12.1, 00:02:54, FastEthernet0/0

Now to redistribute OSPF into BGP.

R2(config)#router bgp 2
R2(config-router)#redistribute ospf 1 match external
R2(config-router)#^Z
R2#sh bgp ipv
*Mar  1 02:13:18.267: %SYS-5-CONFIG_I: Configured from console by console
R2#sh bgp ipv4 uni
BGP table version is 20, local router ID is 24.24.24.2
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 0.0.0.0          12.12.12.1               1         32768 ?
R3#sh bgp ipv4 uni
BGP table version is 18, local router ID is 3.3.3.3
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
* i0.0.0.0          12.12.12.1               1    100      0 ?
R4#sh bgp ipv4 uni
BGP table version is 18, local router ID is 4.4.4.4
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 0.0.0.0          24.24.24.2               1             0 2 ?

So the default-information originate command must always be accompanied by
a redistribute statement. It can be static or from a dynamic protocol but
there must be a redistribute statement.

It is also possible to use the network command.

R2(config)#router bgp 2
R2(config-router)#no redistribute ospf 1
R2(config-router)#int f0/0
R2(config-if)#no ip ospf 1 area 0
R2(config-if)#  
*Mar  1 02:15:41.559: %OSPF-5-ADJCHG: Process 1, Nbr 12.12.12.1 on FastEthernet0/0 from FULL to DOWN, Neighbor Down: Interface down or detached
R2(config-if)#ip route 0.0.0.0 0.0.0.0 12.12.12.1
R2(config)#router bgp 2
R2(config-router)#network 0.0.0.0
R2(config-router)#^Z
R2#sh bgp ipv4 uni
BGP table version is 22, local router ID is 24.24.24.2
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 0.0.0.0          12.12.12.1               0         32768 i
R3#sh bgp ipv4 uni
BGP table version is 18, local router ID is 3.3.3.3
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
* i0.0.0.0          12.12.12.1               0    100      0 i
R4#sh bgp ipv4 uni
BGP table version is 20, local router ID is 4.4.4.4
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 0.0.0.0          24.24.24.2               0             0 2 i

The difference here is that network 0.0.0.0 will pick it up if there is
a default route in the RIB. There is no need to redistribute. Now for OSPF
as well.

R2#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
R2(config)#no ip route 0.0.0.0 0.0.0.0
R2(config)#int f0/0
R2(config-if)#ip ospf 1 area 0
R2(config-if)#^Z
R2#
%OSPF-5-ADJCHG: Process 1, Nbr 12.12.12.1 on FastEthernet0/0 from LOADING to FULL, Loading Done

R2#sh bgp ipv4 uni
BGP table version is 24, local router ID is 24.24.24.2
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 0.0.0.0          12.12.12.1               1         32768 i
R3#sh bgp ipv4 uni        
BGP table version is 18, local router ID is 3.3.3.3
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
* i0.0.0.0          12.12.12.1               1    100      0 i
R4#sh bgp ipv4 uni
BGP table version is 22, local router ID is 4.4.4.4
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 0.0.0.0          24.24.24.2               1             0 2 i

What if we don’t want to use a default route on the local router or only
generate a default route towards a specific neighbor. That is when the
default-originate command is used towards a neighbor.

R2(config)#int f0/0
R2(config-if)#no ip ospf 1 area 0
R2(config-if)#router bgp 2
R2(config-router)#
*Mar  1 02:22:29.035: %OSPF-5-ADJCHG: Process 1, Nbr 12.12.12.1 on FastEthernet0/0 from FULL to DOWN, Neighbor Down: Interface down or detached
R2(config-router)#no network 0.0.0.0
R2(config-router)#nei 24.24.24.4 default-originate  
R2(config-router)#do sh ip route 0.0.0.0
% Network not in table
R2(config-router)#do sh bgp ipv4 uni
BGP table version is 25, local router ID is 24.24.24.2
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*>i3.3.3.3/32       23.23.23.3               0    100      0 i
*> 4.4.4.4/32       24.24.24.4               0             0 4 i

As you can see. There is no default route in R2 RIB or BGP RIB. R3 should not
have a default route now.

R3#sh bgp ipv4 uni
BGP table version is 18, local router ID is 3.3.3.3
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 3.3.3.3/32       0.0.0.0                  0         32768 i
* i4.4.4.4/32       24.24.24.4               0    100      0 4 i

R4 has it.

R4#sh bgp ipv4 uni
BGP table version is 24, local router ID is 4.4.4.4
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 0.0.0.0          24.24.24.2               0             0 2 i
*> 3.3.3.3/32       24.24.24.2                             0 2 i
*> 4.4.4.4/32       0.0.0.0                  0         32768 i

So to summarize. When advertising a default route in BGP you can either use
network 0.0.0.0 command. This can be used to only inject a default without
redistributing static or dynamically learned routes.

The default-information originate command is used if you are redistributing
routes but the default route is not getting included. This command must always
be matched by a redistribute statement.

Default-originate is used to only advertise a default to a specific neighbor
and it does not insert default route into BGP RIB and does not regquire a
default to exist in RIB at all.

The last command would probably be the only one used in a real life case but
for the CCIE lab you need to know them all.