Archive

Posts Tagged ‘CCDE’

More PIM-BiDir Considerations

August 12, 2015 3 comments

Introduction

From my last post on PIM BiDir I got some great comments from my friend Peter Palúch. I still had some concepts that weren’t totally clear to me and I don’t like to leave unfinished business. There is also a lack of resources properly explaining the behavior of PIM BiDir. For that reason I would like to clarify some concepts and write some more about the potential gains of PIM BiDir is. First we must be clear on the terminology used in PIM BiDir.

Terminology

Rendezvous Point Address (RPA) – The RPA is an address that is used as the root of the distribution tree for a range of multicast groups. This address must be routable in the PIM domain but does not have to reside on a physical interface or device.

Rendezvous Point Link (RPL) – It is the physical link to which the RPA belongs. The RPL is the only link where DF election does not take place. The RFC also says “In BIDIR-PIM, all multicast traffic to groups mapping to a specific RPA is forwarded on the RPL of that RPA.” In some scenarios where the RPA is virtual, there may not be an RPL though.

Upstream – Traffic towards the root (RPA) of the tree. This is the direction used by packets traveling from source(s) to the RPL.

Downstream – Traffic going away from the root. The direction from which packets travel from the RPL to the receivers.

Designated Forwarder (DF) – A single DF exists for every RPA on a link, point to point or multipoint. The only exception as noted above is the RPL. The DF is the router with the best metric to the RPA. The DF is responsible for forwarding downstream traffic onto its link and forwarding upstream traffic from its link towards the RPL. The DF on a link is also responsible for processing Join messages from downstream routers on the link and ensuring packets are forwarded to local receivers as discovered by IGMP or MLD.

RPF Interface – The RPF interface is determined by looking up the RPA in the Multicast Routing Information Table (MRIB). The RPF information then determines which interface of the router that would be used to send packets towards the RPL of the group.

RPF Neighbor – The next node on the shortest path towards the RPA.

Sources in PIM BiDir

One of the confusing parts in PIM BiDir is how traffic travels from the source(s) to the RPA. There is no (S,G) and no PIM Register messages in PIM BiDir, so how is this handled?

When a source starts sending traffic it will send it towards the RPA regardless if there are any receivers or not. The DF on the segment is responsible for sending the traffic upstream towards the RPA. The packet then travels through the PIM domain until it reaches the root of the tree (RPA). In some articles on PIM BiDir, it is mentioned that there is no RPF check. This is not entirely true since RPF is used to find the right interface towards the RPA but it does not use RPF to ensure loop free forwarding, the DF mechanism is instead used for this.

Bitbucket

The picture shows a source sending traffic to a multicast group, there are no interested receivers yet. Traffic from the source travels towards the RPA which is not a physical device but only exists virtually on a shared segment with the routers R1, R2 and R3. These routers are connected to the RPL and on the RPL, there is no DF election. This means that they are free to forward the packets, however, there are no receivers yet and hence no interfaces in the Outgoing Interface List (OIL). This means that traffic will simply get dropped in the bitbucket. This is a waste of bandwidth until at least one receiver joins the shared tree.

Considerations for PIM BiDir

Since there is no way to control what the multicast sources are sending, what we are giving up for getting minimal state in the PIM domain is bandwidth. It is not very likely though that a many-to-many multicast application will not have any receivers so this may be an acceptable sacrifice.

Due to having minimum state, PIM BiDir will use less memory compared to PIM ASM or PIM SSM, it will also use less CPU since there is less PIM messages to be generated and received and processed. Is this a consideration on modern platforms? It might be, it might not be. What is known though is that it is a less complex protocol than PIM ASM because it does not have a PIM Register process. Due to this, the RP, which does not even have to be a real router, can’t get overwhelmed by the unicast PIM Register messages. This also provides for an easier mechanism to provide RP redundancy compared to PIM ASM and SSM which requires anycast and MSDP to provide the same.

PIM BiDir Not Working  in IOSv and CSR1000v

While writing this post I needed to run some tests, so I booted up VIRL and tested on IOSv but could not get PIM BiDir to work properly. I then tested with CSR1000v, both within VIRL and directly on an ESXi server with the same results. These images were quite new and it seems something is not working properly in them. When the routers were running in BiDir mode, they would not process the multicast and forward it on. Just a fair warning that if you try it that you may run into similar results, please share if you discover something interesting.

Putting it All Together

To get an understanding for the whole process, let us then describe all of the steps from when the source starts sending until the receiver starts receiving the traffic from the source.

The source starts sending traffic on the local segment to R4. R4 is the DF on the link since it’s the only router present.

BiDir1

R4 does a RPF check for 192.0.2.1 which is the RPA, through this process it finds the upstream interface and starts forwarding traffic towards the RPA where R1 is the RPF neighbor and the next router on the path towards the RPA.

BiDir2

R1 forwards traffic towards the RPA on its upstream interface but there are no interested receivers yet, so the traffic will get dropped in the bitbucket. Please note that both R2 and R3 does receive this traffic but if there is no interface in the OIL, the traffic simply gets dropped.

BiDir3

The PC then starts to generate an IGMP Report and sends it towards R5.

BiDir4

R5 is the DF on the segment which also means that it is the Designated Router (DR). It will generate a PIM Join (*,G) and send it towards the RPA on its upstream interface where R3 is the RPF neighbor. Only the DF and hence DR on a link may act on the IGMP Report.

BiDir5

R3 receives the PIM Join from R5 and since traffic is already being sent out by R1 on the RPL, R3 is allowed to start forwarding the traffic. Remember that there is no DF elected on the RPL. We now have end to end multicast flowing.

BiDir6

Conclusion

The main goal of this post was to show how the RP can be a virtual device on a shared segment. This means that redundancy can be designed into the RP role without any complex mechanisms used in PIM ASM. I also wanted to clarify some concepts with the forwarding and the terminology since there seem to be quite a few posts out there that are slightly wrong or not using the correct terminology.

Advertisements
Categories: CCDE Tags: , ,

Many to Many Multicast – PIM BiDir

August 9, 2015 3 comments

Introduction

This post will describe PIM Bidir, why it is needed and the design considerations for using PIM BiDir. This post is focused on technology overview and design and will not contain any actual configurations.

Multicast Applications

Multicast is a technology that is mainly used for one-to-many and many-to-many applications. The following are examples of applications that use or can benefit from using multicast.

One-to-many

One-to-many applications have a single sender and multiple receivers. These are examples of applications in the one-to-many model.

Scheduled audio/video: IP-TV, radio, lectures

Push media: News headlines, weather updates, sports scores

File distributing and caching: Web site content or any file-based updates sent to distributed end-user or replicating/caching sites

Announcements: Network time, multicast session schedules

Monitoring: Stock prices, security system or other real-time monitoring applications

Many-to-many

Many-to-many applications have many senders and many receivers. One-to-many applications are unidirectional and many-to-many applications are bidirectional.

Multimedia conferencing: Audio/video and whiteboard is the classic conference application

Synchronized resources: Shared distributed databases of any type

Distance learning: One-to-many lecture but with “upstream” capability where receivers can question the lecturer

Multi-player games: Many multi-player games are distributed simulations and also have chat group capabilities.

Overview of PIM

PIM has different implimentations to be able to handle the above applications. There are mainly three implementatios of PIM, PIM Any Source Multicast (ASM), PIM Source Specific Multicast (SSM) and PIM BiDirectional (BiDir).

PIM ASM

PIM ASM was the first implementation and is well suited for one-to-many applications. ASM means that traffic from any source to a group will be delivered to the receiver(s). PIM ASM uses the concept of a Rendezvous Point Tree (RPT) and Shortest Path Tree (SPT). The RPT is a tree built from the receiver to a Rendezvous Point (RP). The tree from a multicast source to a receiver is called the SPT. Before the receiver can learn the source and build the SPT, the RP will have sent a PIM Join towards the source to build the SPT between the source and the RP. When looking in the mroute table, RPT state will be shown as (*,G) and SPT state will be shown as (S,G)

PIM1

The responsibilities of the RP are:

  • Receive PIM Register messages from the First Hop Router (FHR) and send Register Stop
  • Join the SPT and the RPT so the receivers get traffic and find out the source of the multicast

Initially traffic flows through the RP, there is a more efficient path though. When the Last Hop Router (LHR) starts receiving the multicast it will switch over to the SPT.The SPT will be a more optimal path and (likely) introduce lower delay between the source and the receiver.

PIM2

PIM ASM can support both one-to-many and many-to-many applications since it can use both SPT and RPT. To prevent LHR to switch to SPT, ip pim spt-threshold command can be used. It can either be set to switch over at a certain rate of traffic (kbps) or be set to infinity to always stay on the RPT. This can be combined with ACL to have certain groups always stay on the RPT and for others to switch over. PIM ASM can therefore use SPT for some groups and RPT for other groups. There are still drawbacks to PIM ASM, a few are mentioned here:

  • Complex protocol state with Register messages
  • Redundancy requires the use of MSDP
  • Any source can send which opens attack vector for DoS and sending traffic from spoofed source

PIM SSM

PIM SSM was created to work better with one-to-many flows compared to PIM ASM. In PIM SSM, there is no complex handling of state and there is only SPT, no RPT. That also means that there is no need for a RP. PIM SSM is much easier to setup and use, it does require clients to support IGMPv3 so that the IGMP Report can contain which source the receiver wants to receive the traffic from. This also means that since there is no RP, there has to be some way for the receiver to know which sources send to which groups. This has to be hanled by some form of Out Of Band(OOB) mechanism. The most common use for SSM is IP-TV where the Set Top Box (STB) receives a list of sources and groups by contacting a server.

The drawback of PIM SSM is that (S,G) state is created requiring more memory. Depending on the number of sources, this may be a factor or not.

PIM BiDir

Bidirectional PIM was created to work better with many-to-many applications. PIM BiDir uses only RPT and no SPT. This means that there has to be a RP. With bidirectional PIM, the RP does not perform any of the functions of PIM ASM though, such as sending Register Stop messages or joining the SPT. Remember, in PIM BiDir, there is no SPT.The RP in PIM BiDir does not have to be a physical device since the RP is not performing any control plane functions. It is simply a way of forwarding traffic the right way, think of it as a vector. The RP can be a physical device and in that case it is a normal RP, just without the responsibilities of an RP as we know it in PIM ASM. When configuring PIM BiDir to have redundant RPs the RP is sometimes called Phantom RP, because it does not have to reside on a physical device.

PIM BiDir is often used in “hoot n holler” and financial applications. PIM BiDir and PIM SSM are at different ends of the spectrum where PIM ASM can serve both type of applications.

PIM uses the concept of Reverse Path Forwarding (RPF) to ensure loop free forwarding. RPF ensures that traffic comes in on the interface that would be used to send traffic out towards the source. PIM BiDir can send traffic both up and down the RPT. This is not normally supported by using RPF, to support this PIM BiDir uses a Designated Forwarder (DF) on each segment, even point-to-point segments. The main responsibility of the DF is to forward traffic upstream towards the RP. The DF is elected based on the metric towards the RP, essentially building a tree along the best path without having to install any (S,G) state. RPF is still used to find the appropiate path towards the Rendezvous Point Link (RPL) but it is the DF mechanism that ensures loop free forwarding.

RP Considerations

In PIM BiDir there isno MSDP, it does not use (S,G) so this is expected. To provide redundant RP in PIM BiDir, Phantom RP is used. The Phantom RP is a virtual RP which is not assigned to a physical device, it is often implemented by having two routers use a loopback with different subnet mask length.

PIM3

Routers are assigned the RP adress of 192.0.2.1 which is then the Phantom RP, the actual routers where the traffic will flow through have been assigned 192.0.2.2 and 192.0.2.3 but with different net mask lengths. Normal best path rules will then forward traffic towards the longest path match which will be RP1 when it is available and RP2 when RP1 is not available. It is important to not configure the RP address as a physical interface address since this would break the redundancy. If a router was configured with the real address, it would not forward the traffic since the traffic would be destined for one of its own addresses.

Since the RP is so critical, redundancy must be provided. All traffic will pass through the RP which means that certain links in the network may have to carry a lot of the traffic. For this reason it can be necessary to have several RPs, that are acting as RPs for different multicast groups. The placement of the RP also becomes very important since traffic must flow through the RP.

PIM BiDir Considerations

PIM BiDir uses the DF mechanism and for the election to succeed, all the PIM routers on the segment must support PIM BiDir, otherwise the DF election will fail and PIM BiDir will not be supported on the segment. It is possible to have non PIM BiDir routers on a segment if a PIM neighbor filter is implemented to not form PIM adjacencies with certain routers. That way PIM BiDir can be gradually implemented into the network.

Closing Thoughts

PIM ASM supports all multicast models but at the cost of complexity. One could say that it’s a jack of all trades but does not excel at anything. PIM SSM is less complex and the best choice for one-to-many applications if the receivers have support for IGMPv3. PIM BiDir is best suited for many-to-many applications and keeps the least state of all the PIM implementations. Every PIM implementation has its use case and as an architect/designer its your job to know all the models and pick the best one based on business requirements.

Categories: CCDE Tags: , , , , ,

Interview with CCDE/CCAr Program Manager Elaine Lopes

August 2, 2015 2 comments

I am currently studying for the CCDE exam. Elaine Lopes is the program manager for the CCDE and CCAr certification. I’ve had the pleasure of interacting with her online and meeting her at Cisco Live as well. The CCDE is a great certification and I wanted you to get some insight into the program and ask about the future of the CCDE. A big thanks to Elaine and Cisco for agreeing to do the interview.

Daniel: Hi Elaine, and welcome. It was nice seeing you at Cisco Live! Can you please give a brief introduction of yourself to the readers?

Elaine: Hi, it was nice to see you, too! My name is Elaine Lopes and I’m the CCDE and CCAr Certification Program Manager. I’ve been with Cisco’s Learning@Cisco team since 1999, – I’m passionate about how people’s lives can change for the better through education and certification.

Daniel: Elaine, why did Cisco create an expert level design program? What kind of people should be looking at the CCDE?

Elaine: Cisco has very well established expert-level certifications for network engineers in various fields which assess configuration, implementation, troubleshooting and operations skills; however, these certifications were never aimed to assess design skills. The root cause of many network failures is poor network design and the CCDE helps to fill this gap. The certification was created to assess a candidate’s skills in real-life network design. The candidate should mainly be able to meet business requirements through their network designs as well as understand design principles such as network resiliency, scalability and manageability! CCDE focuses on design by making technology decisions and justifying the choices made. Since CCDE is meant to assess design skills, it targets infrastructure network designers.

Daniel: Are there any prerequisites before taking the CCDE?

Elaine: There are no pre-requisites for the CCDE certification, although it is recommended that candidates have 7+ years of experience on network design in diverse environments.

Daniel: What kind of experience should a candidate have to be a good fit for the CCDE? What is the technology range that needs to be covered such as RS, SP, datacenter, security?

Elaine: CCDE is a role-based certification, and therefore it is desirable that candidates have experience (breadth and depth) in large-scale network designs, as they will be tested on making design decisions within constraints. CCDE focuses mainly on Layer 3 control plane, Layer 2 control plane and network virtualization technologies, but also assesses QoS, security, network management with a little of wireless, optical and storage technologies.

Daniel: Can you give us a short description of the exam process and at which locations the exam is available?

Elaine: The CCDE certification is divided into two steps. The first step is the CCDE written exam, which focuses on design aspects of the various technologies described above, and can be taken at any Pearson VUE testing center at any time. Once CCDE candidates pass the written exam, they then need to pass the CCDE practical exam, which is made of four different scenarios where technologies and design concepts are interconnected. The CCDE practical exam is tridimensional: the same technologies tested on the CCDE written exam plus the different job tasks (merge/divest, add technology, replace technology, scaling and design failure), and the task domains (analyze requirements, design, plan for the design deployment, and validate and optimize network designs). The CCDE practical exam is administered four days a year at any of the 275 Pearson Professional Centers (PPCs) worldwide.

Daniel: You are also responsible for the CCAr program. What is the difference between design and architecture? What kind of candidates should be looking at this exam?

Elaine: CCArs collaborate with senior leadership to create a vision for the network, and their outputs are the business and technical requirements which will be input for CCDEs to create a network design that meets these requirements. The pre-requisite for CCAr is to be a CCDE in good standing, with the target audience being the infrastructure architects who navigate between the technical and business worlds.

Daniel: What kind of study resources are available for the CCDE? I know you have been working hard on providing guidance in the written blueprint, what else is coming?

Elaine: The biggest challenge for CCDE candidates seems to be how to get started, so we recently launched the Streamlined Preparation Resources. This site offers a study methodology and links to many recordings with information about the CCDE program. It’s mainly a list of preparation resources that can be personalized for one’s own needs and offers diverse resource types in a very prescriptive way for CCDE candidates to prepare for the CCDE written exam, but also can be helpful for the practical exam. Since the CCDE practical exam is situation-based, the team decided to provide materials to make candidates think as network designers. The materials are not mapped 1:1 to the blueprint but our long-term objective will be to release the materials in bits and pieces as they come available.

Daniel: Elaine, tell us a bit about the upcoming CCDE study guide. Why was this book written and how should it be used to prepare for the CCDE?

Elaine: Marwan [Al-Shawi] approached me saying he wanted to author a CCDE book, so we had some conversations and exchanged many emails which helped shape the book outline, aiming to be an “all-in-one” study guide for the CCDE practical exam. He then went through the whole publishing process with Cisco Press. There are great technical reviewers involved in it, and the book is to be released soon – I can’t wait to get my copy!

Daniel: After my last blog post on the CCIE program, I received some comments where people questioned the integrity of the CCIE exam. How do you work with the integrity of the CCDE?

Elaine: Integrity has always been top of mind when considering the delivery of the CCDE practical exam: it’s Windows-based and administered at the secure PPCs. The exam changes between administrations, and the nature of the scenarios makes it hard to guess the responses.

Daniel: I know you have designed the CCDE to be as timeless and generic as possible while still covering the relevant technologies. How will the exam be affected of new technologies and forwarding paradigms, such as SDN?

Elaine: True. I’d expect the CCIEs and CCDEs out there to be at the forefront of adoption of these new technologies in the field, and we’re already making plans for incorporating these new technologies into the CCIE tracks. CCDE will be no different, but I don’t have details yet I can share.

Daniel: Creating exams is very difficult and people often have opinions on the material being tested. It’s not well known that you can comment on the exam while taking it. Isn’t it true that comments are one of your sources of feedback on the quality of the exam?

Elaine: I heavily rely on statistical analysis to understand both item and exam performance before making any adjustments to the exam.. To get a more holistic view, I also read the comments candidates make on items while taking the exam. These comments sometimes provide good insight on how to fix low-performing items.

Daniel: Elaine, how can people give feedback on the program outside of comments while taking the exam?

Elaine: Just send me an email elopes@cisco.com. To be informed on what’s going on in the CCDE world, you can connect with me on LinkedIn (Elaine Lopes) and/or Twitter (@elopes01).

Daniel: There is a Subject Matter Expert (SME) recruitment program for certifications within Cisco. Do you have SMEs for the CCDE and how can any CCDE’s out there contact Cisco if they want to be part of the SME program?

Elaine: SMEs are critical to assure the exams are relevant, so yes, I do have several CCDE certified SMEs participating on the various phases of the exam, from development teams to authoring/reviewing/editing items, to working on the preparation resources, to the maintenance of the existing exams, etc. If you are CCDE-certified and want to participate, join the program and I’ll contact you for the next opportunity to participate.

Daniel: Thank you so much for your time, Elaine! I hope we’ll meet soon again. Do you have any final words and where do you see the CCDE program going in the future?

Elaine: I wanted to give a hint to CCDE candidates taking the practical exam: take the time to read and connect with each scenario and don’t make decisions or assumptions outside the context of the scenario. If you read a question and the answer is not glaring, go back to the scenario materials! CCDE design principles don’t change, so when the time comes I see “sprinkling” design aspects of new technologies in the CCDE exams. Hope to see you soon!! It’s been a pleasure to participate, thank you for inviting me!

Categories: CCDE Tags: , , ,

Service Provider IPv6 Deployment

June 29, 2015 2 comments

These are my study notes regarding IPv6 deployment in SP networks in preparation for the CCDE exam.

Drivers for implementing IPv6

  • External drivers
    • SP customers that need access to IPv6 resources
    • SP customers that need to interconnect their IPv6 sites
    • SP customers that need to interface with their own customers over iPv6
  • Internal drivers
    • Handle problems that may be hard to fix with IPv4 such as large number of devices (cell phones, IP cameras, sensors etc)
    • Public IPv4 address exhaustion
    • Private IPv4 address exhaustion
  • Strategic drivers
    • Long term expansion plans and service offerings
    • Preparing for new services and gaining competitive advantage

Infrastructure

  • SP Core Infrastructure
    • Native IPv4 core
    • L2TPv3 for VPNs
    • MPLS core
    • MPLS VPNs

My reflection is that most cores would be MPLS enabled, however there are projects such as Terastream in Deutsche Telekom where the entire core is IPv6 enabled and L2TPv3 is used in place of MPLS.

  • IPv6 in Native IPv4 Environments
    • Tunnel v6 in v4
    • Native v6 with dedicated resources
    • Dual stack

The easiest way to get going with v6 was to tunnel it over v4. The next logical step was to enable v6 but on separate interfaces to not disturb the “real” traffic and to be able to experiment with the protocol. The end goal is dual stack, at least in a non MPLS enabled network.

  • IPv6 in MPLS environments
    • 6PE
    • 6VPE

6PE is a technology to run IPv6 over an IPv4 enabled MPLS network. 6VPE does the same but with VRFs.

  • Native IPv6 over Dedicated Data Link
    • Dedicated data links between core routers
    • Dedicated data links to IPv6 customers
    • Connection to an IPv6 IX
  • Dual stack
    • All P + PE routers capable of v4 + v6 transport
    • Either two IGPs or one IGP for both v4 + v6
    • Requires more memory due to two routing tables
    • IPv6 multicast natively supported
    • All IPv6 traffic is routed in global space (no MPLS)
    • Good for content distribution and global services (Internet)
  • 6PE
    • IPv6 global connectivity over an IPv4 MPLS core
    • Transition mechanism (debatable)
    • PEs are dual stacked and need 6PE configuration
    • IPv6 reachability exchanged via MPBGP over iBGP sessions
    • IPv6 packets transported from 6PE to 6PE inside MPLS
    • The next-hop is an IPv4 mapped IPv6 address such as ::FFFF:1.1.1.1
    • BGP label assigned for the IPv6 prefix
    • Bottom label used due to P routers not v6 capable and for load sharing
    • neighbor send-label is configured under BGP address-family ipv6

6PE is viewed as a transition mechanism but this is arguable, if you transport IPv4 over MPLS, you may want to do the same with IPv6 as well for consistency. Running 6PE means that there is fate sharing between v4 and v6 though, which could mean that an outage may affect both protocols. This could be avoided by running MPLS for IPv4 but v6 natively.

  • Core network (P routers) left untouched
  • IPv6 traffic inherits MPLS benefits such as fast-reroute and TE
  • Incremental deployment possible (upgrade PE routers first)
  • Each site can be v4-only, v4-VPN-only, v4+v6, v4-VPN+v6 and so on
  • Scalability concerns due to separate RIB and FIB required per customer
  • Mostly suitable for SPs with limited amount of PEs
  • 6vPE
    • Equivalent of VPNv4 but for IPv6
    • Add VPNv6 address family under MPBGP
    • Send extended communities for the prefixes under the address family

It is a common misconception for 6PE and 6vPE that traceroutes are not possible, that is however not entirely true. A P router can generate ICMPv6 messages that will follow the LSP to the egress PE and then the ICMPv6 error message is forwarded back to the originator of the traceroute.

  • Route reflectors for 6PE and 6vPE
    • Needed to scale BGP full mesh
    • Dedicated RRs or data path RRs
    • Either dedicated RR per AF or have multiple AFs per RR
    • 6PE-RR must support IPv6 + label functionality
    • 6vPE-RR must support IPv6 + label and extended communities functionality

PA vs PI

  • PA advantages
    • Aggregation towards upstreams
    • Minimizes Internet routing table size
  • PA disadvantages
    • Customer is “locked” with the SP
    • Renumbering can be painful
    • Multi-homing and TE problems

The main driver here is if you are going to multi home or not. Renumbering is always painful but at least less so on IPv6 due to being able to advertise multiple IPv6 prefixes through Router Advertisements (RA).

  • PI advantages
    • Customers are not “locked” to the SP
    • Multi homing is straight forward
  • PI disadvantages
    • Larger Internet routing table due to lack of efficient aggregation
    • Memory and CPU needs on BGP speakers

Infrastructure Addressing (LLA vs global)

What type of addresses should be deployed on infrastructure links?

  • Link Local Address FE80::/10
    • Non routeable address
    • Less attack surface
    • Smaller routing tables
    • Can converge faster due to smaller RIB/FIB
    • Less need for iACL at edge of network
    • Can’t ping links
    • Can’t traceroute links
    • May be more complex to manage with NMS
    • Use global address on loopback for ICMPv6 messages
    • Will not work with RSVP-TE tunnels
  • Global only 2000::/3 (current IANA prefix)
    • Globally routeable
    • Larger attack surface unless prefix suppression is used
    • Use uRPF and iACL at edge to protect your links
    • Easier to manage

It would be interesting to hear if you have seen any deployments with LLA only on infrastructure links. In theory it’s a nice idea but it may corner you in some cases, preventing you from implementing other features that you wish to deploy in your network.

Use /126 or /127 on P2P links which is the equivalent of /30 or /31 on IPv4 links. For loopbacks use /128 prefixes. Always assign addresses from a range so that creating ACLs and iACLs becomes less tedious.

Using another prefix than /64 on an interface will break the following features:

  • Neighbor Discovery (ND)
  • Secure Neighbor Discovery (SEND)
  • Privacy extensions
  • PIM-SM with embedded RP

This is of course for segments where there are end users.

Prefix Allocation Practices

  • Many SPs offer /48, /52, /56, /60 or /64 prefixes
  • Enterprise customers receive one /48 or more
  • Small business customers receive /52 or /56 prefix
  • Broadband customers may receive /56 or /60 via DHCP Prefix Delegation (DHCP-PD)

Debating prefix allocation prefixes is like debating religion, politics or your favourite OS. Whatever you choose, make sure that you can revise your practice as future services and needs arrise.

Carrier Grade NAT(CGN)

  • Short term solution to IPv4 exhaustage without changing Residential Gateway (RG) or SP infrastructure
  • Subscriber uses NAT44 and SP does CGN with NAT44
  • Multiplexes several customers onto the same public IPv4 address
  • CGN performance and capabilities should be analysed in the planning phase
  • May provide challenges in logging sessions
  • Long term solution is to deploy IPv6

I really don’t like CGN, it slows down the deployment of IPv6. It’s a tool like anything else though that may be used selectively if there is no other solution available.

IPv6 over L2TP Softwires

  • Dual stack IPv4/IPv6 on RG LAN side
  • PPPoE or IPv4oE terminated on v4-only BNG
  • L2TPv2 softwire between RG and IPv6-dedicated L2TP Network Server (LNS)
  • Stateful architecture on LNS
    • Offers dynamic control and granular accounting of IPv6 traffic
  • Limited investment needed and limited impact on existing infrastructure

I have never seen IPv6 deployed over softwires, what about you readers?

6RD

  • Uses 6RD CE (Customer Edge) and 6RD BR (Border Relay)
  • Automatic prefix delegation on 6RD CE
  • Stateless and automatic IPv6 in IPv4 encap and decap functions on 6RD
  • Follows IPv4 routing
  • 6RD BRs are adressed with IPv4 anycast for load sharing and resiliency
  • Limited investment and impact on existing infrastructure

IPv4 via IPv6 Using DS-Lite with NAT44

  • Network has migrated to IPv6 but needs to provide IPv4 services
  • IPv4 packets are tunneled over IPv6
  • Introduces two components: B4 (Basic Bridging Broadband Element) and AFTR (Address Family Transition Router)
    • B4 typically sits in the RG
    • AFTR is located in the core infrastructure
  • Does not provide IPv4 and IPv6 hosts to talk to each other
  • AFTR device terminates the tunnel and decapsulates IPv4 packet
  • AFTR device performs NAT44 on customer private IP to public IP addresses
  • Increased MTU, be aware of fragmentation

Connecting IPv6-only with IPv4-only (AFT64)

  • Only applicable where IPv6-only hosts need to communicate with IPv4-only hosts
  • Stateful or stateless v6 to v4 translation
  • Includes NAT64 and DNS64

MAP (Mapping of Address and Port)

  • MAP-T Stateless 464 translation
  • MAP-E Stateless 464 encapsulation
  • Allows sharing of IPv4 address across an IPv6 network
    • Each shared IPv4 endpoint gets a unique TCP/UDP port range via “rules”
    • All or part of the IPv4 address can be derived from the IPv6 prefix
      • This allows for route summarization
    • Need to allocate TCP/UDP port ranges to each CPE
  • Stateless border relays in SP network
    • Can be implemented in hardware for superior performance
    • Can use anycast and have asymmetric routing
    • No single point of failure
  • Leverages IPv6 in the network
  • No CGN inside SP network
  • No need for logging or ALGs
  • Dependent on CPE router

NAT64

  • Stateful or stateless translation
  • Stateful
    • 1:N translation
    • “PAT”
    • TCP, UDP, ICMP
    • Shares IPv4 addresses
  • Stateless
    • 1:1 translation
    • “NAT”
    • Any protocol
    • No IPv4 address savings

DNS64 is often required in combination with NAT64 to send AAAA response to the IPv6-only hosts in case the server only exists in the v4 world.

464XLAT

  • Somewhere around 15% of apps break with native v6 or NAT64
  • Skype is one of these apps
  • 464XLAT can help with most of these applications
  • Handset does stateless 4 to 6 translation
  • Network does NAT64
  • Deployed by T-Mobile
Categories: CCDE, IPv6 Tags: , ,

Design Considerations for North/South Flows in the Data Center

May 28, 2015 4 comments

Traditional data centers have been built by using standard switches and running Spanning Tree (STP). STP blocks redundant links and builds a loop-free tree which is rooted at the STP root. This kind of topology wastes a lot of links which means that there is a decrease in bisectional bandwidth in the network. A traditional design may look like below where the blocking links have been marked with red color.

DC1-STP

If we then remove the blocked links, the tree topology becomes very clear and you can see that there is only a single path between the servers. This wastes a lot of bandwidth and does not provide enough bisectional bandwidth. Bisectional bandwidth is the bandwidth that is available from the left half of the network to the right half of the network.

DC2-STP

The traffic flow is highlighted below.

DC3-Bisectional

Technologies like FabricPath (FP) or TRILL can overcome these limitations by running ISIS and building loop-free topologies but not blocking any links. They can also take advantage of Equal Cost Multi Path (ECMP) paths to provide load sharing without doing any complex VLAN manipulations like with STP. A leaf and spine design is most commonly used to provide for a high amount of bisectional bandwidth.

DC1-Multipath

Hot Standby Routing Protocol (HSRP) has been around for a long time providing First Hop Redundancy (FHR) in our networks. The drawback of HSRP is that there is only one active forwarder. So even if we run a layer 2 multipath network through FP, for routed traffic flows, there will only be one active path.

DC-FP-1

The reason for this is that FP advertsises its Switch ID (SID) and that the Virtual MAC (vMAC) will be available behind the FP switch that is the HSRP active device. Switched flows can still use all of the available bandwidth.

To overcome this, there is the possibility of running VPC+ between the switches and having the switches advertise an emulated SID, pretending to be one switch so that the vMAC will be available behind that SID.

DC-FP-2

There are some drawbacks to this however. It requires that you run VPC+ in the spine layer and you can still only have 2 active forwarders. if you have more spine devices they will not be uitilized for Nort/South flows. To overcome this there is a feature called Anycast HSRP.

DC-FP-3

Anycast HSRP works in a similar way by advertising a virtual SID but it does not require links between the spines or VPC+. It also supports up to 4 active forwarders currently which provides for double the bandwidth compared to VPC+

Modern data centers provide for a lot more bandwidth and bisectional bandwidth than previous designs, but you still need to consider how routed flows can utilize the links in your network. This post should give you some insights on what to consider in such a scenario.

Introduction to Storage Networking and Design

April 29, 2015 3 comments

Introduction

Storage and storage protocols are not generally well known by network engineers. Networking and storage have traditionally been two silos. Modern networks and data centers are looking to consolidate these two networks into one and to run them on a common transport such as Ethernet.

Hard Disks and Types of Storage

Hard disks can use different type of connectors and protocols.

  • Advanced Technology Attachment (ATA)
  • Serial ATA (SATA)
  • Fibre Channel (FC)
  • Small Computer System Interface (SCSI)
  • Serial Attached SCSI (SAS)

ATA and SATA and SCSI are older standards, newer disks will typically use SATA or SAS where SAS is more geared towards the enterprise market. FC is used to attach to Storage Area Network (SAN)

Storage can either be file-level storage or block-level storage. File-level storage provides access to a file system through protocols such as Network File System (NFS) or Common Internet File System (CIFS). Block-level storage can be seen as raw storage that does not come with a file system. Block-level storage presents Logical Unit Number (LUN) to servers and the server may then format that raw storage with a file system. VmWare uses VmWare File System (VMFS) to format raw devices.

DAS, NAS and SAN

Storage can be accessed in different ways. Directly Attached Storage (DAS) is storage that is attached to a server, it may also be described as captive storage. There is no efficient sharing of storage and can be complex to implement and manage. To be able to share files the storage needs to be connected to the network. Network Attached Storage (NAS) enables the sharing of storage through the network and protocols such as NFS and CIFS. Internally SCSI and RAID will commonly be implemented. Storage Area Network (SAN) is a separate network that provides block-level storage as compared to the NAS that provides file-level storage.

Virtualization of Storage

Everything is being abstracted and virtualized these days, storage is no exception. The goal of anything being virtualized is to abstract from the physical layer and to provide a better utilization and less/no downtime when making changes to the storage system. It is also key in scaling since direct attached storage will not scale well. It also helps in decreasing the management complexity if multiple pools of storage can be accessed from one management tool. One basic form of virtualization is creating virtual disks that use a subset of the storage available on the physical device such as when creating a virtual machine in VmWare or with other hypervisors.

Virtualization exists at different levels such as block, disk, file system and file virtualization.

One form of file system virtualization is the concept of NAS where the storage is accessed through NFS or CIFS. The file system is shared among many hosts which may be running different operating systems such as Linux and Windows.

Block level storage can be virtualized through virtual disks. The goal of virtual disks is to make them flexible, being able to increase and decrease in size, provide as fast storage as needed and to increase the availability compared to physical disks.

There are also other forms of virtualization/abstracting where several LUNs can be hidden behind another LUN or where virtual LUNs are sliced from a physical LUN.

Storage Protocols

There are a number of protocols available for transporting storage traffic. Some of them are:

Internet Small Computer System Interface (iSCSI) – Transports SCSI requests over TCP/IP. Not suitable for high performance storage traffic

Fibre Channel Protocol (FCP) – It’s the interface protocol of SCSI on fibre channel

Fibre Channel over IP (FCIP) – A form of storage tunneling or FC tunneling where FC information is tunneled through the IP network. Encapsulates the FC block data and transports it through a TCP socket

Fibre Channel over Ethernet (FCoE) – Encapsulating FC information into Ethernet frames and transporting them on the Ethernet network

Storage protocols

Fibre Channel

Fibre channel is a technology to attach to and transfer storage. FC requires lossless transfer of storage traffic which has been difficult/impossible to provide on traditional IP/Ethernet based networks. FC has provided more bandwidth traditionally than Ethernet, running at speeds such as 8 Gbit/s and 16 Gbit/s but Ethernet is starting to take over the bandwidth race with speeds of 10, 40, 100 or even 400 Gbit/s achievable now or in the near future.

There are a lot of terms in Fibre channel which are not familiar for us coming from the networking side. I will go through some of them here:

Host Bus Adapter (HBA) – A card with FC ports to connect to storage, the equivalent of a NIC

N_Port – Node port, a port on a FC host

F_Port – Fabric port, port on a switch

E_Port – Expansion port, port connecting two fibre channel switches and carrying frames for configuration and fabric management

TE_Port – Trunking E_Port, Cisco MDS switches use Enhanced Inter Switch Link (EISL) to carry these frames. VSANs are supported with TE_Ports, carrying traffic for several VSANs over one physical link

World Wide Name (WWN) – All FC devices have a unique identity called WWN which is similar to how all Ethernet cards have a MAC address. Each N_Port has its own WWN

World Wide Node Name (WWNN) – A globally unique identifier assigned to each FC node or device. For servers and hosts, the WWNN is unique for each HBA, if a server has two HBAs, it will have two WWNNs.

World Wide Port Number (WWPN) – A unique identifier for each FC port of any FC device. A server will have a WWPN for each port of the HBA. A switch has WWPN for each port of the switch.

Initiator – Clients called initiators issues SCSI commands to request services from logical units on a server that is known as a target

Fibre channel has many similarities to IP (TCP) when it comes to communicating.

  • Point to point oriented – facilitated through device login
    • Similar to TCP session establishment
  • N_Port to N_Port connection – logical node connection point
    • Similar to TCP/UDP sockets
  • Flow controlled – hop by hop and end-to-end basis
    • Similar to TCP flow control but a different mechanism where no drops are allowed
  • Acknowledged – For certain types of traffic but not for others
    • Similar to how TCP acknowledges segments
  • Multiple connections allowed per device
    • Similar to TCP/UDP sockets

Buffer to Buffer Credits

FC requires lossless transport and this is achieved through B2B credits.

  • Source regulated flow control
  • B2B credits used to ensure that FC transport is lossless
  • The number of credits is negotiated between ports when the link is brought up
  • The number of credits is decremented  with each packet placed on the wire
    • Does not rely on packet size
    • If the number of credits is 0, transmission is stopped
  • Number of credits incremented when “transfer ready” message received
  • The number of B2B credits needs to be taken into consideration as bandwidth and/or distance increases

Virtual SAN (VSAN)

Virtual SANs allow to utilize the physical fabric better, essentially providing the same functionality as 802.1Q does to Ethernet.

  • Virtual fabrics created from a larger cost-effective and redundant physical fabric
  • Reduces waste of ports of a SAN island approach
  • Fabric events are isolated per VSAN, allowing for higher availability and isolation
  • FC features can be configured per VSAN, allowing for greater versability

Fabric Shortest Path First (FSPF)

To find the best path through the fabric, FSPF can be used. The concept should be very familiar if you know OSPF.

  • FSPF routes traffic based on the destination Domain ID
  • For FSPF a Domain ID identifies a VSAN in a single switch
    • The number of maximum switches supported in a fabric is then limited to 239
  • FSPF
    • Performs hop-by-hop routing
    • The total cost is calculated to find the least cost path
    • Supports the use of equal cost load sharing over links
  • Link costs can be manually adjusted to affect the shortest paths
  • Uses Dijkstra algorithm
  • Runs only on E_Ports or TE_Ports and provides loop free topology
  • Runs on a per VSAN basis

Zoning

To provide security in the SAN, zoning can be implemented.

  • Zones are a basic form of data path security
    • A bidirectional ACL
    • Zone members can only “see” and talk to other members of the zone. Similar to PVLAN community port
    • Devices can be members of several zones
    • By default, devices that are not members of a zone will be isolated from other devices
  • Zones belong to a zoneset
  • The zoneset must be active to enforce the zoning
    • Only one active zoneset per fabric or per VSAN

SAN Drivers

What are the drivers for implementing a SAN?

  • Lower Total Cost of Ownership (TCO)
  • Consolidation of storage
  • To provide better utilization of storage resources
  • Provide a high availability
  • Provide better manageability

Storage Design Principles

These are some of the important factors when designing a SAN:

  • Plan a network that can handle the number of ports now and in the future
  • Plan the network with a given end-to-end performance and throughput level in mind
  • Don’t forget about physical requirements
  • Connectivity to remote data centers may be needed to meet the business requirements of business continuity and disaster recovery
  • Plan for an expected lifetime of the SAN and make sure the design can support the SAN for its expected lifetime

Device Oversubscription and Consolidation

  • Most SAN designs will have oversubscription or fan-out from the storage devices to the hosts.
    • Follow guidelines from the storage vendor to not oversubscribe the fabric too heavily.
  • Consolidate the storage but be aware of the larger failure domain and fate sharing
    • VSANs enable consolidation while still keeping separate failure domains

When consolidating storage, there is an increased risk that all of the storage or a large part of it can be brought offline if the fabric or storage controllers fail. Also be aware that when using virtualization techniques such as vSANS, there is fate sharing because several virtual topologies use the same physical links.

Convergence and Stability

  • To support fast convergence, the number of switches in the fabric should not be too large
  • Be aware of the number of parallell links, a lot of links will increase processing time and SPF run time
  • Implement appropriate levels of redundancy in the network layer and in the SAN fabric

The above guidelines are very general but the key here is that providing too much redundancy may actually decrease the availability as the Mean Time to Repair (MTTR) increases in case of a failure. The more nodes and links in the fabric the larger the link state database gets and this will lead to SPF runs taking a longer period of time. The general rule is that two links is enough and that three is the maximum, anything more than that is overdoing it. The use of portchannels can help in achieving redundancy while keeping the number of logical links in check.

SAN Security

Security is always important  but in the case of storage it can be very critical and regulated by PCI DSS, HIPAA, SOX or other standards. Having poor security on the storage may then not only get you fired but behind bars so security is key when designing a SAN. These are some recommendations for SAN security:

  • Use secure role-based management with centralized authentication, authorization and logging of all the changes
  • Centralized authentication should be used for the networking devices as well
    • Only authorized devices should be able to connect to the network
  • Traffic should be isolated and secured with access controls so that devices on the network can send and receive data securely while being protected from other activities of the network
  • All data leaving the storage network should be encrypted to ensure business continuane
    • Don’t forget about remote vaulting and backup
  • Ensure the SAN and network passes any regulations such as PCI DSS, HIPAA, SOX etc

SAN Topologies

There are a few common designs in SANs depending on the size of the organization. We will discuss a few of them here and their characteristics and strong/weak points.

Collapsed Core Single Fabric

Collapsed-core-single-fabric

In the collapsed core, both the iniator and the target are connected through the same device. This means all traffic can be switched without using any Inter Switch Links (ISL). This provides for full non-blocking bandwidth and there should be no oversubscription. It’s a simple design to implement and support and it’s also easy to manage compared to more advanced designs.

The main concern of this design is how redundant the single switch is. Does it provided for redundant power, does it have a single fabric or an extra fabric for redundancy? Does the switch have redundant supervisors? At the end of the day, a single device may go belly up so you have to consider the time it would take to restore your fabric and if this downtime is acceptable compared to a design with more redundancy.

Collapsed Core Dual Fabric

Collapsed-core-dual-fabric

The dual fabric designs removes the Single Point of Failure (SPoF) of the single switch design. Every host and storage device is connected to both fabrics so there is no need for an ISL. The ISL would only be useful in case the storage device loses its port towards fabric A and the server loses its port towards fabric B. This scenario may not be that likely though.

The drawback compared to the single fabric is the cost of getting two of every equipment to create the dual fabric design.

Core Edge Dual Fabric

Core-edge-dual-fabric

For large scale SAN designs, the fabric is divided into a core and edge part where the storage is connected to the edge of the fabric. This design is dual fabric to provide high availability. The storage and servers are not connected to the same device, meaning that storage traffic must pass the ISL links between the core and the edge. The ISL links must be able to handle the load so that the oversubscription ratio is not too high.

The more devices that get added to a fabric, the more complex it gets and the more devices you have to manage. For a large design you may not have many options though.

Fibre Channel over Ethernet (FCoE)

Maintaining one network for storage and one for normal user data is costly and complex. It also means that you have a lot of devices to manage. Wouldn’t it be better if storage traffic could run on the normal network as well? That is where FCoE comes into play. The FC frames are encapsulated into Ethernet frames and can be sent on the Ethernet network. However, Ethernet isn’t lossless, is it? That is where Data Center Bridging (DCB) comes into play.

Data Center Bridging (DCB)

Ethernet is not a lossless protocol. Some devices may have support for the use of PAUSE frames but these frames would stop all communication, meaning your storage traffic would come to a halt as well. There was no way of pausing only a certain type of traffic. To provide lossless transfer of frames, new enhancements to Ethernet had to be added.

Priority Flow Control (PFC)

  • PFC is defined in 802.1Qbb and provides PAUSE based on 802.p CoS
  • When link is congested, CoS assigned to “no-drop” will be paused
  • Other traffic assigned to other CoS values will continue to transmit and rely on upper layer protocols for retransmission
  • PFC is not limited to FCoE traffic

It is also desirable to be able to guarantee traffic a certain amount of the bandwidth available and to not have a class of traffic use up all the bandwidth. This is where Enhanced Transmission Selection (ETS) has its use.

Enhanced Transmission Selection (ETS)

  • Defined in 802.1Qaz and prevents a single traffic class from using all the bandwidth leading to starvation of other classes
  • If a class does not fully use its share, that bandwidth can be used by other classes
  • Helps to accomodate for classes that have a bursty nature

The concept is very similar to doing egress queuing through MQC on a Cisco router.

We now have support for lossless Ethernet but how can we tell if a device has implemented these features? Through the use of Data Center Bridging eXchange (DCBX).

Data Center Bridging Exchange (DCBX)

  • DCBX is LLDP with new TLV fields
  • Negotiates PFC, ETS, CoS values between DCB capable devices
  • Simplifies management because parameters can be distributed between nodes
  • Responsible for logical link up/down signaling of Ethernet and Fibre Channel

What is the goal of running FCoE? What are the drivers for running storage traffic on our normal networks?

Unified Fabric

Data centers require a lot of cabling, power and cooling. Because storage and servers have required separate networks, a lot of cabling has been used to build these networks. With a unified fabric, a lot of cabling can be removed and the storage traffic can use the regular IP/Ethernet network,so that half of the number of cables are needed. The following are some reasons for striving for a unified fabric:

  • Reduced cabling
    • Every server only requires 2xGE or 2x10GE instead of 2 Ethernet ports and 2 FC ports
  • Fewer access layer switches
    • A typical Top of Rack (ToR) design may have two switches for networking and two for storage, two switches can then be removed
  • Fewer network adapters per server
    • A Converged Network Adapter (CNA) combines networking and storage functionality so that half of the NICs can be removed
  • Power and cooling savings
    • Less NICs, mean less power which then also saves on cooling. The reduced cabling may also improve the airflow in the data center
  • Management integration
    • A single network infrastructure and less devices to manage decreases the overall management complexity
  • Wire once
    • There is no need to recable to provide network or storage connectivity to a server

Conclusion

This post is aimed at giving the network engineer an introduction into storage. Traditionally there have been silos between servers, storage and networking people but these roles are seeing a lot of more overlap in modern networks. We will see networks be built to provide both for data and storage traffic and to provide less complex storage. Protocols like iSCSI may get a larger share of the storage world in the future and FCoE may become larger as well.

Categories: Storage Tags: , , , ,

OSPF Design Considerations

March 6, 2015 2 comments

Introduction

Open Shortest Path First (OSPF) is a link state protocol that has been around for a long time. It is geneally well understood, but design considerations often focus on the maximum number of routers in an area. What other design considerations are important for OSPF? What can we do to build a scalable network with OSPF as the Interior Gateway Protocol (IGP)?

Prefix Suppression

The main goal of any IGP is to be stable, converge quickly and to provide loop free connectivity. OSPF is a link state protocol and all routers within an area maintain an identical Link State Data Base (LSDB). How the LSDB is built it out of scope for this post but one relevant factor is that OSPF by default advertises stub links for all the OSPF enabled interfaces. This means that every router running OSPF installs these transit links into the routing table. In most networks these routes are not needed, only connectivity between loopbacks is needed because peering is setup between the loopbacks. What is the drawback of this default behavior?

  • Larger LSDB
  • SPF run time increased
  • Growth of the routing table

To change this behavior, there is a feature called prefix suppression. When this feature is enabled the stub links are no longer advertised. The benefits of using prefix suppression is:

  • Smaller LSDB
  • Shorter SPF run time
  • Faster convergence
  • Remote attack vector decreased

If there needs to be connectivity to these prefixes for the sake of monitoring or other reasons, these prefixes can be carried in BGP.

How Many Routers in an Area?

The most common question is “How many routers in an Area?”. As usual, the answer is, it depends… In the past the hardware of routers such as the CPU and memory severely limited the scalability of an IGP but these are not much of an factor on modern devices. So what factors decide how many routers can be deployed in an area?

  • Number of adjacent neighbors
  • How much information is flooded in the area? How many LSAs are in the area?
  • Keep router LSA under MTU size
    – Implies lots of interfaces (and possibly lots of neighbors)
    – Exceeding the MTU leeds to IP fragmentation which should be avoided

It’s impossible to give an exact answer to how many routers that fit into an area. There are ISPs out there running hundreds or even thousands in the same area. Doing so creates a very large failure domain though, a misbehaving router or link will cause all routers in the area to run SPF. To create a smaller failure domain, areas could be used, on the other hand MPLS does not play well with areas… So it depends…That is also why we see technologies like BGP-LS where you can have IGP islands glued together by BGP.

How many ABRs in an Area?

How many ABRs are suitable in an area? The ABR is very critical in OSPF due to the distance vector behavior between areas in OSPF. Traffic must pass through the ABR. Having one ABR may not be enough but adding too many adds complexity and adds flooding of LSAs and increases the size of the LSDB.

ABR1

  • More ABRs create more Type 3 LSA replication within the backbone and in other areas
  • This can create scalability issues in large scale routing
  • 10 prefixes in area 0 and 10 prefixes in area 1 would generate 60 summary LSAs with just 3 ABRs
  • Increasing the number of areas or the number of ABRs would worsen the situation

How Many Areas per ABR?

Based on what we learned above, increasing the number of areas on an ABR quickly adds up to a lot of LSAs.

ABR2

This ABR is in four areas, if every area contains 10 prefixes, the ABR has to generate 120 Type 3 summary LSAs in total.

  • More areas per ABR puts more burden on the ABR
  • More type 3 LSAs will be generated by the ABR
  • Increasing the number of areas will worsen the situation

Considerations for Intra-Area Routing Scalability

To build a stable and scalable intra-area network, take the following parameters into consideration:

  • Physical link flaps can cause instability in OSPF
    – Consider using IP dampening

  • Avoid having physical links in OSPF through the use of prefix suppression
  • BGP can be used to carry transit links for monitoring purpose

Considerations for Inter-Area Routing Scalability

  • Filter physical links outside the area through type 3 filtering feature
  • Every area should only carry loopback addresses for all routers
  • NMS station may keep track of physical links if needed
  • These can be redistributed into BGP

OSPF Border Connections

OSPF always prefers intra-area paths to inter-area paths, regardless of metric. This may cause suboptimal routing under certain conditions.

Border1

  • Assume the link between D and E is in area 0
  • If the link between D and F fails, traffic will follow the intra area path D -> G, G -> E and E -> F

This could be solved by adding an extra interface/subinterface between D and E in area 1. This would increase the number of LSAs though…

OSPF Hub and Spoke

  • Make spoke areas as stubby as possible
    – If possible, make the area totally stubby
    – If redistribution is needed, make the area totally not-so-stubby

  • Be aware of reachability issues, make sure hub router becomes DR and use network type of Point to Multipoint (P2MP) if needed
    – P2MP has smaller DB size compared to Point to Point (P2P)
    – P2P will use more address space and increase the DB size compared to P2MP but it may be beneficial for other reasons when trying to achieve fast convergence

  • If the number of spokes is small, the hub and spokes can be placed within an area such as the backbone
  • If the number of spokes is large, make the hub an ABR and split off the spokes in area(s)

Summary

OSPF is a link state protocol that can scale to large networks if the network is designed according to the characteristics of OSPF. This post described design considerations and features such as prefix suppression that will help in scaling OSPF. For a deeper look at OSPF design, go through BRKRST-2337, available at Cisco Live 365.

Categories: CCDP, OSPF Tags: , ,