CCDE | Daniels networking blog

More PIM-BiDir Considerations

August 12, 2015 reaper81 3 comments

Introduction

From my last post on PIM BiDir I got some great comments from my friend Peter Palúch. I still had some concepts that weren’t totally clear to me and I don’t like to leave unfinished business. There is also a lack of resources properly explaining the behavior of PIM BiDir. For that reason I would like to clarify some concepts and write some more about the potential gains of PIM BiDir is. First we must be clear on the terminology used in PIM BiDir.

Terminology

Rendezvous Point Address (RPA) – The RPA is an address that is used as the root of the distribution tree for a range of multicast groups. This address must be routable in the PIM domain but does not have to reside on a physical interface or device.

Rendezvous Point Link (RPL) – It is the physical link to which the RPA belongs. The RPL is the only link where DF election does not take place. The RFC also says “In BIDIR-PIM, all multicast traffic to groups mapping to a specific RPA is forwarded on the RPL of that RPA.” In some scenarios where the RPA is virtual, there may not be an RPL though.

Upstream – Traffic towards the root (RPA) of the tree. This is the direction used by packets traveling from source(s) to the RPL.

Downstream – Traffic going away from the root. The direction from which packets travel from the RPL to the receivers.

Designated Forwarder (DF) – A single DF exists for every RPA on a link, point to point or multipoint. The only exception as noted above is the RPL. The DF is the router with the best metric to the RPA. The DF is responsible for forwarding downstream traffic onto its link and forwarding upstream traffic from its link towards the RPL. The DF on a link is also responsible for processing Join messages from downstream routers on the link and ensuring packets are forwarded to local receivers as discovered by IGMP or MLD.

RPF Interface – The RPF interface is determined by looking up the RPA in the Multicast Routing Information Table (MRIB). The RPF information then determines which interface of the router that would be used to send packets towards the RPL of the group.

RPF Neighbor – The next node on the shortest path towards the RPA.

Sources in PIM BiDir

One of the confusing parts in PIM BiDir is how traffic travels from the source(s) to the RPA. There is no (S,G) and no PIM Register messages in PIM BiDir, so how is this handled?

When a source starts sending traffic it will send it towards the RPA regardless if there are any receivers or not. The DF on the segment is responsible for sending the traffic upstream towards the RPA. The packet then travels through the PIM domain until it reaches the root of the tree (RPA). In some articles on PIM BiDir, it is mentioned that there is no RPF check. This is not entirely true since RPF is used to find the right interface towards the RPA but it does not use RPF to ensure loop free forwarding, the DF mechanism is instead used for this.

The picture shows a source sending traffic to a multicast group, there are no interested receivers yet. Traffic from the source travels towards the RPA which is not a physical device but only exists virtually on a shared segment with the routers R1, R2 and R3. These routers are connected to the RPL and on the RPL, there is no DF election. This means that they are free to forward the packets, however, there are no receivers yet and hence no interfaces in the Outgoing Interface List (OIL). This means that traffic will simply get dropped in the bitbucket. This is a waste of bandwidth until at least one receiver joins the shared tree.

Considerations for PIM BiDir

Since there is no way to control what the multicast sources are sending, what we are giving up for getting minimal state in the PIM domain is bandwidth. It is not very likely though that a many-to-many multicast application will not have any receivers so this may be an acceptable sacrifice.

Due to having minimum state, PIM BiDir will use less memory compared to PIM ASM or PIM SSM, it will also use less CPU since there is less PIM messages to be generated and received and processed. Is this a consideration on modern platforms? It might be, it might not be. What is known though is that it is a less complex protocol than PIM ASM because it does not have a PIM Register process. Due to this, the RP, which does not even have to be a real router, can’t get overwhelmed by the unicast PIM Register messages. This also provides for an easier mechanism to provide RP redundancy compared to PIM ASM and SSM which requires anycast and MSDP to provide the same.

PIM BiDir Not Working in IOSv and CSR1000v

While writing this post I needed to run some tests, so I booted up VIRL and tested on IOSv but could not get PIM BiDir to work properly. I then tested with CSR1000v, both within VIRL and directly on an ESXi server with the same results. These images were quite new and it seems something is not working properly in them. When the routers were running in BiDir mode, they would not process the multicast and forward it on. Just a fair warning that if you try it that you may run into similar results, please share if you discover something interesting.

Putting it All Together

To get an understanding for the whole process, let us then describe all of the steps from when the source starts sending until the receiver starts receiving the traffic from the source.

The source starts sending traffic on the local segment to R4. R4 is the DF on the link since it’s the only router present.

R4 does a RPF check for 192.0.2.1 which is the RPA, through this process it finds the upstream interface and starts forwarding traffic towards the RPA where R1 is the RPF neighbor and the next router on the path towards the RPA.

R1 forwards traffic towards the RPA on its upstream interface but there are no interested receivers yet, so the traffic will get dropped in the bitbucket. Please note that both R2 and R3 does receive this traffic but if there is no interface in the OIL, the traffic simply gets dropped.

The PC then starts to generate an IGMP Report and sends it towards R5.

R5 is the DF on the segment which also means that it is the Designated Router (DR). It will generate a PIM Join (*,G) and send it towards the RPA on its upstream interface where R3 is the RPF neighbor. Only the DF and hence DR on a link may act on the IGMP Report.

R3 receives the PIM Join from R5 and since traffic is already being sent out by R1 on the RPL, R3 is allowed to start forwarding the traffic. Remember that there is no DF elected on the RPL. We now have end to end multicast flowing.

Conclusion

The main goal of this post was to show how the RP can be a virtual device on a shared segment. This means that redundancy can be designed into the RP role without any complex mechanisms used in PIM ASM. I also wanted to clarify some concepts with the forwarding and the terminology since there seem to be quite a few posts out there that are slightly wrong or not using the correct terminology.

Categories: CCDE Tags: CCDE, Design, PIM BiDir

Many to Many Multicast – PIM BiDir

August 9, 2015 reaper81 3 comments

Introduction

This post will describe PIM Bidir, why it is needed and the design considerations for using PIM BiDir. This post is focused on technology overview and design and will not contain any actual configurations.

Multicast Applications

Multicast is a technology that is mainly used for one-to-many and many-to-many applications. The following are examples of applications that use or can benefit from using multicast.

One-to-many

One-to-many applications have a single sender and multiple receivers. These are examples of applications in the one-to-many model.

Scheduled audio/video: IP-TV, radio, lectures

Push media: News headlines, weather updates, sports scores

File distributing and caching: Web site content or any file-based updates sent to distributed end-user or replicating/caching sites

Announcements: Network time, multicast session schedules

Monitoring: Stock prices, security system or other real-time monitoring applications

Many-to-many

Many-to-many applications have many senders and many receivers. One-to-many applications are unidirectional and many-to-many applications are bidirectional.

Multimedia conferencing: Audio/video and whiteboard is the classic conference application

Synchronized resources: Shared distributed databases of any type

Distance learning: One-to-many lecture but with “upstream” capability where receivers can question the lecturer

Multi-player games: Many multi-player games are distributed simulations and also have chat group capabilities.

Overview of PIM

PIM has different implimentations to be able to handle the above applications. There are mainly three implementatios of PIM, PIM Any Source Multicast (ASM), PIM Source Specific Multicast (SSM) and PIM BiDirectional (BiDir).

PIM ASM

PIM ASM was the first implementation and is well suited for one-to-many applications. ASM means that traffic from any source to a group will be delivered to the receiver(s). PIM ASM uses the concept of a Rendezvous Point Tree (RPT) and Shortest Path Tree (SPT). The RPT is a tree built from the receiver to a Rendezvous Point (RP). The tree from a multicast source to a receiver is called the SPT. Before the receiver can learn the source and build the SPT, the RP will have sent a PIM Join towards the source to build the SPT between the source and the RP. When looking in the mroute table, RPT state will be shown as (*,G) and SPT state will be shown as (S,G)

The responsibilities of the RP are:

Receive PIM Register messages from the First Hop Router (FHR) and send Register Stop
Join the SPT and the RPT so the receivers get traffic and find out the source of the multicast

Initially traffic flows through the RP, there is a more efficient path though. When the Last Hop Router (LHR) starts receiving the multicast it will switch over to the SPT.The SPT will be a more optimal path and (likely) introduce lower delay between the source and the receiver.

PIM ASM can support both one-to-many and many-to-many applications since it can use both SPT and RPT. To prevent LHR to switch to SPT, ip pim spt-threshold command can be used. It can either be set to switch over at a certain rate of traffic (kbps) or be set to infinity to always stay on the RPT. This can be combined with ACL to have certain groups always stay on the RPT and for others to switch over. PIM ASM can therefore use SPT for some groups and RPT for other groups. There are still drawbacks to PIM ASM, a few are mentioned here:

Complex protocol state with Register messages
Redundancy requires the use of MSDP
Any source can send which opens attack vector for DoS and sending traffic from spoofed source

PIM SSM

PIM SSM was created to work better with one-to-many flows compared to PIM ASM. In PIM SSM, there is no complex handling of state and there is only SPT, no RPT. That also means that there is no need for a RP. PIM SSM is much easier to setup and use, it does require clients to support IGMPv3 so that the IGMP Report can contain which source the receiver wants to receive the traffic from. This also means that since there is no RP, there has to be some way for the receiver to know which sources send to which groups. This has to be hanled by some form of Out Of Band(OOB) mechanism. The most common use for SSM is IP-TV where the Set Top Box (STB) receives a list of sources and groups by contacting a server.

The drawback of PIM SSM is that (S,G) state is created requiring more memory. Depending on the number of sources, this may be a factor or not.

PIM BiDir

Bidirectional PIM was created to work better with many-to-many applications. PIM BiDir uses only RPT and no SPT. This means that there has to be a RP. With bidirectional PIM, the RP does not perform any of the functions of PIM ASM though, such as sending Register Stop messages or joining the SPT. Remember, in PIM BiDir, there is no SPT.The RP in PIM BiDir does not have to be a physical device since the RP is not performing any control plane functions. It is simply a way of forwarding traffic the right way, think of it as a vector. The RP can be a physical device and in that case it is a normal RP, just without the responsibilities of an RP as we know it in PIM ASM. When configuring PIM BiDir to have redundant RPs the RP is sometimes called Phantom RP, because it does not have to reside on a physical device.

PIM BiDir is often used in “hoot n holler” and financial applications. PIM BiDir and PIM SSM are at different ends of the spectrum where PIM ASM can serve both type of applications.

PIM uses the concept of Reverse Path Forwarding (RPF) to ensure loop free forwarding. RPF ensures that traffic comes in on the interface that would be used to send traffic out towards the source. PIM BiDir can send traffic both up and down the RPT. This is not normally supported by using RPF, to support this PIM BiDir uses a Designated Forwarder (DF) on each segment, even point-to-point segments. The main responsibility of the DF is to forward traffic upstream towards the RP. The DF is elected based on the metric towards the RP, essentially building a tree along the best path without having to install any (S,G) state. RPF is still used to find the appropiate path towards the Rendezvous Point Link (RPL) but it is the DF mechanism that ensures loop free forwarding.

RP Considerations

In PIM BiDir there isno MSDP, it does not use (S,G) so this is expected. To provide redundant RP in PIM BiDir, Phantom RP is used. The Phantom RP is a virtual RP which is not assigned to a physical device, it is often implemented by having two routers use a loopback with different subnet mask length.

Routers are assigned the RP adress of 192.0.2.1 which is then the Phantom RP, the actual routers where the traffic will flow through have been assigned 192.0.2.2 and 192.0.2.3 but with different net mask lengths. Normal best path rules will then forward traffic towards the longest path match which will be RP1 when it is available and RP2 when RP1 is not available. It is important to not configure the RP address as a physical interface address since this would break the redundancy. If a router was configured with the real address, it would not forward the traffic since the traffic would be destined for one of its own addresses.

Since the RP is so critical, redundancy must be provided. All traffic will pass through the RP which means that certain links in the network may have to carry a lot of the traffic. For this reason it can be necessary to have several RPs, that are acting as RPs for different multicast groups. The placement of the RP also becomes very important since traffic must flow through the RP.

PIM BiDir Considerations

PIM BiDir uses the DF mechanism and for the election to succeed, all the PIM routers on the segment must support PIM BiDir, otherwise the DF election will fail and PIM BiDir will not be supported on the segment. It is possible to have non PIM BiDir routers on a segment if a PIM neighbor filter is implemented to not form PIM adjacencies with certain routers. That way PIM BiDir can be gradually implemented into the network.

Closing Thoughts

PIM ASM supports all multicast models but at the cost of complexity. One could say that it’s a jack of all trades but does not excel at anything. PIM SSM is less complex and the best choice for one-to-many applications if the receivers have support for IGMPv3. PIM BiDir is best suited for many-to-many applications and keeps the least state of all the PIM implementations. Every PIM implementation has its use case and as an architect/designer its your job to know all the models and pick the best one based on business requirements.

Categories: CCDE Tags: ASM, CCDE, Multicast, PIM, PIM BiDir, SSM

Interview with CCDE/CCAr Program Manager Elaine Lopes

August 2, 2015 reaper81 2 comments

I am currently studying for the CCDE exam. Elaine Lopes is the program manager for the CCDE and CCAr certification. I’ve had the pleasure of interacting with her online and meeting her at Cisco Live as well. The CCDE is a great certification and I wanted you to get some insight into the program and ask about the future of the CCDE. A big thanks to Elaine and Cisco for agreeing to do the interview.

Daniel: Hi Elaine, and welcome. It was nice seeing you at Cisco Live! Can you please give a brief introduction of yourself to the readers?

Elaine: Hi, it was nice to see you, too! My name is Elaine Lopes and I’m the CCDE and CCAr Certification Program Manager. I’ve been with Cisco’s Learning@Cisco team since 1999, – I’m passionate about how people’s lives can change for the better through education and certification.

Daniel: Elaine, why did Cisco create an expert level design program? What kind of people should be looking at the CCDE?

Elaine: Cisco has very well established expert-level certifications for network engineers in various fields which assess configuration, implementation, troubleshooting and operations skills; however, these certifications were never aimed to assess design skills. The root cause of many network failures is poor network design and the CCDE helps to fill this gap. The certification was created to assess a candidate’s skills in real-life network design. The candidate should mainly be able to meet business requirements through their network designs as well as understand design principles such as network resiliency, scalability and manageability! CCDE focuses on design by making technology decisions and justifying the choices made. Since CCDE is meant to assess design skills, it targets infrastructure network designers.

Daniel: Are there any prerequisites before taking the CCDE?

Elaine: There are no pre-requisites for the CCDE certification, although it is recommended that candidates have 7+ years of experience on network design in diverse environments.

Daniel: What kind of experience should a candidate have to be a good fit for the CCDE? What is the technology range that needs to be covered such as RS, SP, datacenter, security?

Elaine: CCDE is a role-based certification, and therefore it is desirable that candidates have experience (breadth and depth) in large-scale network designs, as they will be tested on making design decisions within constraints. CCDE focuses mainly on Layer 3 control plane, Layer 2 control plane and network virtualization technologies, but also assesses QoS, security, network management with a little of wireless, optical and storage technologies.

Daniel: Can you give us a short description of the exam process and at which locations the exam is available?

Elaine: The CCDE certification is divided into two steps. The first step is the CCDE written exam, which focuses on design aspects of the various technologies described above, and can be taken at any Pearson VUE testing center at any time. Once CCDE candidates pass the written exam, they then need to pass the CCDE practical exam, which is made of four different scenarios where technologies and design concepts are interconnected. The CCDE practical exam is tridimensional: the same technologies tested on the CCDE written exam plus the different job tasks (merge/divest, add technology, replace technology, scaling and design failure), and the task domains (analyze requirements, design, plan for the design deployment, and validate and optimize network designs). The CCDE practical exam is administered four days a year at any of the 275 Pearson Professional Centers (PPCs) worldwide.

Daniel: You are also responsible for the CCAr program. What is the difference between design and architecture? What kind of candidates should be looking at this exam?

Elaine: CCArs collaborate with senior leadership to create a vision for the network, and their outputs are the business and technical requirements which will be input for CCDEs to create a network design that meets these requirements. The pre-requisite for CCAr is to be a CCDE in good standing, with the target audience being the infrastructure architects who navigate between the technical and business worlds.

Daniel: What kind of study resources are available for the CCDE? I know you have been working hard on providing guidance in the written blueprint, what else is coming?

Elaine: The biggest challenge for CCDE candidates seems to be how to get started, so we recently launched the Streamlined Preparation Resources. This site offers a study methodology and links to many recordings with information about the CCDE program. It’s mainly a list of preparation resources that can be personalized for one’s own needs and offers diverse resource types in a very prescriptive way for CCDE candidates to prepare for the CCDE written exam, but also can be helpful for the practical exam. Since the CCDE practical exam is situation-based, the team decided to provide materials to make candidates think as network designers. The materials are not mapped 1:1 to the blueprint but our long-term objective will be to release the materials in bits and pieces as they come available.

Daniel: Elaine, tell us a bit about the upcoming CCDE study guide. Why was this book written and how should it be used to prepare for the CCDE?

Elaine: Marwan [Al-Shawi] approached me saying he wanted to author a CCDE book, so we had some conversations and exchanged many emails which helped shape the book outline, aiming to be an “all-in-one” study guide for the CCDE practical exam. He then went through the whole publishing process with Cisco Press. There are great technical reviewers involved in it, and the book is to be released soon – I can’t wait to get my copy!

Daniel: After my last blog post on the CCIE program, I received some comments where people questioned the integrity of the CCIE exam. How do you work with the integrity of the CCDE?

Elaine: Integrity has always been top of mind when considering the delivery of the CCDE practical exam: it’s Windows-based and administered at the secure PPCs. The exam changes between administrations, and the nature of the scenarios makes it hard to guess the responses.

Daniel: I know you have designed the CCDE to be as timeless and generic as possible while still covering the relevant technologies. How will the exam be affected of new technologies and forwarding paradigms, such as SDN?

Elaine: True. I’d expect the CCIEs and CCDEs out there to be at the forefront of adoption of these new technologies in the field, and we’re already making plans for incorporating these new technologies into the CCIE tracks. CCDE will be no different, but I don’t have details yet I can share.

Daniel: Creating exams is very difficult and people often have opinions on the material being tested. It’s not well known that you can comment on the exam while taking it. Isn’t it true that comments are one of your sources of feedback on the quality of the exam?

Elaine: I heavily rely on statistical analysis to understand both item and exam performance before making any adjustments to the exam.. To get a more holistic view, I also read the comments candidates make on items while taking the exam. These comments sometimes provide good insight on how to fix low-performing items.

Daniel: Elaine, how can people give feedback on the program outside of comments while taking the exam?

Elaine: Just send me an email elopes@cisco.com. To be informed on what’s going on in the CCDE world, you can connect with me on LinkedIn (Elaine Lopes) and/or Twitter (@elopes01).

Daniel: There is a Subject Matter Expert (SME) recruitment program for certifications within Cisco. Do you have SMEs for the CCDE and how can any CCDE’s out there contact Cisco if they want to be part of the SME program?

Elaine: SMEs are critical to assure the exams are relevant, so yes, I do have several CCDE certified SMEs participating on the various phases of the exam, from development teams to authoring/reviewing/editing items, to working on the preparation resources, to the maintenance of the existing exams, etc. If you are CCDE-certified and want to participate, join the program and I’ll contact you for the next opportunity to participate.

Daniel: Thank you so much for your time, Elaine! I hope we’ll meet soon again. Do you have any final words and where do you see the CCDE program going in the future?

Elaine: I wanted to give a hint to CCDE candidates taking the practical exam: take the time to read and connect with each scenario and don’t make decisions or assumptions outside the context of the scenario. If you read a question and the answer is not glaring, go back to the scenario materials! CCDE design principles don’t change, so when the time comes I see “sprinkling” design aspects of new technologies in the CCDE exams. Hope to see you soon!! It’s been a pleasure to participate, thank you for inviting me!

Categories: CCDE Tags: CCAr, CCDE, Elaine Lopes, Interview

IPv6 Multicast

July 14, 2015 reaper81 2 comments

These are my notes for IPv6 multicast for the CCDE exam. Overview

Prefix FF::/8 reserved for multicast
Multicast Listener Discovery (MLD) replaces IGMP
- MLD is part of ICMPv6
- MLDv1 equivalent to IGMPv2
- MLDv2 equivalent to IGMPv3
ASM, SSM and Bidir supported
PIM identified by IPv6 next header 103
BSR and static RP supported
No support for MSDP
- Anycast supported through PIM, defined in RFC4610

Any Source Multicast (ASM)
- PIM-SM, PIM-BiDir
- Default for generic multicast and unicast prefix-based multicast
- Starts with FF3x::/12
Source Specific Multicast (SSM)
- PIM-SSM
- FF3X::/32 is allocated for SSM by IANA
- Currently prefix and plen is zero so FF3X::/96 is useable for SSM
Embedded RP groups
- PIM-SM, PIM-BIDir
- Starts with FF70::/12

IPv6 Multicast Addressing

IPv6 multicast address format includes variable bits to define what type of address it is and what the scope is of the multicast group. The scope can be:

1 – Node

2 – Link

3 – Subnet

4 – Admin

5 – Site

8 – Organization

E – Global

The flags define if embedded RP is used, if the address is based on unicast and if the address is IANA assigned or not (temporary). The unicast based IPv6 multicast address allows an organization to create globally unique IPv6 multicast groups based on their unicast prefixes. This is similar to GLOP addressing in IPv4 but does not require an Autonomous System Number (ASN). IPv6 also allows for embedding the RP address into the multicast address itself. This provides a static RP to multicast group mapping mechanism and can be used to provide interdomain IPv6 multicast as there is no MSDP in IPv6. When using Ethernet, the destination MAC address of the frame will start with 33:33 and the remaining 32 bits will consist of the low order 32 bits of the IPv6 multicast address.

Well Known Multicast Addresses

FF02::1 – All Nodes

FF02::2 – All Routers

FF02::5 – OSPF All Routers

FF02::6 – OSPF DR Routers

FF02::A – EIGRP Routers

FF02::D – PIM Routers

Neighbor Solicitation and DAD

IPv6 also uses multicast to replace ARP through the neighbor solicitation process. To do this the solicited node multicast address is used and the prefix is FF02::1:FF/104 and the last 24 bits are taken from the lower 24 bits of the IPv6 unicast address. If Host A needs to get the MAC of Host B, Host A will send the NS to the solicited node multicast address of B. IPv6 also does Duplicate Address Detection (DAD) to check that noone else is using the same IPv6 address and this also uses the solicited node multicast address. If Host A is checking uniqueness of its IPv6 address, the message will be sent to the solicited node multicast address of Host A.

Multicast Listener Discovery (MLD)

MLDv1 messages
- Listener Query
- Listener Report
- Listener Done
MLDv2 messages
- Listener Query
- Listener Report

MLDv2 does not use a specific Done message which is equivalent to the Leave message in IGMP. It will stop sending Reports or send a Report which excludes the source it was previously interested in.

Protocol Independent Multicast (PIM) for IPv6

PIM-SM (RP is required)
- Many to many applications (multiple sources, single group)
- Uses shared tree initially but may switch to source tree
PIM-BiDir (RP is required)
- Bidirectional many to many applications (hosts can be sources and receivers)
- Only uses shared tree, less state
PIM-SSM
- One to many applications (single source, single group)
- Always uses source tree
- Source must be learnt through out of band mechanism

Anycast RP

IPv6 does not have support for MSDP. It can support anycast RP through the use of PIM which can implement this feature. All the RPs doing anycast will use the same IPv6 address but they also require a unique IPv6 address that will be used to relay the PIM Register messages coming from the multicast sources. A RP-set is defined with the RPs that should be included in the Anycast RP and the PIM Register messages will be relayed to all the RPs defined in the RP-set. If the PIM Register message comes from an IPv6 address that is defined in the RP-set, the Register will not be sent along which is a form of split horizon to prevent looping of control plane messages. When a RP relays a PIM Register, this is done from a unique IPv6 address which is similar to how MSDP works.

Sources will find the RP based on the unicast metric as is normally done when implementing anycast RP. If a RP goes offline, messages will be routed to the next RP which now has the best metric.

Interdomain Multicast

These are my thoughts on interdomain multicast since there is no MSDP for IPv6. Embedded RP can be used which means that other organization needs to use your RP. Define a RP prefix that is used for interdomain multicast only or use a prefix that is used for internal usage but implement a data plane filter to filter out requests for groups that should not cross organizational boundaries. This could also be done by filtering on the the scope of the multicast address.

Another option would be to anycast RP with the other organization but this could get a lot messier unless a RP is defined for only a set of groups that are used for interdomain multicast. Each side would then have a RP defined for the groups and PIM Register messages would be relayed. The drawback would be that both sides could have sources but the policy may be that only one side should have sources and the other side only has listeners. This would be difficult to implement in a data plane filter. It might be possible to solve in the control plane by defining which sources the RP will allow to Register.

If using SSM, there is no need for a RP which makes it easier to implement interdomain multicast. There is always the consideration of joining two PIM domains but this could be solved by using static joins at the edge and implementing data plane filtering. Interdomain multicast is not something that is implemented a lot and it requires some thought to not merge into one failure domain and one administrative domain.

Final Thoughts

Multicast is used a lot in IPv6, multicast is more tightly integrated into the protocol than in IPv4, and it’s there even if you see it or not. The addressing, flags and scope can be a bit confusing at first but it allows for using multicast in a better way in IPv6 than in IPv4.

Categories: CCDE, IPv6 Tags: Anycast RP, IPv6, MLD, Multicast, PIM

Service Provider IPv6 Deployment

June 29, 2015 reaper81 2 comments

These are my study notes regarding IPv6 deployment in SP networks in preparation for the CCDE exam.

Drivers for implementing IPv6

External drivers
- SP customers that need access to IPv6 resources
- SP customers that need to interconnect their IPv6 sites
- SP customers that need to interface with their own customers over iPv6
Internal drivers
- Handle problems that may be hard to fix with IPv4 such as large number of devices (cell phones, IP cameras, sensors etc)
- Public IPv4 address exhaustion
- Private IPv4 address exhaustion
Strategic drivers
- Long term expansion plans and service offerings
- Preparing for new services and gaining competitive advantage

Infrastructure

SP Core Infrastructure
- Native IPv4 core
- L2TPv3 for VPNs
- MPLS core
- MPLS VPNs

My reflection is that most cores would be MPLS enabled, however there are projects such as Terastream in Deutsche Telekom where the entire core is IPv6 enabled and L2TPv3 is used in place of MPLS.

IPv6 in Native IPv4 Environments
- Tunnel v6 in v4
- Native v6 with dedicated resources
- Dual stack

The easiest way to get going with v6 was to tunnel it over v4. The next logical step was to enable v6 but on separate interfaces to not disturb the “real” traffic and to be able to experiment with the protocol. The end goal is dual stack, at least in a non MPLS enabled network.

IPv6 in MPLS environments
- 6PE
- 6VPE

6PE is a technology to run IPv6 over an IPv4 enabled MPLS network. 6VPE does the same but with VRFs.

Native IPv6 over Dedicated Data Link
- Dedicated data links between core routers
- Dedicated data links to IPv6 customers
- Connection to an IPv6 IX
Dual stack
- All P + PE routers capable of v4 + v6 transport
- Either two IGPs or one IGP for both v4 + v6
- Requires more memory due to two routing tables
- IPv6 multicast natively supported
- All IPv6 traffic is routed in global space (no MPLS)
- Good for content distribution and global services (Internet)

6PE
- IPv6 global connectivity over an IPv4 MPLS core
- Transition mechanism (debatable)
- PEs are dual stacked and need 6PE configuration
- IPv6 reachability exchanged via MPBGP over iBGP sessions
- IPv6 packets transported from 6PE to 6PE inside MPLS
- The next-hop is an IPv4 mapped IPv6 address such as ::FFFF:1.1.1.1
- BGP label assigned for the IPv6 prefix
- Bottom label used due to P routers not v6 capable and for load sharing
- neighbor send-label is configured under BGP address-family ipv6

6PE is viewed as a transition mechanism but this is arguable, if you transport IPv4 over MPLS, you may want to do the same with IPv6 as well for consistency. Running 6PE means that there is fate sharing between v4 and v6 though, which could mean that an outage may affect both protocols. This could be avoided by running MPLS for IPv4 but v6 natively.

Core network (P routers) left untouched
IPv6 traffic inherits MPLS benefits such as fast-reroute and TE
Incremental deployment possible (upgrade PE routers first)
Each site can be v4-only, v4-VPN-only, v4+v6, v4-VPN+v6 and so on
Scalability concerns due to separate RIB and FIB required per customer
Mostly suitable for SPs with limited amount of PEs

6vPE
- Equivalent of VPNv4 but for IPv6
- Add VPNv6 address family under MPBGP
- Send extended communities for the prefixes under the address family

It is a common misconception for 6PE and 6vPE that traceroutes are not possible, that is however not entirely true. A P router can generate ICMPv6 messages that will follow the LSP to the egress PE and then the ICMPv6 error message is forwarded back to the originator of the traceroute.

Route reflectors for 6PE and 6vPE
- Needed to scale BGP full mesh
- Dedicated RRs or data path RRs
- Either dedicated RR per AF or have multiple AFs per RR
- 6PE-RR must support IPv6 + label functionality
- 6vPE-RR must support IPv6 + label and extended communities functionality

PA vs PI

PA advantages
- Aggregation towards upstreams
- Minimizes Internet routing table size
PA disadvantages
- Customer is “locked” with the SP
- Renumbering can be painful
- Multi-homing and TE problems

The main driver here is if you are going to multi home or not. Renumbering is always painful but at least less so on IPv6 due to being able to advertise multiple IPv6 prefixes through Router Advertisements (RA).

PI advantages
- Customers are not “locked” to the SP
- Multi homing is straight forward
PI disadvantages
- Larger Internet routing table due to lack of efficient aggregation
- Memory and CPU needs on BGP speakers

Infrastructure Addressing (LLA vs global)

What type of addresses should be deployed on infrastructure links?

Link Local Address FE80::/10
- Non routeable address
- Less attack surface
- Smaller routing tables
- Can converge faster due to smaller RIB/FIB
- Less need for iACL at edge of network
- Can’t ping links
- Can’t traceroute links
- May be more complex to manage with NMS
- Use global address on loopback for ICMPv6 messages
- Will not work with RSVP-TE tunnels
Global only 2000::/3 (current IANA prefix)
- Globally routeable
- Larger attack surface unless prefix suppression is used
- Use uRPF and iACL at edge to protect your links
- Easier to manage

It would be interesting to hear if you have seen any deployments with LLA only on infrastructure links. In theory it’s a nice idea but it may corner you in some cases, preventing you from implementing other features that you wish to deploy in your network.

Use /126 or /127 on P2P links which is the equivalent of /30 or /31 on IPv4 links. For loopbacks use /128 prefixes. Always assign addresses from a range so that creating ACLs and iACLs becomes less tedious.

Using another prefix than /64 on an interface will break the following features:

Neighbor Discovery (ND)
Secure Neighbor Discovery (SEND)
Privacy extensions
PIM-SM with embedded RP

This is of course for segments where there are end users.

Prefix Allocation Practices

Many SPs offer /48, /52, /56, /60 or /64 prefixes
Enterprise customers receive one /48 or more
Small business customers receive /52 or /56 prefix
Broadband customers may receive /56 or /60 via DHCP Prefix Delegation (DHCP-PD)

Debating prefix allocation prefixes is like debating religion, politics or your favourite OS. Whatever you choose, make sure that you can revise your practice as future services and needs arrise.

Carrier Grade NAT(CGN)

Short term solution to IPv4 exhaustage without changing Residential Gateway (RG) or SP infrastructure
Subscriber uses NAT44 and SP does CGN with NAT44
Multiplexes several customers onto the same public IPv4 address
CGN performance and capabilities should be analysed in the planning phase
May provide challenges in logging sessions
Long term solution is to deploy IPv6

I really don’t like CGN, it slows down the deployment of IPv6. It’s a tool like anything else though that may be used selectively if there is no other solution available.

IPv6 over L2TP Softwires

Dual stack IPv4/IPv6 on RG LAN side
PPPoE or IPv4oE terminated on v4-only BNG
L2TPv2 softwire between RG and IPv6-dedicated L2TP Network Server (LNS)
Stateful architecture on LNS
- Offers dynamic control and granular accounting of IPv6 traffic
Limited investment needed and limited impact on existing infrastructure

I have never seen IPv6 deployed over softwires, what about you readers?

6RD

Uses 6RD CE (Customer Edge) and 6RD BR (Border Relay)
Automatic prefix delegation on 6RD CE
Stateless and automatic IPv6 in IPv4 encap and decap functions on 6RD
Follows IPv4 routing
6RD BRs are adressed with IPv4 anycast for load sharing and resiliency
Limited investment and impact on existing infrastructure

IPv4 via IPv6 Using DS-Lite with NAT44

Network has migrated to IPv6 but needs to provide IPv4 services
IPv4 packets are tunneled over IPv6
Introduces two components: B4 (Basic Bridging Broadband Element) and AFTR (Address Family Transition Router)
- B4 typically sits in the RG
- AFTR is located in the core infrastructure
Does not provide IPv4 and IPv6 hosts to talk to each other
AFTR device terminates the tunnel and decapsulates IPv4 packet
AFTR device performs NAT44 on customer private IP to public IP addresses
Increased MTU, be aware of fragmentation

Connecting IPv6-only with IPv4-only (AFT64)

Only applicable where IPv6-only hosts need to communicate with IPv4-only hosts
Stateful or stateless v6 to v4 translation
Includes NAT64 and DNS64

MAP (Mapping of Address and Port)

MAP-T Stateless 464 translation
MAP-E Stateless 464 encapsulation
Allows sharing of IPv4 address across an IPv6 network
- Each shared IPv4 endpoint gets a unique TCP/UDP port range via “rules”
- All or part of the IPv4 address can be derived from the IPv6 prefix
  - This allows for route summarization
- Need to allocate TCP/UDP port ranges to each CPE
Stateless border relays in SP network
- Can be implemented in hardware for superior performance
- Can use anycast and have asymmetric routing
- No single point of failure
Leverages IPv6 in the network
No CGN inside SP network
No need for logging or ALGs
Dependent on CPE router

NAT64

Stateful or stateless translation
Stateful
- 1:N translation
- “PAT”
- TCP, UDP, ICMP
- Shares IPv4 addresses
Stateless
- 1:1 translation
- “NAT”
- Any protocol
- No IPv4 address savings

DNS64 is often required in combination with NAT64 to send AAAA response to the IPv6-only hosts in case the server only exists in the v4 world.

464XLAT

Somewhere around 15% of apps break with native v6 or NAT64
Skype is one of these apps
464XLAT can help with most of these applications
Handset does stateless 4 to 6 translation
Network does NAT64
Deployed by T-Mobile

Categories: CCDE, IPv6 Tags: CCDE, IPv6, IPv6 transition

Design Considerations for North/South Flows in the Data Center

May 28, 2015 reaper81 4 comments

Traditional data centers have been built by using standard switches and running Spanning Tree (STP). STP blocks redundant links and builds a loop-free tree which is rooted at the STP root. This kind of topology wastes a lot of links which means that there is a decrease in bisectional bandwidth in the network. A traditional design may look like below where the blocking links have been marked with red color.

If we then remove the blocked links, the tree topology becomes very clear and you can see that there is only a single path between the servers. This wastes a lot of bandwidth and does not provide enough bisectional bandwidth. Bisectional bandwidth is the bandwidth that is available from the left half of the network to the right half of the network.

The traffic flow is highlighted below.

Technologies like FabricPath (FP) or TRILL can overcome these limitations by running ISIS and building loop-free topologies but not blocking any links. They can also take advantage of Equal Cost Multi Path (ECMP) paths to provide load sharing without doing any complex VLAN manipulations like with STP. A leaf and spine design is most commonly used to provide for a high amount of bisectional bandwidth.

Hot Standby Routing Protocol (HSRP) has been around for a long time providing First Hop Redundancy (FHR) in our networks. The drawback of HSRP is that there is only one active forwarder. So even if we run a layer 2 multipath network through FP, for routed traffic flows, there will only be one active path.

The reason for this is that FP advertsises its Switch ID (SID) and that the Virtual MAC (vMAC) will be available behind the FP switch that is the HSRP active device. Switched flows can still use all of the available bandwidth.

To overcome this, there is the possibility of running VPC+ between the switches and having the switches advertise an emulated SID, pretending to be one switch so that the vMAC will be available behind that SID.

There are some drawbacks to this however. It requires that you run VPC+ in the spine layer and you can still only have 2 active forwarders. if you have more spine devices they will not be uitilized for Nort/South flows. To overcome this there is a feature called Anycast HSRP.

Anycast HSRP works in a similar way by advertising a virtual SID but it does not require links between the spines or VPC+. It also supports up to 4 active forwarders currently which provides for double the bandwidth compared to VPC+

Modern data centers provide for a lot more bandwidth and bisectional bandwidth than previous designs, but you still need to consider how routed flows can utilize the links in your network. This post should give you some insights on what to consider in such a scenario.

Categories: CCDE Tags: Bisectional BW, CCDE, Data center, Design, Fabric Path, STP

A Quick Look at Cisco FabricPath

February 26, 2015 reaper81 6 comments

Cisco FabricPath is a proprietary protocol that uses ISIS to populate a “routing table” that is used for layer 2 forwarding.

Whether we like or not, there is often a need for layer 2 in the Datacenter for the following reasons:

Some applications or protocols require to be layer 2 adjacent
It allows for virtual machine/workload mobility
Systems administrators are more familiar with switching than routing

A traditional network with layer 2 and Spanning Tree (STP) has a lot of limitations that makes it less than optimal for a Datacenter:

Local problems have a network-wide impact
The tree topology provides limited bandwidth
The tree topology also introduces suboptimal paths
MAC address tables don’t scale

In the traditional network, because STP is running, a tree topology is built. This works better for for flows that are North to South, meaning that traffic passes from the Access layer, up to Distribution, to the Core and then down to Distribution and to the Access layer again. This puts a lot of strain on Core interconnects and is not well suited for East-West traffic which is the name for server to server traffic.

A traditional Datacenter design will look something like this:

If we want end-to-end L2, we could build a network like this:

What would be the implications of building such a network though?

Large failure domain
Unknown unicast and broadcast flooding through large parts of the network
A large number of STP instances needed unless using MST
Topology change will have a large impact on the network and may cause flooding
Large MAC address tables
Difficult to troubleshoot
A very brittle network

So let’s agree that we don’t want to build a network like this. What other options do we have if we still need layer 2? One of the options is Cisco FabricPath.

FabricPath provides the following benefits:

Reduction/elimination of STP
Better stability and convergence characteristics
Simplified configuration
Leverage parallell paths
Deterministic throughput and latency using typical designs
“VLAN anywhere” flexibility

The FabricPath control plane consists of the following elements:

Routing table – Uses ISIS to learn Switch IDS (SIDs) and build a routing table
Multidestination trees – Elects roots and builds multidestination trees
Mroute table – IGMP snooping learns group membership at the edge, Group Member LSPs (GM-LSPs) are flooded by ISIS into the fabric

Observe that LSPs has nothing to do with MPLS in this case and that this is not MAC based routing, routing is based on SIDs.

FabricPath ISIS learns the shortest path to each SID based on link metrics/path cost. Up to 16 equal (ECMP) routes can be installed. Choosing a path is based on a hashing function using Src IP/Dst IP/L4/VLAN which should be good for avoiding polarization.

FabricPath supports multidestination trees with the following capabilities:

Multidestination traffic is contained to a tree topology, a network-wide identifier (Ftag) is assigned to each tree
A root switch is elected for each multidestination tree
Multipathing is supported through multiple trees

Note that root here has nothing to do with STP, think of it in terms of multicast routing.

Multidestination trees do not dictate forwarding for unicast, only for multidestination packets.

The FabricPath data plane behaves according to the following forwarding rules:

MAC table – Hardware performs MAC lookup at CE/FabricPath edge only
Switch table – Hardware performs destination SID lookups to forward unicast frames to other switches
Multidestination table – A hashing function selects the tree, multidestination table identifies on which interfaces to flood based on selected tree

The Ftag used in FabricPath identifies which ISIS topology to use for unicast packets and for multidestination packets, which tree to use.

If a FabricPath switch belongs to a topology, all VLANs of that topology should be configured on that switch to avoid blackholing issues.

FabricPath supports 802.1p but can also match/set DSCP and match on other L2/L3/L4 information.

With FabricPath, edge switches only need to learn:

Locally connected host MACs
MACs with which those hosts are bidirectionally communicating

This reduces the MAC address table capacity requirements on Edge switches.

FabricPath Designs

There are different designs that can be used together with FabricPath. The first one is routing at the Aggregation layer.

The first design is the most classic one where STP has been replaced by FP in the Access layer and routing is used above the Aggregation layer.

This design has the following characteristics:

Evolution of current design practices
The Aggregation layer functions as FabricPath spine and L2/L3 boundary
– FabricPath switching for East – West intra VLAN traffic
– SVIs for East – West inter VLAN traffic
– Routed uplinks for North – South routed flows
Access layer provides pure L2 functions
– FabricPath core ports facing Aggregation layer
– CE edge ports facing hosts
– Optionally vPC+ can be used for active/active host connections

This design is the simplest option and is an extension of regular Access/Aggregation designs. It provides the following benefits:

Simplified configuration
Removal of STP
Traffic distribution over all uplinks without the use of vPC
Active/active gateways
“VLAN anywhere” at the Access layer
Topological flexibility
– Direct-path forwarding option
– Easily provision additional AccessAggregation bandwidth
– Easily deploy L4-L7 services
– Can use vPC+ towards legacy Access switches

There is also the centralized routing design which looks like the following:

Centralized routing has the following characteristics:

Traditional Aggregation layer becomes pure FabricPath spine
– Provides uniform any-to-any connectivity between leaf switches
– In simplest case, only FabricPath switching occurs in spine
– Optionally, some CE edge ports exist to provide external router connections
FabricPath leaf switches, connecting to spine, have specific “personality”
– Most of the leaf switches will provide server connectivity, like traditional access switches in “Routing at Aggregation” designs
– Two or more leaf switches provide L2/L3 boundary, inter-VLAN routing and North-South routing
– Other or same leaf switches may provide L4-L7 services
Decouples L2/L3 boundary and L4-L7 services provisioning from Spine
– Simplifies Spine design

The different traffic flows in this design looks like the following:

Another design is the multi-pod design which can look like the following:

The multi-pod design has the following characteristics:

Allows for more elegant DC-wide versus pod-local VLAN definition/isolation
– No need for pod-local VLANs to exist in core
– Can support VLAN id reuse in multiple pods
Define FabricPath VLANs -> map VLANs to topology -> map topology to FabricPath core port(s)
Default topology always includes all FabricPath core ports
– Map DC-wide VLANs to default topology
Pod-local core ports also mapped to pod-local topology
– Map pod-local VLANs to pod-local topology

This post briefly describes Cisco FabricPath which is a technology for building scalable L2 topologies, allowing for more bisectional bandwidth to support East-West flows which are common in Datacenters. To dive deeper into FabricPath, visit the Cisco Live website.

Categories: CCDE Tags: CCDE, FabricPath, Network Design

STP Notes for CCDE

February 8, 2015 reaper81 Leave a comment

These are my study notes for CCDE based on “CCIE Routing and Switching v5.0 Official Cert Guide, Volume 1, Fifth Edition” and “Designing Cisco Network Service Architectures (ARCH) Foundation Learning Guide: (CCDP ARCH 642-874), Third Edition“, “INE – Understanding MSTP” and “Spanning Tree Design Guidelines for Cisco NX-OS Software and Virtual PortChannels“. This post is not meant to cover STP and all its aspects, it’s a summary of key concepts and design aspects of running STP.

STP

STP was originally defined in IEEE 802.1D and improvements were defined in amendments to the standard. RSTP was defined in amendment 802.1w and MSTP was defined in 802.1s. The latest 802.1D-2004 standard does not include “legacy STP”, it covers RSTP. MSTP was integrated into 802.1Q-2005 and later revisions.

STP has two types of BPDUs: Configuration BPDUs and Topology Change Notification BPDUs. To handle topology change, there are two flags in the Configuration BPDU: Topology Change Acknowledgment flag and Topology Change flag.

MessageAge is an estimation of the age of BPDU since it was generated by root, root sends it with an age of 0 and other switches increment this value by 1. The lifetime of a BPDU is MaxAge – MessageAge. MaxAge, HelloTime and ForwardDelay are values set by the root and locally configured values will only be used if that switch becomes the root.

STP works by comparing which Configuration BPDU is superior according to the following ordered list where lower values are better:

Root Bridge ID (RBID)
Root Path Cost (RPC)
Sender Bridge ID (SBID)
Sender Port ID (SPID)
Receiver Port ID (RPID; not included in the BPDU, evaluated locally)

Each port stores the superior BPDU that has been sent or received, depending on the port role. Root and blocking ports store the received BPDU, designated ports store the sent BPDU.

To determine port roles and which ports forward and block, the following three-step process is used:

Elect the root switch
Determine each switch’s Root port
Determine the Designated port for each segment

Root bridge is elected based on lowest bridge ID, which consists of 4 bits Priority, 12 bits System ID Extension and 6 bytes System ID (MAC address). Before 802.1t, a lot of MAC addresses were consumed to make the BID unique when using PVST+ or MST.

BPDUs are only forwarded on designated ports, root ports and blocking ports do not send them since they would be inferior on the segment. A designated port is a port with a superior BPDU on a segment.

Topology Change

A topology change event occurs when:

A TCN BPDU is received by a Designated Port of a switch
A port moves to the Forwarding state and the switch has at least one Designated Port
A port moves from Learning or Forwarding to Blocking
A switch becomes the root switch

STP is slow to converge, especially with indirect failures where a link fails between a root switch and an intermediary switch. When inferior BPDUs are received, MaxAge has to expire before a switch will act on it.

When the topology has changed, CAM table needs to be updated on all switches, a timer equivalent to ForwardDelay is used to time out unused entries.

A topology change starts at a switch and it sends TCN BPDU out its root port. The designated switch sets TCA bit in the field of the configuration BPDU to acknowledge the TCN. The TCN then travels upstream until it reaches the root. The root will then send configuration BPDU with TC bit set for MaxAge + ForwardDelay seconds and all switches will shorten the aging time for the CAM table to ForwardDelay seconds.

PVST+

PVST+ runs one spanning tree instance per VLAN. This does not scale well for a large number of VLANs and normally there will only be a few logical topologies anyway.

Switches that do not support PVST+ run Common Spanning Tree (CST) which has one instance of STP for all VLANs. Cisco switches can interact with CST through VLAN 1 by sending untagged BPDUs. All other VLANs in the PVST+ region will tag their BPDUs and tunnel the BPDUs through the CST region by using a special destination MAC address. The CST region is treated as a loop-free shared segment from the viewpoint of the PVST+ region. The destination MAC address is a multicast address that will get flooded by the CST switches.

RPVST+

RSTP has four different port roles:

Root Port
Designated Port
Alternate Port
Backup port

The first two are the same as in legacy STP and the last two are new. An alternate port is a port that is a potential backup for the Root Port. A backup port is a replacement for a Designated Port, you would rarely, if ever, see a Backup Port because it is only used on shared segments.

RSTP uses synchronization process to achieve fast convergence. This only works on links that are point to point and is detected by the duplex mode of an interface. The link type can be hard coded in the rare case where a port is half duplex but still not on a shared segment.

RSTP uses more bits in the Configuration BPDU to encode additional information. These are the Proposal bit, Port Role bits, Learning bit, Forwarding bit and Agreement bit.

RSTP switches send their own BPDUs as opposed to only relaying the roots BPDU as in legacy STP. If no BPDU is heard for 3x hello interval, the BPDU is expired. RSTP does not rely on the MaxAge timer to expire BPDUs. RSTP can also act on inferior BPDUs directly instead of waiting for MaxAge to expire. This speeds up indirect link failure scenarios.

RSTP uses a proposal/agreement process where switches negotiate which port that will become Designated. If proposal bit is set, the switch is proposing that its port should become Designated and the other switch will reply with Agreement to immediately allow this. When ports first come up they are in Designated Discarding state. To not create a temporary loop during the synchronization process, all Non-Edge Designated ports are put into a Discarding state. I have in detail described this process in an earlier post.

With RSTP, only ports moving to a Forwarding state will cause a topology change. RSTP sets the TC bit in the BPDU to notify of a topology change and sends it out its Root Port and Designated Ports that are Non-Edge. MAC addresses are immediately flushed on these ports.

MST

MST uses the same underlying structure such as RSTP with regards to BPDU parameters but it decouples VLANs from spanning tree instances, multiple VLANs can be mapped to a single instance. MST is more efficient because the operator can define the number of instances needed and map the VLANs to these instances. MST is the only standard that supports VLANs and is suitable in a multi vendor environment.

MST switches organize the network into regions, switches within a region use MST in a consistent way. For switches to be in the same region, the name, revision and instance to VLAN mapping must match.

The System ID in MST uses the Instance ID instead of the VLAN ID to create the BID, used in BPDUs. MST sends a single BPDU containing information about all instances. In MST, a port sends BPDUs if it is Designated for at least one MST instance.

MST instance 0 is special and contains all VLANs by default, it is called the Internal Spanning Tree (IST). IST interacts with STP switches that are outside the region. The port role and state determined by the interaction of IST with a neighboring switch will be inherited by all VLANs on that port, not just the VLANs mapped to the IST. This behavior makes the region appear as a single switch to the outside of the region. If running multiple regions, each region can be seen as a single switch from the outside. The resulting network can still contain loops if there are multiple inter region links. MST blocks these loops by building a Common Spanning Tree (CST) running between the regions. CST is also used to interact with non MST switches. The tree built by CST will be used for all VLANs. The IST and CST is then merged together and called the Common and Internal Spanning Tree (CIST).

The CIST Root switch is elected based on the lowest BID from all switches that in any region. This switch will also become the root for the IST (instance 0) within the region, this is called the CIST Regional Root.

In regions that do not contain the CIST Root, only boundary switches are allowed to become the IST Root. A boundary switch is a switch that has a link (or several) to other MST regions. The IST Root is elected based on external root path cost, which is the cost of using the inter region links between MST regions. If there is a tie in cost, the lowest BID is used as a tiebreaker to elect the CIST Regional Root. Cost inside a region is not taken into account.

The CIST Regional Root switch will have its Root Port towards the CIST Root, this is called the master port and this port is used by all MST instances to reach the CIST Root.

The following pictures show the different concepts of MST, starting with a physical topology:

The IST runs within the region to block ports, to break up the physical loop. One switch will be the CIST root and one switch will be the CIST Regional root.

In reality, all these things tie in together and happen simultaneously but to solidify the understanding, we divide them into steps. The IST has run internally and blocked ports. This is what the CST looks like:

The CST runs between regions and/or non MST devices and makes sure there is no loop between regions or to non MST domains. If we combine the CST and the IST, we get the CIST which is the final topology:

Interopability Between MST and Other STP Versions

When communicating with IEEE STP or RSTP switch, the MST switch must share the role and state on the port towards the non MST switch for all VLANs. STP or RSTP can’t see into the MST region so it is treated as a single logical switch. The MST switch will speak by using the IST (instance 0) on boundary ports and format the BPDU to be STP or RSTP. The IST will also process inbound BPDUs from the non MST switch.

When communicating with PVST+ or RPVST+ region, things get a bit more complex. One STP instance is run for each VLAN and port role and state is individually calculated per VLAN. The IST will communicate with the non MST switch and must make sure that the information it sends to each PVST+/RPVST+ instance gets the same information to make a consistent choice. MST and PVST+ must arrive at the same port role and state for all instances even though a single MST instance and PVST+ instance directly interact with each other. This is also known as PVST Simulation mechanism.

The IST will replicate BPDUs for all active VLANs towards the PVST+ switch, meaning that the PVST+ switch will make a consistent choice for port role and state for all VLANs. The IST does this by formatting the BPDUs as PVST+ BPDUs.

In the opposite direction, the IST takes the BPDU from VLAN 1 as a representative for the entire PVST+ region and processes this in the IST. The boundary ports role and state will be binding for all active VLANs on that port. The MST switch must make certain that the result of the IST interaction with VLAN 1 STP instance is consistent with the state of STP instances run in other VLANs.

An MST boundary port will become a Designated Port if the BPDUs it sends out are superior to incoming VLAN 1 PVST+ BPDUs. The port will then be forwarding for all VLANs. To make sure that other PVST+ instances make a consistent decision, the MST switch must check that all incoming PVST+ BPDUs are inferior to its own outgoing BPDUs. If not, the PVST Simulation mechanism will fail.

The CIST Root can be located in the PVST+ region and the boundary port can have a port role of Root if the incoming VLAN 1 PVST+ BPDUs are not only superior to the MST switch but also better than any other VLAN 1 PVST+ BPDUs received on any other boundary port. Once again, to check the consistency of of port role, all Root bridges must be located in the PVST+ region and use the same boundary port to reach these switches. The PVST Simulation mechanism will check that incoming PVST+ BPDUs for VLANs other than VLAN 1 are identical or superior to the VLAN 1 PVST+ BPDUs.

An MST boundary port will become Non-Designated if it receives superior VLAN 1 PVST+ BPDUs compared to its own but not superior enought to make it a Root Port.

It is recommended to have the MST region appear as a Root switch to all PVST+ instances by lowering the IST root’s priority below the priorities of all PVST+ switches in all VLANs.

When an MST switch is communicating to a PVST+ or RPVST+ switch it will always revert back to PVST+. There is less state involved with PVST+ due to not having a Proposal/Agreement process which simplifies the interworking of MST and PVST+.

Portfast Ports

Transitions directly to Forwarding state, saving 2x ForwardDelay
Does not generate topology change events
Does not flush CAM due to topology change
DOES send BPDUs
Does not expect to receive BPDUs
Not influenced by the Sync step in Proposal/Agreement procedure(RSTP)

Portfast enabled ports may also be referred to as Edge ports. If a Portfast enabled port receives BPDUs it will lose its Portfast status until the port has gone up and down. RSTP uses Proposal/Agreement process and when going through Sync, it will put all Non-Edge Designated ports into a Discarding state. Unless enduser ports are configured as Edge ports they will be affected and lose connectivity briefly during the Sync process. Portfast is also important so that when a PC boots up and requests an IP address via DHCP, it gets one assigned before the process times out, waiting for the port to go into a Forwarding state. Portfast can be enabled per port or globally for all access ports.

BPDU Guard: Enabled per port of globally for all Portfast enabled ports, will error-disable the port upon receiving ANY BPDU
Root Guard: Only enabled per port, ignores any superior BPDUs received to prevent the port from becoming a Root Port. If a superior BPDU is received, the port is put into a root-inconsistent blocking state, cease forwarding and receiving data frames until the superior BPDUs cease

After BPDU-Guard has error-disabled a port, it must manually be recovered or by using error-disable recovery feature.

Root Guard will block the port if a superior BPDU comes in, this does not have to be the best BPDU, simply better than what the local switch is originating. Root-Guard will recover the port after the superior BPDU has expired which would be MaxAge – MessageAge or 3x Hello for STP and RSTP respectively.

BPDU Filter

If enabled on a port it will unconditionally stop sending and receiving BPDUs
If enabled globally for Edge ports, it will send 11 BPDUs after enabling the feature and then stop sending BPDUs. If a BPDU is received at any point in time, BPDU Filter is operationally disabled on the port and will revert to normal STP rules, sending and receiving BPDUs.

Protecting Against Unidirectional Link Issues

Several mechanism are available to protects against unidirectional links such as Loop Guard, UDLD, RSTP Dispute mechanism and Bridge Assurance.

UDLD

UDLD is a Cisco-proprietary layer 2 protocol that serves as an echo mechanism between a pair of devices. It sends UDLD messages advertising its identity and port identifier pair as well as a list of all neighboring switch/port pairs heard on the same segment. The following explicit conditions are used by UDLD to detect an unidirectional link:

UDLD messages arriving from a neighbor that do not contain the exact switch/port pair matching the receiving switch and its port in the list of detected neighbors. This would suggest that either the neighbor does not hear this switch at all (fiber cut) or that neighbor’s port sending these UDLD messages is different from the neighbor’s port receiving the UDLD messages. This could be the case if the TX fiber is plugged into a different port than the RX fiber.
If the incoming UDLD messages contain the same switch/port originator pair as the receiving switch, which would indicated that the port is self-looped.
A switch has detected only a single neighbor but the neighbor’s UDLD messages contain several switch/port pairs in the list of neighbors, this would indicated shared media and lack of visibility between all connected devices.

The above are explicit examples which will error-disable a port due to it being unidirectional. UDLD runs either in normal or aggressive mode. In normal mode, UDLD tries to reconnect with its neighbor(s) up to 8 times if there is a loss of incoming UDLD messages. Normal mode does not react to this implicit condition if not successfull, aggressive mode will error-disable the port if it stops receiving UDLD messages and the reconnect(s) fails. UDLD can be enabled globally or per port, globally enabling it will only enable UDLD on fiber ports.

Loop Guard prevents Root and Alternate ports from becoming Designated in the case of loss of incoming BPDUs. When the stored BPDU on a port expires, Loop Guard will put the port into a loop-inconsistent state. Loop-Guard can be configured clobally or per port.

Bridge Assurance is another mechanism that is available on select platforms and works with RPVST+ and MST on point-to-point links. A port will send BPDUs regardless of state if Bridge Assurance is enabled. If BPDUs are not received, the port will be put into a BA-inconsistent state. This protects from unidirectional links as well as malfunctioning switches that stop participating in RPSVT+/MST.

Finally the Dispute mechanism available in RPVST+/MST works by checking the incoming BPDU flags. If an inferior BPDU is received but the flags are Designated Learning or Forwarding, the local port will move into a Discarding state.

Port Channel

Interfaces can be bundled into a Port Channel which increases the available bandwidth by carrying multiple frames over multiple links. A hashing mechanism run over selected frames address fields will determine which physical link to send the frame over. The hashing is deterministic, meaning that frames of the same flow will travel the same physical link.

Load sharing can be based on MAC address, IP address or on some platforms even port numbers. A choice needs to be made depending on the type of flow, which load sharing mechanism will be most beneficial. Normally only one type of load sharing can be used for all flows on a switch. Normally load sharing will be more balanced if using a number of links divisible by 2. This varies by platform and the number of hash buckets.

To bring interfaces into a bundle, several parameters must match, such as speed, duplex, trunk/access, allowed VLANs, STP cost and so on.

It is recommended to run a dynamic protocol such as LACP to setup the bundle, this will prevent from failure modes where a switching loop is created where one side is unconditionally bundling links and the other side has not yet formed the bundle. Portchannels are treated as a single logical interface by STP and a single physical interface will be responsible for transmitting BPDUs for the bundle. Etherchannel misconfig guard can protect against failures where multiple BPDUs are incoming with different source MAC on ports in the bundle.

STP Scalability and vPC

MST offers greater scalability than RPVST+ due to sending only one BPDU and the decoupling of VLANs from instances. Normally two instances is enough with MST. With MST, VLANs can be created without affecting the STP instances. MST can also better support stretched layer 2 domains through the use of regions.

To achieve load balancing with MST, at least two STP instances need to be defined and different switches will be the root for each of these instances.

Recommendations for MST:

Define a region configuration to be copied to all the switches that are part of the Layer 2 topology
As part of the region configuration, define to which instances all the VLANs belong. Normally two instances would be enough
Define primary and secondary root switches for all the instances that you have defined, also for instance 0. Typically one switch would be the root for instance 0 and instance 1 and a redundant aggregation switch for instance 2
Preprovision all VLAN mapppings and topologies and later create VLANs as needed

Special Considerations for Spanning Tree with vPCs

Virtual Port Channel (vPC) is a technology used on Nexus switches where to switches act as if they were one by having the primary switch generate BPDUs, LACP messages and so on. The two switches use a link between them to synchronize state and to pass traffic over, this link is called the vPC peer link. Ports that are not configured for vPC behave as normal ports, meaning that BPDUs get generated by the local switch.

Some modifications have been done to STP to be used in combination with vPC, they are the following:

The peer link should never be blocking because it carries important traffic such as Cisco Fabric Services over Ethernet (CFSoE) Protocol. The peer link is always forwarding
On vPC ports, only the primary switch generates BPDUs. The secondary switch will relay incoming BPDUs to the primary switch

The following picture shows the behavior of Spanning Tree on Nexus switches:

The operational primary switch sends BPDUs towards Access1 even though it is not the STP Root. BPDUs that come from Access1 are relayed by Agg2. On ports that are not member of a vPC, normal rules apply, meaning that both Agg switches will send BPDUs towards Access2.

It is recommended to align the operational primary role with the STP Root role. If the peer-link fails, the vPC ports on the secondary switch will be shutdown. To keep SVIs up for non vPC VLANs if the peer-link fails, use a backup link between the switches that is independent from the peer-link or the dual-active exclude command. If using an extra link, remove all the non vPC VLANs from the vPC peer-link.

MST and vPC Best Practices

Associate the root and secondary root role at the aggregation layer and match the vPC primary and secondary roles with the STP root role.
One MST instance is enough
Configure regions during the deployment phase
If changing the VLAN to instance mapping, change both the primary and secondary vPC to avoid global inconsistency
Use dual-active exclude command to not isolate non vPC VLANs when the peer-link is lost

If using RPVST+, use pathcost method so that lower speed interfaces do not get the same metric as higher speed interfaces. This should be the default for MST but may vary by platform.

Scaling Considerations

Scaling may be affected by the following parameters:

The number of PortChannels
The number of VLANs supported by the switch
Logical interface count
Oversubscription rate

A logical port is the sum of the number of physical ports times the number of VLANs on each port. When vPC is used, the secondary device passes BPDUs to the primary device which increases the scale of logical interfaces. A PortChannel is a logical interface so it counts as a single logical port regardless of the number of links it contains. To calculate the logical ports, multiply the number of vPCs times the number of VLANs on each vPC. For non vPC switches, the logical ports is the number of trunks * number of VLANs + number of access ports. For a switch with 10 trunks with 100 VLANs and 10 access ports that is 1010 logical ports.

Virtual ports is a line card limitation where a line card can support a maximum number of logical ports per line card. Virtual ports are calculated the same way but for a PortChannel, all physical interfaces count individually.

To reduce the number of logical ports, the following concepts are important:

Implement multiple aggregation modules
Perform manual pruning on trunks
Use MST instead of (R)PVST+
Distribute trunks and access ports across line cards
Remove unused VLANs going to Content Switching Modules (CSM) – The CSM automatically has all VLANs defined in the system configuration

This post describes key concepts of STP, different STP optimizations and which scaling factors are important in designing a layer 2 network.

Categories: CCDE, Spanning tree Tags: CCDE, Logical port, MST, STP, Virtual port

QoS Design Notes for CCDE

January 17, 2015 reaper81 14 comments

Trying to get my CCDE studies going again. I’ve finished the End to End QoS Design book (relevant parts) and here are my notes on QoS design.

Basic QoS

Different applications require different treatment, the most important parameters are:

Delay: The time it takes from the sending endpoint to reach the receiving endpoint
Jitter: The variation in end to end delay between sequential packets
Packet loss: The number of packets sent compared to the number of received as a percentage

Characteristics of voice traffic:

Smooth
Benign
Drop sensitive
Delay sensitive
UDP priority

One-way requirements for voice:

Latency ≤ 150 ms
Jitter ≤ 30 ms
Loss ≤ 1%
Bandwidth (30-128Kbps)

Characteristics for video traffic:

Bursty
Greedy
Drop sensitive
Delay sensitive
UDP priority

One-way requirements for video:

Latency ≤ 200-400 ms
Jitter ≤ 30-50 ms
Loss ≤ 0.1-1%
Bandwidth (384Kbps-20+ Mbps)

Characteristics for data traffic:

Smooth/bursty
Benign/greedy
Drop insensitive
Delay insensitive
TCP retransmits

Quality of Service (QoS) – Managed unfairness, measured numerically in latency, jitter and packetloss

Quality of Experience (QoE) – End user perception of network performance, subjective and can’t be measured

Tools

Classification and marking tools: Session, or flows, are analyzed to determine what class the packets belong to and what treatment they should receive. Packets are marked so that analysis happens a limited number of times, usually at ingress as close to the source as possible. Reclassification and remarking is common as the packets traverse the network.

Policing, shaping and markdown tools: Different classes of traffic are alotted portions of the network resources. Traffic may be selectively dropped, delayed or remarked to avoid congestion when it exceeds the available network resources. Traffic can be dropped (policing), slowed down (shaped) or remarked (markdown) to conform.

Congestion management or scheduling tools: When there is more traffic than available network resources it will be queued. For traffic classes that don’t react well to queueing they can be denied access by a scheduling tool to avoid lowering quality of the existing flows.

Link-specific tools: Link fragmentation and interleaving fits into this category.

Packet Header

IPv4 packet has 8-bit Type of Service (ToS) field, IPv6 packet has 8-bit Traffic Class field. The first three bits are IP Precedence (IPP) bits for a total of 8 classes. The first three bits in combination with the nex three is known as DSCP for a total of 64 classes.

At layer two the most common marking is 802.1p Class of Service (CoS) or MPLS EXP bits, each using three bits for a total of 8 classes.

QoS Deployment Principles

Define business/organizational objectives of QoS deployment. This may including provisioning real-time services for voice/video traffic or guaranteeing bandwidth for critical business applications and also managing scavenger traffic. Seek executive endorsement of the business objectives to not derail the process later on.
Based on the business objectives, determine how many classes of traffic is needed. Define an end-to-end strategy how to identify the traffic and treat it across the network.
Analyze the requirements of each application class so that the proper QoS tools can be deployed to meet these requirements.
Design platform-specific QoS policies to meet the requirements with consideration for appropriate Place In the Network (PIN).
Test the QoS designs in a controlled environment.
Begin deployment with a closely monitored and evaluated pilot rollout.
The tested and pilot proven QoS designs can be deployed to the production network in phases during scheduled downtime.
Monitor service levels to make sure that the QoS objectives are being met.

The common mistake is to make it a technical process only and not research the business objectives and requirements.

QoS Feature Sequencing

Classification: The identification of each traffic stream.

Pre-queuing: Admission decisions, and dropping and marking the packet, are best applied before the packet enters a queue for egress scheduling and transmission.

Queueing: Scheduling the order of packets before transmission.

Post-queueing: Usually optional, sometimes needed to apply actions that are dependent on the transmission order of packets, such as sequence numbering(e.g. compression and encryption), which isn’t known until the QoS scheduling function dequeues the packets based on the priority rules.

Security and QoS

Trust Boundaries

A trust boundary is a network location where packet markings are not accepted and may be rewritten. Trust domains are network locations where packet markings are accepted and acted on.

Network Attacks

QoS tools can mitigate the effects of worms and DoS attacks to keep critical applications available during an attack.

Recommendations and Guidelines

Classify and mark traffic as close to the source as technically and administratively feasible
Classification and marking can be done on ingress or egress but queuing and shaping are usually done on egress
Use an end-to-end Diffserv PHB model for packet marking
Less granular fields such as CoS and MPLS EXP should be mapped to DSCP as close to the traffic source as possible
Set a trust boundary and mark or remark traffic that comes in beyond the boundary
Follow standards based Diffserv PHB markings if possible to ensure interopability with SP networks, enterprise networks or merging networks together
Set dscp and set precedence should be used to mark all IP traffic, set ip dscp and set ip precedence only marks IPv4 packets
When using tunnel interfaces, think of feature sequencing to make sure that the inner or outer packet headers (or both) are marked as intended

Policing and Shaping Tools

Policer: Checks for traffic violations against a configured rate. Does not delay packets, takes immediate action to drop or remark packet if exceeding rate.

Shaper: Traffic smoothing tool with the objective to buffer packets instead of dropping them, smoothing out any peaks of traffic arrival to not exceed configured rate.

Characteristics of a policer:

Causes TCP resends when traffic is dropped
Inflexible and inadaptable;makes instantaneous packet drop decisions
An ingress or egress interface tool
Does not add any delay or jitter to packets
Rate limiting without buffering

Characteristics of a shaper:

Typically delays rather than drops exceeding traffic, causes fewer TCP resends
Adapts to congestion by buffering exceeding traffic
Typically an egress interface tool
Adds delay and jitter if rate exceeds the shaper
Rate limiting with buffering

Placing Policers and Shapers in the Network

Policers make instantaneous decisions and should be deployed ingress, don’t transport packets if they are going to be dropped anyway. Policers can also be placed on egress to limit a traffic class at the edge of the network.

Shapers are often deployed as egress tools, commonly on enterprise to SP links to not exceed the commited rate of the SP.

Tail Drop and Random Drop

Tail drop means dropping the packet that is at the end of an queue. The TX ring is always FIFO, if a voice packet is trying to get into the TX ring but it’s full it will get dropped because it’s at the tail of the queue. Random drop via Random Early Detection (RED) or Weighted Random Early Detection (WRED) tries to keep the queues from becoming full by dropping packets from traffic classes to cause TCP slowing down.

Recommendations and Guidelines

Police as close to the source as possible, preferably on ingress.
Single rate three color policer handles bursts better than single rate two color policer resulting in fewer TCP retransmissions
Use a shaper on interfaces where speed mismatches, such as buying a lower rate than physical speed or between a remote-end access link and the aggregated head-end link
When shaping on an interface carrying real-time traffic, set the Tc value to 10 ms

Scheduling Algorithms

Strict priority: Lower priority queues are only served when higher priority queues are empty. Can potentially starve traffic in lower priority queues.

Round robin: Queues are served in a set sequence, does not starve traffic but can add unpredictable delays in real-time, delay sensitive traffic.

Weighted fair: Packets in the queue are weighted, usually by IP precedence so that some queues get served more often than others. Does not provide bandwidth guarantee, the bandwidth per flow varies based on number of flows and the weight of each flow.

WRED is a congestion avoidance tool and manages the tail of the queue. The goal is to avoid TCP synchronization where all TCP flows speed up and slow down at the same time, which leads to poor utilization of the link. WRED has little or no effect on UDP flows. WRED can be used to set the RFC 3168 IP ECN bits to indicated that it is experiencing congestion.

Recommendations and Guidelines

Critical applications like VoIP requires service guarantees regardless of network conditions. This requires to enable queueing on all nodes with a potential for congestion.
A large number of applications end up in the default class, reserve 25% for this default Best Effort class
For a link carrying a mix of voice, video and data traffic, limit the priority queue to 33% of the link bandwidth
Enable LLQ if real-time, latency sensitive traffic is present
Use WRED for congestion avoidance on TCP flows but evalute if it has any traffic on UDP flows
Use DSCP-based WRED wherever possible

Bandwidth Reservation Tools

Measurement based: Counting mechanism to only allow a limited number of calls (sessions). Normally statically configured by an administrator.

Resource based: Based on the availability of resources in the network, usually bandwidth. Uses the current status of the network to base its decision.

Resource Reservation Protocol (RSVP) is a resource based protocol, commonly used with MPLS-TE. The drawback of RSVP is that it requires a lot of state in the devices.

AC functionality is most effectively deployed at aplication level such as with Cisco Unified Communications Manager (CUCM). It works well in networks with limited complexity and where flows are of predictable bandwidth.

RSVP can be used in combination with Diffserv in an Intserv/Diffserv model where RSVP is only responsible for admission control and Diffserv for the queuing.

A RSVP proxy can be used because end devices such as phones and video endpoints usually don’t support the RSVP stack. A router closest to the endpoint is then used as a proxy together with CUCM to act as an AC mechanism.

Recommendations and Guidelines

Cisco recommends using RSVP Intserv/Diffserv model with a router-based proxy device. This allows for scaling of policies together with a dynamic network aware AC.

IPv6 and QoS

IPv6 headers are larger in size so bandwidth consumption for small packet sizes is higher. IPv4 header is normally 20 bytes but IPv6 is 40 bytes. IPv6 has a 20-bit Flow Label field and 8-bit Traffic Class field.

Medianet

Modern applications can be difficult to classify and can consists of multiple types of traffic. Webex provides text, audio, instant messaging, application sharing and desktop video conferencing through the same application. NBAR2 can be used to identify applications.

Application Visibility Control (AVC)

Consists of NBAR2, Flexible Netflow (FNF) and MQC. NBAR2 is used to identify traffic through Deep Packet Inspection (DPI), FNF reports on usage and MQC is used for the configuration.

FNF uses Netflow v9 and IPFIX to export flow record information. It can monitor L2 to L7 and identify apps by port and through NBAR2. When using NBAR2, CPU usage may increase significantly as well as memory usage. This is also true for FNF. Consider the performance impact before deploying it.

QoS Requirements and Recommendations by Application Class

Voice requirements:

One-way latency should be no more than 150 ms
One-way peak-to-peak jitter should be no more than 30 ms
Per-hop peak-to-peak jitter should be no more than 10 ms
Packet loss should be no more than 1%
A range of 20 – 320 Kbps of guaranteed priority bandwidth per call (depends on sampling rate, codec and L2 overhead)

Voice recommendations:

Mark to Expedited Forwarding (EF) / DSCP 46
Treat with EF PHB (priority queuing)
Voice should be admission controlled

May use jitter buffers to reduce the effects of jitter, however it does add delay. Voice packets are constant in size which means bandwidth can be provisioned accurately. Don’t forget to account for L2 overhead.

Broadcast video requirements:

Packet loss should be no more than 0.1%

Broadcast video recommendations:

Mark to CS5 / DSCP 40
May be treated with EF PHB (priority queuing)
Should be admission controlled

Flows are usually unidirectional and include application level buffering. Does not have strict jitter or latency requirements.

Real-time interactive video requirements:

One-way latency should be no more than 200 ms
One-way peak-to-peak jitter should be no more than 50 ms
Per-hop peak-to-peak jitter should be no more than 10 ms
Packet loss should be no more than 0.1%
Provisioned bandwidth depends on codec, resolution, frame rates, additional data components and network overhead

Real-time interactive video recommendations:

Should be marked with CS4 / DSCP 32
May be treated with an EF PHB (priority queuing)
Should be admission controlled

Multimedia conferencing requirements:

One-way latency should be no more than 200 ms
Packet loss should be no more than 1%

Multimedia conferencing recommendations:

Mark to AF4 class (AF41/AF42/AF43 or DSCP 34/36/38)
Treat with AF PHB with guaranteed bandwidth and DSCP-based WRED
Should be admission controlled

Multimedia streaming requirements:

One-way latency should be no more than 400 ms
Packet loss should be no more than 1%

Multimedia streaming recommendations:

Should be marked to AF3 class (AF31/AF32/AF33 or DSCP 26/28/30)
Treat with AF PHB with guaranteed bandwidth and DSCP-based WRED
May be admission controlled

Data applications can be divided into Transactional Data (low latency) or Bulk Data (high throughput)

Transactional data recommendations:

Should be marked to AF2 class (AF21/AF22/AF23 or DSCP 18/20/22)
Treat with AF PHB with guaranteed bandwidth and DSCP-based WRED

This class may be subject to policing and remarking. Applications in this class can be Enterprise Resource Planning (ERP) or Customer Relationship Management (CRM).

Bulk data recommendations:

Should be marked to AF1 class (AF11/AF12/AF13 or DSCP 10/12/14)
Treat with AF PHB with guaranteed bandwidth and DSCP-based WRED
Deployed in moderately provisioned queue to provide a degree of bandwidth constraint during congestion, to prevent long TCP session from dominating network bandwidth

Example applications are e-mail, backup operations, FTP/SFTP transfers, video and content distribution.

Best effort data recommendations:

Mark to DF (DSCP 0)
Provision in dedicated queue
May be provisioned with guaranteed bandwidth allocation and WRED/RED

Scavenger traffic recommendations:

Should be marked to CS1 (DSCP 8)
Should be assigned a minimally provisioned queue

Example traffic is Youtube, Xbox Live/360 movies, iTunes, Bittorrent.

Control plane traffic can be divided into Network Control, Signaling and Operations/Administration/Management (OAM).

Network Control recommendations:

Should be marked to CS6 (DSCP 48)
May be assigned a moderately provisioned guaranteed bandwidth queue

Do not enable WRED. Example traffic is EIGRP, OSPF, BGP, HSRP and IKE.

Signaling traffic recommendations:

Should be marked to CS3 (DSCP 24)
May be assigned a moderately provisioned guaranteed bandwidth queue

Do not enable WRED. Example traffic is SCCP, SIP and H.323.

OAM traffic recommendations:

Should be marked to CS2 (DSCP 16)
May be assigned a moderately provisioned guaranteed bandwidth queue

Do not enable WRED. Example traffic is SSH, SNMP, Syslog, HTTP/HTTPs.

QoS Design Recommendations:

Always enable QoS in hardware as opposed to software if possible
Classify and mark as close to the source as possible
Use DSCP markings where available
Follow standards based DSCP PHB markings
Police flows as close to source as possible
Mark down traffic according to standards based rules if possible
Enable queuing at every node that has potential for congestion
Limit LLQ to 33% of link capacity
Use AC mechanism for LLQ
Do not enable wred for LLQ
Provision at least 25% for Best Effort traffic

QoS Models:

Four-Class Model:

Voice
Control
Transactional Data
Best Effort

Eight-Class Model:

Voice
Multimedia-conferencing
Multimedia-streaming
Network Control
Signaling
Transactional Data
Best Effort
Scavenger

Twelve-Class Model:

Voice
Broadcast Video
Real-time interactive
Multimedia-conferencing
Multimedia-streaming
Network Control
Signaling
Management/OAM
Transactional Data
Bulk Data
Best Effort
Scavenger

This picture shows how different size models can be expanded or vice versa.

Campus QoS Design Considerations and Recommendations:

The primary role of QoS is campus networks is not to control latency or jitter, but to manage packet loss. Endpoints normally connect to the campus at high speeds, it may only take a few milliseconds if congestion to overrun the buffers of switches/linecards/routers.

Trust Boundaries:

Conditionally trusted endpoints: Cisco IP phones, Cisco Telepresence, Cisco IP video surveillance cameras, Cisco digital media players.

Trusted endpoints: Centrally administered PCs and endpoints, IP video conferencing units, managed APs, gateways and other similar devices.

Untrusted endpoints: Unsecure PCs, printers and similar devices.

Port-Based QoS versus VLAN-based QoS versus Per-Port/Per-VLAN QoS

Design recommendations:

Use port-based QoS when simplicity and modularity are the key design drivers
Use VLAN-based QoS when looking to scale policies for classification, trust and marking
Do not use VLAN-based QoS to scale (aggregate) policing policies
Use per-port/per-VLAN when supported and policy granularity is the key design driver

EtherChannel QoS

Load balance based on source and destination IP or what is expected to give the best distribution of traffic
Be aware that multiple real-time flows may up on the same physical link and oversubscribing the real-time queue

EtherChannel QoS will vary by platform and some policies are applied to the bundle and some to the physical interface.

Ingress QoS Models:

Design recommendations:

Deploy ingress QoS models such as trust, classification and policing on all access edge ports
Deploy ingress queuing (if supported and required)

The probability for congestion on ingress is less than on egress.

Egress QoS Models:

Design recommendations:

Deploy egress queuing policies on all switch ports
Use a 1 priority queue and 3 normal queues or better queuing structure

Enable trust on ports leading to network infrastructure and similar devices.

Trusted Endpoint:

Trust DSCP
Optional ingress marking and/or poling
Minimum 1P3Q

Untrusted Endpoint:

No trust
Optional ingress marking and/or poling
Minimum 1P3Q

Conditionally Trusted Endpoint:

Conditional trust with trust CoS
Optional ingress marking and/or poling
Minimum 1P3Q

Switch to Switch/Router Port QoS:

Trust DSCP
Minimum 1P3Q

Control Plane Policing

Can be used to harden the network infrastructure. Packets handled by main CPU typically include the following:

Routing protocols
Packets destined to the local IP of the router
Packets from management protocols such as SNMP
Interactive access protocols such as Telnet and SSH
ICMP or packets with IP options may have to be handled by CPU
Layer two packets such as BPDUs, CDP, DTP and so on

Wireless QoS

802.11e Working Group (WG) proposed QoS enhancements to the 802.11 standard in 2007. This was also revised in IEEE 802.11-2012. Wi-Fi Alliance has a compatibility standard called Wireless Multimedia (WMM).

In Wi-Fi networks only one station may transmit at a time, physical constraints that are not in place on wired networks. The Radio Frequency (RF) is shared between devices. This is similar to a hub environment. Wireless networks operate at variable speeds.

Distributed Coordination Function (DCF) is responsible for scheduling and transmitting frames onto the wireless medium.

Wirless uses Carrier Sense Multiple Access/Collision Avoidance (CSMA/CA). It actively tries to avoid collisions. A wireless client has a random period where it may send traffic to try to avoid collisions.

DCF evolved to Enhanced Distributed Channel Access (EDCA) which is a MAC layer protocol. It has the following additions compared to DCF:

Four priority queues, or access categories
Different interframe spacing for each AC as compared to a single fixed value for all traffic
Different contention window for each AC
Transmission Opportunity (TXOP)
Call admission control (TSpec)

802.11e Ethernet frame uses 3-bit field known as User Priority (UP) for traffic marking. It is analogous to 802.1p CoS. One difference is that voice is marked with UP 6 as compared to CoS 5.

Interframe spacing is a time the client needs to wait before starting to send traffic, the wait time is lower for higher priority traffic.

The contention window is used when the wireless media is not free, higher priority traffic waits a shorter period of time before trying to send again than lower priority data.

TXOP is a period of time when the client is allowed to send to not make it hog up the media for a long period of time.

TSpec is used for admission control, the client sends it requirements such as data rate, frame size to the AP and the AP only admits it if there is available bandwidth.

Upstream QoS is packets from the wireless network onto the wired network. Downstream QoS is packets from the wired network onto the wireless network.

Wireless marking may not be consistent with wired markings so mapping may have to be done to map traffic into the correct classes on the wired network.

Upstream QoS:

802.11e UP marking on upstream frame from client to AP is translated to a DSCP valued on the outside of the CAPWAP tunnel. The inner DSCP marking is preserved
After the CAPWAP packet is decapsulated at the WLC, the original IP headers DSCP value is used to derive the 802.1p CoS value

Downstream QoS:

A frame with 802.1p CoS marking arrives a WLC wired interface. DSCP value of the IP packet is used to set the DSCP of the outer CAPWAP header.
The DSCP value of the CAPWAP header is used to set the 802.11e UP value on the wireless frame

The 802.1p CoS value is not used in the above process.

Data Center QoS

Primary goal is to manage packet loss. A few milliseconds of traffic during congestion can cause buffer overruns.

Various data center designs have different QoS needs. These are a few data center architectures:

High-Performance Trading (HPT)
Big data architectures, including High-Performance Computing (HPC), High-Throughput Computing (HTC) and grid data
Virtualized Multiservice Data Center (VMDC)
Secure Multitenant Data Center (SMDC)
Massively Scalable Data center (MSDC)

High-Performance Trading:

Minimal or no QoS requirements because the goal of the architecture is to introduce as little delay as possible using low latency platforms such as the Nexus.

Big Data (HPC/HTC/Grid) Architectures

Have similar QoS needs as a campus network. The goal is to process large and complex data sets that are too difficult to handle by traditional data processing applications.

High-Performance Computing: Uses large amounts of computing power for a short period of time. Often measured in Floating-point Operations Per Second (FLOPS)

High-Throughput Computing: Also uses large amounts of computing power but for a larger period of time. More focused on operations per month or year.

Grid: A federation of computer resources from multiple locations to reach a common goal. A distributed system with noninteractive workloads that involve a large number of files. Compared to HPC, Grid is usually more heterogenous, loosely coupled and geographically dispersed.

Virtualized Multiservice Data Center (VMDC):

VMDC comes with unique requirements due to compute and storage virtualization, including provisioning a lossless Ethernet service.

Applications no longer map to physical servers (or cluster of servers)
Storage is no longer tied to a physical disk (or array)
Network infrastructure is no longer tied to hardware

Lossless compute and storage virtualization protocols such as RoCE and FCoE need to be supported as well as Live Migration/vMotion.

Secure Multitenant Data Center (SMDC):

Virtualization is leveraged to support multitenants over a common infrastructure and this affects the QoS design. SMDC has similar needs as VMDC but a different marking model.

Massively Scalable Data Center:

A framework used to build elastic data centers that host a few applications that are distributed across thousands of servers. Geographically distributed homogenous pools of compute and storage. The goal is to maximize throughput. Common to use a leaf and spine design.

Data Center Bridging Toolset

IEEE 802.1 Data Center Bridging Task Group has defined enhancements to Ethernet to support requirements of converged data center networks.

Priority flow control (IEEE 802.1Qbb)
Enhanced transmission selection (IEEE 802.1Qaz)
Congestion notification (IEEE 802.1Qau)
DCB exchange (DCBX) (IEEE 802.1Qaz combined with 802.1AB)

Priority Flow Control (802.1Qbb): PFC provides link level flow control mechanism that can be controlled independently for each 802.1p CoS priority. The goal is to provide zero frame loss due to congestion in DCB networks and mitigating Head of Line (HoL) blocking. Uses PAUSE frames.

Skid Buffers

Buffer management is critical to PFC, if transmit or receive buffers are overflowed, transmission will not be lossless. A switch needs sufficient buffers to:

Store frames sent during the time it takes to send the PAUSE frame across the network between stations
Store frames that are already in transit when the sender receives the PFC PAUSE frame

The buffers used for this are called skid buffers and usually engineered on a per port basis in hardware on ingress.

An incast flow is a flow from many senders to one receiver.

Virtual Output Queuing (VOQ)

Artifically induce congestion on ingress ports where there is an incast flow going to a host. This lessens the need for deep buffers on egress. VOQ consumes congestion at every ingress port and optimizes switch buffering capacity for incast flows. It does not consume fabric bandwidth only to be dropped on the egress port.

Enhanced Transmission Selection – IEEE 802.1Qaz

Uses a virtual lane concept on a DCB enabled NIC, also called Converged Network Adaptor (CNA). Each virtual interface queue is accountable for managing its alloted bandwidth for its traffic group. If a group is not using all its bandwidth it may be used by other groups.

ETS virtual interface queues can be serviced as follows:

Priority – a virtual lane can be assigned a strict priority service
Guaranteed bandwidth – a percentage of the physical link capacity
Best effort – the default virtual lane service

Congestion Notification IEEE 802.1Qau

Layer two traffic management system that pushes congestion to the edge of the network by instructing rate limiters to shape the traffic that is causing congestion. The congestion point such as a distribution switch connecting to several access switches can instruct these switches called reaction points to throttle the traffic by sending control frames.

Data Center Bridging Exchange (DCBX) IEEE 802.1Qaz + 802.1AB

DCB capabilities:

DCB peer discovery
Mismatched configuration detection
DCB link configuration of peers

The following DCB parameters can be exchanged by DCBX:

PFC
ETS
Congestion notification
Applications
Logical link-down
Network interface virtualization

DCBX can be used between switches and with some endpoints.

Data Center Transmission Control Protocol (DCTCP)

A goal of the data center is to maximize the goodput which is the application level throughput excluding protocol overhead. Goodput is reduced by TCP flow control and congestion avoidance, specifically TCP slow start.

DCTCP is based on two key concepts:

React in proportion to the extent of congestion, not its presence – this reduces variance in sending rates
Mark ECN base on instantaneous queue length – this enables fast feedback and corresponding window adjustments to better deal with bursts

Considerations affecting the marking model to be used in the data center include the following:

Data center applications and protocols
CoS/DSCP marking
CoS 3 overlapping considerations
Application-based marking models
Application- and tenant-based marking modelse

Data Center Applications and Protocols

Recommendations:

Consider what applications/protocols are present in the data center and may not already be reflected in the enterprise QoS model and how these may be integrated
Consider what applications/protocols may not be present or have a significantly reduced presence in the DC

Compute Virtualization Protocols:

Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE):
Supports direct memory access of one computer into another over converged Ethernet without involving either one’s operating system. Permits high-throughput low-latency networking, especially useful in massively parallel computer clusters. It’s a link layer protocol that allows communcation between any two hosts in the same broadcast domain. RoCE requires lossless service via PFC. When implemented along with FCoE, it should be assigned its own no-drop class/virtual lane, such as CoS 4. Other applications such as video using CoS 4 need to be reassigned to improve RoCE performance.

Internet Wide Area RDMA Protocol (iWARP):
Extends the reach of RDMA over IP networks. Does not require lossless service because it runs over TCP or STCP which uses reliable transport. It can be marked to unused CoS/DSCP or combined with internetwork control (CS6/CoS 6) or network control (CS7/CoS 7).

Virtual machine control and live migration protocols (VM control):
Virtual Machines (VMs) require control traffic to be passed between hypervisors. VM control is control plane traffic and should be marked to CoS 6 or CoS 7, depending on QoS model in use.

Live migration:
Protocols that support the process of moving a running VM (or application) between different phsycical machines without disconnecting the client or the application. Memory, storage and network connection are moved from original host matchine to the destination. A common example being vMotion. Can be argued to be a candidate for internetwork control (CoS 6) due to being a control plane protocol but sends too much traffic to be put in that class. Use an available marking or combine with CoS 4, CoS 2 or even CoS 1.

Storage Virtualization Protocols:

Fibre Channel over Ethernet (FCoE):
Encapsulates Fibre Channel (FC) frames over Ethernet networks, requires lossless service and is a layer two protocol that can’t be natively routed. Requires lossless service via PFC and usually marked with CoS 3 which should be dedicated for FCoE.

Internet Protocol Small Computer System Interface (iSCSI):
Encapsulates SCSI commands within IP to enable data transfers. Can be used to transmit data over LANS, WANS or even the Internet and can enable location independent data storage and retrieval. Does not require lossless service due to using TCP. Can be provisioned in dedicated class or in another class such as CoS 2 or CoS 1.

CoS/DSCP Marking:

Recommendations:

Some layer two protocols within the DC require CoS marking
CoS marking has limitations so consider a hybrid CoS/DSCP model (when supported)

CoS 3 Overlap Considerations and Tactical Options:

Recommendations:

Recognize the potential overlap of signaling (and multimedia streaming) markings with FCoE
Select a tactical option to address this overlap

Signaling is normally marked with CoS 3 but so is also FCoE. Some administrators prefer to dedicate CoS 3 to FCoE but that leaves the question what to do with signaling. Options to handle the overlap:

Hardware Isolation:
Some platforms and interface modules do not support FCoE such as Nexus 7k M-Series module but F-series do. M-series module can connect to CUCM and multimedia streaming servers and F-series modules to DCB extended fabric supporting FCoE.

Layer 2 Versus Layer 3 Classification:
Signaling and multimedia streaming can be classified by DSCP values (CS3 and AF3) to be assigned to queues and FCoE can be classified by CoS 3 to its own dedicated queue.

Asymmetrical CoS/DSCP Marking:
Asymmetrical meaning the that the three bits forming the CoS do not match the first three bits of the DSCP value. Signaling could be marked with CoS 4 but DSCP CS3.

DC/Campus DSCP Mutation:
Perform ingress and egress DSCP mutation on data center to campus links. Signaling and multimedia streams can be assigned DSCP values that map to CoS 4 (rather than CoS 3).

Coexistence:
Allow signaling and FCoE to coexist in CoS 3. The reasoning being that if the CUCM server has CNA then both signaling and FCoE will be provided a lossless service.

Data Center QoS Models:

Trusted Server Model:
Trust L2/L3 markings sent on application servers. Only approved servers should be deployed in the DC.

Untrusted Server Model:
Do not trust markings, reset markings to 0.

Single-Application Server Model:
Same as the untrusted server model but remarked to a non zero value.

Multi-Application Server Model:
Access-lists are used for classification and traffic is marked to multiple codepoints. Application server does not mark traffic at all or it marks it to different values than the enterprise QoS model.

Server Policing Model:
One or more application classes are metered via one-rate or two-rate policers, with conforming, exceeding and optionally violating traffic marked to different DSCP values.

Lossless Transport Model:
Provision lossless service to FCoE.

Trusted Server/Network Interconnect:

Trust CoS/DSCP
Ingress queuing
Egress queuing

Untrusted Server:

Set CoS/DSCP to 0
Ingress queuing
Egress queuing

Single-App Server:

Set CoS/DSCP to non zero value
Ingress queuing
Egress queuing

Multi-App Server:

Classify by ACL
Set CoS/DSCP values
Ingress queuing
Egress queuing

Policed Server:

Police flows
Remark/drop
Ingress queuing
Egress queuing

Lossless Transport:

Enable PFC
Enable ETS
Enable DCBX
Ingress queuing
Egress queuing

WAN & Branch QoS Design Considerations & Recommendations:

To manage packet loss (and jitter) by queuing policies
To enhance classification granularity by leveraging deep packet inspection engines

Packet jitter is most apparent at WAN/branch edge because of downshift in link speeds.

Latency and Jitter:
Recommendations:

Choose service provider paths to target 150 ms for one-way latency. If this target can’t be met, 200 ms is generally acceptable
Only queuing delay is managable by QoS policies

Network latency consists of:

Serialization delay (fixed)
Propagation delay (fixed)
Queuing delay (variable)

Serialization delay is the time it takes to convert a layer two frame into electrical or optical pulses onto the transmission media. The delay is fixed and a function of the line rate.

Propagation delay is also fixed and a function of the physical distance between endpoints. The gating factor is speed of light at 300 000km/s in vacuum but speed in fiber circuits is around a third of that. Propagation delay is then approximately 6.3 microseconds per km. Propagation delay is what makes up most of the network delay.

Queuing delay is variable and a function of whether a node is congested or not and if scheduling policies have been applied to resolve congestion events.

Tx-Ring:
Recommendation:

Be aware of the Tx-Ring function and depth;tune only if necessary

The Tx-Ring is the final IOS output buffer for an interface, it’s a relatively small FIFO queue that maximizes physical link bandwidth utilization by matching the outbound packet rate on the router with the physical interface rate. If the size of the Tx-Ring is too large, packets will be subject to latency and jitter while waiting to be served. If the Tx-Ring is too small the CPU will be continually interrupted, causing higher CPU usage.

LLQ:
Recommendations:

Use a dual-LLQ design when deploying voice and real-time video applications
Limit sum of all LLQs to 33% of bandwidth
Tune the burst parameter if needed

Some applications like Telepresence may be bursty by nature, the burst value may have to be adjusted to account for this.

WRED:
Recommendations:

Optionally tune WRED thresholds as required
Optionally enable ECN

To match behavior of AF PHB defined in RFC 2597 use these values:

Set minimum WRED threshold for AFx3 to 60% of queue depth
Set minimum WRED threshold for AFx2 to 70% of queue depth
Set minimum WRED threshold for AFx1 to 80% of queue depth
Set all WRED maximum thresholds to 100%

RSVP
Recommendations:

Enable RSVP for dynamic network-aware admission control requirements
Use the Intserv/Diffserv RSVP model to increase efficiency and scalability
Use application-identification RSVP policies for greater policy granularity

Ingress QoS Models
Recommendations:

DSCP is trusted by default in IOS
Enable ingress classification with NBAR2 on LAN edges, as required
Enable ingress/internal queuing, if required

Egress QoS Models
Recommendations:

Deploy egress queuing policies on all WAN edge interfaces
Egress queuing policies may not be required on LAN edge interfaces

Recommendation for queues:
LLQ:

Limit the sum of all LLQs to 33%
Use an admission control mechanism
Do not enable WRED

Multimedia/Data:

Provision guaranteed bandwidth according to application requirements
Enable fair-queuing presorters
Enable DSCP-based WRED

Control:

Provision guaranteed bandwidth according to control traffic requirements
Do not enable presorters
Do not enable WRED

Scavenger:

Provision with a minimum bandwidth allocation such as 1%
Do not enable presorters
Do not enable WRED

Default/Best effort:

Allocate at least 25% for the default/Best effort queue
Enable fair-queuing pre-sorters
Enable WRED

WAN and Branch Interface QoS Roles:

WAN aggregator LAN edge:

Ingress DSCP trust should be enabled
Ingress NBAR2 classification and marking policies may be applied
Ingress Medianet metadata classification and marking policies may be applied
Egress LLQ/CBWFQ/WRED policies may be applied (if required)

WAN aggregator WAN edge:

Ingress DSCP trust should be enabled
Egress LLQ/CBWFQ/WRED policies should be applied
RSVP policies may be applied
Additional VPN specific policies may be applied

Branch WAN edge:

Ingress DSCP trust should be enabled
Egress LLQ/CBWFQ/WRED policies should be applied
RSVP policies may be applied
Additional VPN specific policies may be applied

Branch LAN edge:

Ingress DSCP trust should be enabled
Ingress NBAR2 classification and marking policies may be applied
Ingress Medianet metadata classification and marking policies may be applied
Egress LLQ/CBWFQ/WRED policies may be applied (if required)

MPLS VPN QoS Design Considerations & Recommendations
The role of QoS over MPLS VPNs may include the following:

Shaping traffic to contracted service rates
Performing hierarchical queuing and dropping within these shaped rates
Mapping enterprise classes to the service provider classes
Policing traffic according to contracted rates
Restoring packet markings

MEF Ethernet Connectivity Services

E-line:
A service connecting two customer Ethernet ports over a WAN. It is based on point-to-point Ethernet Virtual Connection (EVC)

Ethernet Private Line(EPL):
A basic point-to-point service characterized by low frame delay, frame delay variation and frame loss ratio. Service multiplexing is not allowed. No CoS bandwidth profiling is allowed, only a Committed Information Rate (CIR).

Ethernet Virtual Private Line(EVPL):
Multiplexing of EVCs is allowed. The individual EVCs can be defined with different bandwidth profiles and layer two control processing methods.

E-LAN:
A multipoint service connecting customer endpoints and acting as a bridged Ethernet network. It is based on multipoint EVC and service multiplexing is allowed. It can be configured with a CIR, Committed Burst Size (CBS) and Excess Information Rate (EIR).

E-Tree:
A point-to-multipoint version of the E-LAN, essentialy it’s a hub and spoke topology where the spokes can only communicate with the hub but not each other. Common for franchise operations.

Sub-Line-Rate Ethernet Design Implications
Recommendations:

Sub line rate may require hierarchical shaping with nested queuing policies
Configure the CE shaper’s Committed Burst (Bc) value to be no more than half of the SP’s policer Bc

If the Bc of the shaper is set too high, packets may be dropped by the policer even though the shaper is shaping to CIR of the service.

When using sub line rate there will be no congestion on the interface, congestion is artificially induced by using a shaper and then a nested policy for the queuing. This may be referred to as Hierarchical QoS (HQoS).

QoS Paradigm Shift
Recommendation:

Enterprises and service providers most cooperate to jointly administer QoS over MPLS VPNs

MPLS VPNs offer a full mesh of connectivity between campus and branch networks. This fully meshed connectivity has implications for the QoS design. Previously WANs were usually point-to-point or hub and spoke which made the QoS design simpler. Branch to branch traffic would pass through the hub which controlled the QoS.

When using MPLS VPNs traffic from branch to branch will not pass the hub meaning that QoS needs to be deployed on all the branches as well. However, this is not enough, contending traffic may not be coming from the same site, it could be coming from any site. To overcome this the service provider needs to deploy QoS policies that are compatible with the enterprise policies on the PE routers. This is a paradigm shift in QoS administration and requires the enterprise and SP to jointly administer the QoS policies.

Service Provider Class of Service Models
Recommendations:

Fully understand the CoS models of the SP
Select the model that most closely matches your strategic end-to-end model

MPLS DiffServ Tunneling Modes
Recommendations:

Understand the different MPLS Diffserv tunneling modes and how they affect customer DSCP markings
Short pipe mode offers enterprise customers the most transparency and control of their traffic classes

Uniform Mode
Recommendation:

If provider uses uniform mode, be aware that your packets DSCP values may be remarked

Uniform mode is generally used when the customer and SP share the same Diffserv domain, which would be the case for an enterprise deploying MPLS.

Uniform mode is the default mode. The first three bits of the IP ToS field are mapped to MPLS EXP bits on the ingress PE when it adds the label. If a policer or other mechanism remarks the MPLS EXP value this value is copied to lower level labels and at the egress PE the MPLS EXP value is used to set the IPP value.

Short Pipe Mode

It is used when customer and SP are in different Diffserv domains. This mode is useful when the SP wants to enfore its own Diffserv policy but the customer wants its Diffserv information to be preserved across the MPLS VPN.

The ingress PE sets the MPLS EXP value based on the SPs policies. Any remarking will only propagate to the MPLS EXP bits of labels but not to the IPP bits of the customers IP packet. On egress the queuing is based on the IPP marking of the customers packet, giving the customer maximum control.

Pipe Mode

Pipe mode is the same as short pipe mode except for that the queuing is based on MPLS EXP bits at the egress PE and not on the customers IPP marking.

Enterprise-to-Service Provider Mapping
Recommendation:

Map the enterprise application classes to the SP CoS classes as efficiently as possible

Enterprise to service provider mapping considerations include the following:

Mapping real-time voice and video traffic
Mapping signaling and control traffic
Separating TCP-based applications from UDP-based applications (where possible)
Remarking and restoring packet markings (where required)

Mapping Real-Time Voice and Video
Recommendation:

Balance the service level requirements for real-time voice and video with the SP premium for real-time bandwidth
In either scenario, use a dual LLQ policy at CE egress edge

SPs often only a single real-time CoS, if you are deploying both real-time voice and video you will have to make a choice to put the video in the real-time class or not. Putting both voice and video into the real-time class may be costly or even cost prohibitive. You should still use a dual LLQ at the CE edge since that is under your control and that way you can protect voice from video. Downgrading video to a non real-time class may only produce slightly lower quality which could be acceptable.

Mapping Control and Signaling Traffic
Recommendation:

Avoid mixing control plane traffic with data plane traffic in a single SP CoS

Signaling should be separated from data traffic if possible since the signaling could get dropped if the class is oversubscribed and thus producing voice/video instability. If the SP does not offer enough classes to put signaling in its own, consider putting it in the real-time class since these flows are lightweight, but critical.

Separating TCP from UDP
Recommendation:

Separate TCP traffic from UDP traffic when mapping to SP CoS classes

It is generally best to not mix TCP-based traffic with UDP-based traffic (especially if the UDP traffic is streaming video such as broadcast video) within a single SP CoS. These protocols behave differently under congestion. Some UDP applications may have application-level windowing, flow control and retransmission capabilities but most UDP transmitters are oblivious to drops and don’t lower transmission rates due to dropping.

When TCP and UDP share a SP CoS and that class experiences congestion, the TCP flows continually lower their transmission rates, potentially giving up their bandwidth to UDP flows that are oblivious to drops. This is called TCP starvation/UDP dominance.

Even if enabling WRED the same behavior would be seen because WRED (primarily) manages congestion only on TCP-based flows.

Re-Marking and Restoring Markings
Recommendation:

Remark application classes on CE edge on egress (as required)
Restore markings on the CE edge on ingress via deep packet inspection policies (as required)

If packets need to be remarked to fit with the SP CoS model, do it at the CE edge on egress. This requires less of an effort than doing it in the campus.

To restore DSCP markings, traffic can be classified on ingress on the CE edge via DPI.

MPLS VPN QoS Roles

CE LAN edge:

Ingress DSCP trust should be enabled (enabled by default)
Ingress NBAR2 classification and marking policies may be applied
Ingress Medianet metadata classification and marking policies may be applied
Egress LLQ/CBWFQ/WRED policies may be applied (if required)

CE VPN edge:

Ingress DSCP trust should be enabled (enabled by default)
Ingress NBAR2 classification and marking policies may be applied (to restore markings lost in transit)
Ingress Medianet metadata classification and marking policies may be applied (to restore markings lost in transit)
RSVP policies may be applied
Egress LLQ/CBWFQ/WRED policies should be applied
Egress hierarchical shaping with nested LLQ/CBWFQ/WRED policies may be applied
Egress DSCP remarking policies may be applied (used to map application classes into specific SP CoS)

PE customer-facing edge:

Ingress DSCP trust should be enabled (enabled by default)
Ingress policing policies to meter customer traffic should be applied
Ingress MPLS tunneling mode policies may be applied
Egress MPLS tunneling mode policies may be applied
Egress LLQ/CBWFQ/WRED policies should be applied

PE core-facing edge:

Ingress DSCP trust should be enabled (enabled by default)
Ingress policing policies to meter customer traffic should be applied
Egress MPLS EXP-based LLQ/CBWFQ policies should be applied
Ergess MPLS EXP-based WRED policies may be applied

P edges:

Ingress DSCP trust should be enabled (enabled by default)
Egress MPLS EXP-based LLQ/CBWFQ policies may be applied
Egress MPLS EXP-based WRED policies may be applied

IPSEC QoS Design

Tunnel Mode

Default IPSEC mode of operation on Cisco IOS routers. The entire IP packet is protected by IPSEC, the sending VPN router encrypts the entire original IP packet and adds a new IP header to the packet. It supports multicast and routing protocols.

Transport Mode

Often used for encrypting peer-to-peer communications, does not encase the original IP packet into a new packet. Only the payload is encrypted while the original IP header is preserved, in effect being copied to outside of the new IP packet. Because the header is left intact its not possible to do multicast or routing protocols in transport mode.

IPSEC with GRE

GRE can be used to enable VPN services that connect disparate networks. It’s a key building block when using VRF Lite, a technology allowing related Virtual Routing and Forwarding (VRF) instances running on different routers to be interconnected across an IP network, while maintaining their separation from both the global routing table and other VRFs.

When using GRE as a VPN technology, it is often desirable to encrypt the GRE tunnel so that privacy and authentication of the connection can be ensured. GRE can be used with IPSEC tunnel mode or transport mode but if the tunnel transits a NAT or PAT device, tunnel mode is required.

Remote-Access VPNs

Cisco’s primary remote-access VPN client is AnyConnect Secure Mobility Client, which supports both IPSEC and Secure Sockets Layer (SSL) encryption.

Anyconnect uses Data Transport Layer Security (DTLS) to optimize real-time flows over SSL encrypted tunnel. Anyconnect connects to remote headend concentrator (such as an ASA firewall) through TCP-based SSL. All traffic from the client including voice, video and data traverses the SSL TCP connection. When TCP loses packets it pauses and waits for them to be resent, this is not good for real-time UDP based packets.

DTLS is a datagram technology, meaning it uses UDP packets instead of TCP. After Anyconnect establishes the TCP SSL tunnel it also establishes an UDP-based DTLS tunnel which is reserved for the use of real-time applications. This allows RDP voice and video packets to be sent unhindered. In case of packet loss, the session does not pause.

The decision on which tunnel to send the packets to is dynamic and made by the Anyconnect client.

QoS Classification of IPsec Packets
Recommendation:

Understand the default behavior of Cisco VPN routers to copy the ToS byte from the inner packet to the VPN packet header

Cisco routers by default copy the the ToS field from the original IP packet and write it into the new IPSEC packet header, thus allowing classification to still be accomplished by matching DSCP values. The same holds true for GRE packets as well. The IP packet is encrypted so it’s not possible to match on other fields such as IP addresses, ports, protocol and so on without using another feature.

The IOS Preclassify Feature
Recommendations:

Be aware of the limitations of QoS classification when using something other than the ToS byte
Use the IOS preclassify feature for all non ToS types of QoS classification
As a best practice, enable this feature for all VPN connections

Normally tunneling and encryption takes place before QoS classification in the order of operations, QoS preclassify reverses the order so that classification can be done on the IP header before it gets encrypted. Actually the order isn’t really reversed but the router clones the original IP header and keeps it in memory so that it can be used for QoS classification after tunneling and encryption.

This feature is only applicable on the encrypting routers outbound interface (physical or tunnel). Downstream routers can’t make decisions on the header because the packet will be encrypted at that point. Always enable the feature since tests have shown that it has very little impact on the routers performance to enable it.

MTU Considerations
Recommendations:

Be aware that MTU issues can severely impact network connectivity and the quality of user experience in VPN networks

When tunneling technologies are used there is always the risk of exceeding the MTU somewhere in the path. Unless jumbo frames are available end-to-end, MTU issues will almost always need to be addressed when dealing with any kind of VPN technology. Common symptoms when having MTU issues is that applications using small packets such as voice work but not e-mail, file server connections and many other applications.

Path MTU Discovery (PMTUD) can be used to discover what the MTU is along the path but it relies on ICMP messages which may be blocked on intermediary devices.

TCP Adjust-MSS

TCP Maximum Segment Size (MSS) is the maximum amount of payload data that a host is willing to accept in a single TCP/IP datagram. During a TCP connection setup between two hosts (TCP SYN), the MSS for each side of the connection is reported to each other. It’s the responsibility of the sending host to limit the size of the datagram to a value less than or equal to the receiving hosts MSS.

For an IP packet that is 1500 bytes and using TCP, the MSS is 1460 bytes, 20 bytes for IP and 20 bytes for TCP excluded from the 1500 byte packet.

Two hosts may not be aware they are communication through a tunnel and send a TCP SYN with MSS 1460 but the MTU may be lower. TCP Adjust-MSS can rewrite the MSS of the SYN packet so that when the receiving hosts gets it, the value is set to something lower to be able to send traffic through the tunnel without fragmentation. The receiving host will then reply with this value to the sender host. The router is acting as a middleman for the TCP session.

When using IPSEC over GRE, a MTU of 1378 bytes can be used:

Original IP packet = 1500 bytes
Subtract 20 bytes for IP header = 1480 bytes
Subtract 20 bytes for IP header = 1460 bytes
Subtract 24 bytes for GRE header = 1436 bytes
Subtract a maximum of 58 bytes for IPSEC = 1378 bytes

Adjusting MSS is a CPU intensive process. Enable it at remote sites rather than headend since it might be terminating a lot of tunnels. Adjusting MSS only needs to be done at one point in the path.

TCP Adjust-MSS only has impact on TCP packets, UDP packets are less likely to be of large size compared to TCP.

Compression Strategies Over VPN
Recommendations:

Compression can improve overall throughput, latency and user experience on VPN connections
Some compression technologies tunnel and may hide the fields used for QoS classification

TCP Optimization Using WAAS

Wide Area Application Services (WAAS) is a WAN accelerator, it uses compression technologies such as LZ compression, Date Redundancy Elimination (DRE) and specific Application Optimizers (AO). This significantly reduces the amount of data send over the WAN or VPN. For a technology like WAAS to work, the compression must take place before encryption.

Compression technologies can have a significant effect on the QoE but it works mainly for TCP traffic. Some WAN acceleration solution may break classification if the traffic is tunnel so that the original IP header is obfuscated. WAAS only compresses the data partion of the packet and keeps the header intact leaving the ToS byte available for classification.

Using Voice Codecs over a VPN Connection

To improve voice quality over bandwidth constrained VPN links, administrators may use compression codecs such as ILBC or G.729.

G.729 uses about a third of the bandwidth of G.711 but this also increases the effect of packet loss since more data is lost in every packet. To overcome this when the a packet is lost and the jitter buffer expires, the voice from the previous packet can be replayed to hide the gap, essentially tricking the listener. Through this technology, up to 5% of packet loss can be acceptable.

Internet Low Bitrate Codec (ILBC) uses 15.2 Kbit/s or 13.33 Kbit/s and performs similarly to G.729, the Mean Opinion Score (MOS) for ILBC is significantly better though when there is packet loss.

Compress Real-Time Protocol (cRTP) is not compatible with IPSEC because the packets are already encrypted when cRTP would try to compress them.

Antireplay Implications
Recommendation:

Antireplay drops may introduce in an IPSEC VPN network with QoS enabled

When ESP authentication is configured in an IPSEC transform set, every Security Association (SA) keeps a 64-packet sliding window where it checks the incoming sequence number of the encrypted packets. This is to stop from someone replaying packets and is called connectionless integrity. If packets arrive out of order due to queuing it must fit inside the window or the packet will be drop and seen as antireplay error. A data packet may get stuck behind voice in a queue so that it misses to fit inside its sliding window and then the packet would get dropped. To overcome this use a line in the ACL for every type of traffic such as voice, data, video. This will create a SA for each type of traffic.

TCP will be affected by packet loss, it will not know that the packets are dropped due to antireplay.

Antireplay drops are around 1 to 1.5% on congested VPN links with queuing enabled. A CBWFQ policy will often hold 64 packets per queue, decreasing this will lead to fewer antireplay drops as the packets are dropped before traversing the VPN but it may also increase the CPU usage.

DMVPN QoS Design

DMVPN offers some advantages regarding QoS compared to IPSEC, such as the following:

Reduction of overall hub router QoS configuration
Scalability to thousands of sites, with QoS for each tunnel on the hub router
Zero-touch QoS support on the hub router for new spokes
Flexibility of both hub and spoke and spoke to spoke (full mesh) deployment models

DMVPN Building Blocks

mGRE: Multi-point GRE allows a single tunnel interface to server a large number of remote spokes. One outbound QoS policy can be applied instead of one per tunnel as with normal GRE which is point-to-point.

Dynamic discover of IPSEC tunnel endpoints and crypto profiles: Dynamic creation of crypto maps, no need to statically build crypto map for each tunnel endpoint.

NHRP: Allows spoke to be configured with dynamically configured IP address. Also enables zero-touch deployment that makes DMVPN spokes easy to set up. Think of the hub router as a “next-hop server” rather than a traditional VPN router. NHRP is also used for per tunnel QoS feature.

The Per-Tunnel QoS for DMVPN Feature

Allows the administrator to enable QoS on a per-tunnel or per-spoke basis. QoS policy is applied to the mGRE tunnel interface. This protects spokes from each other and keeps one spoke from using all the BW so that there is none left for the others. The QoS policy at the hub is automatically generated for each tunnel when a spoke registers with the hub.

Queuing only kicks in when there is congestion, to signal to the routers QoS mechanism that there is congestion a shaper is used. Shape the traffic flows to the real VPN tunnel bandwidth to produce artificial back pressure. With per-tunnel QoS for DMVPN, a shaper is automatically applied by the system to each and every tunnel. This allows the router to implement differentiated services for the various data flows corresponding to each tunnel. This technique is called Hierarchical Queuing Framework (HQF).

Using NHRP, multiple spokes can be grouped together to use the same QoS policy.

This technique provides QoS in the egress direction of the hub towards the spokes. For QoS from the spokes to the hub, a QoS policy needs to be applied at the spokes.

At this time it is not possible to have an unique policy for traffic between spoke to spoke due to spokes not having access to the NHRP database.

GET VPN QoS Design

Group Encrypted Transport (GET) VPN is a technology to encrypt traffic between IPSEC endpoints without the use of tunnels. Packets transmitted use IPSEC tunnel mode but it is not defined by traditional IPSEC SA.

Because there are no tunnels, the QoS configuration is simplified.

GET VPN QoS Overview

DMVPN is suitable for hub and spoke VPNs over a public untrusted network such as the Internet, GET VPN is suitable for private networks such as a MPLS VPN. A MPLS VPN is private but not encrypted and GET VPN can encrypt the traffic between the MPLS sites. GET VPN has no real concept of hub and spoke, which simplifies the QoS architecture. There is not one major hub aggregating all the remote sites and being liable to massive oversubscription.

These are some of the major differences between DMVPN and GET VPN model:

Group Domain of Interpretation (GDOI)

GDOI is a technology that supports any to any IPSEC VPN without the use of tunnels. There is no concept of SA between specific routers, instead it uses a group SA which is used by all the encrypting nodes in the network. There is no per tunnel QoS needed since it does not use tunnels, QoS is simply applied egress on each GET VPN router.

GDOI control plane protocol uses UDP port 848 and ISAKMP on port UDP 500. These packets are normally marked DSCP CS6 by the router.

IP Header Preservation

Normally with IPSEC tunnel mode the ToS byte is copied to the new IP header but the original IP header is not preserved. On a public network such as the Internet it makes good sense to hide the source and destination IP addresses but GET VPN is deployed on MPLS networks which are private.

GET VPN keeps the original IP header intact which simplifies QoS, dynamic routing and multicast. The packet is still considered an ESP IPSEC packet, not TCP or UDP, so to classify based on port numbers the QoS preclassify feature will still be needed.

How and When to Use the QoS Preclassify Feature
Design principles:

If classification is based on source or destination IP, preclassify is not needed but still recommended
If classification is based on TCP or UDP port numbers, QoS preclassificy is needed
Enable the QoS preclassify feature in GET VPN deployments

A Case for Combining GET VPN and DMVPN

DMVPN has some drawbacks, spoke to hub tunnel is always up but spoke to spoke tunnels are dynamically brought up. This causes a delay which can take a second or two and may have negative impact on real-time traffic. The delay is not caused by NHRP or the packetization of the GRE tunnel but rather the exchange of ISAKMP messaging and the establishment of the IPSEC SAs between the routers.

DMVPN could then be used solely for setting up GRE tunnels and GET VPN for encryption of the packets going into the tunnel. This then allows for fast establishment of tunnels and encrypting the packets, increasing the overall user experience.

Working with Your Service Provider When Deploying GET VPN
Design principles:

Ensure that the service provider handles DSCP consistently troughout the MPLS WAN network

Categories: CCDE, QoS Tags: CCDE, Data center, Design, DMVPN, GET VPN, MPLS, QoS

Daniels networking blog

Archive

More PIM-BiDir Considerations

Many to Many Multicast – PIM BiDir

Interview with CCDE/CCAr Program Manager Elaine Lopes

IPv6 Multicast

Service Provider IPv6 Deployment

Design Considerations for North/South Flows in the Data Center

A Quick Look at Cisco FabricPath

STP Notes for CCDE

QoS Design Notes for CCDE

Email Subscription

Categories

Archives

Recent Posts

Blogroll

Twitter from Daniel