IPv6 Multicast

July 14, 2015 2 comments

These are my notes for IPv6 multicast for the CCDE exam. Overview

  • Prefix FF::/8 reserved for multicast
  • Multicast Listener Discovery (MLD) replaces IGMP
    • MLD is part of ICMPv6
    • MLDv1 equivalent to IGMPv2
    • MLDv2 equivalent to IGMPv3
  • ASM, SSM and Bidir supported
  • PIM identified by IPv6 next header 103
  • BSR and static RP supported
  • No support for MSDP
    • Anycast supported through PIM, defined in RFC4610
  • Any Source Multicast (ASM)
    • PIM-SM, PIM-BiDir
    • Default for generic multicast and unicast prefix-based multicast
    • Starts with FF3x::/12
  • Source Specific Multicast (SSM)
    • PIM-SSM
    • FF3X::/32 is allocated for SSM by IANA
    • Currently prefix and plen is zero so FF3X::/96 is useable for SSM
  • Embedded RP groups
    • PIM-SM, PIM-BIDir
    • Starts with FF70::/12

IPv6 Multicast Addressing

IPv6 multicast address format includes variable bits to define what type of address it is and what the scope is of the multicast group. The scope can be:

1 – Node

2 – Link

3 – Subnet

4 – Admin

5 – Site

8 – Organization

E – Global

The flags define if embedded RP is used, if the address is based on unicast and if the address is IANA assigned or not (temporary). The unicast based IPv6 multicast address allows an organization to create globally unique IPv6 multicast groups based on their unicast prefixes. This is similar to GLOP addressing in IPv4 but does not require an Autonomous System Number (ASN). IPv6 also allows for embedding the RP address into the multicast address itself. This provides a static RP to multicast group mapping mechanism and can be used to provide interdomain IPv6 multicast as there is no MSDP in IPv6. When using Ethernet, the destination MAC address of the frame will start with 33:33 and the remaining 32 bits will consist of the low order 32 bits of the IPv6 multicast address.

Well Known Multicast Addresses

FF02::1 – All Nodes

FF02::2 – All Routers

FF02::5 – OSPF All Routers

FF02::6 – OSPF DR Routers

FF02::A – EIGRP Routers

FF02::D – PIM Routers

Neighbor Solicitation and DAD

IPv6 also uses multicast to replace ARP through the neighbor solicitation process. To do this the solicited node multicast address is used and the prefix is FF02::1:FF/104 and the last 24 bits are taken from the lower 24 bits of the IPv6 unicast address. If Host A needs to get the MAC of Host B, Host A will send the NS to the solicited node multicast address of B. IPv6 also does Duplicate Address Detection (DAD) to check that noone else is using the same IPv6 address and this also uses the solicited node multicast address. If Host A is checking uniqueness of its IPv6 address, the message will be sent to the solicited node multicast address of Host A.

Multicast Listener Discovery (MLD)

  • MLDv1 messages
    • Listener Query
    • Listener Report
    • Listener Done
  • MLDv2 messages
    • Listener Query
    • Listener Report

MLDv2 does not use a specific Done message which is equivalent to the Leave message in IGMP. It will stop sending Reports or send a Report which excludes the source it was previously interested in.

Protocol Independent Multicast (PIM) for IPv6

  • PIM-SM (RP is required)
    • Many to many applications (multiple sources, single group)
    • Uses shared tree initially but may switch to source tree
  • PIM-BiDir (RP is required)
    • Bidirectional many to many applications (hosts can be sources and receivers)
    • Only uses shared tree, less state
    • One to many applications (single source, single group)
    • Always uses source tree
    • Source must be learnt through out of band mechanism

Anycast RP

IPv6 does not have support for MSDP. It can support anycast RP through the use of PIM which can implement this feature. All the RPs doing anycast will use the same IPv6 address but they also require a unique IPv6 address that will be used to relay the PIM Register messages coming from the multicast sources. A RP-set is defined with the RPs that should be included in the Anycast RP and the PIM Register messages will be relayed to all the RPs defined in the RP-set. If the PIM Register message comes from an IPv6 address that is defined in the RP-set, the Register will not be sent along which is a form of split horizon to prevent looping of control plane messages. When a RP relays a PIM Register, this is done from a unique IPv6 address which is similar to how MSDP works.

Sources will find the RP based on the unicast metric as is normally done when implementing anycast RP. If a RP goes offline, messages will be routed to the next RP which now has the best metric.

Interdomain Multicast

These are my thoughts on interdomain multicast since there is no MSDP for IPv6. Embedded RP can be used which means that other organization needs to use your RP. Define a RP prefix that is used for interdomain multicast only or use a prefix that is used for internal usage but implement a data plane filter to filter out requests for groups that should not cross organizational boundaries. This could also be done by filtering on the the scope of the multicast address.

Another option would be to anycast RP with the other organization but this could get a lot messier unless a RP is defined for only a set of groups that are used for interdomain multicast. Each side would then have a RP defined for the groups and PIM Register messages would be relayed. The drawback would be that both sides could have sources but the policy may be that only one side should have sources and the other side only has listeners. This would be difficult to implement in a data plane filter. It might be possible to solve in the control plane by defining which sources the RP will allow to Register.

If using SSM, there is no need for a RP which makes it easier to implement interdomain multicast. There is always the consideration of joining two PIM domains but this could be solved by using static joins at the edge and implementing data plane filtering. Interdomain multicast is not something that is implemented a lot and it requires some thought to not merge into one failure domain and one administrative domain.

Final Thoughts

Multicast is used a lot in IPv6, multicast is more tightly integrated into the protocol than in IPv4, and it’s there even if you see it or not. The addressing, flags and scope can be a bit confusing at first but it allows for using multicast in a better way in IPv6 than in IPv4.

Categories: CCDE, IPv6 Tags: , , , ,

Service Provider IPv6 Deployment

June 29, 2015 2 comments

These are my study notes regarding IPv6 deployment in SP networks in preparation for the CCDE exam.

Drivers for implementing IPv6

  • External drivers
    • SP customers that need access to IPv6 resources
    • SP customers that need to interconnect their IPv6 sites
    • SP customers that need to interface with their own customers over iPv6
  • Internal drivers
    • Handle problems that may be hard to fix with IPv4 such as large number of devices (cell phones, IP cameras, sensors etc)
    • Public IPv4 address exhaustion
    • Private IPv4 address exhaustion
  • Strategic drivers
    • Long term expansion plans and service offerings
    • Preparing for new services and gaining competitive advantage


  • SP Core Infrastructure
    • Native IPv4 core
    • L2TPv3 for VPNs
    • MPLS core
    • MPLS VPNs

My reflection is that most cores would be MPLS enabled, however there are projects such as Terastream in Deutsche Telekom where the entire core is IPv6 enabled and L2TPv3 is used in place of MPLS.

  • IPv6 in Native IPv4 Environments
    • Tunnel v6 in v4
    • Native v6 with dedicated resources
    • Dual stack

The easiest way to get going with v6 was to tunnel it over v4. The next logical step was to enable v6 but on separate interfaces to not disturb the “real” traffic and to be able to experiment with the protocol. The end goal is dual stack, at least in a non MPLS enabled network.

  • IPv6 in MPLS environments
    • 6PE
    • 6VPE

6PE is a technology to run IPv6 over an IPv4 enabled MPLS network. 6VPE does the same but with VRFs.

  • Native IPv6 over Dedicated Data Link
    • Dedicated data links between core routers
    • Dedicated data links to IPv6 customers
    • Connection to an IPv6 IX
  • Dual stack
    • All P + PE routers capable of v4 + v6 transport
    • Either two IGPs or one IGP for both v4 + v6
    • Requires more memory due to two routing tables
    • IPv6 multicast natively supported
    • All IPv6 traffic is routed in global space (no MPLS)
    • Good for content distribution and global services (Internet)
  • 6PE
    • IPv6 global connectivity over an IPv4 MPLS core
    • Transition mechanism (debatable)
    • PEs are dual stacked and need 6PE configuration
    • IPv6 reachability exchanged via MPBGP over iBGP sessions
    • IPv6 packets transported from 6PE to 6PE inside MPLS
    • The next-hop is an IPv4 mapped IPv6 address such as ::FFFF:
    • BGP label assigned for the IPv6 prefix
    • Bottom label used due to P routers not v6 capable and for load sharing
    • neighbor send-label is configured under BGP address-family ipv6

6PE is viewed as a transition mechanism but this is arguable, if you transport IPv4 over MPLS, you may want to do the same with IPv6 as well for consistency. Running 6PE means that there is fate sharing between v4 and v6 though, which could mean that an outage may affect both protocols. This could be avoided by running MPLS for IPv4 but v6 natively.

  • Core network (P routers) left untouched
  • IPv6 traffic inherits MPLS benefits such as fast-reroute and TE
  • Incremental deployment possible (upgrade PE routers first)
  • Each site can be v4-only, v4-VPN-only, v4+v6, v4-VPN+v6 and so on
  • Scalability concerns due to separate RIB and FIB required per customer
  • Mostly suitable for SPs with limited amount of PEs
  • 6vPE
    • Equivalent of VPNv4 but for IPv6
    • Add VPNv6 address family under MPBGP
    • Send extended communities for the prefixes under the address family

It is a common misconception for 6PE and 6vPE that traceroutes are not possible, that is however not entirely true. A P router can generate ICMPv6 messages that will follow the LSP to the egress PE and then the ICMPv6 error message is forwarded back to the originator of the traceroute.

  • Route reflectors for 6PE and 6vPE
    • Needed to scale BGP full mesh
    • Dedicated RRs or data path RRs
    • Either dedicated RR per AF or have multiple AFs per RR
    • 6PE-RR must support IPv6 + label functionality
    • 6vPE-RR must support IPv6 + label and extended communities functionality

PA vs PI

  • PA advantages
    • Aggregation towards upstreams
    • Minimizes Internet routing table size
  • PA disadvantages
    • Customer is “locked” with the SP
    • Renumbering can be painful
    • Multi-homing and TE problems

The main driver here is if you are going to multi home or not. Renumbering is always painful but at least less so on IPv6 due to being able to advertise multiple IPv6 prefixes through Router Advertisements (RA).

  • PI advantages
    • Customers are not “locked” to the SP
    • Multi homing is straight forward
  • PI disadvantages
    • Larger Internet routing table due to lack of efficient aggregation
    • Memory and CPU needs on BGP speakers

Infrastructure Addressing (LLA vs global)

What type of addresses should be deployed on infrastructure links?

  • Link Local Address FE80::/10
    • Non routeable address
    • Less attack surface
    • Smaller routing tables
    • Can converge faster due to smaller RIB/FIB
    • Less need for iACL at edge of network
    • Can’t ping links
    • Can’t traceroute links
    • May be more complex to manage with NMS
    • Use global address on loopback for ICMPv6 messages
    • Will not work with RSVP-TE tunnels
  • Global only 2000::/3 (current IANA prefix)
    • Globally routeable
    • Larger attack surface unless prefix suppression is used
    • Use uRPF and iACL at edge to protect your links
    • Easier to manage

It would be interesting to hear if you have seen any deployments with LLA only on infrastructure links. In theory it’s a nice idea but it may corner you in some cases, preventing you from implementing other features that you wish to deploy in your network.

Use /126 or /127 on P2P links which is the equivalent of /30 or /31 on IPv4 links. For loopbacks use /128 prefixes. Always assign addresses from a range so that creating ACLs and iACLs becomes less tedious.

Using another prefix than /64 on an interface will break the following features:

  • Neighbor Discovery (ND)
  • Secure Neighbor Discovery (SEND)
  • Privacy extensions
  • PIM-SM with embedded RP

This is of course for segments where there are end users.

Prefix Allocation Practices

  • Many SPs offer /48, /52, /56, /60 or /64 prefixes
  • Enterprise customers receive one /48 or more
  • Small business customers receive /52 or /56 prefix
  • Broadband customers may receive /56 or /60 via DHCP Prefix Delegation (DHCP-PD)

Debating prefix allocation prefixes is like debating religion, politics or your favourite OS. Whatever you choose, make sure that you can revise your practice as future services and needs arrise.

Carrier Grade NAT(CGN)

  • Short term solution to IPv4 exhaustage without changing Residential Gateway (RG) or SP infrastructure
  • Subscriber uses NAT44 and SP does CGN with NAT44
  • Multiplexes several customers onto the same public IPv4 address
  • CGN performance and capabilities should be analysed in the planning phase
  • May provide challenges in logging sessions
  • Long term solution is to deploy IPv6

I really don’t like CGN, it slows down the deployment of IPv6. It’s a tool like anything else though that may be used selectively if there is no other solution available.

IPv6 over L2TP Softwires

  • Dual stack IPv4/IPv6 on RG LAN side
  • PPPoE or IPv4oE terminated on v4-only BNG
  • L2TPv2 softwire between RG and IPv6-dedicated L2TP Network Server (LNS)
  • Stateful architecture on LNS
    • Offers dynamic control and granular accounting of IPv6 traffic
  • Limited investment needed and limited impact on existing infrastructure

I have never seen IPv6 deployed over softwires, what about you readers?


  • Uses 6RD CE (Customer Edge) and 6RD BR (Border Relay)
  • Automatic prefix delegation on 6RD CE
  • Stateless and automatic IPv6 in IPv4 encap and decap functions on 6RD
  • Follows IPv4 routing
  • 6RD BRs are adressed with IPv4 anycast for load sharing and resiliency
  • Limited investment and impact on existing infrastructure

IPv4 via IPv6 Using DS-Lite with NAT44

  • Network has migrated to IPv6 but needs to provide IPv4 services
  • IPv4 packets are tunneled over IPv6
  • Introduces two components: B4 (Basic Bridging Broadband Element) and AFTR (Address Family Transition Router)
    • B4 typically sits in the RG
    • AFTR is located in the core infrastructure
  • Does not provide IPv4 and IPv6 hosts to talk to each other
  • AFTR device terminates the tunnel and decapsulates IPv4 packet
  • AFTR device performs NAT44 on customer private IP to public IP addresses
  • Increased MTU, be aware of fragmentation

Connecting IPv6-only with IPv4-only (AFT64)

  • Only applicable where IPv6-only hosts need to communicate with IPv4-only hosts
  • Stateful or stateless v6 to v4 translation
  • Includes NAT64 and DNS64

MAP (Mapping of Address and Port)

  • MAP-T Stateless 464 translation
  • MAP-E Stateless 464 encapsulation
  • Allows sharing of IPv4 address across an IPv6 network
    • Each shared IPv4 endpoint gets a unique TCP/UDP port range via “rules”
    • All or part of the IPv4 address can be derived from the IPv6 prefix
      • This allows for route summarization
    • Need to allocate TCP/UDP port ranges to each CPE
  • Stateless border relays in SP network
    • Can be implemented in hardware for superior performance
    • Can use anycast and have asymmetric routing
    • No single point of failure
  • Leverages IPv6 in the network
  • No CGN inside SP network
  • No need for logging or ALGs
  • Dependent on CPE router


  • Stateful or stateless translation
  • Stateful
    • 1:N translation
    • “PAT”
    • TCP, UDP, ICMP
    • Shares IPv4 addresses
  • Stateless
    • 1:1 translation
    • “NAT”
    • Any protocol
    • No IPv4 address savings

DNS64 is often required in combination with NAT64 to send AAAA response to the IPv6-only hosts in case the server only exists in the v4 world.


  • Somewhere around 15% of apps break with native v6 or NAT64
  • Skype is one of these apps
  • 464XLAT can help with most of these applications
  • Handset does stateless 4 to 6 translation
  • Network does NAT64
  • Deployed by T-Mobile
Categories: CCDE, IPv6 Tags: , ,

Coming Updates to the CCIE Program

June 21, 2015 2 comments

With everything going on in the industry, what is happening to the CCIE program?

I recently watched a webinar on coming updates to the CCIE program. I have also been talking to the CCIE and CCDE program managers which I am proud to call my friends. The certifications are a big part of Cisco’s business, people are afraid that certifications will lose value as Software Defined Networking (SDN) gains more traction in the industry. What is Cisco’s response to the ever changing landscape of networking?

We have already seen Cisco announce the CCNA cloud and CCNA industrial which shows that Cisco follows the market. Will we see a CCIE cloud or CCIE SDN? Doubtful… Why? Because SDN is not a track in itself, it will be part of all tracks… The CCIE DC will be refreshed to include topics like Application Centric Infrastructure (ACI) in the blueprint. When? It’s not official yet which means you have at least 6 months. My guess is that we will see an announcement before this year ends which would mean that the update is around a year away.

CCIE DC is the natural fit for SDN. What about the other tracks? Expect other tracks to get updated as well. The CCIE RS will add the Application Policy Infrastructure Controller Enterprise Module (APIC-EM) for sure and maybe some other topics as well. We will definitely see more of Intelligent WAN (IWAN) in the next update. The CCIE RS was recently bumped to version 5 so I would expect it to take a bit longer than the DC to refresh but it should not be that far out either. I think we can expect more refreshes since the networking is moving at a much faster pace now.

The CCIE SP will include topics such as Segment Routing (SR), Network Function Virtualizaiton (NFV), service chaining, Netconf and YANG and so on. At least that is what I expect. The CCIE SP recently moved to version 4 so I don’t expect it to change just yet but I’m sure Cisco is working on the next refresh already.

A change we have all been waiting to see is that Cisco is going to implement dual monitors in the CCIE lab. This has been discussed for a long time. According to Cisco only 6% of candidates have requested the dual monitors though which shows how important it is to give Cisco feedback. I’m sure more than 6% were bothered by the single screen in the lab. The delay in implementing it has been due to make sure that all lab centers get the same conditions at the same time to not create any debate about the testing environment.

Cisco is also working a lot with exam integrity, they have made changes to the lab delivery system in the backend to prevent people from leaking the material. There is also a much bigger pool of questions and topologies, a lot thanks to the virtualized environment. The Diagnostic (DIAG) section has also been successful in getting the passing rates down to the expected levels. Cisco does a lot of work with statistics to see how their material is received and what makes sense to ask about and if they need to rephrase something or remove it from the topology. They can also do statistical analysis to look for strange behavior from the candidates at the lab. Exam integrity is the #1 focus from my discussions with Cisco.

You have the chance to leave comments when you are taking an exam. I have been lazy in supplying comments in my tests which I will change from now on. From my discussions with the CCIE program managers this is very important feedback for them and their main source of information for how the test is being received.

If you are truly interested in improving the certifications of Cisco and you are already certified, you can apply to become a Subject Matter Expert (SME), SME’s help Cisco in exam development and in picking out the path of the certifications to include new topics and remove old ones.

I still believe in the CCIE program, it’s not going away. I think it would be a huge mistake for people to start diving into SDN without first getting the basic concepts straight. Everything can’t magically go into a fabric and never fail. Read some of Ivan Pepelnjak’s posts to get some perspective on large layer 2 domains. History always repeats itself.

Categories: CCIE Tags: , , , , , , ,

Design Considerations for North/South Flows in the Data Center

May 28, 2015 4 comments

Traditional data centers have been built by using standard switches and running Spanning Tree (STP). STP blocks redundant links and builds a loop-free tree which is rooted at the STP root. This kind of topology wastes a lot of links which means that there is a decrease in bisectional bandwidth in the network. A traditional design may look like below where the blocking links have been marked with red color.


If we then remove the blocked links, the tree topology becomes very clear and you can see that there is only a single path between the servers. This wastes a lot of bandwidth and does not provide enough bisectional bandwidth. Bisectional bandwidth is the bandwidth that is available from the left half of the network to the right half of the network.


The traffic flow is highlighted below.


Technologies like FabricPath (FP) or TRILL can overcome these limitations by running ISIS and building loop-free topologies but not blocking any links. They can also take advantage of Equal Cost Multi Path (ECMP) paths to provide load sharing without doing any complex VLAN manipulations like with STP. A leaf and spine design is most commonly used to provide for a high amount of bisectional bandwidth.


Hot Standby Routing Protocol (HSRP) has been around for a long time providing First Hop Redundancy (FHR) in our networks. The drawback of HSRP is that there is only one active forwarder. So even if we run a layer 2 multipath network through FP, for routed traffic flows, there will only be one active path.


The reason for this is that FP advertsises its Switch ID (SID) and that the Virtual MAC (vMAC) will be available behind the FP switch that is the HSRP active device. Switched flows can still use all of the available bandwidth.

To overcome this, there is the possibility of running VPC+ between the switches and having the switches advertise an emulated SID, pretending to be one switch so that the vMAC will be available behind that SID.


There are some drawbacks to this however. It requires that you run VPC+ in the spine layer and you can still only have 2 active forwarders. if you have more spine devices they will not be uitilized for Nort/South flows. To overcome this there is a feature called Anycast HSRP.


Anycast HSRP works in a similar way by advertising a virtual SID but it does not require links between the spines or VPC+. It also supports up to 4 active forwarders currently which provides for double the bandwidth compared to VPC+

Modern data centers provide for a lot more bandwidth and bisectional bandwidth than previous designs, but you still need to consider how routed flows can utilize the links in your network. This post should give you some insights on what to consider in such a scenario.

Introduction to Storage Networking and Design

April 29, 2015 3 comments


Storage and storage protocols are not generally well known by network engineers. Networking and storage have traditionally been two silos. Modern networks and data centers are looking to consolidate these two networks into one and to run them on a common transport such as Ethernet.

Hard Disks and Types of Storage

Hard disks can use different type of connectors and protocols.

  • Advanced Technology Attachment (ATA)
  • Serial ATA (SATA)
  • Fibre Channel (FC)
  • Small Computer System Interface (SCSI)
  • Serial Attached SCSI (SAS)

ATA and SATA and SCSI are older standards, newer disks will typically use SATA or SAS where SAS is more geared towards the enterprise market. FC is used to attach to Storage Area Network (SAN)

Storage can either be file-level storage or block-level storage. File-level storage provides access to a file system through protocols such as Network File System (NFS) or Common Internet File System (CIFS). Block-level storage can be seen as raw storage that does not come with a file system. Block-level storage presents Logical Unit Number (LUN) to servers and the server may then format that raw storage with a file system. VmWare uses VmWare File System (VMFS) to format raw devices.


Storage can be accessed in different ways. Directly Attached Storage (DAS) is storage that is attached to a server, it may also be described as captive storage. There is no efficient sharing of storage and can be complex to implement and manage. To be able to share files the storage needs to be connected to the network. Network Attached Storage (NAS) enables the sharing of storage through the network and protocols such as NFS and CIFS. Internally SCSI and RAID will commonly be implemented. Storage Area Network (SAN) is a separate network that provides block-level storage as compared to the NAS that provides file-level storage.

Virtualization of Storage

Everything is being abstracted and virtualized these days, storage is no exception. The goal of anything being virtualized is to abstract from the physical layer and to provide a better utilization and less/no downtime when making changes to the storage system. It is also key in scaling since direct attached storage will not scale well. It also helps in decreasing the management complexity if multiple pools of storage can be accessed from one management tool. One basic form of virtualization is creating virtual disks that use a subset of the storage available on the physical device such as when creating a virtual machine in VmWare or with other hypervisors.

Virtualization exists at different levels such as block, disk, file system and file virtualization.

One form of file system virtualization is the concept of NAS where the storage is accessed through NFS or CIFS. The file system is shared among many hosts which may be running different operating systems such as Linux and Windows.

Block level storage can be virtualized through virtual disks. The goal of virtual disks is to make them flexible, being able to increase and decrease in size, provide as fast storage as needed and to increase the availability compared to physical disks.

There are also other forms of virtualization/abstracting where several LUNs can be hidden behind another LUN or where virtual LUNs are sliced from a physical LUN.

Storage Protocols

There are a number of protocols available for transporting storage traffic. Some of them are:

Internet Small Computer System Interface (iSCSI) – Transports SCSI requests over TCP/IP. Not suitable for high performance storage traffic

Fibre Channel Protocol (FCP) – It’s the interface protocol of SCSI on fibre channel

Fibre Channel over IP (FCIP) – A form of storage tunneling or FC tunneling where FC information is tunneled through the IP network. Encapsulates the FC block data and transports it through a TCP socket

Fibre Channel over Ethernet (FCoE) – Encapsulating FC information into Ethernet frames and transporting them on the Ethernet network

Storage protocols

Fibre Channel

Fibre channel is a technology to attach to and transfer storage. FC requires lossless transfer of storage traffic which has been difficult/impossible to provide on traditional IP/Ethernet based networks. FC has provided more bandwidth traditionally than Ethernet, running at speeds such as 8 Gbit/s and 16 Gbit/s but Ethernet is starting to take over the bandwidth race with speeds of 10, 40, 100 or even 400 Gbit/s achievable now or in the near future.

There are a lot of terms in Fibre channel which are not familiar for us coming from the networking side. I will go through some of them here:

Host Bus Adapter (HBA) – A card with FC ports to connect to storage, the equivalent of a NIC

N_Port – Node port, a port on a FC host

F_Port – Fabric port, port on a switch

E_Port – Expansion port, port connecting two fibre channel switches and carrying frames for configuration and fabric management

TE_Port – Trunking E_Port, Cisco MDS switches use Enhanced Inter Switch Link (EISL) to carry these frames. VSANs are supported with TE_Ports, carrying traffic for several VSANs over one physical link

World Wide Name (WWN) – All FC devices have a unique identity called WWN which is similar to how all Ethernet cards have a MAC address. Each N_Port has its own WWN

World Wide Node Name (WWNN) – A globally unique identifier assigned to each FC node or device. For servers and hosts, the WWNN is unique for each HBA, if a server has two HBAs, it will have two WWNNs.

World Wide Port Number (WWPN) – A unique identifier for each FC port of any FC device. A server will have a WWPN for each port of the HBA. A switch has WWPN for each port of the switch.

Initiator – Clients called initiators issues SCSI commands to request services from logical units on a server that is known as a target

Fibre channel has many similarities to IP (TCP) when it comes to communicating.

  • Point to point oriented – facilitated through device login
    • Similar to TCP session establishment
  • N_Port to N_Port connection – logical node connection point
    • Similar to TCP/UDP sockets
  • Flow controlled – hop by hop and end-to-end basis
    • Similar to TCP flow control but a different mechanism where no drops are allowed
  • Acknowledged – For certain types of traffic but not for others
    • Similar to how TCP acknowledges segments
  • Multiple connections allowed per device
    • Similar to TCP/UDP sockets

Buffer to Buffer Credits

FC requires lossless transport and this is achieved through B2B credits.

  • Source regulated flow control
  • B2B credits used to ensure that FC transport is lossless
  • The number of credits is negotiated between ports when the link is brought up
  • The number of credits is decremented  with each packet placed on the wire
    • Does not rely on packet size
    • If the number of credits is 0, transmission is stopped
  • Number of credits incremented when “transfer ready” message received
  • The number of B2B credits needs to be taken into consideration as bandwidth and/or distance increases

Virtual SAN (VSAN)

Virtual SANs allow to utilize the physical fabric better, essentially providing the same functionality as 802.1Q does to Ethernet.

  • Virtual fabrics created from a larger cost-effective and redundant physical fabric
  • Reduces waste of ports of a SAN island approach
  • Fabric events are isolated per VSAN, allowing for higher availability and isolation
  • FC features can be configured per VSAN, allowing for greater versability

Fabric Shortest Path First (FSPF)

To find the best path through the fabric, FSPF can be used. The concept should be very familiar if you know OSPF.

  • FSPF routes traffic based on the destination Domain ID
  • For FSPF a Domain ID identifies a VSAN in a single switch
    • The number of maximum switches supported in a fabric is then limited to 239
  • FSPF
    • Performs hop-by-hop routing
    • The total cost is calculated to find the least cost path
    • Supports the use of equal cost load sharing over links
  • Link costs can be manually adjusted to affect the shortest paths
  • Uses Dijkstra algorithm
  • Runs only on E_Ports or TE_Ports and provides loop free topology
  • Runs on a per VSAN basis


To provide security in the SAN, zoning can be implemented.

  • Zones are a basic form of data path security
    • A bidirectional ACL
    • Zone members can only “see” and talk to other members of the zone. Similar to PVLAN community port
    • Devices can be members of several zones
    • By default, devices that are not members of a zone will be isolated from other devices
  • Zones belong to a zoneset
  • The zoneset must be active to enforce the zoning
    • Only one active zoneset per fabric or per VSAN

SAN Drivers

What are the drivers for implementing a SAN?

  • Lower Total Cost of Ownership (TCO)
  • Consolidation of storage
  • To provide better utilization of storage resources
  • Provide a high availability
  • Provide better manageability

Storage Design Principles

These are some of the important factors when designing a SAN:

  • Plan a network that can handle the number of ports now and in the future
  • Plan the network with a given end-to-end performance and throughput level in mind
  • Don’t forget about physical requirements
  • Connectivity to remote data centers may be needed to meet the business requirements of business continuity and disaster recovery
  • Plan for an expected lifetime of the SAN and make sure the design can support the SAN for its expected lifetime

Device Oversubscription and Consolidation

  • Most SAN designs will have oversubscription or fan-out from the storage devices to the hosts.
    • Follow guidelines from the storage vendor to not oversubscribe the fabric too heavily.
  • Consolidate the storage but be aware of the larger failure domain and fate sharing
    • VSANs enable consolidation while still keeping separate failure domains

When consolidating storage, there is an increased risk that all of the storage or a large part of it can be brought offline if the fabric or storage controllers fail. Also be aware that when using virtualization techniques such as vSANS, there is fate sharing because several virtual topologies use the same physical links.

Convergence and Stability

  • To support fast convergence, the number of switches in the fabric should not be too large
  • Be aware of the number of parallell links, a lot of links will increase processing time and SPF run time
  • Implement appropriate levels of redundancy in the network layer and in the SAN fabric

The above guidelines are very general but the key here is that providing too much redundancy may actually decrease the availability as the Mean Time to Repair (MTTR) increases in case of a failure. The more nodes and links in the fabric the larger the link state database gets and this will lead to SPF runs taking a longer period of time. The general rule is that two links is enough and that three is the maximum, anything more than that is overdoing it. The use of portchannels can help in achieving redundancy while keeping the number of logical links in check.

SAN Security

Security is always important  but in the case of storage it can be very critical and regulated by PCI DSS, HIPAA, SOX or other standards. Having poor security on the storage may then not only get you fired but behind bars so security is key when designing a SAN. These are some recommendations for SAN security:

  • Use secure role-based management with centralized authentication, authorization and logging of all the changes
  • Centralized authentication should be used for the networking devices as well
    • Only authorized devices should be able to connect to the network
  • Traffic should be isolated and secured with access controls so that devices on the network can send and receive data securely while being protected from other activities of the network
  • All data leaving the storage network should be encrypted to ensure business continuane
    • Don’t forget about remote vaulting and backup
  • Ensure the SAN and network passes any regulations such as PCI DSS, HIPAA, SOX etc

SAN Topologies

There are a few common designs in SANs depending on the size of the organization. We will discuss a few of them here and their characteristics and strong/weak points.

Collapsed Core Single Fabric


In the collapsed core, both the iniator and the target are connected through the same device. This means all traffic can be switched without using any Inter Switch Links (ISL). This provides for full non-blocking bandwidth and there should be no oversubscription. It’s a simple design to implement and support and it’s also easy to manage compared to more advanced designs.

The main concern of this design is how redundant the single switch is. Does it provided for redundant power, does it have a single fabric or an extra fabric for redundancy? Does the switch have redundant supervisors? At the end of the day, a single device may go belly up so you have to consider the time it would take to restore your fabric and if this downtime is acceptable compared to a design with more redundancy.

Collapsed Core Dual Fabric


The dual fabric designs removes the Single Point of Failure (SPoF) of the single switch design. Every host and storage device is connected to both fabrics so there is no need for an ISL. The ISL would only be useful in case the storage device loses its port towards fabric A and the server loses its port towards fabric B. This scenario may not be that likely though.

The drawback compared to the single fabric is the cost of getting two of every equipment to create the dual fabric design.

Core Edge Dual Fabric


For large scale SAN designs, the fabric is divided into a core and edge part where the storage is connected to the edge of the fabric. This design is dual fabric to provide high availability. The storage and servers are not connected to the same device, meaning that storage traffic must pass the ISL links between the core and the edge. The ISL links must be able to handle the load so that the oversubscription ratio is not too high.

The more devices that get added to a fabric, the more complex it gets and the more devices you have to manage. For a large design you may not have many options though.

Fibre Channel over Ethernet (FCoE)

Maintaining one network for storage and one for normal user data is costly and complex. It also means that you have a lot of devices to manage. Wouldn’t it be better if storage traffic could run on the normal network as well? That is where FCoE comes into play. The FC frames are encapsulated into Ethernet frames and can be sent on the Ethernet network. However, Ethernet isn’t lossless, is it? That is where Data Center Bridging (DCB) comes into play.

Data Center Bridging (DCB)

Ethernet is not a lossless protocol. Some devices may have support for the use of PAUSE frames but these frames would stop all communication, meaning your storage traffic would come to a halt as well. There was no way of pausing only a certain type of traffic. To provide lossless transfer of frames, new enhancements to Ethernet had to be added.

Priority Flow Control (PFC)

  • PFC is defined in 802.1Qbb and provides PAUSE based on 802.p CoS
  • When link is congested, CoS assigned to “no-drop” will be paused
  • Other traffic assigned to other CoS values will continue to transmit and rely on upper layer protocols for retransmission
  • PFC is not limited to FCoE traffic

It is also desirable to be able to guarantee traffic a certain amount of the bandwidth available and to not have a class of traffic use up all the bandwidth. This is where Enhanced Transmission Selection (ETS) has its use.

Enhanced Transmission Selection (ETS)

  • Defined in 802.1Qaz and prevents a single traffic class from using all the bandwidth leading to starvation of other classes
  • If a class does not fully use its share, that bandwidth can be used by other classes
  • Helps to accomodate for classes that have a bursty nature

The concept is very similar to doing egress queuing through MQC on a Cisco router.

We now have support for lossless Ethernet but how can we tell if a device has implemented these features? Through the use of Data Center Bridging eXchange (DCBX).

Data Center Bridging Exchange (DCBX)

  • DCBX is LLDP with new TLV fields
  • Negotiates PFC, ETS, CoS values between DCB capable devices
  • Simplifies management because parameters can be distributed between nodes
  • Responsible for logical link up/down signaling of Ethernet and Fibre Channel

What is the goal of running FCoE? What are the drivers for running storage traffic on our normal networks?

Unified Fabric

Data centers require a lot of cabling, power and cooling. Because storage and servers have required separate networks, a lot of cabling has been used to build these networks. With a unified fabric, a lot of cabling can be removed and the storage traffic can use the regular IP/Ethernet network,so that half of the number of cables are needed. The following are some reasons for striving for a unified fabric:

  • Reduced cabling
    • Every server only requires 2xGE or 2x10GE instead of 2 Ethernet ports and 2 FC ports
  • Fewer access layer switches
    • A typical Top of Rack (ToR) design may have two switches for networking and two for storage, two switches can then be removed
  • Fewer network adapters per server
    • A Converged Network Adapter (CNA) combines networking and storage functionality so that half of the NICs can be removed
  • Power and cooling savings
    • Less NICs, mean less power which then also saves on cooling. The reduced cabling may also improve the airflow in the data center
  • Management integration
    • A single network infrastructure and less devices to manage decreases the overall management complexity
  • Wire once
    • There is no need to recable to provide network or storage connectivity to a server


This post is aimed at giving the network engineer an introduction into storage. Traditionally there have been silos between servers, storage and networking people but these roles are seeing a lot of more overlap in modern networks. We will see networks be built to provide both for data and storage traffic and to provide less complex storage. Protocols like iSCSI may get a larger share of the storage world in the future and FCoE may become larger as well.

Categories: Storage Tags: , , , ,

Next Generation Multicast – NG-MVPN

April 10, 2015 Leave a comment


Multicast is a great technology that although it provides great benefits, is seldomly deployed. It’s a lot like IPv6 in that regard. Service providers or enterprises that run MPLS and want to provide multicast services have not been able to use MPLS to provide multicast  Multicast has then typically been delivered by using Draft Rosen which is a mGRE technology to provide multicast. This post starts with a brief overview of Draft Rosen.

Draft Rosen

Draft Rosen uses GRE as an overlay protocol. That means that all multicast packets will be encapsulated inside GRE. A virtual LAN is emulated by having all PE routers in the VPN join a multicast group. This is known as the default Multicast Distribution Tree (MDT). The default MDT is used for PIM hello’s and other PIM signaling but also for data traffic. If the source sends a lot of traffic it is inefficient to use the default MDT and a data MDT can be created. The data MDT will only include PE’s that have receivers for the group in use.


Draft Rosen is fairly simple to deploy and works well but it has a few drawbacks. Let’s take a look at these:

  • Overhead – GRE adds 24 bytes of overhead to the packet. Compared to MPLS which typically adds 8 or 12 bytes there is 100% or more of overhead added to each packet
  • PIM in the core – Draft Rosen requires that PIM is enabled in the core because the PE’s must join the default and or data MDT which is done through PIM signaling. If PIM ASM is used in the core, an RP is needed as well. If PIM SSM is run in the core, no RP is needed.
  • Core state – Unneccessary state is created in the core due to the PIM signaling from the PE’s. The core should have as little state as possible
  • PIM adjacencies – The PE’s will become PIM neighbors with each other. If it’s a large VPN and a lot of PE’s, a lot of PIM adjacencies will be created. This will generate a lot of hello’s and other signaling which will add to the burden of the router
  • Unicast vs multicast – Unicast forwarding uses MPLS, multicast uses GRE. This adds complexity and means that unicast is using a different forwarding mechanism than multicast, which is not the optimal solution
  • Inefficency – The default MDT sends traffic to all PE’s in the VPN regardless if the PE has a receiver in the (*,G) or (S,G) for the group in use

Based on this list, it is clear that there is a room for improvement. The things we are looking to achieve with another solution is:

  • Shared control plane with unicast
  • Less protocols to manage in the core
  • Shared forwarding plane with unicast
  • Only use MPLS as encapsulation
  • Fast Restoration (FRR)


To be able to build multicast Label Switched Paths (LSPs) we need to provide these labels in some way. There are three main options to provide these labels today:

  • Multipoint LDP(mLDP)
  • Unicast MPLS + Ingress Replication(IR)

MLDP is an extension to the familiar Label Distribution Protocol (LDP). It supports both P2MP and MP2MP LSPs and is defined in RFC 6388.

RSVP-TE is an extension to the unicast RSVP-TE which some providers use today to build LSPs as opposed to LDP. It is defined in RFC 4875.

Unicast MPLS uses unicast and no additional signaling in the core. It does not use a multipoint LSP.

Multipoint LSP

Normal unicast forwarding through MPLS uses a point to point LSP. This is not efficient for multicast. To overcome this, multipoint LSPs are used instead. There are two different types, point to multipoint and multipoint to multipoint.


  • Replication of traffic in core
  • Allows only the root of the P2MP LSP to inject packets into the tree
  • If signaled with mLDP – Path based on IP routing
  • If signaled with RSVP-TE – Constraint-based/explicit routing. RSVP-TE also supports admission control


  • Replication of traffic in core
  • Bidirectional
  • All the leafs of the LSP can inject and receive packets from the LSP
  • Signaled with mLDP
  • Path based on IP routing

Core Tree Types

Depending on the number of sources and where the sources are located, different type of core trees can be used. If you are familiar with Draft Rosen, you may know of the default MDT and the data MDT.


Signalling the Labels

As mentioned previously there are three main ways of signalling the labels. We will start by looking at mLDP.

  • LSPs are built from the leaf to the root
  • Supports P2MP and MP2MP LSPs
    • mLDP with MP2MP provides great scalability advantages for “any to any” topologies
      • “any to any” communication applications:
        • mVPN supporting bidirectional PIM
        • mVPN Default MDT model
        • If a provider does not want  tree state per ingress PE source
  • Supports Fast Reroute (FRR) via RSVP-TE unicast backup path
  • No periodic signaling, reliable using TCP
  • Control plane is P2MP or MP2MP
  • Data plane is P2MP
  • Scalable due to receiver driven tree building
  • Supports MP2MP
  • Does not support traffic engineering

RSVP-TE can be used as well with the following characteristics.

  • LSPs are built from the head-end to the tail-end
  • Supports only P2MP LSPs
  • Supports traffic engineering
    • Bandwidth reservation
    • Explicit routing
    • Fast Reroute (FRR)
  • Signaling is periodic
  • P2P technology at control plane
    • Inherits P2P scaling limitations
  • P2MP at the data plane
    • Packet replication in the core

RSVP-TE will mostly be interesting for SPs that are already running RSVP-TE for unicast or for SPs involved in video delivery. The following table shows a comparision of the different protocols.

Core protocols

Assigning Flows to LSPs

After the LSPs have been signalled, we need to get traffic onto the LSPs. This can be done in several different ways.

  • Static
  • PIM
    • RFC 6513
  • BGP Customer Multicast (C-Mcast)
    • RFC 6514
    • Also describes Auto-Discovery
  • mLDP inband signaling
    • RFC 6826


  • Mostly applicable to RSVP-TE P2MP
  • Static configuration of multicast flows per LSP
  • Allows aggregation of multiple flows in a single LSP


  • Dynamically assigns flows to an LSP by running PIM over the LSP
  • Works over MP2MP and PPMP LSP types
  • Mostly used but not limited to default MDT
  • No changes needed to PIM
  • Allows aggregation of multiple flows in a single LSP

BGP Auto-Discovery

  • Auto-Discovery
    • The process of discovering all the PE’s with members in a given mVPN
  • Used to establish the MDT in the SP core
  • Can also be used to discover set of PE’s interested in a given customer multicast group (to enable S-PSMSI creation)
    • S-PMSI = Data MDT
  • Used to advertise address of the originating PE and tunnel attribute information (which kind of tunnel)

BGP MVPN Address Family

  • MPBGP extensions to support mVPN address family
  • Used for advertisement of AD routes
  • Used for advertisement of C-mcast routes (*,G) and (S,G)
  • Two new extended communities
    • VRF route import – Used to import mcast routes, similar to RT for unicast routes
    • Source AS – Used for inter-AS mVPN
  • New BGP attributes
    • PMSI Tunnel Attribute (PTA) – Contains information about advertised tunnel
    • PPMP label attribute – Upstream generated label used by the downstream clients to send unicast messages towards the source
  • If mVPN address family is not used the address family ipv4 mdt must be used

BGP Customer Multicast

  • BGP Customer Multicast (C-mcast) signalling on overlay
  • Tail-end driven updates is not a natural fit for BGP
    • BGP is more suited for one-to-many not many-to-one
  • PIM is still the PE-CE protocol
  • Easy to use with SSM
  • Complex to understand and troubleshoot for ASM

MLDP Inband Signaling

  • Multicast flow information encoded in the mLDP FEC
  • Each customer mcast flow creates state on the core routers
    • Scaling is the same as with default MDT with every C-(S,G) on a Data MDT
  • IPv4 and IPv6 multicast in global or VPN context
  • Typical for SSM or PIM sparse mode sources
  • IPTV walled garden deployment
  • RFC 6826

The natural choice is to stick with PIM unless you need very high scalability. Here is a comparison of PIM and BGP.


BGP C-Signaling

  • With C-PIM signaling on default MDT models, data needs to be monitored
    • On default/data tree to detect duplicate forwarders over MDT and to trigger the assert process
    • On default MDT to perform SPT switchover (from (*,G) to (S,G))
  • On default MDT models with C-BGP signaling
    • There is only one forwarder on MDT
      • There are no asserts
    • The BGP type 5 routes are used for SPT switchover on PEs
  • Type 4 leaf AD route used to track type 3 S-PMSI (Data MDT) routes
  • Needed when RR is deployed
  • If source PE sets leaf-info-required flag on type 3 routes, the receiver PE responds with with a type 4 route


If PIM is used in the core, this can be migrated to mLDP. PIM can also be migrated to BGP. This can be done per multicast source, per multicast group and per source ingress router. This means that migration can be done gradually so that not all core trees must be replaced at the same time.

It is also possible to have both mGRE and MPLS encapsulation in the network for different PE’s.

To summarize the different options for assigning flows to LSPs

  • Static
    • Mostly applicable to RSVP-TE
  • PIM
    • Well known, has been in use since mVPN introduction over GRE
  • BGP A-D
    • Useful where head-end assigns the flows to the LSP
  • BGP C-mcast
    • Alternative to PIM in mVPN context
    • May be required in dual vendor networks
  • MLDP inband signaling
    • Method to stitch a PIM tree to a mLDP LSP without any additional signaling

Optimizing the MDT

There are some drawbacks with the normal operation of the MDT. The tree is signalled even if there is no customer traffic leading to unneccessary state in the core. To overcome these limitations there is a model called the partitioned MDT running over mLDP with the following characteristics.

  • Dynamic version of default MDT model
  • MDT is only built when customer traffic needs to be transported across the core
  • It addresses issues with the default MDT model
    • Optimizes deployments where sources are located in a few sites
    • Supports anycast sources
      • Default MDT would use PIM asserts
    • Reduces the number of PIM neighbors
      • PIM neighborship is unidirectional – The egress PE sees ingress PEs as PIM neighbors


There are many many different profiles supported, currently 27 profiles on Cisco equipment. Here are some guidelines to guide you in the selection of a profile for NG-MVPN.

  • Label Switched Multicast (LSM) provides unified unicast and multicast forwarding
  • Choosing a profile depends on the application and scalability/feature requirements
  • MLDP is the natural and safe choice for general purpose
    • Inband signalling is for walled garden deployments
    • Partitioned MDT is most suitable if there are few sources/few sites
    • P2MP TE is used for bandwidth reservation and video distribution (few source sites)
    • Default MDT model is for anyone (else)
  • PIM is still used as the PE-CE protocol towards the customer
  • PIM or BGP can be used as an overlay protocol unless inband signaling or static mapping is used
  • BGP is the natural choice for high scalability deployments
    • BGP may be the natural choice if already using it for Auto-Discovery
  • The beauty of NG-MVPN is that profile can be selected per customer/VPN
    • Even per source, per group or per next-hop can be done with Routing Policy Language (RPL)

This post was heavily inspired and is basically a summary of the Cisco Live session BRKIPM-3017 mVPN Deployment Models by Ijsbrand Wijnands and Luc De Ghein. I recommend that you read it for more details and configuration of NG-MVPN.

Categories: Multicast Tags: , , ,

My CLUS 2015 Schedule for San Diego

April 5, 2015 2 comments

With roughly two months to go before Cisco Live starts, here is my preliminary schedule for San Diego.

CLUS San Diego Schedule

I have two CCDE sessions booked to help me prepare for the CCDE exam. I have the written scheduled on wednesday and we’ll see how that goes.

I have a pretty strong focus on DC because I want to learn more in that area and that should also help me prepare for the CCDE.

I have the Routed Fast Convergence because it’s a good session and Denise Fishburne is an amazing instructor and person.

Are you going? Do you have any sessions in common? Please say hi if we meet in San Diego.

Categories: Uncategorized