My automation/orchestration thoughts

I’ve been thinking for a while to write about my experience in the field of network automation somewhere, and thanks to Richard I finally found a place where to share my thoughts and findings.
I will start with some non-technical dreck and then move on to some more technical stuff. I hope you will appreciate both.

The views and opinions expressed in this post are solely my own and do not express the views or opinions of my employer.

What?

To commence the discussion, let me warn you. Automation/orchestration/SDN are some of the most abused and confused terms in IT and are as misleading as vague.
The way I see it is that “orchestration” is something “higher”, that provides guidance to some clever workhorses working behind the scenes on how to do repetitive stuff with the right information.
There is no particular technologies or tools that give you “automation” or “orchestration”.
Orchestrating is about changing mindsets, organisational models and, ultimately, changing how we do things. It’s not about flipping tables, as that usually does not bring people together, but apart.

Why?

The reason everyone is nowadays craving for automation is just because it’s something that the net-execs love to hear. “We can lower OPEX by automating more !!”
Fact is that if you suck more money to achieve automation than the $$ you save, the discussion is a non-starter.
Also, if you don’t tackle the problem the right way from the start, it might mean that your investments are thrown in the bin as they are very short-lived.

Besides, it is pretty evident that having some clever non-human “functions” that are able to make our lives easier and let us spend more time in the pub would be nice, right?

Assume you have 3K+ switches where you want to change the NTP server address all in one go.
You certainly don’t want to have to login on 3K+ boxes one by one. A script could do it all for you, couldn’t it?
But that’s not the point. We’ve always had scripts, we’ve always had geeks writing 3000 lines-long bash scripts to carry out such tasks (that no-one but them could read or tweak because they were utterly bad-written).

The point of automation is having a much more comprehensive and coherent framework where tasks like changing an NTP server address or the MTU on an interface do not require ANY scripting skills, but can be done easily by anyone, even oompa-loompas.
Having such an environment lowers dramatically the chance for human errors and cross-configuration incompatibilities, which, in turn, lowers the risk of MPIs and, ultimately, of revenue losses.
We’re in the days of continuous testing and delivery of software, why can we not mutuate the same philosophy in the networking world?
A device config, in the end of the day, is a set of instructions that can be regarded exactly as a piece of code that can be version-controlled, automatically-delivered, and continuously tested.
We have to transform our network labs into continuous-testing facilities driven by tools like Jenkins/GIT/etc.!
My testing friends are probably going to hate me now. But can you imagine how quick and cheap would be to test a new service/feature just by defining the expected results and waiting for the outcome to appear in your mailbox?
I’m not saying we have to get rid of people but that we can concentrate on doing much clever stuff just by re-working our mindsets.

How ?

I work for one of the biggest ISPs in the UK. My company is a fairly complex organisation due to its history of acquisitions and massive growth rate in the last few years.
Introducing the orchestration topics in big and complex organisations is a particularly challenging task, not only because you have to get people from various departments on your side, but especially because you have to start operating on heterogeneous network estates whose stability is critical to guarantee the degree of services our customers expect (and pay us for).

We can spend hours discussing about the specific tools or softwares that can be used to orchestrate network functions, but we wouldn’t be tackling the main point.
Orchestrating a network (I’m taking SDN out of the picture here) essentially means “driving” the network configurations in a clever way, from a centralised system that I enjoy calling the SPOT (single-point-of-truth).
You can drive configurations quite easily, even if your network is multi-vendor, but you cannot do it without having a model to drive it with. You need some good-quality data to render your configurations. The recipe for an accurate network configuration has three ingredients: a good model, good data, and a good template. Et voila’.

Network configuration templates

Writing templates is easy, very easy. We opted for Jinja2 as a templating language, as it seems very widely used and is supported by tools like Ansible.
The challenge with templates is definitely one: re-usability. I do not want to write a template for how to configure a VLAN each and every time I write the templates for my projects. How can others can re-use my templates? Are my templates elastic enough to let others expand them should there need be?

Network models

Putting together a model is – fairly – simple. The big network players (coordinated by the IETF NETMOD WG) are trying to make ends meet and agree on some standardised models, all written in YANG. (www.yang-central.org)
A network device is typically used for more than one purpose (a BNG can be used to terminate subscribers, as a pure MPLS PE router where to connect CE devices, etc.) and I expect the model that best represents such a device would be composed by multiple services, each in the form of a mix of YANG model instances.

Network state Data

What is NOT easy at all is – especially in a brown field scenario – putting together the data that is necessary to populate the nice model you engineered so carefully.
Where do you fetch your IP addressing from, the L3/L2 MPLS VPN data, the interfaces descriptions and the network physical topology?
You might already have an IPAM, a database of some sort where you store that suff, but can they talk to each other? Do they expose any APIs? Are those APIs powerful enough to fulfill your needs?

I think the essence of the problem is about tool co-ordination. “Because that data needs to be somewhere”. You just need to talk to people and try to understand how you can make use of the data you already have.

The immaturity of YANG, the lack of models for ‘everything’, and the lack of network devices that can be configured through NETCONF just by the means of YANG data structures certainly does not help businesses to steer with determination towards the YANG road. (I suspect there is a lot of politics behind the reason why YANG is not yet so popular, although it’s been around for some years now).
The above is maybe one of the reasons why the landscape is blossoming with proprietary solutions developed (or acquired – that’s the case of Cisco NCS – former Tail-f) by multiple vendors that can help bridging the gap.

When?

NOW. This part is not going to be long and boring. You have to start NOW. Reason is that it’s a fairly complicated and time-consuming process (if you’re not starting from scratch).

Who?

That’s a 1BN$ question. I think everyone has to be involved in the process. No-one should feel left apart nor immune from the revolution.
If you don’t like it, change job. If you love CLIs, you don’t need to worry. You’re still going to need CLIs. It’s probably not going to be IOS or JUNOS anymore, but you can still call it CLI if it makes you feel better.

Summary

Introducing orchestration in big ISPs can be regarded as the first industrial revolution of the Internet after the advent of IP (just kidding), where efficiencies are introduced by automating dumb and time-consuming tasks.
The industrial revolutions took 20 years each to complete and I expect this one to last for a while too. There is going to be turmoils, riots and rallies arranged by the “non-automation” parties. But it is going to happen no matter what.

The biggest resistances to the revolution are maybe the ones coming from up above, as everything has to have a business-case nowadays, and building one for a revolution is rather difficult.
But wait, do we need one? This is not just about going out to the market and buying a shiny and very pricey “jack of all trades” that does everything at the click of a button or hiring 200 employees to code a monolithic tool.
What we can do is start tackling some simple problems wearing the hat of the orchestration architect (and therefore not writing 30.000 lines long python scripts that no-one can read) and orchestrating small portions of our networks at a minimal cost.
Then, when everyone else will see what you’ve done within the (small) budget of some other projects and how much you’ve simplified your and everybody else’s life they’ll say: “Wait a minute. Why don’t we jump on the bandwagon?”

IPv6 Tunnel and Failing TCP Sessions

As part of our IPv6 deployment we had to upgrade the firmware on our CPEs.  We have a small variety of different models, but the majority of them are based on a Broadcom chipset.  This firmware upgrade included all the features we needed for IPv6, the DHCPv6 client for the WAN, RA announcements on the LAN etc, but it also included other non-related IPv6 fixes and enhancements.

We spend a lot of time and effort regression testing these firmware pushes, and are generally pretty confident in it by the time we go to mass push it out via TR69.  However, shortly after the firmware upgrade we started hearing complaints that this firmware had broken a very specific use case that we hadn’t obviously tested for.

IPv6 tunnels such as the 6in4 ones offered for free by Hurricane Electric.  Odd, we hadn’t started the enablement of native IPv6 prefixes for these customers yet, but we did deploy it with ULA RAs enabled, could that be affecting things?   We didn’t think so, but we had to investigate obviously.

Problem Statement:

6in4 tunnel problem

6in4 tunnel client configured behind the router, inside the DMZ (not firewalled).

6in4 tunnel server on the internet, provided by Hurricane Electric.

Tunnel establishes correctly.  Client gets an IPv6 prefix, can ping6 tunnel end-point as well as other v6 connected servers on the internet.

However, TCP sessions don’t establish over the tunnel.

Diagnosing:

ICMPv6 echo requests and replies are passing fine, so it doesn’t appear to be an obvious routing issue.

First step is to bust out netcat, tcpdump on the client and watch the TCP 3-way handshake and see how far we get.

We see the initial SYN go out fine, and then the reply SYN/ACK from the remote v6 host and we send the replying ACK.  All looks good, myth busted, time for the pub….. but wait.. there comes in a duplicate SYN/ACK.   Wait what?   We send another ACK but yet another duplicate SYN/ACK comes back.

duplicate_SYNACKSOK, sounds like our ACKs aren’t making it back to the remote host;  Time to check the remote side of the session and confirm.

Sure enough, our hypothesis is correct:remote_duplicate_SYNACKSBut why? It’s not a routing issue,  we confirmed that earlier.  Maybe it’s getting eaten inside Hurricane Electric land for some reason.    MTU issues seem likely, it’s always a bloody MTU issue, right?  Especially when we start faffing with encapsulating tunnels.

Hrm nope this is IPv6, the minimum MTU size is 1280 and that’s definitely enough for a small ACK.   Check the he-ipv6 tunnel iface mtu, 1480.  Yup, makes sense, 1500-20 bytes for the 6in4 encapsulation.  Quick check to make sure we can get at least 1280 through unfragged:

ping6_1280

Yup, we’re golden.  Can also ping6 up to 1432 bytes unfragged which matches the advertised MSS value above once we include the 8 byte ICMP header (but I didn’t screen-cap that).   Side note: There’s also another 40 bytes for the IPv6 header which adds up to the 1480 byte tunnel MTU we confirmed above.

Right, let’s tcpdump on the WAN interface of the router and have a look at the 6in4 encapsulated packets to see what’s up.

Now this is where it gets fun. Tcpdump on the router’s WAN interface…. and the 3-way TCP handshake establishes.  Wait….. what?     Stop the tcpdump and no new sessions establish successfully.

Riiiiiiiight..  is the kernel looking at the 6in4 payload destination address and throwing it on the ground unless promiscuous mode is enabled?   Nope, tcpdumping with -p (–no-promiscuous-mode) and the TCP sessions work again.   OK, what if we tcpdump on the internal bridge interface instead of the WAN?   That results in the broken behaviour!

tcpdump on WAN iface and TCP sessions work.  tcpdump on LAN iface and TCP breaks.

This. Is. Odd.

We spent a wee while going back and forth over a few hypothesis and then proving them wrong.  Then after a quick chat with one of our CPE developers, he informs us of an MTU issue they saw a while ago affecting large packets when using the hardware forwarding function of the Broadcom chipset known as “fastpath”.  But this shouldn’t affect our small 72 byte packet, and besides, it was fixed ages ago.

This gets us thinking;  It would explain why tcpdumping could “fix” the issue by punting the packets to the CPU and forcing software forwarding.  Let’s test this theory and disable the hardware fastpath… hrmm nope no dice.  Let’s disable another fastpath function, this time a software fastpath function called “flow cache”.  Lo and behold, the TCP sessions establish! 

Myth confirmed, bug logged upstream with Broadcom, time for a well earned pint.

Cheers to @NickMurison for raising the issue and helping me with the diagnostics.

DNS Behaviour of Dual-stacked Hosts

DNS is one of those ancillary services that can often get overlooked, be it recursive, authoritative forward or reverse.

Assigning a Recursive DNS server (RDNSS):

There are 2 main ways to tell a client what RDNSS to use, stateless DHCPv6 (or with stateful if you want to use DHCP for IPv6 address assignment), or built in to the ICMPv6 Router Advertisements (RA) being sent by a router, a la RFC6106.

Sadly Microsoft and Google(Android) disagree on which of these methods is the best; Microsoft will only use DHCPv6 and Android will only use RAs.  The fallout of this means that if you want both types of systems on your IPv6 network, you need to announce your RDNSS via both RAs and DHCPv6. The alternative is to just rely on the RDNSS being handed out via DHCPv4 if you’re dualstacked, obviously not a future proofed solution.

See here for a list of support in other operating systems: https://en.wikipedia.org/wiki/Comparison_of_IPv6_support_in_operating_systems

Resolving AAAA Records:

AAAA or “quad As” are the IPv6 equivalent of the IPv4 A record, used for forward resolution from a domain name to an IP address.  They are intrinsic to your use of the internet, without them your client would not know which server to connect to.

Now to dispel one misconception right off the bat, it is not essential that your RDNSS be contactable over IPv6 in order to deliver an AAAA record.  Ie. A dual stacked host can use an IPv4 RDNSS and still be able to browse the IPv6 internet.

For anyone starting to deploy IPv6 in a dual stack environment, it’s important to realise that clients will start requesting both an A and an AAAA record for each resolution attempt, which will effectively double the load on your RDNSS infrastructure.

But when should a client start requesting an AAAA record?  The easy answer is obvious, when the client has a public IPv6 address, however in reality it’s not quite as simple as that.  As we discovered above, different operating systems, and even different clients within said operating systems, behave differently.

We found that Windows (7/8/10) and OSX all start requesting AAAA records as soon as they have an IPv6 address that isn’t a link-local or Teredo address.  This means that even if you haven’t enabled IPv6 on the WAN side just yet, you will still see a drastic increase in RDNSS load when you upgrade your CPEs firmware to support IPv6 if they start handing out Unique Local Address (ULA) addresses.

Oddly enough, Android itself and Chrome on any OS do not seem to request AAAA records when only presented with a ULA address.  To me this seems broken, as routable networks can be built using only ULA addressing, and thus could quite feasible want forward DNS entries.  ie. Just like on an IPv4 LAN using RFC1918.

 

Deploying IPv6 – The Residential ISP’s Challenges

There’s a lot of support for deploying IPv6, no one really is saying that you shouldn’t, but the lackluster uptake from large eyeball ISPs tends to grate on those most vocal IPv6 evangelicals.  Rest assured however that most of the larger ones have been silently working away on this for some time, working on making their large scale deployments to the mass market as seamless as possible.  Those smaller ISPs that lack budget, or those that are flush with IPv4 space, are happy to ride it out letting others deal with the 0day vendor bugs.  That’s fine, with the already tight margins in this market, why should they spend money when they don’t need to?

Anywho, I decided to document some of the issues we’ve faced and have had to overcome in the past few years as we start to descend over the precipice of our residential mass market IPv6 rollout.  This is by no means a complete list, just a few of the interesting ones that spring to mind.

Authentication

PPPoE with CHAP for authentication will continue to be fine.  Your PPP session establishes as normal, IPv6CP negotiates link-local addressing, and then DHCPv6 hands out an IA_NA and/or IA_PD over the top.

IPoE on the other hand, authentication is typically done when the BNG receives a DHCPv4 DISCOVER or DHCPv6 SOLICIT.  It’s done with what’s colloquially referred to as “port based authentication”, using either the Circuit-ID or Remote-ID that’s inserted via the DHCP Relay Agent on the Access Node, ie. the DSLAM or OLT.  This is often referred to in DHCPv4 land simply as “Option 82” with Circuit-ID being sub-option 1, and Remote-ID sub-option 2.

First issue: not all Access Nodes, or their service providers will support DHCPv6 LDRA insertion of Remote-ID or Interface-ID (the v6 equivalent of Circuit-ID). Openreach being our primary example here in the UK, although their Huawei MSANs do actually support it, Openreach don’t.  The impact of this means you’re reliant on DHCPv4 for gleaning this information, and can’t go single-stacked native IPv6 just yet.  Doesn’t seem like that big a deal, right?  Which brings us to the second issue:

In lieu of having the native Remote-ID/Interface-ID, some BNGs can attribute the v6 session to the same as the v4 if the DHCPv6 SOLICIT is received within a few seconds of the DISCOVER. Nice wee kludge that works when a CPE is freshly booted, but can fall out of sync depending on timers, if v6 is enabled after it’s online, or other CPE quirks.  If this happens, the CPE’s existing PD may become non-routable causing end user’s traffic to be blackholed, or the CPE just won’t get a new PD.

OK, so you own your own access network and your DSLAM/OLT supports Lightweight DHCPv6 Relay Agent to insert a Remote-ID, great we now know who sent us that DHCPv6 SOLICIT.  Except that DHCPv4 Option 82.1 Remote-ID is different to the DHCPv6 Option 37 Remote-ID.  RFC4649 prepends an extra 4 bytes to the front of the Remote-ID, to include the IANA registered Enterprise-ID of the relay agent vendor.  Now you’re going to need extra RADIUS logic to strip off those first 4 bytes; to do that it needs to be able to reliably identify an IPv6 triggered ACCESS-REQUEST, which can also be a challenge in itself as the the different format Remote-IDs get inserted in to the same RADIUS attribute, Broadband Forum’s  “Agent-Remote-ID”.

Resourcing

On a BNG, like other routers, you have to keep a close eye on the resources being used. On the Alcatel-Lucent 7750-SR, each subscriber connected chews up what they call a “host resource”. As soon as you dual stack a subscriber, that takes up another host resource, thus halving the total number of subscribers you can host on that one BNG.

If you hand out an IA_NA as well as an IA_PD, that uses up yet another host resource.  To avoid this waste of resources, you don’t have to assign a public IPv6 point-to-point addressing for use on the CPE’s WAN interface (IA_NA), instead they can just use link local for BNG<->CPE communication.

Side note: In a later R12 release of SR OS it no longer uses a host resource each for IA_NA and IA_PD.

CPE

As we’re not allocating an IA_NA address for use on the WAN interface but rather using the link local addressing, the CPE no longer has a public IP address.  Not really a major issue apart from perhaps confusing some helpdesk agents or end users who can no longer ping their CPE as proof their connection is up.  We mitigate this with a small custom tweak on the CPE, it claims the first ::1 address from the PD and uses it as a loopback of sorts.

For firewalling, we’ve chosen to follow the RFC6092 recommendations on CPE IPv6 security.  Which means we’ll be, by default, allowing all inbound IPSec but blocking all other non-solicited inbound traffic.

That poses an issue that we haven’t really resolved, and I’m unsure as to what the exact impact will be.  As a result of NAT on IPv4 a lot of applications utilise UPnP to open up inbound ports for connectivity, this means not just a DNAT entry, but also an inbound firewall rule.  Whilst we no longer need NAT with IPv6, that inbound firewall rule will still be required on CPEs that have a default deny policy.

As mentioned in a previous post, there are new UPnP functions which allow for the dynamic creation of IPv6 firewall rules and are actually mandated in the IGD:2 specs; Sadly not many CPEs meet these IGD:2 specifications yet, and even then it will require application developers to update their applications to make use of these new functions.

Another potential issue I foresee with firewalling is more of an end-user training one as opposed to a technical one.  Most modern OS will make use of privacy addressing.  This is a method of an end-host pseudo-randomly assigning itself temporary addresses to use for outbound connections and then deprecating them after a while to be replaced with new ones.  The end result is that an end-host will have a multitude of IPv6 addresses on an interface, including:

  • Link local addressing which will start with fe80::
  • Unique Local Adress (ULA) which starts with fd
  • A static EUI-64 address based on the interface’s MAC address.
  • Several of the aforementioned Privacy Extension addresses. (Only 1 being used for new outbound connections, but possibly multiple deprecated addresses that were used for older flows)

Hopefully people will realise pretty quickly that the first two aren’t globally routable and aren’t to be used for inbound firewall rules. The issue is that the last two types of addresses are both assigned out of the same prefix handed out via RAs from the CPE and aren’t instantly recognisable by their format.  Thankfully most OS will include the word “temporary” next to the privacy addresses, which will hopefully steer end-users to use the EUI-64 address for any IPv6 firewalling rules they decide to manually enter.

Right that’s enough for now, and that’s just a small snippet focusing on the very end of the Internet chain.  Hopefully it helps some people, or gives others an idea of what kind of things they should be looking at when doing their own IPv6 deployment.

Apple and IPv6

Apple have recently made waves in the networking world by announcing IPv6 support will be mandatory for apps in iOS9. Now, this is not actually as scary as it originally sounds for app developers; all it really means is that they must avoid the use of IPv4 literals and make use of iOS’s high-level network API’s. In this way, the network is abstracted from the developer and they don’t really need to care about the underlying transport. iOS will decide what to use. If only IPv4 is available, it will use that. If only IPv6, it will use that, although perhaps with the help of DNS64/NAT64 in the providers network if the content itself is IPv4-only. Things get more interesting when both IPv4 and IPv6 are available, the decision is then more complicated.

In dual-stack networks, you could be forgiven for thinking that iOS will make the ”right” decision when presented with a dual-stack connection, unfortunately, real world experience is not proving this to be the case. Of course, the “right decision” means different things to different people. For example, Facebook have recently shown that they see a significant performance increase (Up to 40% decrease in page load-time) when using IPv6 on a major US mobile network. The interesting thing here though, is that they also report that iOS will only choose the IPv6 path ~20% of the time. This seems to be because Apple are basing their decision on RTT even though HTTP throughput is faster via IPv6. This has actually led Facebook to develop mobile proxygen so they have control over the protocol selection process.

So what are developers meant to do? How will they even know if their apps are IPv6 compliant? Fortunately, Apple have thought of this and are providing OS X with the ability to create an IPv6-only hotspot. As far as I know, the details of exactly how this works are not yet public but the best guess seems to be that they will present an IPv6-only connection and then use NAT64/DNS64 to provide connectivity to any content that may be IPv4-only.

This is all great news but there is an elephant in the room: 464XLAT. This is a technique that combines existing translation mechanisms (RFC6146, and RFC6145) to provide IPv4 connectivity to devices connected to an IPv6-only network. This is achieved by providing a client-side IPv4 CLAT, traffic is allowed to enter as IPv4 and is then mapped to IPv6 for transport across an IPv6-only network. This is great for the mobile world; it means UEs (User Equipment) that support 464XLAT can be connected to operators IPv6-only networks without fear of poorly coded apps breaking. The UE itself “fakes” IPv4 connectivity towards any applications that can’t live without it. So why have Apple stubbornly refused to support 464XLAT and show no sign of changing this stance?

It is important to remember that 464XLAT solves a very specific problem; IPv6-only networks, even with DNS64/NAT64 cannot achieve parity with IPv4-only networks when some apps break on an IPv6-only network. This typically happens because the app is using IPv4 literals or making use of legacy IPv4-only APIs. 464XLAT gets around this by presenting IPv4 connectivity so these apps can continue to work. Because of this, major mobile networks (T-Mobile US being a prime example) have been able to provide IPv6-only to their Android users. But Apple are in a unique position, they exercise great control over their store and apps must meet certain requirements to gain entry. One of these requirements will now be that the app *must* work on an IPv6-only network. If developers have coded apps in such a way that they break without IPv4, they will be required to fix this for iOS9. This is an excellent move, just like so many things before, Apple are consigning “IPv4-only” to the history books.

There is still plenty more for Apple (And the rest of the industry) to do regarding IPv6 but I can’t help but be optimistic after hearing this announcement. I hope this move really does help weed out legacy code in apps and Apple listen to experiences gained by Facebook, and others, around improvements to protocol selection. I guess we will find out over the coming months.

Segment Routing, the VxLAN for Service Providers

Software Defined Networking is inarguably the most popular and simultaneously least favourite buzzword within the network world over the last few years.

VxLAN is a protocol that’s used to get end-to-end connectivity across whatever layer 3 jungle that may lie in the middle and it has been quickly adopted to provide SDN solutions within enterprise data center and hosting environments.  Service providers and carriers have less of a reason to use it as we already have MPLS, Pseudowires and VPLS, right?

Typically an MPLS network will use one of two signalling protocols (sometimes both), LDP or RSVP-TE, to build label switched paths from one PE to another. The network engineer will then either tweak the IGP metrics to steer the LDP LSPs around, or build stitched up RSVP-TE LSPs with strict or loose hops.  RSVP-TE LSPs are pre-signalled, sometimes even with multiple backup FRR paths also pre-signalled.  LDP solves the manual configuration overhead of creating these LSPs everywhere by leaving the forwarding decisions up to the IGP, but it doesn’t solve the scalability issues of having a fully meshed network with bidirectional LSPs between every PE.

Some providers, and even Facebook, have offloaded this route calculation to dedicated Path Computation Element (PCE) platforms that can build traffic engineered paths based on a wider view of the network topology, QoS awareness, as well as any other variables such as scheduled outage windows fed in from a change control system.  Pretty neat stuff, and pretty close to service provider SDN you could say.

You could also say that Alcatel-Lucent’s 5620 SAM platform, or Huawei’s U2000 have been doing SDN-style network programmability for years by automating the end-to-end service build. Great if you have a ubiquitous single vendor network and all that you care is about static service provisioning.

The downside to both of these solutions is that they’re still reliant on the in-band control planes, and with RSVP-TE there’s a lot of state to keep those LSPs standing.

So here’s where the beauty of Segment Routing comes in.  SR uses your current IGP as a label distribution protocol, albeit with some extensions. The data forwarding plane concepts of label switching that you already know still remain, however all the transit LSRs in the middle don’t need to know about all the LSPs that go through it. The LSPs aren’t pre-signalled, they’re not really signalled at all.

If the ingress PE (A) doesn’t care about the path to the egress PE (Z), it can just push Z’s node label on to the stack and sends it on its way letting the IGP topology decide the best path.  However if A wishes for a specific path to Z, it can stack a series labels of each explicit hop, on to the payload.  These segment labels can either be a node label to define an LSR hop, or an adjacency label to specify a connected interface on an LSR. So there we have both the simplicity of LDP and the traffic engineering of RSVP-TE, but without the signalling overheads.

That’s the basic gist of SR and I won’t go in to more detail around SR itself as there’re plenty of presentations and write ups about that do it better than I, e.g., http://packetpushers.net/introduction-to-segment-routing/ or this very good explanation of the problems that led to SR by one of the draft’s authors, Rob Shakir

By using the existing IGP and removing the requirement for an extra LSP signaling protocol, Segment Routing lends itself perfectly to provide scalable and dynamic transport over an MPLS network.  The ability to on the fly push a stack of labels to an ingress PE with no control plane overhead, whilst still utilising MPLS switching for the data plane is, in my opinion, quite a nifty thing.

Which leads me back to my initial analogy.  VxLAN allows endpoints to dynamically build layer 2 networks across a routed transport network, without the routed network needing to know.  This abstraction of network creation from the underlay network is what enables cloud hosting providers to build your on-demand virtual host farm at the click of a button.

Segment Routing will enable similar functionality in the MPLS world, providing scalable, dynamic and on demand route orchestration for both layer 3 and layer 2 services.

CGN vs IPv6

There are two rather big misconceptions I often see mentioned when it comes to the IPv6 over Carrier Grade NAT debate.

Firstly that CGN and IPv6 are mutually exclusive.  Some folks seem to think that it’s an either/or approach when it comes to deploying these technologies, rather than having them work alongside each other in parallel.

IPv4 isn’t going away anytime soon, CGN should be treated as a transitioning mechanism, not as a method to procrastinate the deployment of IPv6.  CGN is capex intensive, requiring special hardware to NAT and keep track of the translation state table. This CGN hardware is typically going to be limited by number of flows and bandwidth throughput, i.e, how many customers and how much they’re using.  Providing an IPv6 prefix delegation dual stacked alongside a private CGN IP allows for native IPv6 traffic to bypass these resource (and financially) intensive CGN gateways, meaning more capacity out of them, better IP saving efficiency and most importantly, better user experience.

ISPs don’t want to deploy CGN, it costs us, our customers don’t like it, not to mention the compliance issues involved; however, until the IPv4 internet gets turned off, we’re going to need some form of CGN, NAT64, 464XLAT, 4RD etc.

The second misconception is that IPv6 is a silver bullet, solving all of our NAT issues and providing ubiquitous end-to-end connectivity.

I can’t imagine that many ISPs are going to be deploying IPv6 to their end-users with the CPE firewall disabled by default, so there goes that theory.  If you’re lucky your ISP will provide you a CPE that has a nice web UI with an IPv6 firewall that allows you to open up ports to your public IPv6 addresses on your LAN. If you’re really lucky your CPE will allow IPSec to your end hosts by default, as per RFC6092.  But what about users that don’t want to or care how to do manual firewall rules, let alone know that they should use the EUI-64 address instead of the privacy extension addresses?  What happens if the dynamic PD changes, do they have to login and update the CPE firewall rules each time?

To make sure IPv6 doesn’t negatively impact the user experience we need some sort of dynamic firewall and port forwarding similar to what UPnP used to provide us in IPv4.  Well, there is a UPnP function to allow this, but it’s a (relatively) new one specified in IGD:2 that requires both CPE and application support.  I don’t know the stats on how widely this is supported by residential CPEs, but my personal experience has indicated “not very”, and even if it is, it requires every application that currently does an IPv4 AddPortMapping() request to do a new IPv6 AddPinhole().

Thankfully, as of the 30th March 2015, the UPnP Forum has deprecated IGD:1 so hopefully we will see a bit more up take amongst the commodity CPEs vendors for this.

As a side note, this IGD:2 spec also includes an AddAnyPortMapping() function that allows a CPE to return a different port to the one requested, in case the original one is in use or otherwise unavailable.  This would be great for CGN deployments that incorporate PCP to allow for dynamic port forwarding, but once again requires application developers to support the new function, as well as a UPnP-PCP Interworking function on the CPE.