Tuesday, July 12, 2011

Long-Distance vMotion: Part 3

This is the third part in a multi-part series on the topic of Long-Distance vMotion. Part 1 introduced us to disaster recovery as practiced today and laid the foundation to build upon. Part 2 built out the Long-Distance vMotion architecture with a few different approaches.

There are some limitations and challenges that we must consider when designing LDVM or other workload mobility technologies. If it were too easy, everyone would be doing it. The first area we’ll address is commonly referred to as the traffic trombone.

Traffic Trombone

Understanding what I mean by a traffic trombone requires a bit of visualization. When we have one site, everything is local. My storage is connected within my datacenter, my other subnets are routed by a local router. The path to these devices is very short, measured in meters, so latency is very small. As we migrate VMs to another datacenter, the VMs have moved, but the network traffic and storage traffic continue to go back to their original router and storage controllers, if we don’t add a little extra prevention. When we send a packet or a read/write, it goes back to the original datacenter, gets serviced, then returns back to our new datacenter where we’re now running. That backing and forthing is what we refer to as tromboning, hence the traffic trombone. (My PowerPoint presentation drives this home.) I’ll address this in two parts: network and storage.

When addressing the IP network, the first thing I’ll say about the network trombone is this is a desirable effect for existing network stateful connections (TCP). We want those connections to stay alive without disconnecting. For all new connections, we’d like to optimize the path to go through the local site. When we optimize the path, I’ll further break this down between coming to and leaving my subnet. For the remote subnets, Cisco GSS and ACE play a role with vCenter awareness to point to the correct site where this service is running. GSS points to the external ACE vIP where the workload currently lives. For leaving our subnet, we use HSRP default gateway localization forwarding traffic to the local Cisco ACE device for processing. This helps preserve symmetrical routing so our firewalls don’t drop our packets thinking something has gone wrong.

An alternative emerging technology is Locator-ID Separation Protocol (LISP). This protocol runs on both the outside of our subnet pointing to the correct site, as well as within pointing to the correct default gateway out. Think of LISP in the terms of cell phones. We used to have phone numbers that were tied to a specific cell-phone provider. When we switched carriers, we needed to get a new phone number. Phone number portability untied our phone numbers for a particular provider, it can now go with us and points to our new company we switched to. LISP does a very similar thing to IP addresses, it lets us take it with us to a new site and points to where we now live. LISP is available on certain blades in the Nexus 7000, and is also being ported to other products.

When addressing the storage area network, some products are already tackling this problem. EMC VPLEX and NetApp FlexCache each open their volumes locally without MPIO drivers extending across the datacenters, eliminating any traffic trombones. When dealing with IBM SVC and NetApp MetroCluster varieties, since they are split-controller designs the MPIO paths will have active paths to one datacenter and passive paths to the other. When the VMs move to the other site and back, one of those two sites will trombone traffic back to the primary controller for that volume. This will add latency into the IOs. In the case of SVC, we can only go campus today anyway (< 10 km) so the distance and latency is pretty short. In the case of NetApp, we need to stay with synchronous distances, but my MetroCluster customers haven’t had adverse impact to their IO. Of course, your mileage will vary depending on how much you stress your storage.

It is always important to have competent network, storage and virtualization architects take all latency, routing and cluster impacts into account to have a successful implementation. Excessive network or storage traffic needs to be understood, and QoS is always applied. A vSphere architect can help design how the cluster will be laid out and take DRS and fault-tolerant VMs and their associated traffic into account.

Future Directions and My Wish List

While EMC VPLEX and NetApp FlexCache has node redundancy at each site, IBM SVC and NetApp MetroCluster do not. This has scared one or two customers away from these solutions. It also can make code upgrades require a bit more planning. I’d like to see NetApp and IBM come up with solutions with node redundancy at each site. NetApp’s evolving cluster mode for block storage (FC/FCoE, iSCSI) may provide some of this. IBM has other technologies in the DS8000 base, such as Open-Systems Hyperswap that could possibly hold promise here.

When it comes to the storage trombone, both IBM and NetApp need to eliminate all storage trombones and IBM needs to go beyond 10 km. This could possibly be with more intelligent multipath drivers that plug into vCenter and automatically swap active/passive paths as the VMs move.

VMware vSpehere itself needs some work to go beyond their current limit of 5 ms RTT, or 400 km under perfect conditions. This is one of the biggest limitations of the technology today. In EMC’s recent EMC World, the announced VPLEX Geo for regional protection sending Microsoft Hyper-V VMs about 3000 km. EMC has a product roadmap with VPLEX to go around the world (NetApp FlexCache already does this) with VPLEX Global.


While I suddenly have a number of customers investigating and deploying Long-Distance vMotion, I do understand it is not an inexpensive solution. First we need a Metro Area Network (MAN) capable of doing 622 Mb/s or greater speed with under 5 ms round trip latency. We’ll need to transport the storage traffic as well. This doesn’t come cheap, but depending your options in your metro, the prices are more attainable every year.

I currently favor EMC VPLEX Metro and NetApp MetroCluster because both are validated, tested and referenced in KB articles by VMware (see below). I also have customers deploying these solutions. IBM just doesn’t go far enough, which is too bad, since I have a large SVC install base wanting this technology, but the EMC solution can front end an IBM SVC. It’s been hard-pressed finding a lot of NetApp FlexCache solutions outside of Hollywood. The technology has promise, but again is all NFS.

A lot of these same principles also apply to IBM PowerVM Live Partition Mobility, Microsoft HyperV Live Migration and Oracle VM for x85 Live Migration. Each hypervisor will come with it’s own limitations and peculiarities, so make sure you fully understand them before deploying them. Most of these solutions can be deployed within the same infrastructure as LDVM.

This technology seems to have caught some of the storage vendors off guard. When I first presented this in March of this year, I had spent six months preparing for it. I already had customers ready to deploy. Since March, I’ve run into around five more locally, and more nationally looking at these solutions. As prices in hardware and bandwidth fall, it will become as common as storage mirroring has become in the last decade.



  1. Thanks for the clear explanation and the excellent examples. really good post. keep it up..

    Video conferencing uk

  2. Reliance on data management systems has its own advantages and disadvantages. It may make our work faster, but there are also challenges in overcoming natural disasters and maintaining computer security. Even data systems such as Long-Distance Motion need back-up systems in case emergency strikes.

    Mac Pherson

  3. Mac,

    I couldn't agree more, that's why we separate disaster recovery from business continuity. BC is Long-Distance vMotion, or Workload Mobility. We still use traditional SRM-type recovery for going beyond the local metro to keep disaster recovery. That is specifically what you're talking about.

    You can never sacrifice disaster recovery and LDVM doesn't aim to do that.