Tuesday, October 11, 2011

Long-Distance vMotion: Updates

This is an update to the Long-Distance vMotion series I did earlier this year. If you wish to read it all, start with Long-Distance vMotion: Part 1.

The problem with blogging about technology, techniques and architectures is they change. Sometimes that change is rapid, sometimes it takes time over major releases. In a more converged world where multiple components play, they can change quite rapidly.

Since writing my Long-Distance vMotion (LDVM) series, there have been some changes. Here in lies the dilemma, do I go back and change the old articles, or do I post an update like this new entry. I could add a section to the blog with the latest analysis, called Long-Distance vMotion. Part of me feels I should leave old posts unchanged (except correcting typos and erroneous information). The other approach would be changing the old articles preserving search engine entry points that are currently sending people into the articles – they wouldn’t have to go to another place in the blog for the latest updates. I can post-date new entries, I can’t post-date new information. Which is the best approach? Let me know what you think.

Monday, August 29, 2011

vBlocks and FlexPods: is this Coke v Pepsi?

When Cisco came out with their UCS servers, I was impressed. They took the Nexus FCoE switches and modified them into a whole new thing, the UCS: with FCoE, service profiles and an expandable distributed blade server model. What really makes sense is the bottom line, you can save real money by deploying them over traditional blade or standalone servers. They save money with price per port and not having to buy additional switches for every 14-16 servers in a traditional blade enclosure. They simplify rapid deployment of servers. They allow moving workload to new blades without having to rebuild them.

Converged networks suddenly start to make sense with the Cisco UCS. You begin to see Cisco’s master plan in action. It’s not just FCoE in a switch, but a whole system built around best practices: FCoE, boot from SAN, etc. The biggest gain immediately obvious are the service profiles: VMware abstracts servers, service profiles are somewhere in between virtualizing the hardware the VMware is built upon. Firmware, UUIDs, WWPNs, MAC addresses, everything is abstracted. It took things one step farther than HP virtual connect.

Tuesday, August 9, 2011

Cloud #fail

This is a very brief post on the cloud computing failure of today. I hope to have a guest writer post something better, more lengthy in the future.

I’ve been partaking in some discussions among peers on today’s Amazon EC2 cloud outage – again. I’ve been listening to people say the cloud isn’t ready or is a bad idea. The cloud is the cloud, and continues to be a great decision for a lot of people where it makes sense. The failure people make is in abandoning IT best practices when going to the cloud and going with a single system or provider.

When we design for disaster recovery or business continuity, we usually design in redundant, diverse data paths to the secondary data center with carrier diversity (meaning more than one carrier). When going to the cloud, if you’ve decided to outsource everything, you should continue that diversity with multiple cloud providers and the resiliency to be able to use either. Failure to provide cloud diversity is the same as having one datacenter, you’ve got all your eggs (IT) in one basket.

When going to the cloud, you should either have a hybrid private/public cloud with redundancy, or two public cloud providers with diversity. Those that stray from IT best practices will pay the price – on twitter you’ll get the dreaded #fail associated with your name.

Tuesday, July 12, 2011

Long-Distance vMotion: Part 3

This is the third part in a multi-part series on the topic of Long-Distance vMotion. Part 1 introduced us to disaster recovery as practiced today and laid the foundation to build upon. Part 2 built out the Long-Distance vMotion architecture with a few different approaches.

There are some limitations and challenges that we must consider when designing LDVM or other workload mobility technologies. If it were too easy, everyone would be doing it. The first area we’ll address is commonly referred to as the traffic trombone.

Traffic Trombone

Understanding what I mean by a traffic trombone requires a bit of visualization. When we have one site, everything is local. My storage is connected within my datacenter, my other subnets are routed by a local router. The path to these devices is very short, measured in meters, so latency is very small. As we migrate VMs to another datacenter, the VMs have moved, but the network traffic and storage traffic continue to go back to their original router and storage controllers, if we don’t add a little extra prevention. When we send a packet or a read/write, it goes back to the original datacenter, gets serviced, then returns back to our new datacenter where we’re now running. That backing and forthing is what we refer to as tromboning, hence the traffic trombone. (My PowerPoint presentation drives this home.) I’ll address this in two parts: network and storage.

Sunday, June 5, 2011

Long-Distance vMotion: Part 2

This is a rapidly changing field and there have been new updates. Please see Long-Distance vMotion: Updates for the latest changes.

This is the second part in a multi-part series on the topic of Long-Distance vMotion. Part 1 introduced us to disaster recovery as practiced today and laid the foundation to build upon.

When building out Long Distance vMotion (LDVM), we still need to focus on the same components we focused upon building out disaster recovery. We will take the leap from our recovery time taking 5 minutes to a continuous non-disruptive operation. We’ll need to change our network from two different subnets in two different datacenters, to one stretched subnet. We’ll need to take our mirrored storage and create what I call a single shared storage image. Last, we’ll get rid of Site Recovery Manager (SRM) and replace it with a VMware vSphere split cluster.

Tuesday, May 3, 2011

Long-Distance vMotion: Part 1

This is the first part in a multi-part series on the topic of Long-Distance vMotion. I am currently architecting and building this out for a few of my customers.

Long-Distance vMotion (LDVM) is the holy grail of business continuity – the ability to migrate workloads across data centers or in and out of clouds with no disruption of service and zero downtime. When I started consulting in the 1990s, after my several years as a software developer, I was a high availability clustering consultant, among other things. Later I architected geographic clusters, but one thing was certain, it was very expensive, complex in architecture and difficult to manage.

Long-Distance vMotion attempts to tackle one issue, that of business continuity. Let’s face it, disasters are rare. I know there are earthquakes, tornados, hurricanes, floods and other bad things that happen. In my many years of consulting, these have rarely happened to my customers. I have two customers that have had their storage ruined, each by their own fire suppression system failing and pouring water onto their equipment. These disasters, although rare, do happen. They must be planned for. It’s risk mitigation; a business decision that doesn’t come for free.

Wednesday, April 6, 2011

The Empire Strikes Back

I was getting ready to write EMC off, at least in the mid-tier. The Clariion was old-tech, and an old way of doing things. They screamed unified, but it didn’t feel that way. Celerra in the NS/NX felt like a bolt-on. They were expensive, fragmented and difficult to work with.

EMC had been making a number of good buys over the past couple of years. RSA, VMware, Kashya and Data Domain come to mind. Avamar and YottaYotta were lesser-known pieces. When it came to primary storage, however it seemed stale. Then they started showing the cards they were holding.

First came VMAX. They refreshed the Symmetrix line with a modular, scalable architecture. It could grow from something small to something big. But the real changes starting coming with FAST and FAST VP.

Monday, January 17, 2011

I’ve Got The Remote Replication, Single-Storage Image MPIO Blues

There are not a lot of customers I meet that don’t want some form of replication to a disaster recovery/colocation facility. What used to be financially unreachable has come down over 10-15 years to be affordable for most businesses. Remote replication, coupled with VMware or one of the other hypervisors providing server virtualization has made recovery quick, easy and within budget.

So as I look at some of the new storage systems being released lately, I’m scratching my head. Why would an affordable small to medium business mid-tier storage system provide only FibreChannel-based replication – today?

Monday, January 10, 2011

The Storage Evolution Part 2: Deduplication

This is part 2 in the Storage Evolution series.

When we created SANs we stored more and more data retrieved at higher and higher speeds, and it was good. Then we added advanced functionality like creating copies (clones, snapshots, etc.) quickly, and it was good. Then development wanted to have 6 copies of that production database, one for each developer, also test, QA and staging. We needed a daily clone for the data warehouse and business intelligence. We virtualized our servers, we booted from SAN, some of us ditched the desktops and workstations opting for Citrix and VDI. We were making copies of the same data, over and over and over again. And at the 11th hour of the second half of the day, we backed it all up.

And it just got worse.

The Tale of Tape

We had a love/hate relationship with tape. We loved it’s density, it’s streaming speed, we hated its bulk, daily off-site management, library management and load and seek time. But it was cheap, alternatives expensive.