Thursday, December 30, 2010

The Storage Evolution Part 1: Virtualization

In order for me to talk about storage virtualization, I feel it’s important to describe what I mean by it. Storage virtualization is introducing a layer of virtualization, or abstraction, when dealing with storage. What I’m not talking about is storage for server virtualization, or storage of VMware, Hyper-V, etc. While server virtualization and storage virtualization are very complimentary, they are not the same thing.

Storage virtualization adds some kind of pointer-based approach abstracting the physical blocks from the logical blocks of a disk, LUN or unit of storage. This approach adds power and benefit to give us more flexibility on how we allocate, move, make recovery points (snapshots, copies, etc.) and replication (mirroring) for business continuity or disaster recovery.

This can be accomplished in a couple of different ways. The first occurs in what I generally call the appliance based, or out-of-the-box virtualization. This is some type of engine, box or controller that sits in the middle as an appliance and adds a layer of virtualization to heterogeneous storage: virtualization with vendor independence. The second I like to define as virtualization of a single vendor’s storage; I call it in-box and it’s commonly a homogeneous approach.

Nothing New

Virtualization is an overused word. Storage abstraction has been around for a long time. Logical volume managers are available from many vendors where physical disks are aggregated or concatenated into volume groups and chopped up into logical volumes. Some vendors have offered this technology for over 20 years (even longer in the Mainframe world).

With the prevalence of logical volume managers and storage vMotion in ESX and vSphere, one could make an argument that storage virtualization isn’t needed. But in my opinion there are too many benefits to ignore it.

Primary v. Secondary Storage and the SAN

Primary storage is where your data lives and is directly accessed on a daily basis. It’s where you primarily and continuously access things. This typically takes the form of spinning disk today, although it is already being replaced in high-end storage tiers by solid-state disk. Secondary storage is where your data is archived to or recovered from. This can take the form of automated-tape libraries (ATLs), virtual-tape libraries (VTLs) or high-density low-cost disks (without all that VTL business). (I call this rolling your own VTL.) Sometimes that low-cost disk is in large archives, such as content-addressable (or object-based) storage.

You may or may not have a Storage Area Network (SAN). If you’re drives are all in your servers, us storage guys call this direct-attached storage (DAS), where you may still reap some of those benefits with a logical volume manager. However, most of us storage guys focus on the SAN. It allows shared storage for clustering: server farms (or grids) like Citrix and vSphere (ESX), storage consolidation (less wasted space), snapshots/recovery points for rapid recovery and mirroring for business continuity. Some applications like vSphere’s Site Recovery Manager or high availability clusters wouldn’t work without a SAN.

The SAN: A Little History

When SANs came onto the scene in the late 90s, they were making all kinds of promises of saving space by consolidating all that direct-attached storage in one place. They did save space, but not as well as we liked. They were also expensive. But that expense was offset by performance. High performance computing easily choked the disks in servers. SANs, with their large caches and many RAID arrays, were the only thing to make those systems hum. 

Snapshots and mirroring were also introduced. We could now take a frozen point-in-time copy to make our backups from. We could mirror that data to another set of disks (local or remote) as a crash-consistent state in case of a disaster. This did the same thing for storage that clusters did for servers. In the case that something failed, we could recover quickly without having to spin a bunch of tape to get back up and running.

Performance was often limited to a specific RAID array and moving data from one place or another was a time-consuming chore. You either needed an outage, or you had to have mirroring licenses to get everything up and synced, then take an outage to use your newly mirrored copy. The field was ripe for storage virtualization.

Appliance-based Solutions

IBM practically invented the appliance-based primary storage virtualization field in the early 2000s. Their SAN Volume Controller (SVC) was a great leap forward in flexibility, allowing better performance by aggregating many RAID arrays together in a disk group, making movement between storage and tiers of storage non-disruptive, enabling easy snapshots with their FlashCopy and then mirroring locally and globally. I’ve used it personally to enable business continuity between sites. It’s an elegant and pretty simple to administer solution.

Hitachi came out a few years later with USP-V. It provides many of the same physical to virtual abstraction with snapshots and mirroring and movement between storage tiers that SVC provides.

EMC tried and failed to make a splash with a different out-of-band (or switch-based) approach with Invista. After many technical problems and poor acceptance, it died on the vine. I was ready to write them off, when they came late in the game with vPlex (an acquisition of Yotta Yotta). VPlex brings in-band heterogeneous support and allows multi-site federation, meaning that both sites appear as one storage system. This allows things like long-distance vMotion enablement. It’s a promising technology that offers a bright spot for them.

NetApp V-series allows you to get all the nifty whiz-bang features of Data OnTap (their storage OS) in front of your existing other-vendor storage. They allow attachment of native NetApp storage shelves as well. I think the NetApp offers amazing features and flexibility to small and mid-size customers, a bulk of the companies out there. They are finally getting the ability to non-disruptively move storage between tiers, something that was a weak spot in its functionality. If they ever get their multi-controller grid-based software running with Fiberchannel, it may truly change things.

I’m not a 3PAR expert and I’m sure there are other solutions I’m missing, but these are the market leaders so I’m focusing on them for now.

In-Box Virtualization

There are a number of solutions that take a somewhat different approach. Compellent, NetApp, IBM’s XIV, EMC’s VMax and others offer virtualization in the box, meaning the firmware offers the abstraction, along with many of the features that go along with it, in a homogeneous, single-vendor solution.

Compellent’s claim to fame has been automatic storage tiering. They were the first to offer auto-migration between tiers. That once-unique feature is being widely adopted by IBM’s Easy Tier (DS8000, v7000, SVC) and EMC’s FAST (VMax, Clariion). It’s becoming a space filled by many. (Compellent is currently being purchased by Dell.)

NetApp has an elegant snapshot with robust integrated application support. Their claim to fame is data deduplication, the only primary storage vendor that has it at the block level. When they came out with dedupe, I thought everyone would have it in 2-3 years, but they remain the only game in town. And there’s a reason: they use an internal 4K block size (given to their UNIX underpinnings). The other vendors have a larger internal block, making dedupe problematic. NetApp sells a lot of storage given this unique offering, and have a space savings guarantee to back it up.

IBM’s XIV is an interesting grid-based offering. Coupled with IBM’s FlashCopy Manager (that also works with their other storage products), it offers similar application integration like NetApp. XIV’s claim to fame is one storage tier, as in don’t worry about it, our caching algorithms take care of all the work for you. It works well for most, but not all, workloads. It’s another promising technology that I expect will mature nicely.

EMC’s vMax offers virtualization with a strong emphasis on virtual [thin] provisioning and automated tiering. VMax is the Intel-based DMX replacement that scales out (as opposed to scaling up) as you build it out. EMC’s FAST (Fully Automated Storage Tiering) is available in both vMax and Clariion and is very robust, with the ability to create tiering policies from SSD/EFD to Fibrechannel to SATA drives. 

Some of the solutions, HDS USP-V, IBM v7000 and NetApp V-series are kind of hybrids that do both appliance-based and in-box simultaneously. They can virtualize your existing storage while also virtualizing their own native trays of disks.

The Secondary Storage Market

Most early adoption occurs in secondary storage, the backup and restoration products, virtual tape libraries (VTLs) and similar products like archives and content addressable storage (CAS). I would venture to guess that more people are testing the waters with a VTL before a primary storage virtualization technology.

Tape libraries are being replaced with virtual tape libraries: disk controllers with intelligence to emulate a traditional automated tape library. Virtualizing tape libraries, while desirable to jockeying tapes, was an expensive proposition at first. Then VTL compression matched the compression of tape drives. When deduplication came, they could now reduce the amount of disk storage to reasonable amounts making them both high performing and affordable.

Secondary storage seems to get the leading-edge product introduction. It’s less of a risk proposition. Deduplication, encryption and other technologies were often introduced first in the secondary storage realm. These products start out as appliances – reaching a wide audience and retrofitting existing equipment, then migrate into the storage devices themselves. A VTL gateway becomes a VTL appliance with everything built in. Encryption goes from gateways to being built into drives and VTLs, media servers and clients. Deduplication goes into VTLs, media servers and a global deduplication cache among backup agents sending less data over the wire, less on disk.

But primary storage products are now mature. These technologies: dedupe and encryption are no longer bleeding edge, but general acceptance. Soon the technologies will be in every point in the storage ecosystem.

An Enabling Technology

Storage virtualization is an enabling technology. It allows migration of virtual volumes without worrying about their physical disks. That allows us to migrate from old storage to new; one vendor to another. You can do all of this without shutting down a single server, migrating it all non-disruptively.

I said earlier, SANs promised to give better storage utilization. They did, but not enough. Storage virtualization technologies like thin provisioning and storage pooling of multiple RAID groups, then carving out of that pool take the promises of better utilization and start to deliver. You can now achieve better utilization of storage resources with less wasted space. This is a mixed blessing. While gaining better utilization of your storage, you have to balance that with faster just-in-time storage purchases. You never want to run out of space (bad things happen). If you’re the kind of shop that can’t acquire additional storage easily, better utilization might not be for you. You’ll need to have extra capacity at hand.

The technology allows us to make snapshots without copying all the data. There is debate between redirect on write v. copy on write. Some vendors will just copy a bitmap when making a snapshot, then copy the actual blocks when a block changes, updating the snap shot area with the old frozen block then allow the write to occur: this is copy on write. Redirect on write says just write the newly changed block to a different area. The copy on write crowd says you gain a second physical copy on disks in some instances. It’s a valid argument. With today’s RAID-6 and RAID-DP, it’s also an old argument that’s loosing it’s worth.

Deduplication couldn’t occur without virtualization. It’s finding like blocks and deleting one, updating the pointers to both point to one copy of that identical data. This same technique works for primary storage as well as that backup and archive data.

A lot of these add up to yield less space being consumed. I love being in the storage business, people never have enough of it. Their needs constantly grow, often way faster than people want it to. These virtualization technologies save real dollars and help control the growth that people endure year after year.

Do I Need This Stuff?

You don’t need storage virtualization – just like you don’t need server virtualization. These things make our lives easier. Whether that’s automatic migration from one piece of storage to another (like moving a VM from one server to another), or a simplification of business continuity through mirroring, storage virtualization is an enabler. You can obviously manually migrate storage from one box to another (or maybe have VMware Storage vMotion move some of your storage but likely not the whole environment), just like you can manually load a new physical server. It’s time consuming in either case.

Storage virtualization takes us farther down the road of storage consolidation, just like server virtualization enables server consolidation. Storage virtualization allows us to do things like thin provisioning, deduplication and real tiering.

Do you need the full-featured products? No. Do you need vSphere when the free software or ESXi might fit? No, they are choices. They can give greater performance, recoverability, migration and flexibility. They often save you real money.

It’s hard to do these without storage virtualization; you can live without all forms of virtualization. Once you’ve virtualized your storage, you’ll never want to go back.

No comments:

Post a Comment