Monday, January 10, 2011

The Storage Evolution Part 2: Deduplication

This is part 2 in the Storage Evolution series.

When we created SANs we stored more and more data retrieved at higher and higher speeds, and it was good. Then we added advanced functionality like creating copies (clones, snapshots, etc.) quickly, and it was good. Then development wanted to have 6 copies of that production database, one for each developer, also test, QA and staging. We needed a daily clone for the data warehouse and business intelligence. We virtualized our servers, we booted from SAN, some of us ditched the desktops and workstations opting for Citrix and VDI. We were making copies of the same data, over and over and over again. And at the 11th hour of the second half of the day, we backed it all up.

And it just got worse.

The Tale of Tape

We had a love/hate relationship with tape. We loved it’s density, it’s streaming speed, we hated its bulk, daily off-site management, library management and load and seek time. But it was cheap, alternatives expensive.

VTLs wanted to conquer the world (and disk manufactures that didn’t make tape solutions marketed them heavily). They came out with expensive, but speedy boxes. For those where money was no object it was a breath of relief, for the rest of us we looked with envy. Then VTLs got compression to match what tape drives did, lowering costs. Then, a few years later, they got deduplicaiton and VTLs became an affordable reality.

At first deduplication came as many new technologies do, the universal retrofit, the trick out my VTL option, the appliance. But appliances are a stop-gap measure and they morphed in two ways, one is the dedupe features went straight to the VTL itself. For others, the VTL was built around the appliance. The feature is now in the device making for a single solution.

The Tale of Disk

When NetApp introduced deduplication in the form of Advanced Single-Instance Storage (A-SIS for short), I thought it was cool. I also thought everyone would have it in 2-3 years. I was wrong.

NetApp’s deduplication remains one of the most compelling features of their storage. It gets it’s biggest bang in vSphere, but works well in other places as well. The reason NetApp remains king of the primary storage deduplication hill today is embedded in it’s allocation unit, the 4K block. Others have bigger chunks, which don’t deduplicate as well. (It’s harder to match a 1 MB or 1 GB chunk.) This is why, years later, NetApp has one of the only elegant deduplication solutions.

There are some other smaller players trying to jump into this water, as well as an open source effort. But for a major league, market proven solution, they still remain king.

What Does It Buy Me?

Reread my first paragraph. Understand that each VM has Windows 2008, 2003, 2000, Linux or other commonality. Understand all those copies can be folded, either in backups, in variations on versions or in instances. We store the same patterns, binaries, images and data over and over again. We backup night after night, week after week, month after month the same data, with little actual change. For those of us storage architects, we may see a 20% change rate on backups a night, but 4% replication daily change rate. We don’t need the whole file copied over and over again, there is a lot of waste.

Deduplication aims to change that: reduce that footprint, control the growth, see if we can stretch things farther. We try to do more with less.

How Does It Work?

Most deduplication works by identifying duplicate data and removing it (very simplistic). It fingerprints data by using high performance techniques and algorithms developed for IPsec, namely MD5, SHA-1 or other more proprietary methods to quickly get a hash. That hash is then stored in a database. There is more divergence after this between implementations.

Most secondary storage (VTLs) will then deduplication in real-time, while ingesting the data (backing up) and removing the matches. Some vendors take the extra step of doing a bit-level verification, while others deem hash collisions (false matches) too statistically remote to worry over. The duplicate is removed and a pointer to the original occurrence (copy) is inserted in its place.

The primary storage example does the folding offline (a post-process) and bit-level verifies the data before throwing it out.

There are some other approaches, content aware deduplication and other proprietary mathematical schemes that offer alternatives and may offer more protection, may yield smaller datasets. Some approaches have variable size chunks, while others are more fixed. There are also backup software-based approaches: EMC Avamar and IBM TSM to name two of the many.

Where do I Deduplicate?

So you’re deduplicating in your Data Domain, you’re good right? Deduplication is one of those technologies that you’re going to be using everywhere. Primary storage (disk), backup software, VTL, email, data archive, replication software, where ever and as often as you can to store and transmit less. It’s a technology that is a good fit at every point in your environment. As it proliferates, and it will spread throughout IT, we will gain greater storage efficiencies than we have today. We will store less, we know we’re going to be asked to store more and more (retention laws anyone). We’ll have less duplicate data stored over and over and over again all over the place.

There are datasets we don’t recommend deduplicating today. I still don’t recommend high-performance databases, largely sequential workloads and other things that I might just want to leave alone. It’s not for everything, and some unfriendly datasets can actually grow with deduplication.

Compression is complimentary, but it’s not the same thing. It offers a different approach to reducing your storage and fits well in the file serving space. EMC (Celerra/Unified Compression), IBM ([Storwise] Compression Appliance) and NetApp all have compression offerings, some built in.

Each and every technology is aimed at using less space. To help stymie the uncontrollable growth these technologies will continue to evolve and offer better utilization than we had before. We’re going to need it all.

1 comment:

  1. I'd be interested on your thoughts about any risks introduced by integrating deduplication into the fabric of an environment, particularly when deployed at various/multiple layers simultaneously as you suggest may be the case.