Compression, Encryption, Deduplication, and Replication: Strange Bedfellows

One of the great ironies of storage technology is the inverse relationship between efficiency and security: Adding performance or reducing storage requirements almost always results in reducing the confidentiality, integrity, or availability of a system.

Many of the advances in capacity utilization put into production over the last few years rely on deduplication of data. This key technology has moved from basic compression tools to take on challenges in the fields of replication and archiving, and is even moving into primary storage. At the same time, interconnectedness and the digital revolution has made security a greater challenge, with focus and attention turning to encryption and authentication to prevent identity theft or worse crimes. The only problem is, most encryption schemes are incompatible with compression or deduplication of data!

Incompatibility of Encryption and Compression

Consider a basic lossless compression algorithm: We take an input file consisting of binary data and replace all repeating patterns with a unique code. If a file contained the sequence, “101110″ eight hundred times in a row, we could replace the whole 4800-bit sequence with a much smaller sequence that says “repeat this eight hundred times”. In fact, this is exactly what I did (using English) in the previous sentence! This basic concept, called run-length encoding, illustrates how most modern compression technology functions.

Replace the sequence of identical bits with a larger block of data or an entire file and you have deduplication and single-instance storage! In fact, as the compression technology gains access to the underlying data, it can become more and more efficient. The software from Ocarina, for example, actually decompresses jpg and pdf files before recompressing them, resulting in astonishing capacity gains!

Now let’s look at compression’s secretive cousin, encryption. It’s only a small intellectual leap to use similar ideas to hide the contents of a file, rather than just squashing it. But encryption algorithms are constantly under attack, so some very smart minds have come up with some incredibly clever methods to hide data. One of the most important advances was public-key cryptography, where two different keys are used: A public key used for writing, and a private key to read data. This same technique can be used to authenticate identity, since only the designated reader would (in theory) have the key required.

Cryptography has become exceedingly complicated lately in response to repeated attacks. Most compression and encryption algorithms are deterministic, meaning that identical input always yields the same output. This is unacceptable for strong encryption, since a known plaintext attack can be used with the public key to reveal the contents. Much work has focused on eliminating residues of the original data from the encrypted version, as illustrated brilliantly on Wikipedia with the classic Linux “tux” image. The goal is to make the encrypted data indistinguishable from random “noise”.

What happens when we mix these powerful technologies? Deduplication and encryption defeat each other! Deduplication must have access to repeating, deterministic data, and encryption must not allow this to happen. The most common solution (apart from skipping the encryption) is to place the deduplication technology first, allowing it access to the raw data before sending it on to be encrypted. But this leaves the data unprotected longer, and limits the possible locations where encryption technology can be applied. For example, an archive platform would have to encrypt data internally, since many now include deduplication as an integral component.

Why do we prefer compression to encryption? Simply because that’s where the money is! If we can cut down on storage space or WAN bandwidth, we see cost avoidance or even real cost savings! But if we “waste” space by encrypting data, we only save money in the case of a security breach.

A Glimmer of Hope

I had long thought this was an intractable problem, but a glimmer of hope recently presented itself. My hosting provider allows users to back up their files to a special repository using the rsync protocol. This is pretty handy, as you can imagine, but I was concerned about the security of this service. What happens if someone gains access to all of my data by hacking their servers?

At first, I only stored non-sensitive data on the backup site, but this limited its appeal. So I went looking for something that would allow me to encrypt my data before uploading it, and I discovered two interesting concepts: rsyncrypto and gzip-rsyncable.

rsync is a solid protocol, reducing network demands by only sending the changed blocks of a file. But, as noted, compression and encryption tools change the whole file even if only a tiny bit has been altered. A few years back, the folks behind rsync (who also happen to be the minds behind the Samba CIFS server) developed a patch for gzip which causes it to compress files in chunks rather than in their entirety. This patch, called gzip-rsyncable, hasn’t been added to the main source even after a dozen years, but yields amazing results in accelerating rsync performance.

The same technique was then applied to RSA and AES cryptography to create rsyncrypto. This open source encryption tool makes a simple tweak to the standard CBC encryption schema (reusing the initialization vector) to allow encrypted files to be sent more efficiently over rsync. In fact, it relies on gzip-rsyncable to work its magic. Of course, the resulting file is somewhat less secure, but it is probably more than enough to keep a casual snooper at bay.

Both of these tools are similar to modern deduplication techniques in that they chop files up into smaller, variable-sized blocks before working their magic. And the result is awesome: I modified a single word in a large word document that I had previously encrypted and stored at the backup site and was able to transfer just a single block of the new file in an instant rather than a few minutes. My only real issue is the lack of integration of all of these tools: I had to write a bash script to encrypt my files to a temporary directory before rsyncing them. I wish they could be integrated with the main gzip and rsync sources!

If you are interested in trying out these tools for yourself, and if you use a Mac, you are in luck: Macports offers both tools as simple downloads! Just install macports, type “sudo port install gzip +rsyncable” to install gzip with the –rsyncable flag, then type “sudo port install rsyncrypto” and you’re done! I’ll post more details here if there is interest.

Ref : http://blog.fosketts.net/2009/02/05/compression-encryption-deduplication-replication/

Instruction Computers