Using deduplication to reduce storage demands on Cloud Providers

In the paper, “The Effectiveness of Deduplication on Virtual Machine Disk Images“, the authors perform an in-depth analysis of several factors that may or may not impact the level of deduplication of virtual machine images.

So, what’s exactly deduplication?

The main idea is to leverage data commonality in a storage system by identifying duplicate “chunks” of data across multiple files and storing only one copy of each chunk.

How do you do that?

The idea is to compute a digest (such as SHA-1) of each data chunk composing a file, and check if that data is already present in the chunk store. The chunk is only stored in case it is already not there, otherwise a pointer to the already stored chunk is added to the metadata describing how to reconstruct the original file.

How do you divide a file into chunks?

There are two main techniques for chunking: variable-size chunking and fixed-size chunking. Fixed size chunking is straightforward: you define a chunk size (such as 4KB), and divide a file into equal chunks of that size. However, if some data is appended or removed from this file, all the chunks after the modification will become invalid. Variable-size chunking is resistant to modification, since the chunks can have different sizes. A well-known technique for variable-size chunking is to compute a rabin fingerprint of the file stream to define where to place the boundaries of each chunk.

Why use deduplication on virtual machine images?

Virtualization technology is widely adopted in data centers and cloud computing providers in order to better utilize physical resources and to provide isolation between different applications/users. A problem that arises is the amount of storage needed to store multi-gigabyte VM disk images. Several researches identified that different VM images share a considerable amount of data between then, what suggests that the use of deduplication may reduce the total amount of storage needed in VM hosting facilities.

Below are some interesting findings of the aforementioned paper on deduplication in the context of VM images:

  • Deduplication can save 80% of more of storage space when stored VM images are from the same operating system “lineage”, such as Ubuntu or Fedora.
  • For mixed operating systems, the deduplication ratio is about 40%, which is still quite a considerable amount of space saved.
  • Fixed-size chunking outperforms variable-size chunking for VM images, which is good news, since typically that’s easier to implement.
  • Compression of chunks can further increase storage savings
  • Factors that have major impact on deduplication effectiveness:
    • Base operating system (the more homogeneous, the more the level of deduplication)
    • Chunk size (the smaller the chunk, the higher the deduplication level, the higher the overhead to reconstruct the original file)
  • Factors that have little impact on deduplication effectiveness:
    • Package installation or language localization within the same operating system
    • Surprisingly, consecutive releases of a single OS have a similar level of de-duplication of releases away from each other (normally high)

What are the implications of this to my work?

Deduplication in the context of VM images is a great use case for a content-addressable storage, which can be used as a storage backend for the chunk store needed for de-duplication of VM images. Current CAS solutions are either based on costly hardware (such as disk arrays) or centralized. However, a centralized CAS architecture will have limited capacity and will not scale as the amount of stored data grows.

Public and private cloud providers spend a massive amount of storage space to keep user’s VMs. Using a distributed content addressable storage to store VMs have the following advantages (among others):

  • Obvious scalability and elasticity
  • Reduce storage demands for multi-gigabyte VM hosting
  • Use the saved storage space for replicating data chunks, increasing availability and durability
  • Parallel transfer of chunks from multiple servers, possibly in a BitTorrent fashion, what may speed up transfer of a VM image to hosts
Posted in Uncategorized | 1 Comment

Leveraging data commonality with content-addressable storage

There are a lot of similarities between subsequent releases of a software at the binary level. In the figures below, each series represent how many percent of binary data blocks are exactly the same between a reference release of a software package and all previous releases of the same package. The first graph is for the Linux Kernel 2.4 source code releases, and the second graph is for 10 nightly binary releases of Mozilla (in March 2003).

Commonality between Linux 2.4 Kernel releases

Commonality between Mozilla nightly binary releases

On average, about 60% of the blocks from different releases are redudant, and a minimum of 30% blocks are common for all releases. That’s quite a lot! The similarities are more significant for source code then for binary releases, since a small change in the source code may have a large impact in the compiled code. Even though these results are at the block-level, for large software packages a similar level of commonality may also be observed between files of different releases. This is a typical case where content-addressable storage (CAS) may save lots of disk space, since each binary object (either whole file or blocks) is stored only once.

Those results are exciting because they are very related to the use case I’m focusing to build a distributed content-addressable storage. The CernVM file system currently uses a CAS to distribute applications’ software to virtual machines for the LHC experiments at CERN. In that scenario, applications are released every other day. Based on these results, the level of commonality between subsequent releases must be high, what justifies the use of a CAS at CernVM-fs. I wonder what are the actual levels of commonality for experiments’ releases @ CERN. Will see if I can find that out.

The results above were presented in the paper Opportunistic Use of Content Addressable Storage for Distributed File Systems (2003). In this paper they propose a distributed file system on the top of a content-addressable storage. The idea is to divide each file into blocks, hash the contents of each block and write a metadata file that describes how to rebuild the original file from the separate blocks. This metadata is called a recipe, and is shown below:

\begin{figure}\begin{center}\small\begin{verbatim}<?xml version=''1.0'......ist></recipe_choice></recipe>\end{verbatim}\end{center}\end{figure}

Sample File Recipe

The recipe abstraction allows to to split the file in many ways: variable or fixed block size; and to use different hashing algorithms: MD5, SHA-1, etc.

In the CASPER file system, when a client wants to fetch a file, it will try to download the whole file from the server (as in a typical client-server FS: Corba, AFS, NFS, etc). If the connection to the server is slow, the client will instead ask for the recipe of the file it wants to fetch. With the recipe available, the client tries to get individual blocks from nearby content-addressable storage providers. In their view, content-addressable storage providers will be available on local networks with much better bandwidths and latencies. If not all blocks are found on nearby CAS providers, the remaining blocks are fetched from the central server as usual.

Cheers!

Posted in Uncategorized | 1 Comment

A Scalable Architecture for a Distributed Content-Addressable Storage System

Definition

Content-Addressable-Storage (CAS) – A fancy name for a simple storage technique: instead of indexing stored objects by their location (such as file://home/user/example or http://www.example.com/file.jpg), as done in traditional storage systems, index objects by their content.

This is typically done by hashing the contents of the object (using MD5 or SHA-1, for example) and obtaining an unique identifier (UID). For instance, the object “” (empty string), has the MD5 digest: d41d8cd98f00b204e9800998ecf8427e, which can be used as an UID to access this object. Thus, no knowledge of the storage backend or physical data location is needed to retrieve the data. Two benefits of CAS over location-based storage systems are:

  • Automatic data integrity check, which allows to detect corrupted data;
  • Automatic data de-duplication, which can optimize disk space utilization, specially in scenarios where data repetition is common.

Since the UID changes when the content of an object changes, CAS are typically used to store fixed content data (immutable), such as archives or backups. According to analysts, static content account for 50% to 80% of the produced data in organizations nowadays [1][2]. However, content-addressable storage is also used to store mutable data, such as in popular distributed revision control system Git. In order to enable this, additional metadata is needed to correlate multiple versions  of a particular object.

The Problem

In the era of big data, Content Addressable Storage Systems are gaining popularity as means to store large amounts of data, both in volume and in quantity. Besides the traditional use case of archiving documents, cloud-based backup and online file hosting are potential use cases for CAS systems. In this context, critical requirements for modern CAS systems are scalability, fault-tolerance and availability. How can this be achieved? Yes, by distributing!

A distributed content-addressable storage system may attend these requirements by distributing and replicating its data across a set of storage nodes. However, if the UID computation is done at a single node (such as in the client-side), insertion throughput may be compromised if a very large number of objects in inserted at once. This is because the digest computation is normally an expensive operation, and may become a bottleneck during insertion of a large number of objects through a single node.

For instance, in the CernVM project the insertion of a software release composed of 250,000 files (total size: ~10GB) into a repository based on a centralized CAS takes up to 1 hour (UID computation and data compression). This overhead may be reduced if a high-speed connection is available (such as Gigabit Ethernet), since this computation may be distributed among the nodes participating in the DCASS.

The solution (or the way to it..)

The objective of my master thesis is to design a scalable architecture for a Distributed Content-Addressable Storage System (DCASS) of binary objects (blobs). This architecture will focus in the CernVM use-case, where hundreds of thousands of objects may be inserted as a single batch into the DCASS. On the top of the requirements presented previously, this system will have the following additional requirements:

  • Distribution of digest computation across nodes participating in the DCASS when insert throughput can be increased;
  • Elasticity – grow or shrink the DCASS by just adding or removing nodes;
  • Support for additional processing over the data on participating nodes, such as data compression. This additional processing can be enabled through a plugin;
  • Support for multiple storage backend types (such as file system, database system, networked storage, etc)

In contrast with other distributed object storage systems (such as distributed key-value stores), concurrency control does not have to be as strict since the update operation is not supported in a DCASS.

Some of the questions to answer throughout this work are:

  • How is performance (in terms of response time) improved/decreased when using a DCASS instead of other systems in the CernVM repository use case?
  • Given network conditions (bandwidth and delay) connecting the client and the nodes in a DCASS, what is the minimum number of objects (volume and size) that “pays off” distributing the hash computation in contrast of calculating the digests in the client node?
  • What is the overhead introduced by elasticity in a DCASS? (ie. by re-copying files when nodes are added or removed)

Let’s see where this will lead. Suggestions and comments are very welcome!

“Mi CAS es su CAS” ;)

[1] – http://www.internetnews.com/ent-news/article.php/1024271/EMC+Tackles+Fixed+Content+Storage.htm

[2] – http://documentmedia.com/ME2/dirmod.asp?sid=&nm=&type=news&mod=News&mid=9A02E3B96F2A415ABC72CB5F516B4C10&tier=3&nid=DF86E1F49983419F868090D9C5B7498C

Posted in Uncategorized | 6 Comments

Hello, World!

I’ve always wanted to have a tech blog to write about what I’ve been doing at university, eventual side-projects and some random stuff I come across every day. However I never took the time to start it, but now I guess the time has come…

Inspired by fellow EMDC* mates Lalith Suresh and Marcus Ljungblad (I guess I finally learned how to type his surname), who regularly post interesting stuff in their blogs, and in order to share ideas and thoughts about my coming master thesis @ EMDC, I finally decided to start my own blog. Yey! :-)

In the coming months I will mostly be writing about my thesis in designing a distributed storage system. Most of my fellow EMDC classmates will also write about the progress of their thesis in their blogs, which are available in the right sidebar. The cool thing is that the projects are very diverse, in many areas of academia or industry, so lot’s of interesting and cutting-edge content in Distributed Systems will be posted by briliant people, keep an eye! ;)

Looking forward to start writing about my thesis in the next post, some time early this week..

Cheers!

*EMDC = European Master in Distributed Computing – www.kth.se/emdc

Posted in Uncategorized | 1 Comment