There are a lot of similarities between subsequent releases of a software at the binary level. In the figures below, each series represent how many percent of binary data blocks are exactly the same between a reference release of a software package and all previous releases of the same package. The first graph is for the Linux Kernel 2.4 source code releases, and the second graph is for 10 nightly binary releases of Mozilla (in March 2003).
On average, about 60% of the blocks from different releases are redudant, and a minimum of 30% blocks are common for all releases. That’s quite a lot! The similarities are more significant for source code then for binary releases, since a small change in the source code may have a large impact in the compiled code. Even though these results are at the block-level, for large software packages a similar level of commonality may also be observed between files of different releases. This is a typical case where content-addressable storage (CAS) may save lots of disk space, since each binary object (either whole file or blocks) is stored only once.
Those results are exciting because they are very related to the use case I’m focusing to build a distributed content-addressable storage. The CernVM file system currently uses a CAS to distribute applications’ software to virtual machines for the LHC experiments at CERN. In that scenario, applications are released every other day. Based on these results, the level of commonality between subsequent releases must be high, what justifies the use of a CAS at CernVM-fs. I wonder what are the actual levels of commonality for experiments’ releases @ CERN. Will see if I can find that out.
The results above were presented in the paper Opportunistic Use of Content Addressable Storage for Distributed File Systems (2003). In this paper they propose a distributed file system on the top of a content-addressable storage. The idea is to divide each file into blocks, hash the contents of each block and write a metadata file that describes how to rebuild the original file from the separate blocks. This metadata is called a recipe, and is shown below:
The recipe abstraction allows to to split the file in many ways: variable or fixed block size; and to use different hashing algorithms: MD5, SHA-1, etc.
In the CASPER file system, when a client wants to fetch a file, it will try to download the whole file from the server (as in a typical client-server FS: Corba, AFS, NFS, etc). If the connection to the server is slow, the client will instead ask for the recipe of the file it wants to fetch. With the recipe available, the client tries to get individual blocks from nearby content-addressable storage providers. In their view, content-addressable storage providers will be available on local networks with much better bandwidths and latencies. If not all blocks are found on nearby CAS providers, the remaining blocks are fetched from the central server as usual.