Distributed Storage Open Source - Celeste
By roger on Jul 10, 2008
I am a avid digital photography enthusiast. I have over 100,000 pictures in my library. With my current digital SLR camera, a single raw image can be 14MB, and I can generate them at 5 images per second. I've done a few HD video editing projects, and as you can imagine, those also eat gigabytes of disk space like it's popcorn. Managing that much data causes significant issues for me as the household's resident IT manager. I have to keep it all backed up and available for my family. We have several terabytes of spinning rust in our home. Recently, as we left on a week long vacation, our neighbor asked, "if these California wildfires find our neighborhood while you are gone, is there anything that you want me to save." Given that he probably can't save the piano, it was easy for me to ask him to grab the backup storage device, with years of photos and records on it.
Of course, I also work with sensors, devices who's main purpose in life is to generate data. We are generating data at an incredible rate. All that data needs to be stored, so storage becomes more and more important. That's why Jonnathan is blogging about it and why we are announcing a slew of new storage products. Well, we at labs wouldn't want to be left out. That is why I'm happy to announce the availability of Celeste, an open source research project aimed at reliable, secure storage made of unreliable, non-secure parts, that scales beyond imagination.
Celeste is a very interesting set of technologies. It is a distributed object store. That means that when you store an object in a Celeste system, it is split up into chunks that are stored on various machines spread across a network. The system hands back a handle, which later can be used to reconstruct the object. Because these chunks of data can be replicated or encoded for redundancy, the storage can be very reliable. This also has interesting implications for security, since each participating node knows only about the chunk of data it is asked to store. It does not know who stored it or what larger object that chunk of data belongs to. The system is designed to handle unreliable or even rogue nodes in the system. In addition to checking the reliability of the data as it is retrieved, nodes can keep track of the reputation of other nodes in the system. This means that when you save an object, you can ask to save it someplace VERY reliable or someplace with a good reputation for fast access. The society of nodes in the Celeste system use the reputations that are built up over time to match with the users wishes when storing future objects.
The system is alive. When left alone it will continually monitor itself to make sure that the proper levels of redundancy are maintained as nodes come and go from the system. This has the effect of sloshing data around the system to level storage over time. This has an additional very interesting side-effect. Say you have a data center and you decide to buy one of our cool new storage boxes. If you could plug it in and tell it to start participating the the Celeste system, data would naturally start flowing into the new storage units. If you have some old storage boxes that you would like to retire, perhaps because they just draw too much power for the amount of data they hold, you can just turn them off. When you do this, the Celeste system will notice that some data has disappeared and perhaps the redundancy levels have fallen below the required threshold, so Celeste will start creating redundancy coded chucks of data, most probably in your new storage system. This means that your data will automatically migrate to the new storage hardware just by connecting the new storage and turning off an old one.
The heart of the system is the Distributed Object Locator (DOLR). This is based on a Distributed Hash Table (DHT), which is a very slick piece of technology. It allows a completely masterless set of nodes to work together to find any piece of data in linear time. This system as implemented in Celeste can scale incredibly. It can address yottabytes of information. Yes yottabytes - now you can tell your friends that you learned a new word. kilo, mega, giga, terra, peta, exa, zetta, yotta. A yottabyte is 1,000,000,000,000,000,000,000 (1000\^8) bytes. That should be able to store all my photos with no problems.
Another interesting aspect of the system is that it is designed to be able to handle drastic failures. For example, suppose that you have a Celeste system that consists of 1000s of nodes spread across the United States. If someone digging a trench in St Louis, severs your main OC-193 cable between the east and west sides of the country, the system will continue to operate as two separate Celeste systems. Stores and retrieves can go ahead as normal. Eventually, when the damage is repaired, the system will stitch itself back together including flagging potential conflicts where two independent writes have been carried out on a single object. Additionally, a side effect of the way that the data is looked up in the DOLR is that data naturally migrates closer to where it is being used. This means that if you are accessing a lot of data and then travel half way around the world, the data can follow you to ensure speedy access from anywhere. In interesting side effect of these two behaviors is that one can imagine a scenario where your laptop is participating in a Celeste system. As you use data, it finds its way to your laptop. At the end of the day, you take your laptop home and work disconnected from the network. You read and write data locally on your hard drive. The next day you come in and re connect to the Celeste system and your data is seamlessly reintegrated.
A key differentiator between our system and other distributed storage systems is that Celeste is mutable. These distributed storage systems go to great length to never forget anything, so often they are designed to NEVER forget anything. In many cases this is not appropriate for real world use. Nothing is ever deleted and storage is never recovered, so these systems just grow and grow. We have many customers who wrestle with regulations that require them to keep customer records for a certain length of time and then guarantee that they will get rid of them. Celeste includes some secret sauce that allows it to really forget data. You can delete data, change it, etc. just as you would on a standard file system.
There are many other aspects like interesting security modes and possible modes of disconnected operation. This system is a research tool and not yet part of any product, so don't expect to just plug it in and replace you existing file system, however, if you are interested in this type of research I encourage you to check it out. The project is entirely open source.
You can learn more about the Celeste project here: http://www.opensolaris.org/os/project/celeste/
Congratulations to Glenn and Glenn for getting this great technology out to the world!