Preservation and Archive Special Interest Group (PASIG)
By pmonday on May 29, 2008
I am in San Francisco at Sun's Preservation and Archive Special Interest Group (PASIG) Face to Face meeting. Preservation and Archiving is one of the most fascinating and complex problems to face our digital world, in my humble opinion. I've chatted about this before in the blog so I won't belabor the point that taking a chisel to the walls of this beautiful location and etching in the contents of your e-book would have a longer life span than most of our media types in use today.
Still, chiseling has some pretty serious downsides
- It doesn't scale well (oh, and you have to scale to get up the wall)
- You have to travel to this site to view the contents (while that might be pleasant, it can be quite inefficient)
- Its difficult to create a geographically remote backup site
- It seriously blemishes a beautiful location
It turns out, there is a community of people that think about this problem as their day job and PASIG is a gathering point for that community. The community is made up of the "new" digital librarians, software that provides the foundation for digital preservation and archives, storage vendors that can provide "temporal" storage locations for the digital archive, and system integrators that can customize the entire stack and tailor it for a particular domain (astronomy, human sciences, etc...).
The community is global, but their is an obvious disparity in how this type of work is funded across countries and boundaries. Here in the U.S., the leaders of preservation and archiving efforts appears to be the education community, with a variety of colleges (Penn State, University of Pittsburgh (Pennsylvania), etc...) and individual government agencies (Library of Congress, National Archives, etc...). Abroad, it seems the repositories are often nationally funded.
While various technologies were interesting, it was also interesting to hear the oft recurring theme of "sustainability"...how can we "sustain" the rate of ingest and cost over time of these types of solutions? Consider the cost of the initial storage of 1 GB of data today (just the storage medium), its between $0.25 and $1.00 depending on your storage medium. Now add in the cost of the various transitions that data makes over 20,000 years (life of a cave painting) the system administration, the building of new equipment, the power involved, etc... At this stage of human evolution, it is difficult to even predict what type of media we will store data on in 20 or 30 years. How can one predict the cost of sustaining these archives, when we won't know what the archive will look like, let alone the environment around it, in 20 years, 100 years or even 1,000 years.
Which brings up another interesting thread...whereas a cave painting is required to stay on the rock it originated on, the life of preserved digital data (and archived) must be assumed to migrate over the data generations. Consider each generation of media about 4 years (3-5 is common), tape can technically survive multiple generations and one can build systems (like the Sun Storage 5800 (Honeycomb) platform) that can theoretically store data for 100,000s years. But the reality is that the lifetime of a technology and a media is that you migrate approximately every 3-5 years.
This is a conceptually different model then many people think of when they think of an "archive" or a "storage vault". Rather than think of a monolithic system that weathers the years, its important to think of data as migratory, and evolutionary (much like humans themselves). To survive 30,000 years, our digital data originates...often in the physical world...but increasingly our data originates in the digital world itself...and migrates to a Tier 1 storage device. Regardless of whether it remains there, over the next 20 years, that data will move across 4 or more devices (not counting replicants or backups) if it stays "online".
So what are the three most important attributes of an archive and/or a preservation system (IMHO):
- The information is safe, stable, and can be verified to be "original" while it resides temporarily within the system
- The information can be moved off the system when the time is right and can still be certified as "original" data rather than a derivation of the data
- Information can be "moved into" a new preservation / archive system.
In the end, long-term data is migratory...and our systems and software that support this field must start thinking of data that way. This type of thinking can also be applied backwards into more traditional storage paradigms, how useful would it be for a photo repository to be able to migrate to bigger and bigger system with higher capacity and lower power consumption? At some point, there is a trade-off that will trigger that company to move, and facilitating that move is in the best interest of all companies.
What else came up at the conference? Tons of things, have we thought enough about:
- Access Rights (and access across time)
- Roles and Responsibilities in a digital archiving and library maintenance system (the new librarians)
- Geo-political boundaries as repositories move into the "cloud"
At this very minute, Sam Gustman is discussing The Shoah Foundation. The mission of The Shoah Foundation is To overcome prejudice, intolerance and bigotry - and the suffering they cause - through the educational use of the Institute's visual history testimonies. The documentation of The Holocaust contains over 120,000 hours of online video testimony. The foundation indexes minute by minute of this testimony and makes it accessible for people to teach and understand the impacts of the holocaust. The foundation moves beyond The Holocaust into testimonies of other events in human history that must be recorded and understood. As George Santayana said: Those who do not learn from history are doomed to repeat it.
Sam's presentation isn't online, I'm trying to get Art to post it. This work alone shows that digital archiving and preservation is not just an interesting segment of compute and storage...its a moral imperative.
I found this link to a video that discusses Fedora Commons, one of the open source repository software initiatives for creating and managing digital libraries and archives. The video discusses some of the solutions that are being built with the Fedora Commons software solution:
What a great opportunity the conference has been to sit and think about the big picture (and archiving and preservation truly are the big picture).