It's been a while. Thoughts on vast amounts of data
By Chris W Beal-Oracle on May 24, 2005
Now I'm back in the UK things are back to there usual hectic nature and I've neglected my embryo of a blog. So I'm going to make a conscious effort to try and blog before going home. So I was sent a very interesting paper today by Clay Shirky. It started me thinking about the nature of file systems. UFS now supports 16 TB of data in a single file system. OK there are limitations on the number of files, but that's still a heck of a lot of data that needs to be organised. The trouble is our file systems and email folders and many other things we do on our computers assume that the "thing" we're looking at can be easily categorised and infact many times they can't. The paper describes many examples, but the paper itself is as good as any. I wanted to provide a link to the paper in this blog entry, I remembered it have been emailed to me, but email folders tend to run rampant with things I think are going to be useful. I never remember where I've put something. I also remembered it had ontology in the title, and was sent to me by by Peter Harvey. These could be described as metadata for the paper. By searching for this meta data I found the link to the paper again. Where it was was irrelevant. The extension that Clay Shirky makes is that if you get a large number of people to add the meta data they think is important to a document (eg a URL) then you get, on average, a good classification of the document which can be used to find it. The beauty is that the quality of the cateforisation, and hence the ability to find the data increases with the number of people adding the meta data (refered to as a tag) From a technical point of view this could be done now using UFS extended attributes. What would be needed would be a simple tool to add the tags to the data and search the extended attributes for the documents. Sounds like an RFE for Nautilus. The only trouble is I guess one of scale. How would it work unless we have a large number of people adding their tags to the files. I need to give this more thought, but I the paper really got me thinking and I hope it does you too.