Wednesday Mar 25, 2009

Think Twice Before Deleting Stuff (Or Better Not at All!)

Some piggy banks

No, this is not going to be another "Remember to do snapshots" post. I'm also not going to talk about backups. Instead, let's look at some very practical aspects of deleting files.

So, why delete a file? "Trivial", you think, "so I can save space!". Sure, dear reader, but at the expense of what?

Let's stop and think for a minute. Our lives try to center around doing cool, worthwhile, meaningful, useful stuff. Deleting files isn't really cool, nor fun, it is a necessity we're forced to do. Don't you hate it when that dreaded "Your startup disk is almost full" message appears while you're in the middle of downloading new photos from your latest exciting vacation trip?

Actually, the seemingly simple act of deleting is really a challenge: "Will I need this again?", "Wouldn't it be better to archive this instead?", "Last time I was really glad I kept that email from 2 years ago, so why delete this one?". Sometimes I surprise myself thinking a long time before I really press that "ok" button or hit "Enter" after the "rm".

The reality is: Storage is cheap, so why delete stuff in the first place?

To put things in perspective, let's try an ROI analysis of deleting files. Let's say we need about 6 seconds of thinking time before we can decide whether a particular file can really be deleted without regret. Let's also assign some value to our time, say $12 per hour (I hope you're getting paid much more than that, but this is just to keep the numbers simple).

Storage is cheap, and last time I checked, a 1 TB USB hard drive cost about $100 at a major electronics retailer, with prices falling by the hour.

Now, how much space does the act of deleting a file need to free up so it justifies the effort of deciding whether to delete or keep it?

Well, our $12 per hour conveniently breaks down to $0.20 per minute, which allows us to perform 10 delete-it-or-not decisions per minute at $0.02 each. Fine. Deleting seems to be cheap, doesn't it?

Now, for that $0.02 you can buy a 1/5000th of a 1 TB hard drive. Wait a minute, 1TB/5000 still amounts to 200 MB of data per $0.02! That's more than you need to store a 10 minute video, or a full CD of music, compressed at high quality! Or 20 presentations at 10MB each! Not to mention countless emails, source code and other files!

So, unless the file you're pondering is bigger than 200MB, it's not really worth even considering to delete it. I'll call this 200MB boundary the "Destructive Utility Heuristic (DUH)".

The result is therefore: Save your time, buy more harddisk space (or upgrade your old hard drive to a bigger one before it dies) and move on. Life's too precious to waste it on deleting stuff. Create good stuff instead! Only think about deleting stuff if the file in question is bigger than 200MB.

I can hear some "Wait, but!"'s in the audience, ok, one at a time:

  • "But I can delete much faster than 6 seconds!"
    No big deal. So you can delete 1 file per second, that's still a threshold of 33MB, more than 5 songs worth or even the biggest practical business presentation or the source code to a major open source project. And harddisks are getting cheaper every day, while your time will become more and more precious as you age. Yes, if you're dead sure that file is useless junk and don't need to think about it, go ahead and delete it, but why did you save it in the first place?

  • "But I like my directories to be clean and tidy!"
    Congratulations, that's a good habit! Keeping files organized doesn't mean you need to delete stuff, though. Set up an "Archive" folder somewhere and dump everything you think you may or may not use again there. Use one archive folder for each year if you want. File search technology is pretty advanced these days so you should be able to find your archived files quicker than the time you'd take to decide which ones you'll never want to find again. Then, you can still decide to delete your whole archive from 3 years ago because you never used it, and it will likely make some sense, because its size may be above the destructive utility heuristic, but chances are you won't really care because storage will have become even cheaper after those 3 years so you won't save a big deal, relatively speaking.

  • "That still doesn't help me when that damn 'Your startup disk is almost full' message comes!"
    You're right. The point is: It's often hard to sift through data and decide what to keep and what not. That's why we dread deleting stuff and instead wait until that message comes. I'm only offering relief to those that felt that the act of having to delete stuff isn't really rewarding, and it isn't (at least while you're below the DUH). Go buy a bigger harddrive for your laptop, it's really the cost effective option. Use the numbers above to help you justify that towards your finance department.

  • "I'm still not convinced. I actually kinda like going through my files and delete them once in a while..."
    Sure, go ahead. Just know that you could use that time to do more productive stuff, such as checking out the Sun cloud, installing OpenSolaris or testing our new Sun OpenStorage products.

  • "Wait, aren't you supposed to write about OpenSolaris, ZFS and this stuff anyway?"
    I'm glad you mentioned that :). Actually, OpenSolaris and ZFS make it even easier for you to both not care about deleting stuff while keeping your files organized at the same time. The amazing ZFS auto snapshot SMF service will create snapshots of your data automagically every 15 minutes, so it won't matter whether you delete files or not. You can then choose to either not delete them at all and just move them to some archive, or you can delete whatever you want, without the 6 seconds of thinking (just to keep stuff tidy), knowing that you'll always be able to recover those files with Time Slider later. You could then use zfs send/receive to dump your data incrementally to a file server as a backup mechanism and the hooks are already there to automate this.

See, once you think of it, there's not really a need to delete files at all any more. At least not for mere mortals like us with file sizes that are typically below the destructive utility heuristic of currently 200MB (and rising...) most of the time. Music has already reached the point where a song can be stored at studio quality with lossless compression at manageable file sizes so that kind of data won't see significant growth any more. And photos and videos will soon follow. This means we'll need to care less and less about restricting personal data storage. Instead, we now need to focus more on managing personal storage.

Now there's a completely different problem that'll keep us entertained for some time...

Monday Apr 21, 2008

On Knowledge Management, Community Equity and Ontologies

Last week, I attended a meeting of the BITKOM Working Group for Knowledge Engineering & Management at the Sun Frankfurt office. The meeting was very nicely organized by Mr. Weber, Mr. Neuwirth and some colleagues from Sun in Germany (Hi Hansjörg, you should really blog!) and Peter Reiser from Sun in Switzerland. Therefore, I got to play host of the meeting without having to do too much work :).

Peter asked me to present his work on Community Equity (see also this interview with Shel Israel and this other one with Robert Scoble) and the CE 2.0 project to the group. The working group was very interested in how to encourage communities to participate and how Community Equity mechanisms can be used towards this goal. We had quite a few positive discussions during the breaks.

Image illustrating Community Equity 

But, some people seem to be concerned with tracking community contribution and participation on an automatic basis, for example, see Mike's post on the subject and Alec's reaction to Peter's interview. These are all very valid thoughts, and indeed nobody wants to see their work or life be reduced into a couple of numbers.

As always, the threat is not in the technology, but in the way we use it:

  • Measuring stuff is a good thing, if you know what you measure and how accurate that measurement is.
  • Telling people how their work is being received is also a good thing. I always get a kick out of the HELDENFunk download statistics (We should probably start publishing them), or my own blog's metrics. This is a huge motivator.
  • Telling people about how other people's work has been received is also a good thing. Nobody would put the kind of trust into eBay if it weren't for their rating system. How many books have you bought on Amazon based on other people's recommendations, stars, etc. on their site?
  • Web 2.0 style commenting, crosslinking, social networking, tagging and rating is also a good thing. Much of the web 2.0 world today would be untrusted, unnavigationable and unuseful if it weren't for those mechanisms.
  • The next step is to take these concepts, and apply them to an enterprise context. This is what Peter's Community Equity work is all about. The goal I see here is: If you do a good job, others should be able to notice (including, but not limited to, your manager). If you're looking for an expert on topic X, you should be able to find people that may be able to help you. If you are talking to person Y or if you run into that person as part of a team, you should be able to see what kind of work that person has contributed to the enterprise before and what others are saying about them. Think Amazon and eBay and LinkedIn ratings, recommendations, tags etc. as a tool to better navigate the social network and knowledge base of your enterprise.

Notice that the part where discussions become heated is not the technology one, it's the "what do we do with the numbers" part. That, of course, is where we need to be careful. We need to understand how the data is generated, how it has been processed (i.e. the exact rules and formular that is used to generate the Community Equity score) and what it does not tell us. You may trust your latest auction winner to transact with you on that particular sale, but you still don't know if she is actually a nice person or not :).

As long as the process is open, well-understood and transparent, using Web 2.0 mechanisms and Community Equity style metrics can be a very useful thing. You can generate a lot of useful information based on that kind of data: What are hot topics? Which documents are the most used, best rated, most re-used ones? Who are the company internal creators, connectors and consumers of knowledge? What topics have trouble to be picked up by the community? Sounds like fascinating stuff, if you're responsible for your company's knowledge...

Of course, this was only a small part of the BITKOM meeting. We heard presentations by other companies on different applications of knowledge management technologies in a customer service context. Interestingly, all of them (including CE 2.0) mentioned the term Ontology in one way or other. In a knowledge management context, an Ontology is the part of the system that relates "words" or other abstract data to real-world concepts and objects, resolving ambiguities, consolidating synonyms and clarifying user-errors. It's the part of the system that tries to bring in semantic knowledge as opposed to mere processing words.

Ontologies are very hard to do. That's why most of the times they are generated "by hand" which is very time and resource consuming. The holy grail of ontologies is when the system can automatically generate semantic meaning out of naked data by itself, without any help. Some of this systems are seeded with hand-made ontologies that can then expand somewhat automatically.

An interesting approach to generating ontologies might be to analyze web 2.0 style tagging data that has been created by users. An ontology system could then try to identify clusters of tags and assign them to a real world concept, then try to identify relationships between those concepts. As an example, the tags "LDAP", "Directory Server", "DS" all belong to the same concept and they are related to (but not the same as) "Identity Management", "IdM", and "Databases". A search engine then can use this data to find better matches for a user that is looking for "Identity Management and LDAP interoperability".

As you can see, even a seemingly dry and academic workshop on "Knowledge Engineering and Management", organized by an industry association can be an exciting topic, sometimes transcending the boundaries between technology, philosophy and anybody's daily web 2.0 style work.

Thursday Nov 01, 2007

7 Tips for Enhancing Your Email Efficiency

This article helps you deal more efficiently with large amounts of email. It looks at client and server side features that are useful, then concentrates on the most crucial aspect of email efficiency: Email processing workflow. We'll develop 7 easy to follow rules that will enable us to reach 0 mails in our Inbox in a short time while still staying informed earlier, easier and more reliably.
[Read More]

Tune in and find out useful stuff about Sun Solaris, CPU and System Technology, Web 2.0 - and have a little fun, too!


« July 2016