By Achim Hasenmueller on Jun 13, 2009
Virtual disk images have a life of their own: what appears to be just a big file on your host system is actually a file system in the guest image. What does that mean? Most users go for the dynamically expanding images in VirtualBox as they do not want to limit themselves to a small virtual disk size and at the same time do not want to waste disk space on their host while the guest doesn't actually need it. A great concept in theory but there are some caveats. Why do you think would the disk image grow when there is still a lot of free space in the guest? Why does it only keep getting bigger, even if you delete files from the guest?
The answer is in the way how modern filesystems work. In this article, I'd will focus on Windows guests and the NTFS filesystem which is both the most common combination and also the most problematic one. A disk is a (quite large) collection of bits that can be addressed in in arbitrary order (random access). A filesystem is designed to manage those disks and turn them into more useful things such as directories and files. The filesystem governs the disk and it blindly assumes that it owns the whole disk and that it can address all the bits on the disk. If it doesn't make use of some part of the disk, it considers that space to be wasted. It also contains some other assumptions, for example that it's better to not always touch the same bits because they might derogate after some time and therefore the disk is more likely to fail. Sometimes they even try to benefit from the fact that a disk spins faster at the outside than the inner "circles" so reads and writes there are more than 4 times faster usually. What does all this mean to VirtualBox? Well, all these things explain why a filesystem is so wasteful with disk space: it tries to scatter its data all over the disk and this makes the virtual disk image grow.
Why does the virtual disk image grow in large chunks, even if just a small file was written in the guest? That's actually an optimization. The dynamically growing disk images grow in chunks to limit the number of chunks (if it grew by e.g. 1 byte, we'd have to waste more than one byte of overhead for each byte written!). For the standard VDI files, the chunk size is 1MB. If one or more bytes get written to a 1MB chunk (we call them grains), the whole grain gets allocated. For VHD (originally from Microsoft Virtual PC), the grain size is 2MB even. For VMDK (originally from VMware), the grain size is just 64kB. This means VMDK is the most storage efficient file format but it's also a bit less efficient in its overhead and write performance.
Another important factor is fragmentation. If you delete a 1MB file on your NTFS disk, there will be 1MB of free space somewhere on the disk (assuming the file was not fragmented). Now you want to copy a 2MB file to your disk. What should the filesystem do? Should it look for a place on the disk where there 2MB free? Should it cut the file in chunks, put the first 1MB at the place where you just deleted a file and try to squeeze in the rest somewhere else? That's a decision that the filesystem has to make each time and NTFS is known for tending towards fragmentation. Fragmentation is an efficient way of using free space but if your files are distributed all over the disks, it will take a lot of time to read them and performance will degrade. Which user hasn't observed that Windows keeps getting slower and slower? Disk fragmentation is one explanation for that phenomenon.
Let's look again at that 1MB file we just deleted. What happens when you delete a file? Not much actually: the filesystem just marks that file as deleted in some global file structure (MFT - master file table for NTFS). That's very quick and allows undelete programs to do their job in many cases. However, this also means that the free space the file used to live at will still contain the contents of the file we just deleted. Until the filesystem allocates these blocks again, the data will remain as it was. For dynamically growing disk images, this has a major consequence: as the blocks contain data, they appear to VirtualBox as being used so they need to remain in the virtual disk.
If you've made until here, you've seen answers to the following questions:
- How are virtual disk images organized?
- Why do virtual disk images grow so fast?
- Why do virtual disk images never shrink?
- What is fragmentation and how does it affect virtual disk images?