Detecting data/file corruption
By gjl on Jul 22, 2005
To try and debug the problem from the application down, is probably going to be quite long-winded. So, my first action is to verify that the file is actually the same on both machines. i.e. did it get corrupted in the transfer. If it did, then we can forget the appliction layer stuff, and concentrate on the method of transfer. It seems obvious when you think of it, but sometimes in the heat of the momemt, the simplest things get forgotten. What follows are some examples of how to use standard Solaris tools to detect data corruption.
For a long time we've had binaries that generate a checksum against a file - which is a simple way to tell if the source and destination copies are the same. There are sum, cksum and now in s10 digest. Also we have 'cmp' which will do a byte-for-byte conparison of two files.
All of these tools can be used on reguar files and raw devices. !!Copy a raw disk slice to an image file using dd. # dd if=/dev/rdsk/c0t0d0s3 of=/var/tmp/c0t0d0s3.img bs=1024k 41+1 records in 41+1 records out !!Now we can use the comparison tools, they should all come back identical or clean. Remember cmp gives no output for a matching pair of files. For sum and cksum, the first column is the checksum, the second column, the size. # cmp /dev/rdsk/c0t0d0s3 /var/tmp/c0t0d0s3.img # sum /dev/rdsk/c0t0d0s3 /var/tmp/c0t0d0s3.img 28918 85050 /dev/rdsk/c0t0d0s3 28918 85050 /var/tmp/c0t0d0s3.img # cksum /dev/rdsk/c0t0d0s3 /var/tmp/c0t0d0s3.img 3185788260 43545600 /dev/rdsk/c0t0d0s3 3185788260 43545600 /var/tmp/c0t0d0s3.img # digest -a md5 /dev/rdsk/c0t0d0s3 /var/tmp/c0t0d0s3.img (/dev/rdsk/c0t0d0s3) = 0616a55e0a4e30ecf49c974f23a56255 (/var/tmp/c0t0d0s3.img) = 0616a55e0a4e30ecf49c974f23a56255 To show what happens when a file is corrupted we will write a single byte to the front of the file, which is currently all zero's. The current contents of the first 10 bytes of the file (offsets are in octal) # od -x -N 10 /var/tmp/c0t0d0s3.img 0000000 0000 0000 0000 0000 0000 0000012 Now we write the first byte of /etc/hosts (any file would do) to the front of the image file, to simulate corruption. # dd if=/etc/hosts of=/var/tmp/c0t0d0s3.img bs=1 count=1 conv=notrunc We now see that the file has changed by one byte. # od -x -N 10 /var/tmp/c0t0d0s3.img 0000000 3100 0000 0000 0000 0000 0000012 !!Now we will re-run the comparison commands to see what is shown for a corrupted file. # cmp /dev/rdsk/c0t0d0s3 /var/tmp/c0t0d0s3.img /dev/rdsk/c0t0d0s3 /var/tmp/c0t0d0s3.img differ: char 1, line 1 # sum /dev/rdsk/c0t0d0s3 /var/tmp/c0t0d0s3.img 28918 85050 /dev/rdsk/c0t0d0s3 28967 85050 /var/tmp/c0t0d0s3.img # cksum /dev/rdsk/c0t0d0s3 /var/tmp/c0t0d0s3.img 3185788260 43545600 /dev/rdsk/c0t0d0s3 1666608083 43545600 /var/tmp/c0t0d0s3.img Again, note that for cksum and sum, that the second column is identical in the original and corrupt version since we have not changed the file length. Timings, comparing two identical files on filesystem. Single disk Ultra10 Solaris10. The timings are dominated by waiting for IO. # timex cmp /dev/dsk/c0t0d0s3 c0t0d0s3.img.bak real 12.83 user 4.86 sys 1.31 # timex sum /dev/dsk/c0t0d0s3 c0t0d0s3.img.bak 28918 85050 /dev/dsk/c0t0d0s3 28918 85050 c0t0d0s3.img.bak real 15.17 user 3.89 sys 1.15 # timex cksum /dev/dsk/c0t0d0s3 c0t0d0s3.img.bak 3185788260 43545600 /dev/dsk/c0t0d0s3 3185788260 43545600 c0t0d0s3.img.bak real 14.57 user 2.73 sys 1.33 # timex digest -a md5 /dev/dsk/c0t0d0s3 c0t0d0s3.img.bak (/dev/dsk/c0t0d0s3) = 0616a55e0a4e30ecf49c974f23a56255 (c0t0d0s3.img.bak) = 0616a55e0a4e30ecf49c974f23a56255 real 15.82 user 4.07 sys 1.68