Amazon S3 Silent Data Corruption
By gbrunett-Oracle on Jan 28, 2009
While catching up on my reading, I came across an interesting article focused on the Amazon's Simple Storage Service (S3). The author points to a number of complaints where Amazon S3 customers had experienced silent data corruption. The author recommends calculating MD5 digital fingerprints of files before posting them to S3 and validating those fingerprints after later retrieving them from the service. More recently, Amazon has posted a best practices document for using S3 that includes:
Amazon S3’s REST PUT operation provides the ability to specify an MD5 checksum (http://en.wikipedia.org/wiki/Checksum) for the data being sent to S3. When the request arrives at S3, an MD5 checksum will be recalculated for the object data received and compared to the provided MD5 checksum. If there’s a mismatch, the PUT will be failed, preventing data that was corrupted on the wire from being written into S3. At that point, you can retry the PUT.All in all - good advice, but it strikes me as unnecessarily "left as an exercise to the reader". Just as ZFS has revolutionized end-to-end data integrity within a single system, why can't we have similar protections at the Cloud level? While certainly it would help if Amazon was using ZFS on Amber Road as their storage back-end, even this would be insufficient...
MD5 checksums are also returned in the response to REST GET requests and may be used client-side to ensure that the data returned by the GET wasn’t corrupted in transit. If you need to ensure that values returned by a GET request are byte-for-byte what was stored in the service, calculate the returned value’s MD5 checksum and compare it to the checksum returned along with the value by the service.
Clearly, more is needed. For example, would it make sense to have an API layer be added that automates the calculation and validation of digital fingerprints? Most people don't think about silent data corruption and honestly they shouldn't have to! Integrity checks like these should be automated just as they are in ZFS and TCP/IP! As we move into 2009, we need to offer easy to use, approachable solutions to these problems, because if the future is in the clouds, it will revolve around the data.