Cometh the "Semantic Web", cometh "Semantic Attacks"...
By davew on May 04, 2007
As Alec blogged the other day, blogging and other Web 2.0-ish things can really help an organisation's reputation for openness, honesty and responsiveness in the wider world. They can also wreck it just as easily.
A while back, I blogged some thoughts about security in this kind of space; I stated that availability is probably the main aim of organisations providing these kinds of capabilities as public services, and I still think this stands.
We've seen how YouTube has had some issues with the Brazilian and Turkish governments, in the last few months. From a service perspective, the posters of the respective videos successfully perpetrated semantic-level denial-of-service attacks; to Brazilian and Turkish users YouTube was effectively down, so in all probability they went and posted their video clips elsewhere, for the duration.
YouTube and Picasa have a usage model whereby uploaded content can be reported by viewers as inappropriate; if such reports need to be verified by people at the provider end (and they probably do, given that a further semantic attack would be to lodge a complaint against some material which was perfectly above-board), we have a scalability problem in terms of requirements for analysis.
Alec also mentioned the fact that a certain hex string is turning up all over Digg; I'm seeing it turning up in a number of the blogs I read, also. Now that the AACS have expired this string so that new content is encoded somewhat differently, it makes the question academic but nonetheless interesting regarding what - if anything - they could do to monitor its proliferation.
In the Brazilian YouTube case, the main issue was not just that a video of a Brazilian starlet cavorting with her boyfriend while not wearing very much had been uploaded, but that numerous copies of it had been uploaded under different filenames. This made me think of a couple of things about file digests.
Given a file foo, what other stored files are copies of it?
Addressing storage by content should (in theory) be as simple as holding digests of files as well as the files themselves (which ZFS could theoretically do; or a suitable BART manifest could be constructed on the fly at scan time), and then doing a sort|uniq on the result.
If the digests can be pulled into an area which is either unencrypted or encrypted with a key different to any user's key, and those digests are updated by some cron job when users are logged in or by an extra hook in the login / logout mechanisms (I just realised that this is one exception to the rule of keeping data and metadata together - however unsigned digests are easy to reconstruct in any context, so we can get away with it), it will be possible to do some kind of compare (although not time-synched) with the content of other users' home directories, even if those home directories are stored encrypted.
Given a file foo or a digest of it, how can I prove easily that I have no other copies of it?
If a bunch of this content has made its way onto a system's accessible storage, then all that storage needs to be swept to ensure that no copies of the illegal files remain. Holding a central repository of digests for all files available to the system makes this straightforward - for filesystems held as cleartext, this could even be done most efficiently on RAID controllers provided there was a way for such entities to communicate with eachother over NFS or FC (OK, doing it in the host is probably simpler!).
Given a a file foo or a digest of it, can I prove that I have never had a copy of it?
Again, this is great when your friendly local law enforcement officials start asking questions about illegal files that they think might have spread to your organisation. If digest files are integrated into an archive solution so that digests of files are retained even when the file has been deleted (ie, WORM the digests but not the content), then a search through the digest archives should act as conclusive proof (although the digest files will themselves probably need to be signed) that a file hasn't passed through the system.
For text such as "that hex string", simple greps rather than digests work best; however for images, and especially movies, greps and digests are likely to have problems. Changing even one pixel in an offending image or movie would result in a completely different digest, so techniques such as geo-temporal indexing would have to be used, effectively resulting in "pattern-matching blacklists for video".
There's a bunch of guys up in York who could do this, and whose hardware accelerators are designed for our kit - see http://www.cybula.com/.
If approaches such as these can reduce the duration of outages caused by semantic attacks, I'm sure that the larger service providers will be interested...