Tuesday Aug 28, 2007

My Bloomin' Friends

Closed Social Networks are blossoming all over the place. They provide a semblance of protection, at a price: lock in. Locked into the social network provider you get convenience in the form of tools to make conversation easier (video, email, chat boards, ...), some form of privacy protection (if you trust the provider), introductions to 'like minded' people, and other niceties.

Some of us work in the open air: we have to set standards in public view; we stand by what we say; we accept criticism from wherever it comes; and we can't choose our friends based on their social network provider. We describe ourselves in our foaf files where we can specify what we do, how to contact us, our interests, and links to who we know by pointing to their Universal Identifiers. There is no trouble linking between people who are open in this way. We are happy to reference each other: it strenghtens the exposure of our work and the quality of the web. This is how I link to Paul Gearon:

:me foaf:knows  [ = <http://web.mac.com/thegearons/people/PaulGearon/foaf.rdf#me>; 
                  a foaf:Person;
                  foaf:name "Paul Gearon" ] .
I could just point to his URL, but the little extra duplicate information can make life easier for people/robots browsing the data web. It can help people notice inconcistencies and help me correct them.

But not everyone lives in the open the same way, and not everyone wants to make the same amount of information about themselves public. There are a number of different ways to deal with this. I want to discuss a few of them here.

Content Negotition

How much someone says about themselves is up to them, and so is how they protect their information. The same URL that identifies someone, could return more or less information depending on who is asking. I could set up my foaf file so only friends who log in via openid can see my friends. Others would just get default information about me. I could be even more clever. I could allow any friend of my friend who logs in via their openid to see my full foaf file; others would see information about me, and a select group of open friends. Closed Social Networks could open up by making it convenient to specify these policies, and providing the right infrastructure to do so.

Indirect Identification

By directly identifying someone via a URL (as I do) we can leave a lot of the policy of what they make visible up to them. But those that don't have a foaf name, need to be identified indirectly. We can do that by identifying them via some property such as their blog, their home page, their email address, or their openid. I am very open about my email addresses. They are published and visible to all.
 <http://bblfish.net/people/henry/card#me>     <http://xmlns.com/foaf/0.1/mbox> <mailto:henry.story@bblfish.net> .
I value it more that people can contact me easily - living as I do in the middle of nowhere and often living nowhere in particular - than the pain of spammers. Too many people are lazy about security, using virus filled Windoze computers, obvious passwords, cracked software for me to be under any illusion that hiding my email is going to prevent the bad guys from getting it.

However I can't assume that everyone else will accept me applying this argument to their email address. For this there is a nice mathematical technique: I can encrypt their email address using the SHA1 hash function. This create a close to unique string that cannot be dissasembled. You cannot go from the sha1 sum of an email address back to the email. But you can always calculate the same sha1sum from an email. This is how I identify Simon Phipps, Sun's Open source Officer:

:me foaf:knows [ a foaf:Person;
                 foaf:mbox_sha1sum "4e377376e6977b765c1e78b2d0157a933ba11167";
                 foaf:name "Simon Phipps";
                 rdfs:seeAlso <http://www.webmink.net/foaf.rdf>

If you know Simon's email, then you will know that I know him. "What use is that?" I can hear someone ask. It's all about Working with People on the Internet. Imagine you are reading email on a newsgroup with a foaf enabled mail tool linked to a foaf enabled Address Book (such as Beatnik). You come on an email by Simon saying something interesting about how Sun has changed its stock ticker to JAVA for example. My logo and perhaps that of a couple of other people appears on the mail reader in a way that indicates to you that we know Simon. The post is no longer anonymous for you, and so has more trust value. You feel part of a community.[1]

So spammers can not use that information to spam. Either they already know your email address, and so they are probably already spamming you, or they don't, and this won't help them. They can only [2] learn about social network claims: who claims to know who. They could use this, it is true to introduce themselves as an aquaintance of a friend of yours. A bit of a risky strategy that could quickly get them on a black list. Currently being black listed may not be an expensive proposition. But in a cryptographic web of trust this will be both much easier to notice, and more damaging for the infringers.

Fuzzy Identification

I can directly and indirectly identify a lot of people in my Address Book as described above. This is perfectly acceptable for people who have an open life, like I do, and a large portion of the Open Source community, bloggers, standard setters, etc... But on last count I had over 700 people in my AddressBook. It is a lot of work to identify all of them individuall, and to decide how much visibility I should give them. I may not even want people to know how many people I know this way. Also I may want deniability: there are people one may know, but one may not want to highlight that, and one may want to be able to deny that one knows them to some people. The foaf:sha1sum gives me a way to identify someone, but if some nozy person comes to me and asks me about that person's life after having identified the corresponding email address, there is no escape route other than refusing any conversation, which by itself can easily be taken to be significant. What we need is a way to fuzily identify a group of one's aquaintance.

Bloom Filters

This is what Bloom Filters enable one to do. Originally used in times when memory was expensive, they allowed the whole vocabulary of a language to be condensed into a reasonably short string. Here we can use it to group all the email addresses of our friends together in one opaque string. I could express as follows in RDF (bear in mind that the rdf vocabulary has not been settled on):
:me foaf:bloomMbox [ a bloom:Bloom;
                     bloom:base64String """"
                     bloom:hashNum 4;
                     bloom:length 1000 ] .

Given the above Bloom someone can query it with an email address using the inverse algorithm and the Bloom will answer either that I may know that person, or that it can't tell. The loaf project explains some of the advantages of having this in more detail.

The best way to get a feel for how it works is to try it. Here I have written a little java applet [3] that allows you to test my Bloom for people I know, and to create your own bloom [4].

Your browser is completely ignoring the <APPLET> tag! Go to java.com to download the latest.

Some emails you can try with positive results are tbray attextuality dot C O M, or bill at dehora dot net (suitably transformed of course). The applet lowercases all email addresses when creating and when testing the bloom.

To create your own bloom just click the "Create Bloom" tab. An easy way to extract all your email addresses from an OSX Address Book is to run the following on the command line:

hjs@bblfish:0$ osascript -e 'tell application "Address Book" to get the value of every email of every person' | perl -pe 's/,+ /\\n/g' | sort | uniq | pbcopy

You should now be able to paste the list of all your contacts in the applet. To restrict the Addresses to on of your groups named "foaf" for example replace the relevant section above with tell application "Address Book" to get the value of every email of every person in foaf.

You will need to choose the number of hashes and the maximal size of the bucket you wish to fill. The greater the number of hashes and the greater the size of the bucket, the more precision you get and the less deniability.[5]


None of the above tools are by themselves the complete solution for creating an Open Social Network that will satisfy everyone. But for people willing to live in the open, the correct and astute use of them should satisfy most of people's requirements. Access Control on URLs can make it possible to reveal more or less information depending on who is looking; indirect identification can allow one to name people even without direct identification; sha1sums allows one to partially hide sensitve identifying information; and Blooms allow one to make fuzzy statements of set membership. All of these can be combined in different ways. So one can make statements about sha1sum identified people on the open web, or one can do so behind an access controlled file that only friends logged in with OpenId can see. There are bound to be more fun things to be discovered here. But this should make clear just how much can be done in this space.


  1. For the link from email addresses to sha1sums to work, it helps to canonicalise the emails to all lowercase. This should probably be made more explict in the foaf:mbox_sha1sum definition.
  2. "They can 'only' learn about social network claims", is quite a lot more than some people are willing to accept. See the article by Mark Wahl "Organizing principles for identity systems: Attacks on anonymized social networks and fudging oracles" which contains some very good pointers. For people who want to retain complete anonymity, and this is what people subscribe to when they answer public surveys, any leakage of information is too much leakage. The problem is that because of Metcalf's Law it is nearly impossible to stop information combining itself: Information wants to be linked. So I think, when we are not tied to stringent laws, we should accept this rather than fight it, and use it to our advantage when hunting down spammers: the law holds for them too.
  3. You can get the source code for the applet on the so(m)mer repository in the misc/Bloom subdirectory. I used the pt.tumba.spell.BloomFilter class which I adapted a little for my needs. This was just the first one I found out there. It is probably not the most efficient one, as it uses an array of booleans, when it could use an byte array. If you know of other libraries please let me know.
    The code was put together really quickly and may well contain bugs. Feedback and patches and contributions are welcome.
  4. the advantage of Java Applets over server side code is really obvious here:
    • I don't need a server with a fixed port number to show you this
    • someone can't easily start a denial of service attack to bring the server down
    • You email addresses never leave your computer, so there is no fear of loss of privacy.
    On the last point it would be nice if browser vendors made it easier to get info about the exact restrictions a Java Applet had. I would like to be able to click on an Applet and verify or set it to "no network communication whatever". This would increase trust even more in cases like this.
  5. More info on the load site. Apparently one needs more than 1/4 deniability if one is to preserve some measure of privacy, according to the paper "the price of privacy and the limits of LP decoding" by Cynthia Dwork, Frank McSherry and Kunal Talwar (Microsoft Research) who suggests that
    ... any privacy mechanism, interactive or non-interactive, providing reasonably accurate answers to a 0.761 fraction of randomly generated weighted subset sum queries, and arbitrary answers on the remaining 0.239 fraction, is blatantly non-private.
    Thanks again to Mark Wahl for these references.
  6. Thanks a lot to Dan Brickley for working together with me on this last Friday, and pointing me to many of the important work done here. Dan also wrote a little python script to do something similar. Some of the sites I came across during our discussion: Not having studied bloom filters in detail, I am not sure how compatible the blooms of each of these libraries are. The super simple ruby bloom library does not seem to specify the number of hashes that were used to create a Bloom.
  7. Nick Lothian reminded me in a comment to this that he has written a Bloom Filter demo for facebook. I don't have a facebook account (because I am already on LinkedIn, and I can't really be bothered to move all my information, and because I don't like closed networks), so I was not able to use it. Perhaps I should get a facebook account just for this... Let me know.



« August 2016