Monday Aug 25, 2008

Review of CEAS 2008

Last week, I attended CEAS 2008, an Email and anti-spam conference where researchers from all over the world presented latest techniques they have devised against spam. Most of them were pretty heavy on statistical analysis. I have to admit I haven't seen this many math formulas since college, or heard terms such as "OSBF-Lua", "(RO)SVM", or "TREC" before from any anti-spam product vendors.

There were two sessions that I found to be particularly interesting: A Survey of Modern Spam Tools by Henry Stern of IronPort/Cisco and Fighting Spam: Gmail's Story by Brad Taylor of Google. Henry talked about Dark Mailer, Send Safe and Reactor Mailer—the last one is responsible for 40-50% of all Email traffic on the Internet—and showed us an example of how spammers could use the 'rndline' template macro to generate 28.5 quadrillion unique messages. Brad talked about some of the anti-spam measures Gmail takes; he couldn't share all the details for fear that someone might try to game the system with that knowledge. The Gmail "Spam Czar" is no doubt a celebrity in this circle, but he wasn't the only one; Eric Allman (who developed sendmail) and David Crocker (author of RFC 822) were among those in the audience.

Social Honeypots: Making Friends With A Spammer Near You by Steve Webb of Georgia Tech was pretty entertaining as well, for it's evident that spammers are reaching out beyond Email to social networks, but according to Steve's test on MySpace they seem to only target male users.

Perhaps the best part was meeting more than a handful of smart people genuinely interested in messaging. I'll definitely try to come back next year, especially if there's more focus on those other aspects of anti-spam, like maybe a best practice talk on DKIM or SPF, how to defeat SpamAssassin (or another anti-spam tool), how to avoid being blacklisted, etc.

Tuesday Apr 22, 2008

Race between spammers and anti-spammers

In recent months, several companies have released their statistics and outlook on spam.

  • Postini blocked 47 billion spam messages, over 320 Terabytes of spam in October 2007 alone.
  • Barracuda Networks reports that up to 95% of Emails are spam in 2007.
  • Sophos suggests that 92.3% of email sent during the first quarter of 2008 was spam.
  • Symantec Brightmail calls Europe the new King of Spam, taking the title from North America for the third month in a row.
  • Google says incoming spam to Gmail is declining. Some disagree.

The race between spammers and anti-spammers has been going on for years, and spammers have the upper hand now and for the foreseeable future because their opponents are constantly playing defense.

As amazing as computers are at pattern recognition, randomizing a pattern is still magnitudes faster and cheaper than to resolve one. Take CAPTCHA for example. Or consider the number of turns it takes to scramble a Rubik's cube and to unscramble it.

The problem with current spam detection techniques is that they are largely built on pattern recognition, be it phrase filtering, Bayesian, fingerprinting, DNSBL, heuristic analysis or profiling. As soon as an anti-spam algorithm finds a reliable way of countering a certain kind of spam, all it takes for spammers to defeat that and to launch a new wave is to randomize their attack further, and it's back to playing catch up for the blockers. With the vast amount of resources these spammers have, their operation can only become more and more efficient.

Anti-spammers can't win by this strategy, no matter how good they become at it. If they want to win the race, they must start playing offensive, and by that I don't mean legal actions, because spammers are not regulated by any country.

Friday Dec 14, 2007

A way to stop Email Cc abuse?

Do you frequently add others to the Cc: distribution?

Carbon copying (Cc for short) in Email is undoubtedly one of the most useful features. Alice sends Bob and Charlie an Email, then Bob invites Dave to the discussion by adding him in the Cc: field.

[Growing Inboxes]

As it happens, however, Cc is also one of the most abused features. WSJ has an article titled Email's Friendly Fire which says:

Email overload is now considered a much bigger workplace problem than traditional email spam. Inboxes are bulging today partly because of what some are calling "colleague spam—that is, too many people are indiscriminately hitting the "reply to all" button or copying too many people on trivial messages, like inviting 100 colleagues to partake of brownies in the kitchen."

If you're Bob, the person who adds others, Cc is great. But if you're Dave, the person who is being added, sometimes you may wonder why you're on the distribution at all and silently curse Bob for contributing to the "colleague spam" you receive in your INBOX.

One difference between Facebookmail and Email I've noticed is that in Facebook, once the sender defines the distribution scope, it becomes fixed and cannot be expanded or shrunk. In other words, Alice, Bob or Charlie may not invite Dave into the discussion, nor could Bob respond in private to Alice or Charlie without starting a new thread. The upside is that whatever is said between Alice, Bob and Charlie remains private to them, but the downside is that others cannot chime in or add value.

In contrast, Facebook event scheduling system allows participants to invite more friends as long as the event is open. A bigger party is always a better party, I suppose. :)

Can we think of a way to prevent Cc abuse yet maintain the flexibility of it at the same time?

Wednesday Nov 14, 2007

Yahoo and Google to turn Email into a social network

One of the clever things that Facebook does is how it gives users an option to initialize their social graphs from their address book data on Yahoo! Mail, Hotmail and Gmail to see which of their friends are already on Facebook. I didn't bite, in fear of giving my credentials to Facebook (even though they promise to discard them after data is pulled), but a thought struck: isn't social graph basically a more fashionable way of saying address book 2.0?

Then yesterday I read this blog on NYTimes that Yahoo! is working to turn existing user profiles and address books into a social network and they're calling it INBOX 2.0. Google is allegedly doing something similar. Makes sense.

I have been thinking for years that address book should be consulted during spam detection to minimize false positives, and the only systems which have your address book in their possession and also handle your Email are webmail providers. Extending it to create social graphs seems like the logical next step.

Tuesday Nov 06, 2007

Why Twitter won't delete Email

Because it isn't designed to be Email 2.0.

On a recent debate titled "E-mail Faces Deletion" hosted by BusinessWeek, Robert Scoble suggests that Twitter could overtake Email as the leading business communications tool. I read it a few weeks ago but I wasn't on Twitter so I didn't feel qualified to comment. Since then, I've become more familiar with Twitter and found a few of his arguments flawed.

  1. Knowledge retention. While policy varies from country to country, publicly-traded companies and even SMB who don't host their own Email nowadays typically keep Email on the server side and have retention policies (for compliance reasons) which determine how data of former employees is retained and transferred to replacements.
  2. Spam problem. Twitter doesn't suffer from it because users decide who they wish to follow or unfollow. This method is similar to whitelisting and blacklisting and only works in Twitter because it is a walled communication platform and you don't give out your Twitter username as you would give out Email address (on the last page of your presentation, when you fill out online forms, to merchants and service providers, etc).
  3. What happens in Twitter, stays in Twitter. You can depend on Twitter for as long as it is around. Possibly the best way to explain Twitter to non-technical people is that it is a news broadcasting system in which any member can be a broadcaster. This is very appealing to consumers but not so to corporations. For various reasons, good or bad, internal businesses communication most often flow in a controlled and structured manner rather than a broadcasting model.
  4. Twitter lets you filter what others are saying. For example, when Google launched the Open Handset Alliance yesterday, also known as Android, you can do "track android" in Twitter and it'll automatically direct every Twitter message (called "tweet") containing that keyword to you. The upside is that you get to tap into a global community and track actions and thoughts on that topic in near real-time, but the downside is that the signal-to-noise ratio can be very low because everyone can be a broadcaster.

Furthermore, Twitter has a few design choices that make it unsuitable for business use:

  1. Messages are limited to 160 characters.
  2. No support for attachments.
  3. Can't define scope of distribution.
  4. No verification of status. Companies (especially large ones) may wish to cut its tie with terminated employees and it's not clear how Twitter can handle that.

That being said, is Email perfect as a business communication tool? Absolutely not. It's been around for 25 years and I'm confident it'll stick around for another 25 years, but if its weaknesses are not addressed and improvements are not made in time then I doubt it'll maintain its usefulness. Although it's not fair to compare Email with Twitter, there is a few things Email can learn from Twitter:

  1. Needs stronger sender identification. When an Email claims it was sent by Aunt Betty, it must truly came from her and no one else. Twitter's solution is to require account registration and username/password. Systems such as SPF and DomainKeys go so far as to ensure domain-level authenticity, but we need something that goes farther to sender-level.
  2. Needs an API. Twitter offers an API so that users and other developers can discover new ways to use Twitter. Email doesn't have an API, it has RFCs written by lots of people over many years to ensure interoperability, but its fundamentals are largely unchanged even though the rest of the world has progressed. I say it's time for an update, a rethink on modern and future requirements, similar to what ZFS did to filesystem. Excuse the overuse but we need an "Email 2.0".
  3. Needs Permalink. A permalink is basically a fixed index to a web resource to which others link or respond. The vast majority of Email is either an inquiry for response or a response to another inquiry. If every Email message you write has a permalink, then it's a lot easier to track or search when others respond or add value to it.
  4. Follow & track. Once all of the above are in place, these become trivial. In fact, all kinds of new possibilities open up.

Do you think an old dog can learn new tricks? It's only limited by our imagination and drive. Consider how Google uses Email for project management (it's a rather long story, just search for "project management" when the page loads).

Thursday Nov 01, 2007

How Gmail's spam filter works

Gmail does a marvelous job at keeping spam out of its users' INBOX—probably the best among free spam filters (and better than some paid ones, too). But they have not said anything publicly about how they're doing it or how effective it is until now.

These are the four elements that make Gmail's spam filter tick:

  • Gmail users participation
  • Google's compute grid
  • Other Google technologies
  • Sender verification
You could read the full article, or watch Brad Taylor, Gmail's top spam fighting engineer, talk about their techniques in this educational video:


I currently live in San Francisco Bay Area. For the past seven years, I have been designing and building messaging solutions for Sun.


« July 2016