Wednesday Nov 26, 2008

Barbara Bush Recovering From Full-Body Transplant

Fear not, concerned citizens. Former first lady Barbara Bush is recovering from her recent fully-body transplant surgery. Or at least that's what you might think if you saw this blurb on Google News:

It looks like Google's news bot picked up an erroneous picture from Current World News' coverage of the story. In case that story/picture is corrected by the time you read this, here is a capture of its current content:

Kind of funny when a mistake like this can reverberate onto multiple sites. But it's also a little scary. Imagine if there were some similar incident involving a publicly-traded company. It could cause a billion-dollar stock swing. Oh wait, that already happened.

Tuesday Oct 07, 2008

Note to Google: I Drink Too Much

Big Brother Is Watching

Dear Google,

Please help me. I'm prone to bouts of drunken foolishness. They usually end with me sending a string of ill-advised and highly embarrassing emails.

I feel good about sharing this with you, since you're already handling information about my health conditions, sexual preferences, and financial concerns.

Sincerely,
John Doe

1975 Elm Avenue
New York, NY 10041
Ph: 212-555-1278
SSN: 123-45-6789

Sound like a good idea? Then you're going to love Gmail's new feature.

Tuesday Apr 24, 2007

When Does the Real Privacy Backlash Arrive?

Big Brother is coming. And we're welcoming him. He's hiding in our email, our web searches, the banner ads that annoy us, and our kids' MySpace pages (that frighten us). But most of all, he's hiding in plain sight. You see, Big Brother isn't coming from secret government agencies shrouded behind dark tinted windows. He's coming from colorful buildings filled with bright young programmers who have whimsical company logos on their business cards.

I've written about this before. And now Google's agreement to purchase DoubleClick has gotten more people thinking about the company's privacy impact. Why? Because Google is gaining an even larger window into everyone's online activities. Rich Tehrani estimated that if the acquisition is completed, Google could end up with "access to the behavioral information of over 90% of web users".

Tehrani also provides examples of just how this data can be used, such as quoting a Yahoo executive who brags that his company can now "predict with 75% certainty which of the 300,000 monthly visitors to Yahoo! Autos will purchase a new car within the next three months."

So a handful of web giants are amassing thorough records of our online activities and learning how to turn that data into a full picture of our behavior (and likely future behavior). Scary stuff. Still, it doesn't feel like the general public really cares. Yet.

We haven't yet seen real public outcry and backlash against these privacy threats. Part of that is because the companies involved have good reputations (and deservedly so, in most cases). Part is because most of us assume that only "bad people" with something to hide have reason to worry about privacy. But these are just delaying the backlash, not preventing it.

At some point, a catalyst will grab the attention of the general public. It could be a security breach at one of the web giants, exposing so much information about so many people that we can't ignore it. Or it could be the story of how lost privacy has ruined one individual's life, told in such a way that we can't forget it.

I don't know what that event will be or when it will happen. But I do know it's coming. The giants of the Internet are on a collision course with the privacy of the little guy. And when it happens, it won't just be the privacy watchdogs that are complaining.

Tuesday Mar 20, 2007

Does Google Track Search Result Clicks?

A lot of bloggers are talking about Google's patent application for a method of ranking blog Search results. As Bill Slawski and Alex Chitu have noted, these break down into a set of factors which provide positive and negative scoring influences. I won't repeat them all here, but I did find one of the positive factors particularly interesting: the implied popularity of a blog, as determined from click stream analysis in search results.

In other words, if users consistently click on a result from Blog A more often than one from Blog B when both show up in the results for a given search (such as on blogsearch.google.com), it can be seen as an indication that Blog A is more popular and/or of higher quality than Blog B. Pretty obvious stuff. Right?

Sure. And it's also pretty obvious that the same idea can be applied to non-blog resources (such as general web results returned by www.google.com or image results from images.google.com).

The question is... How would Google actually obtain this data?

Normally, the page which presents a hyperlink isn't notified when it's clicked. There are ways around this (such as using special javascript or pointing the hyperlink to an intermediate "redirector" service), but I don't see any evidence in Google's pages that they're employing these mechanisms in their regular search results (though paid ads are a different matter).

So when you click on a Google search result, Google should never know it.

But wait... There is a good chance that they do know it. If you use Google's toolbar and enable the "PageRank Display" feature, they'll know about this click (and all of your others, for that matter). Of if the final destination happens to use certain of Google's server-side services (such as AdSense or Google Analytics), they'll likewise know about it (and all other access to that site).

So does this imperfect but growing view of users' behavior on non-Google sites provide enough data to plug into their search ranking algorithms? Probably. And it's one more example of how a web giant such as Google is gaining a "moat" of data which guards against smaller competitors.

Thursday Mar 15, 2007

Privacy and the Private Sector

Big Brother Is Watching

How would you feel if you saw this headline on a search form? I bet the "I'm Feeling Lucky" button would take on a whole new light, for one thing.

In many ways, it's already happening. Major search engines keep records of every one of your searches. Tracing these records back to you depends on many factors: whether you've received a tracking cookie by logging into other services from that company, whether your ISP has assigned you a static IP address, whether you use a large or small ISP, and more. But the core point is this: by retaining search logs, these companies place your privacy at risk.

Google recently announced that they will be anonymizing search logs after 18-24 months. It's better than their old approach (retaining all information indefinitely). But is it good enough? Your searches in the last 18-24 months probably add up to a pretty interesting picture. It can be scary to think how accurate that picture might be. Even scarier is thinking about where its accuracy would be be an illusion.

Take the case of Thelma Arnold, for example. She is the 62-year-old widow who was identified from "anonymized" search records which AOL deliberately exposed in 2006. She's not a terrorist, a drug dealer, or a sex addict. So she shouldn't have anything to hide. Right?

Maybe.

As the NY Times article reports, "Her search history includes 'hand tremors,' 'nicotine effects on the body,' 'dry mouth' and 'bipolar.'" Yikes. Hope Thelma isn't looking for health insurance... Or life insurance... Or a job with a company wanting to minimize the cost of insuring employees... Or anything else where this picture of her health could be held against her.

The worst part? It isn't a picture of her health at all. It's her friends' health. As the Times article continues: "Ms. Arnold said she routinely researched medical conditions for her friends to assuage their anxieties. Explaining her queries about nicotine, for example, she said: 'I have a friend who needs to quit smoking and I want to help her do it.'"

But aren't Ms. Arnold and the foolish release of AOL's search records a special situation? No company would follow in those footsteps after seeing the grilling AOL took. Right? Maybe. But why do they leave the possibility open by retaining these logs? Could one disgruntled employee expose the logs to harm the company? Could a failing company sell off the logs as a final way to salvage assets? Could one company become so large and involved in so many different fields that the Big Brother scenarios we fear could occur entirely within its own corporate boundaries?

Or could widespread tracking and sharing of online activity data just become a standard part of business? Look no further than our all-important credit reports to see how the monitoring of our personal information can become deeply ingrained into the private sector. Is it really so far-fetched to imagine a similar system built on information culled from our online activities?

George Orwell was brilliant in highlighting the importance of privacy to everyone (not just "bad guys" with something to hide). He was brilliant in foreseeing the clash between technology and privacy. Did his one error come in choosing a villain? Maybe the government isn't the primary threat.

Maybe Big Brother will be born out of Big Business.

Wednesday Mar 07, 2007

Google Maps Super-Zoom

Yes, that really is a screenshot of a Google Maps view showing a couple of guys, their camels, and a yak. Philipp Lenssen of Google Blogoscoped has details showing how it's possible to zoom-in beyond the normal limit for the satellite views of certain areas in Google Maps. (If you're really impatient or skeptical, here is a direct link to the Google Maps view.)

Unfortunately, it doesn't look like they have similar high-res imaging for the Java logo on Sun's SCA14 Building. If anyone in the area happens to be a pilot (or has some other way of obtaining such an aerial image), let me know. I could probably weave it into our own map.

Tuesday Mar 06, 2007

Is NoFollow Misnamed or Not?

Conventional wisdom is that the rel="nofollow" mechanism is misnamed. As the current version of the NoFollow Wikipedia article says:

rel="nofollow" actually tells a search engine "Don't score this link" rather than "Don't follow this link." This differs from the meaning of nofollow as used within a robots meta tag, which does tell a search engine: "Do not follow any of the hyperlinks in the body of this document."

But... Recently Matt Cutts (a Google specialist in SEO issues) has contradicted that. Specifically, a forum participant asked:

...does nofollow really prevent Google from crawling a page?
And Matt responded:
...if a page would have been found anyway via other links, it doesn't prevent crawling of that page. But I believe that if the only link to a page is a nofollow link, Google won't follow that link to the destination page.

So he's saying that rel="nofollow" really does mean "don't follow" (at least to Google), and that the conventional wisdom (and Wikipedia article) are wrong?

Is that right? It'd be nice to have a definitive answer, given the "I believe" opening in Matt's statement.

Friday Mar 02, 2007

NoFollow Considered Harmful?

I've noticed a fair number of people recently calling the rel="nofollow" mechanism a failure and calling for its end. Loren Baker is one such voice, with a post called "13 Reasons Why NoFollow Tags Suck". Andy Beal is another, with a post entitled "Google’s Lasnik Wishes 'NoFollow Didn’t Exist'".

I'm on the opposite side of this argument. As I mentioned a while back, I think that web pages need even more control over the "voting intent" of hyperlinks. So instead of sending NoFollow to its grave, I'd like to see it extended (though probably with a new name and format, such as the Vote Links microformat).

I don't want to re-hash that discussion today. Instead, I want to examine the most prominent argument from the anti-NoFollow crowd: that it just doesn't work. Comment spam has increased in blogs since the time when NoFollow was introduced. Because of that, these people argue that NoFollow is an outright failure and isn't needed in the first place because any good blogger is vigilant in moderating comments.

Again, I disagree. Of course comment spam has increased. Blogging and spamming both have little barrier to entry and high growth. It was inevitable that comment spam would increase, even if the benefit to the spammer for each instance was reduced (which NoFollow ensures, by eliminating any PageRank bonus). But that growth alone doesn't mean that NoFollow is a failure. If a disease grows, do we assume that all related medical treatments and research are failures and should be stopped?

Comment spam would be even worse if the NoFollow mechanism didn't exist. Its practitioners would be multiplied because every shady marketing guide around would be touting "amazing benefits" of using blog comments to increase one's standing in Google.

Even if I'm wrong and NoFollow has done nothing to reduce comment spam, at least it has protected the quality of search results. Google isn't the only one with a vested interest in maintaining quality search results. We would all suffer if we had to go back to the "bad old days" of low-quality web search.

What about the idea that any good blog will have vigilantly moderated comments and make NoFollow irrelevant? Good moderation of blog comments is very important. But the argument that it can displace NoFollow assumes that blatant spam is the only threat. As I mentioned in my "Hyperlinks as Votes" entry, a PageRank-style system in part depends upon us each voting in our own "name" (URL). Without NoFollow, that system breaks down with hyperlinks coming from your URL which aren't spam but also aren't something you would intend to positively endorse.

Suppose I post a comment on your blog with a link back to an entry of my own which is completely relevant but disagrees with you at every turn. It isn't spam. And unless you're particularly thin-skinnned, you probably shouldn't exercise your moderation power to delete it. But should search engines interpret that link to be your positive vote for the quality or importance of my page? And even if you think it should, would you want that vote to be of the same strength as one given to something which you directly referenced in the body of your post?

It isn't time for NoFollow to go away. It's time for it to grow up into something more powerful and expressive.

Monday Feb 05, 2007

Lemonade 2.0: Could Blogging Be Your Kid's First Business?

Today, The Christian Science Monitor has a story about using contextual advertising systems (such as Google AdSense) to make money from blogging. It notes that moderately successful bloggers usually make at most a few hundred dollars a year from advertising, while only a very few uber-bloggers make enough to actually live off of blogging (and in their cases, indirect revenue from consulting and public speaking work is usually far more lucrative). Interesting, but not very surprising if you've read other writings on the subject.

More intriguing to me were a couple of side comments on the article's second page. One expert notes that his son now makes more from his blog's AdSense revenue than from his allowance. That's interesting. Blogging has practically zero barrier to entry and provides the realistic opportunity for revenues which most kids would find very meaningful. Hmm... Could starting a blog replace lemonade stands as the quintessential step in childhood entrepreneurialism?

Also catching my eye was a complaint that AdSense doesn't allow venue owners enough control over ad content. I've often thought this myself. Our policies at Sun prohibit AdSense ads on company blogs for this very reason. No business wants to open the door for competitors to advertise on its own site. Of course, many corporate blogging sites probably wouldn't allow advertising anyway. But some would. And the corporate blogging example is just one of many cases where advertising is being omitted due to a lack of control for the venue owner. Might this be a key vulnerability in Google's AdSense behemoth?

I think it could be. So if your entrepeneur child is ready to graduate past professional blogging, you might just encourage them to create an AdSense competitor with better content controls. Success in that endeavor would certainly mean more than a few hundred dollars.

Friday Jan 26, 2007

A quick word about "A quick word about Googlebombs"

Google has just announced that they have tweaked their search algorithm in a way which "has begun minimizing the impact of many Googlebombs." I'm not sure whether I think that's a good thing or not. On one hand, susceptibility to any artificial manipulation of search results is probably bad. On the other hand, a little light-heartedness is one way that Google has always stood out as a company.

I have no such mixed feelings in looking at how Google announced this change, however. I think it's pathetic. Their blog entry essentially just says that the change is algorithmic and "very limited in scope and impact." Good intro, but how about some details?

Google Bombs worked in the first place because Google's search algorithm assumes that what people say when they link to a page can be used to better understand that page. That idea is an important piece in the search puzzle, and I'd like to understand how their new algorithm changes impact it. Presumably, being "very limited in scope and impact" means that they somehow detect and ignore "bad" context in some links (which match some Google Bomb profile) while still paying attention to "good" context in other links? Again, that sounds good (if my presumption is correct), but why not be more forthcoming with exactly what's being done? We all deserve to know if and how wording around hyperlinks impacts the target URL's status in search results.

I realize that Google is in a very competitive space. Keeping a lead over the likes of Microsoft and Yahoo (if you believe they're leading) requires that Google keep some technical secrets to itself. But the key word is some. There is value in allowing everyone to understand the basics of how a key service such as Google search works. Their core PageRank technology fundamentally depends on us all "voting" with our hyperlinks. And as I've mentioned before, I think that there is an obligation to allow its "electorate" to learn how to best use those votes. That can certainly be accomplished without giving away every detail of their technology. But I think it requires more detail than just telling us that something is algorithmic and low-impact.

Thursday Jan 18, 2007

Hyperlinks as Votes: Time for a PageRank Tune-up?

Treat the hyperlinks in web pages as "votes" for other web pages. Then use a feedback loop so that pages which receive more votes from others have their own votes become more powerful. That's how the PageRank algorithm pushes the best pages to the top of Google search results. Twelve years after Larry Page and Sergey Brin published the initial description of PageRank, Google says it still serves as the core of its technology.

So if hyperlinks are votes, how do we make sure the electorate uses their power wisely?

For one, we need to ensure that people only vote in their own name. Not so long ago, that ideal was effectively violated by blog spam. Automated programs would comb the web looking for any blog where they could post hyperlinks to the likes of Viagara sales. Successfully adding such a hyperlink on a well-known blog would result in a strong PageRank "vote" for the spammer's page. So in effect, the spammer was voting in the blog owner's name (and hijacking his PageRank strength).

This issue was largely fixed in 2005, when Google announced that it would start interpreting a rel="nofollow" hyperlink attribute as a request for exclusion from PageRank calculations. Blog spam can still be a problem, but since most blogging software now adds the rel="nofollow" attribute to hyperlinks in comments, it won't benefit spammers' PageRank standings.

But is just being able to mark a hyperlink as a "non vote" enough? Wouldn't it be nice to have even more control, such as specifying which hyperlinks are positive votes for the referenced page and which are negative votes? That's what some of the Technorati folks are aiming to allow with the Vote Links microformat. It proposes rev="vote-for", rev="vote-abstain", and rev="vote-against" attributes to allow page authors to express their voting intents for each hyperlink.

Still, is even that enough? I wonder why there is no effort to allow authors to control the relative strength of their votes. The Vote Links FAQ has an entry covering this, saying:

Q: Why only for and against? How about something more nuanced?

A: The point of this is to provide a strong yes/no response. Finer-grained measures of agreement don't make much sense on an individual basis; aggregating many votes is more interesting. For example, consider how eBay's user rating system has been reduced to a like/dislike switch by users. The 'Ayes, Noes, abstentions' model has served well in politics and committees, when a division is called for.

I'm not satisfied with this answer. The "interesting" aggregation of simple votes which they mention will sometimes be housed within a single page. For example, thousands of people may give a particular URL a positive response at Digg, but it still just shows up as one hyperlink. The same could be said for other sites with significant user input (such as YouTube, Slashdot, or their own example: eBay).

Obviously, no page should be able to artificially inflate the importance of its own hyperlink votes (e.g. rel="I_represent_1_million_votes--honest"). But why not allow pages to determine the portion of their fixed PageRank contribution which is passed along to each of its hyperlinks? So a Digg page, for example, might choose to give 10% of its PageRank voting value to an item getting 2000 Diggs and only 2% to another item which got just 200 Diggs. Search engines could then benefit from the internal ranking systems of sites (such as digg) without having to understand their internal details. And we could all benefit from a more finely-tuned hyperlink democracy.

Friday Jan 05, 2007

See Java from Space, Part 2

If you'll recall, I left off yesterday wondering how I might get up to visit the giant Java Logo on the roof of one of Sun's buildings. Rama has been kind enough to offer some assistance. Excellent.

Now I just need to find jumbo versions of a few key supplies and this:

Satellite Image of a Building's Roof

...will become this:

Updated Satellite Image of a Building's Roof

Thursday Jan 04, 2007

See Java from Space

Question: see anything strange in this picture?

Satellite Image of a Building's Roof

Time's up. It's a satellite picture of Building 14 in Sun's Santa Clara campus, which just happens to be decorated with a huge Java logo. Don't believe me? See for yourself with this view of our Solaris Registrations Map.

Next question: what's it doing there? Has Sun been getting marketing tips from The Colonel?

Nope. I got the real story from Steve Wilson: "It's been up there since Java Software moved out of [Cupertino] to [Santa Clara]. There was a big all hands with a gag that had Rich doing a video feed from the roof of SCA 14. It's still up there."

Last question: where can I rent a big ladder next time I'm in the Bay Area?

Tuesday Jan 02, 2007

Drilling Down in the Solaris Registrations Map

Recently Jim Grisanzio referenced our Solaris Registrations Map to illustrate the volume of OpenSolaris activity in Japan. Since I'd love to see more people using our map in this way, I thought I'd talk about it a bit.

For this kind of use, one handy feature is the ability to reference a particular map view directly. This can be done by copying the "Link To: ... This View" URL when you have the map zoomed and positioned as desired. For example, Jim could have referenced this URL when talking about Solaris-related activity in Japan. This allows everyone to look at approximately the same map view and statistics. (Note: the URL controls the map type, zoom level, and center coordinates; it obviously cannot control clients' screen resolution, which is what determines how much of the map around the center is shown and leads me to use the "approximately" qualifier.)

On my screen, this view currently shows:

Registrations In Visible Area
Solaris 10 / sparc:1715
Solaris 10 / x86:6387
OpenSolaris / sparc:17
OpenSolaris / x86:290
Total:8409

How do we interpret these numbers? Well, one thing they do not mean is that there are just 8409 Solaris users in Japan. As the FAQ notes, this map only shows data for "Solaris 10 and Open Solaris instances that activated Sun Connection to receive automatic software updates." As with any product, only a subset of total users will go through a registration/activation process.

I'm not aware of a good way to estimate what percentage of total users will have registered. So I can't infer the total number of Solaris and OpenSolaris users in Japan from this map view. On the other hand, it seems likely that whatever percentage of users choose to register/activate in one region would roughly equal the percentage of users who do so in other regions. If that's true, we should be able to use this map to compare the relative size of Solaris and OpenSolaris users in different geographic regions. For example, a fully zoomed-out view of the map currently shows a total of 83268 activated instances. So comparing our Japan total (8409 instances) to this number, we could estimate that around 10% of worldwide Solaris and OpenSolaris users are in Japan. That's interesting (and makes me think that perhaps we should update the map to show such percentages automatically).

Hopefully this gives you some ideas on how to dig for interesting views and stats in the map. As we've seen, the statistics it shows are not good indicators of the number of users or installations in absolute terms, but may be useful in estimating the relative populations for different geographic areas. We plan to investigate adding new data sets (such as Jim suggests) which may provide more absolute population info. Stay tuned. And let me know if you have ideas for a data set you think should be included.

About

woodjr

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today