Tuesday Mar 20, 2007

Does Google Track Search Result Clicks?

A lot of bloggers are talking about Google's patent application for a method of ranking blog Search results. As Bill Slawski and Alex Chitu have noted, these break down into a set of factors which provide positive and negative scoring influences. I won't repeat them all here, but I did find one of the positive factors particularly interesting: the implied popularity of a blog, as determined from click stream analysis in search results.

In other words, if users consistently click on a result from Blog A more often than one from Blog B when both show up in the results for a given search (such as on blogsearch.google.com), it can be seen as an indication that Blog A is more popular and/or of higher quality than Blog B. Pretty obvious stuff. Right?

Sure. And it's also pretty obvious that the same idea can be applied to non-blog resources (such as general web results returned by www.google.com or image results from images.google.com).

The question is... How would Google actually obtain this data?

Normally, the page which presents a hyperlink isn't notified when it's clicked. There are ways around this (such as using special javascript or pointing the hyperlink to an intermediate "redirector" service), but I don't see any evidence in Google's pages that they're employing these mechanisms in their regular search results (though paid ads are a different matter).

So when you click on a Google search result, Google should never know it.

But wait... There is a good chance that they do know it. If you use Google's toolbar and enable the "PageRank Display" feature, they'll know about this click (and all of your others, for that matter). Of if the final destination happens to use certain of Google's server-side services (such as AdSense or Google Analytics), they'll likewise know about it (and all other access to that site).

So does this imperfect but growing view of users' behavior on non-Google sites provide enough data to plug into their search ranking algorithms? Probably. And it's one more example of how a web giant such as Google is gaining a "moat" of data which guards against smaller competitors.

Friday Mar 16, 2007

Product Quality Heatmaps

Read/WriteWeb has an interesting look at heatmap visualizations. In particular, they focus on Summize, a site specializing in product reviews.

Summize allows users to vote on the quality of products. What's new and interesting is how they present those voting results--with heatmaps. Here is an example:

The colored stripe is a heatmap showing what percentage of users think the iPod Nano is great (the green: 44%), what percentage think it's wretched (the red: 12%), and those in between (the orange, yellow, and yellowish green). One nice thing about this visualization is that it works well even when the heatmap image is small. So, for example, they use a scaled-down version of the stripe next to each item in search results.

The end result is a nice way to see and understand a lot of information packed into a small space--the very definition of a good visualization.

Thursday Mar 15, 2007

Privacy and the Private Sector

Big Brother Is Watching

How would you feel if you saw this headline on a search form? I bet the "I'm Feeling Lucky" button would take on a whole new light, for one thing.

In many ways, it's already happening. Major search engines keep records of every one of your searches. Tracing these records back to you depends on many factors: whether you've received a tracking cookie by logging into other services from that company, whether your ISP has assigned you a static IP address, whether you use a large or small ISP, and more. But the core point is this: by retaining search logs, these companies place your privacy at risk.

Google recently announced that they will be anonymizing search logs after 18-24 months. It's better than their old approach (retaining all information indefinitely). But is it good enough? Your searches in the last 18-24 months probably add up to a pretty interesting picture. It can be scary to think how accurate that picture might be. Even scarier is thinking about where its accuracy would be be an illusion.

Take the case of Thelma Arnold, for example. She is the 62-year-old widow who was identified from "anonymized" search records which AOL deliberately exposed in 2006. She's not a terrorist, a drug dealer, or a sex addict. So she shouldn't have anything to hide. Right?

Maybe.

As the NY Times article reports, "Her search history includes 'hand tremors,' 'nicotine effects on the body,' 'dry mouth' and 'bipolar.'" Yikes. Hope Thelma isn't looking for health insurance... Or life insurance... Or a job with a company wanting to minimize the cost of insuring employees... Or anything else where this picture of her health could be held against her.

The worst part? It isn't a picture of her health at all. It's her friends' health. As the Times article continues: "Ms. Arnold said she routinely researched medical conditions for her friends to assuage their anxieties. Explaining her queries about nicotine, for example, she said: 'I have a friend who needs to quit smoking and I want to help her do it.'"

But aren't Ms. Arnold and the foolish release of AOL's search records a special situation? No company would follow in those footsteps after seeing the grilling AOL took. Right? Maybe. But why do they leave the possibility open by retaining these logs? Could one disgruntled employee expose the logs to harm the company? Could a failing company sell off the logs as a final way to salvage assets? Could one company become so large and involved in so many different fields that the Big Brother scenarios we fear could occur entirely within its own corporate boundaries?

Or could widespread tracking and sharing of online activity data just become a standard part of business? Look no further than our all-important credit reports to see how the monitoring of our personal information can become deeply ingrained into the private sector. Is it really so far-fetched to imagine a similar system built on information culled from our online activities?

George Orwell was brilliant in highlighting the importance of privacy to everyone (not just "bad guys" with something to hide). He was brilliant in foreseeing the clash between technology and privacy. Did his one error come in choosing a villain? Maybe the government isn't the primary threat.

Maybe Big Brother will be born out of Big Business.

Tuesday Mar 06, 2007

Is NoFollow Misnamed or Not?

Conventional wisdom is that the rel="nofollow" mechanism is misnamed. As the current version of the NoFollow Wikipedia article says:

rel="nofollow" actually tells a search engine "Don't score this link" rather than "Don't follow this link." This differs from the meaning of nofollow as used within a robots meta tag, which does tell a search engine: "Do not follow any of the hyperlinks in the body of this document."

But... Recently Matt Cutts (a Google specialist in SEO issues) has contradicted that. Specifically, a forum participant asked:

...does nofollow really prevent Google from crawling a page?
And Matt responded:
...if a page would have been found anyway via other links, it doesn't prevent crawling of that page. But I believe that if the only link to a page is a nofollow link, Google won't follow that link to the destination page.

So he's saying that rel="nofollow" really does mean "don't follow" (at least to Google), and that the conventional wisdom (and Wikipedia article) are wrong?

Is that right? It'd be nice to have a definitive answer, given the "I believe" opening in Matt's statement.

Friday Mar 02, 2007

NoFollow Considered Harmful?

I've noticed a fair number of people recently calling the rel="nofollow" mechanism a failure and calling for its end. Loren Baker is one such voice, with a post called "13 Reasons Why NoFollow Tags Suck". Andy Beal is another, with a post entitled "Google’s Lasnik Wishes 'NoFollow Didn’t Exist'".

I'm on the opposite side of this argument. As I mentioned a while back, I think that web pages need even more control over the "voting intent" of hyperlinks. So instead of sending NoFollow to its grave, I'd like to see it extended (though probably with a new name and format, such as the Vote Links microformat).

I don't want to re-hash that discussion today. Instead, I want to examine the most prominent argument from the anti-NoFollow crowd: that it just doesn't work. Comment spam has increased in blogs since the time when NoFollow was introduced. Because of that, these people argue that NoFollow is an outright failure and isn't needed in the first place because any good blogger is vigilant in moderating comments.

Again, I disagree. Of course comment spam has increased. Blogging and spamming both have little barrier to entry and high growth. It was inevitable that comment spam would increase, even if the benefit to the spammer for each instance was reduced (which NoFollow ensures, by eliminating any PageRank bonus). But that growth alone doesn't mean that NoFollow is a failure. If a disease grows, do we assume that all related medical treatments and research are failures and should be stopped?

Comment spam would be even worse if the NoFollow mechanism didn't exist. Its practitioners would be multiplied because every shady marketing guide around would be touting "amazing benefits" of using blog comments to increase one's standing in Google.

Even if I'm wrong and NoFollow has done nothing to reduce comment spam, at least it has protected the quality of search results. Google isn't the only one with a vested interest in maintaining quality search results. We would all suffer if we had to go back to the "bad old days" of low-quality web search.

What about the idea that any good blog will have vigilantly moderated comments and make NoFollow irrelevant? Good moderation of blog comments is very important. But the argument that it can displace NoFollow assumes that blatant spam is the only threat. As I mentioned in my "Hyperlinks as Votes" entry, a PageRank-style system in part depends upon us each voting in our own "name" (URL). Without NoFollow, that system breaks down with hyperlinks coming from your URL which aren't spam but also aren't something you would intend to positively endorse.

Suppose I post a comment on your blog with a link back to an entry of my own which is completely relevant but disagrees with you at every turn. It isn't spam. And unless you're particularly thin-skinnned, you probably shouldn't exercise your moderation power to delete it. But should search engines interpret that link to be your positive vote for the quality or importance of my page? And even if you think it should, would you want that vote to be of the same strength as one given to something which you directly referenced in the body of your post?

It isn't time for NoFollow to go away. It's time for it to grow up into something more powerful and expressive.

Friday Jan 26, 2007

A quick word about "A quick word about Googlebombs"

Google has just announced that they have tweaked their search algorithm in a way which "has begun minimizing the impact of many Googlebombs." I'm not sure whether I think that's a good thing or not. On one hand, susceptibility to any artificial manipulation of search results is probably bad. On the other hand, a little light-heartedness is one way that Google has always stood out as a company.

I have no such mixed feelings in looking at how Google announced this change, however. I think it's pathetic. Their blog entry essentially just says that the change is algorithmic and "very limited in scope and impact." Good intro, but how about some details?

Google Bombs worked in the first place because Google's search algorithm assumes that what people say when they link to a page can be used to better understand that page. That idea is an important piece in the search puzzle, and I'd like to understand how their new algorithm changes impact it. Presumably, being "very limited in scope and impact" means that they somehow detect and ignore "bad" context in some links (which match some Google Bomb profile) while still paying attention to "good" context in other links? Again, that sounds good (if my presumption is correct), but why not be more forthcoming with exactly what's being done? We all deserve to know if and how wording around hyperlinks impacts the target URL's status in search results.

I realize that Google is in a very competitive space. Keeping a lead over the likes of Microsoft and Yahoo (if you believe they're leading) requires that Google keep some technical secrets to itself. But the key word is some. There is value in allowing everyone to understand the basics of how a key service such as Google search works. Their core PageRank technology fundamentally depends on us all "voting" with our hyperlinks. And as I've mentioned before, I think that there is an obligation to allow its "electorate" to learn how to best use those votes. That can certainly be accomplished without giving away every detail of their technology. But I think it requires more detail than just telling us that something is algorithmic and low-impact.

Monday Jan 22, 2007

Wikipedia Decides Its Outgoing Links Can't Be Trusted?

I find this sad. By adding the rel="nofollow" attribute to the outgoing links in all articles, the Wikipedia seems to be wavering in its trust of volunteers. Yes, link spam is a problem. And with its combination of high visibility and open authoring, the Wikipedia is a prime target. But why not deal with this problem the same way it deals with other inaccurate and abusive content? Count on the volunteer base to detect and correct issues quickly (and give the administrators tools to lock certain articles which are repeated targets).

Until yesterday, that's exactly how the English-language Wikipedia dealt with link spam. But now the project has thrown up a white flag and said that its volunteers and tools aren't adequate to police the situation. Instead, the equivalent of martial law has been declared and everyone suffers.

The Wikipedia is the closest thing we have to a collective and collaborative voice in describing our world. When an external URL is referenced in a Wikipedia article, it must pass the editorial "litmus test" of all Wikipedians watching that article (who will presumably have high interest and expertise in the subject). With the blanket inclusion of the nofollow attribute on these links, search engines such as Google will no longer use these links as part of their determination of which URLs are most important. So we end up with slightly poorer search results and one less way to register our "votes" for improving them. Sad.

On the bright side, the original announcement does note that "better heuristic and manual flagging tools for URLs would of course be super." Presumably, this means that when such tools are made available, the blanket application of nofollow will be removed. Let's hope that happens. Soon.

Thursday Jan 18, 2007

Hyperlinks as Votes: Time for a PageRank Tune-up?

Treat the hyperlinks in web pages as "votes" for other web pages. Then use a feedback loop so that pages which receive more votes from others have their own votes become more powerful. That's how the PageRank algorithm pushes the best pages to the top of Google search results. Twelve years after Larry Page and Sergey Brin published the initial description of PageRank, Google says it still serves as the core of its technology.

So if hyperlinks are votes, how do we make sure the electorate uses their power wisely?

For one, we need to ensure that people only vote in their own name. Not so long ago, that ideal was effectively violated by blog spam. Automated programs would comb the web looking for any blog where they could post hyperlinks to the likes of Viagara sales. Successfully adding such a hyperlink on a well-known blog would result in a strong PageRank "vote" for the spammer's page. So in effect, the spammer was voting in the blog owner's name (and hijacking his PageRank strength).

This issue was largely fixed in 2005, when Google announced that it would start interpreting a rel="nofollow" hyperlink attribute as a request for exclusion from PageRank calculations. Blog spam can still be a problem, but since most blogging software now adds the rel="nofollow" attribute to hyperlinks in comments, it won't benefit spammers' PageRank standings.

But is just being able to mark a hyperlink as a "non vote" enough? Wouldn't it be nice to have even more control, such as specifying which hyperlinks are positive votes for the referenced page and which are negative votes? That's what some of the Technorati folks are aiming to allow with the Vote Links microformat. It proposes rev="vote-for", rev="vote-abstain", and rev="vote-against" attributes to allow page authors to express their voting intents for each hyperlink.

Still, is even that enough? I wonder why there is no effort to allow authors to control the relative strength of their votes. The Vote Links FAQ has an entry covering this, saying:

Q: Why only for and against? How about something more nuanced?

A: The point of this is to provide a strong yes/no response. Finer-grained measures of agreement don't make much sense on an individual basis; aggregating many votes is more interesting. For example, consider how eBay's user rating system has been reduced to a like/dislike switch by users. The 'Ayes, Noes, abstentions' model has served well in politics and committees, when a division is called for.

I'm not satisfied with this answer. The "interesting" aggregation of simple votes which they mention will sometimes be housed within a single page. For example, thousands of people may give a particular URL a positive response at Digg, but it still just shows up as one hyperlink. The same could be said for other sites with significant user input (such as YouTube, Slashdot, or their own example: eBay).

Obviously, no page should be able to artificially inflate the importance of its own hyperlink votes (e.g. rel="I_represent_1_million_votes--honest"). But why not allow pages to determine the portion of their fixed PageRank contribution which is passed along to each of its hyperlinks? So a Digg page, for example, might choose to give 10% of its PageRank voting value to an item getting 2000 Diggs and only 2% to another item which got just 200 Diggs. Search engines could then benefit from the internal ranking systems of sites (such as digg) without having to understand their internal details. And we could all benefit from a more finely-tuned hyperlink democracy.

Monday Jan 15, 2007

Tagging Formats: Why Not Follow Search's Lead?

Why is there so much argument over the need for a tagging standard? Of course we need a standard. It just doesn't have to be a new one. We have a ready-made standard in existing keyword-to-content systems: search engines.

How is tagging so different than searching? About the only difference I see is that they're done in reverse order. Instead of proposing keywords for a not-yet-found piece of content, tagging applies keywords to content up front. Great. Both are useful. Both are important. And both should share one syntax for keywords.

Since search came first, it gets to set the standard. So if you want to see how tagging should work, just look to Google's search form. It's space-delimited, spaces can exist within quoted items, and quotes can exist within items if they're escaped. There you go--the tagging standard wars are settled. The settlement just happens to predate tagging itself.

Thursday Dec 14, 2006

Jonathan Hits Madoogle Status?

Some people say you've really made it when you only need one name--like Madonna. Interesting theory. Some people say that top placement in search results is quickly becoming the most valuable real estate in the world. Interesting fact.

What happens if we combine these ideas? I think we get an updated version of the One Name Phenomenon: you've really made it when you're atop the results for a Googling of just your first name. And so I hereby acknowledge that Sun's own media star of a CEO has joined the exclusive "Madoogle" club.

About

woodjr

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today