Fuzzy Authorship: A Mechanism to Motivate and Reward Collaboration

There is nothing quite as rewarding as seeing in your name in lights for doing something good. Getting recognition for your work is one of the big motivators for social beings such as ourselves. Attribution is therefore an essential part of the publishing process--especially for collaborative documents like Wiki pages. But how is authorship to be determined? Having to click different buttons for major and minor changes just complicates things. Better if we can automate the process. This process suggests some "fuzzy logic" heuristics we can use.

Assigning "fuzzy authorship" might be possible if we:

• Identify a Contribution Percentage
• Convert the percentages into Fuzzy Authorship
• Handle Special Situations

Contribution Percentage

The first step is to identify the percentage of material that a contributor is responsible for. That means identifying the size of an individual contribution and comparing it with the sum of all contributions.

The most accurate way to determine the size of a contribution is to use the sum of the absolute value of the deltas. So if a contributor adds 50 characters and removes 50, then the size of the change is abs(50) + abs(-50), or 100 characters. (A simpler way to do it would be to compare before-and-after file sizes, but it would obviously be less accurate.)

The sizes of individual changes can then be accumulated for each contributor, and divided by the sum of all changes to create a percentage. And for the purposes of that calculation, we're more interested in tracking the sum of all changes than in tracking the final file size.

To summarize the ontology:

• A wiki has multiple contributors
• A contributor makes a contribution to page when they post changes.
• The size of the contribution is the sum of the absolute values of the deltas
• The contributor's total contribution to the page is the sum of all past contributions.
• Their contribution percentage, or contributorship, is their total contribution divided by the sum of all contributions.
• Their authorship status is determined by their contribution percentage relative to the size and timing of other contributions. (The subject of the next section.)

Fuzzy Authorship

Once we've established contribution percentages, we're in a position to convert them into some form of tangible recognition. We could just report percentages, but what's the point? There's no real difference between 82.4% and 84.6%, so it makes more sense to partition the percentages into authorship categories that match our intuitions. That sort of partitioning is the essence of fuzzy logic, hence the name, fuzzy authorship.

Here are some thoughts on ways to assign authorship status, based on the size and timing of contribution percentages:

Percentage Title Notes 40-100% Author If no co-authors and no more than 3 major contributors. 40-80% Primary Author If there is a co-author or 2 or more major contributors. <50% Original Author If author/primary author status persists for a month. (Once this state is reached, it never changes.) 30% Co-Author Range: 25-50% 20% Major Contributor Range: 14.5-25% 10% Contributor Range: 7.5-14.5% 5% Reviewer Range: 1.5-7.5% 1% Proofreader If no single change is greater than, say, 15 characters. Otherwise: reviewer.

Sidebar: Fuzzy Logic
Fuzzy logic is generally concerned with reasoning about processes that have already been partitioned. So, for example, high-speed Japanese trains use different heuristics depending on whether the train is "speeding up", "slowing down", "starting" or "coming to a stop". You assign the rules to use in each situation, depending on the goal you want to achieve.

Because you're dealing with named states, rather than numbers, the rules can be read and written by regular people, in addition to programmers. That fact makes it easier for experts to contribute to the heuristics and review the results--a benefit that has also been observed by developers who have implemented a Domain Specific Language (DSL) that matches the problem they are trying to solve.

The fuzzy logic used in the 200mph Japanese trains is so successful that passengers don't need to hold on to the rails at any time, from the time they step on board to the time they get off. So the technique is powerful. But what makes the logic work is the underlying step where a speed of 86mph and an acceleration of such and such is translated into a state that makes a rule fire.

There are many way to solve the problem of translating from a given speed to a state. It can be done by voting. So 86mph might be considered "sorta fast" 20% of the time, "fast" 80% of the time, and "very fast" 20% of the time. Rules associated with all three states might then fire, with their contributions being weighted by .2, .8, and .2, respectively, to determine a numeric result. We don't necessarily need to do anything that complex--but it's interesting that we would have that capability, should we choose to use it.

Special Situations

In some cases, it may be necessary to make changes without changing authorship percentages. That requirement implies a need for roles.

For example, an editor may move half of a page to a new page, to make things easier to read. But that doesn't necessarily make the editor responsible for half the content on the first page, and all of the content on the second! So one implication is the need for an editor role, which can reorganize things without affecting contributorship.

The other implication is the need to maintain contributorship data in the face of such changes.

That brings us to the subject of granularity. In the best of all possible worlds, individual changes would be attributed--internally, at least, if not in any visible way. So if contributor A was responsible for the first half of the page, and contributor B was responsible for the second half, then the attributions would ideally remain appropriate if an editor moved B's contribution to a new page.

On the other hand, if both contributors worked equally on all parts of the page, then accurate attribution would leave them with 50% contributorship for each of the resulting pages. But the ability to make that determination depends on the granularity and method with which authorship is recorded.

Current wikis and file systems may record the fact that a file was created by contributor A, and modified by contributor B, but that is generally as far as they go. More accurate calculations would be possible if the people responsible for individual words, characters, and paragraphs could be identified.

I've heard of one system  that would be capable of doing that, devised by Sandy Klausner (CoreTalk). It cleverly maintains markup outside of the text, with pointers into it. That system allows the text to be marked up in multiple ways, for multiple purposes. The idea is to be able to easily modify one set of markups and rearranging things at will without disturbing the other markups. (That's a seriously complex undertaking. He has developed a completely new hardware/software architecture to solve the problem.)

There have been other attempts at constructing a truly granular storage system, as well:

• I attempted the challenge in 2000, with Kernel--a node-and-lists storage system based on the Java platform. I became overwhelmed by the complexity and gave it up soon after, but only after making copious notes on the idea and the hurdles I faced. (At the end of my Collaboration System index, under "Node Library".)

• Lee Iverson then took up the glove, figured out where I went wrong, and took things to the next level. He eventually started a graduate-school program to work on the idea (Nodal), which produced several papers and a SourceForge project. There are even Java APIs (The Nodal Interface Specification for Java).

Lacking such a system, we'll probably have to be content with a system that uses the total contribution percentages for a page, when things are rearranged. So if an editor moves half of a 1,000-character page elsewhere, then the contributor percentages would remain the same for each of the new pages.

But note that while the percentages remain, the total contribution sizes would need to be adjusted. A 1,000-character page may have had 1500 characters of changes over its lifetime. If Contributors A and B are each responsible for half, then they would be responsible for 750 characters each. If one third of the page were moved elsewhere, then the breakdown would be:

• Before the change:
• total contributions for the page: 1500 characters
• A's share: 750 characters (50%)
• B's share: 750 characters (50%)
• After the change:
• total contribution for original page: 1,000 characters (2/3)
• total contribution for new page: 500 characters (1/3)
• A's contribution for the original page: 500 characters (50%)
• A's contribution for the new page: 250 characters (50%)
...and similarly for B

Resources

• Recent blog entries on the subject of Collaboration
• Extant Granular Repositories
• Nodal: A granular storage system that stores document components in a directed graph.
• Sandy Klausner's CoreTalk system
• History and motivation recorded in these Collaboration System papers. (The exploration of collaboration system requirements surfaced the need for a granular storage mechanism.) In particular:

Interesting stuff, Eric. A question about contribution percentage:

When the sum of the absolute value of the deltas, doesn't this give inordinate "points" for someone who edits an entry. For example, someone who reworks 50 characters for readability would get 100 "points" as opposed to 50 for the original contributor. Or are we assuming that copy editing is more of a corner case and that most of the editing would be factual fine-tuning (where the editor might deserve credit for both removing the factually imprecise bit and adding in the correct part).

Posted by Patrick Keegan on December 11, 2007 at 02:41 AM PST #

Patrick Keegan wrote:
> When the sum is the absolute value of the deltas,
> doesn't that give inordinate "points" to someone
> who edits an entry?
>
Great question, Patrick. I think that it could. This post is a starting point for algorithm awareness. I expect a significant amount of fine-tuning and no doubt some major overhauling as time goes on.

Your question brings up an aspect of the equation that I assumed, without mentioning it in the article: highly granular differencing. If we do character counts on differences generated with a line-based differencer, we'll go over the limit very fast. So with line differencing, we probably want to say the edits under 3 lines constitute "proofreading" rather than reviewing or larger contribution.

Character-based differencing would be more accurate, but it may be possible to get significantly better accuracy with XML-based differencing.

If the Wiki is producing legal xHTML, then an XML-based differencing engine could be used. (That would also be useful when "purplizing" the page for addressability--the subject of a future post.)

XML differencing produces one highly useful capability right away: The ability to distinguish between categories of tags. So:

-> changes in text and \*inline tags\* are minor changes

-> additions and deletions of \*structure tags\* (H1..H5) represent major changes.

-> promoting or demoting a structure tag (h2->h3, for example)is a relatively minor edit

-> intermediate tags like <li> and <p> may be minor if contiguous (breaking one paragraph into two), or represent a significant addition (adding a new paragraph to expand on a subject)

There are papers waiting to be written on the subject, I'm sure. I'm beginning to think now that the first step is to determine whether the changes are \*significant\*. If they are, use the authorship algorithm to determine recognition status. If not, credit the contributor for the proofreading, and leave the authorship values unchanged.

Posted by Eric Armstrong on December 11, 2007 at 03:14 AM PST #

In general, I have serious problems with a system that uses character counts as the sole criterion to establish authorship contribution to a piece of writing. A few questions immediately spring to mind, and I could probably think of many more with further reflection:

\* Character counts are not a consistent indicator of quality or percentage of contribution -- it's actually harder to write succinctly. This attribution scheme seems to reward verbose writing!

\* Similarly, one person's 500-word contribution might be cogent, meaningful, and invaluable while another's is convoluted and tangential. Using character counts alone to evaluate "authorship" seems unfair and misleading.

\* Credit does not appear to be given for the originator of an idea. What if someone has come up with the approach of the article and other contributions have merely been more words to back up points made by the original author? They would be considered co-authors even if the material they added is merely supplemental rather than contributing to the structure of originality of the article.

\* Doesn't this fly in the face of the community spirit? I can easily envision a scenario where people would start adding fluff or unnecessary verbiage in order to increase their credit.

Posted by JaniceG on December 11, 2007 at 05:44 AM PST #

JaniceG raises good, cogent points. They're worth a thoughtful response, so here goes:

> Character counts are not a consistent indicator of
> quality or percentage of contribution.
>
Absolutely true. But I think that an arbitrary judge may be better than none at all. The alternative is a manual system where someone makes a decision. There's nothing wrong with that. It's certainly worth trying to see how it works out. But really good contributors who lack confidence won't claim recognition, while merely average contributors with an overblown sense of self will give themselves credit where it isn't due. The only way to solve that problem is by putting all such decisions in the hands of someone else. That someone else could be the author of the page. But what if they have moved on to other things and don't visit in a while? Or if in the hands of some global administrator, it has the potential of becoming a thankless, full-time job where you someone is sure to criticize your decision, no matter what it is.

I'd love to see an even better algorithm come into existence--one that evaluated the contribution with respect to it's appropriateness and elegance would be just swell. But that would be some pretty deep AI. so far, only humans are good at that--which means a manual system with the issues noted above.

> one person's 500-word contribution might be cogent,
> meaningful, and invaluable while another's is
> convoluted and tangential.
>
Same point expressed another way. But it does prompt an idea: In a system that records authorship with sufficient granularity, reducing the words to a more minimal syntax should reduce the original writer's contribution. On the other hand, removing text should result in some of form of recognition for "editing", rather than for "authoring". Not quite sure how to handle that, as yet. I'm glad you raised the issue.

> Credit does not appear to be given for the originator
> of an idea.
>
That was the goal of the "original author" tag. The note indicates that while you may move from author to primary author to original author, you never move below that. (I don't know that it's sufficient, but it was an attempt to make sure the originator remains recognized.)

> Doesn't this fly in the face of the community spirit?
> I can easily envision a scenario where people would
> start adding fluff or unnecessary verbiage in order
> to increase their credit.
>
That kind of behavior would certainly be anti-community. And it certainly wouldn't be a good idea to encourage that kind of gamesmanship. Even the idea of reducing someone's credit when their words are deleted is fraught with the potential for fraud One would simply have cut to the clipboard, save the page, and then paste in the same text with a new edit to become the primary author!

On the other hand, people said the same thing about Wiki's when they first came out. There was a great fear about letting "just anyone" edit a page. Time has shown that, other than the need to police wiki spam, people really are basically good, and they don't go out of their way to be unfair.

Of course, it is also possible not to implement any authorship recognition strategy at all. That's pretty much the situation we have now. Maybe that's all we need?
Dunno. From what I hear, people are starting to look at \*who\* wrote a particular Wiki page before deciding to trust it. It's the only way they have to establish confidence.

If there's a version history, people can sort of get an idea from that. The idea of fuzzy authorship was just to identify and implement the heuristics that people would naturally tend to use for that process.

Bottom line: I'm for whatever works. If this post helps to stimulate discussion, and something workable surfaces, I'm all for it. (Even if the eventual conclusion is that there is /no/ workable strategy, and we should leave things as they are--but I suspect that the final determination will have to be made in practice, not in theory, just as it was with Wikis themselves.

Posted by Eric Armstrong on December 11, 2007 at 09:30 AM PST #

Eric -
Were cool stuff. Would love to see this integrated in our Community Equity model.

Posted by Peter Reiser on March 11, 2008 at 02:23 AM PDT #