Fuzzy Authorship: A Mechanism to Motivate and Reward Collaboration
By Eric Armstrong on Dec 09, 2007
There is nothing quite as rewarding as seeing in your name in lights for doing something good. Getting recognition for your work is one of the big motivators for social beings such as ourselves. Attribution is therefore an essential part of the publishing process--especially for collaborative documents like Wiki pages. But how is authorship to be determined? Having to click different buttons for major and minor changes just complicates things. Better if we can automate the process. This process suggests some "fuzzy logic" heuristics we can use.
Assigning "fuzzy authorship" might be possible if we:
- Identify a Contribution Percentage
- Convert the percentages into Fuzzy Authorship
- Handle Special Situations
The first step is to identify the percentage of material that a contributor is responsible for. That means identifying the size of an individual contribution and comparing it with the sum of all contributions.
The most accurate way to determine the size of a contribution is to use the sum of the absolute value of the deltas. So if a contributor adds 50 characters and removes 50, then the size of the change is abs(50) + abs(-50), or 100 characters. (A simpler way to do it would be to compare before-and-after file sizes, but it would obviously be less accurate.)
The sizes of individual changes can then be accumulated for each contributor, and divided by the sum of all changes to create a percentage. And for the purposes of that calculation, we're more interested in tracking the sum of all changes than in tracking the final file size.
To summarize the ontology:
- A wiki has multiple contributors
- A contributor makes a contribution to page when they post changes.
- The size of the contribution is the sum of the absolute values of the deltas
- The contributor's total contribution to the page is the sum of all past contributions.
- Their contribution percentage, or contributorship, is their total contribution divided by the sum of all contributions.
- Their authorship status is determined by their contribution percentage relative to the size and timing of other contributions. (The subject of the next section.)
Once we've established contribution percentages, we're in a position to convert them into some form of tangible recognition. We could just report percentages, but what's the point? There's no real difference between 82.4% and 84.6%, so it makes more sense to partition the percentages into authorship categories that match our intuitions. That sort of partitioning is the essence of fuzzy logic, hence the name, fuzzy authorship.
Here are some thoughts on ways to assign authorship status, based on the size and timing of contribution percentages:
Percentage Title Notes 40-100% Author If no co-authors and no more than 3 major contributors. 40-80% Primary Author If there is a co-author or 2 or more major contributors. <50% Original Author If author/primary author status persists for a month. (Once this state is reached, it never changes.) 30% Co-Author Range: 25-50% 20% Major Contributor Range: 14.5-25% 10% Contributor Range: 7.5-14.5% 5% Reviewer Range: 1.5-7.5% 1% Proofreader If no single change is greater than, say, 15 characters. Otherwise: reviewer.
Sidebar: Fuzzy Logic
Fuzzy logic is generally concerned with reasoning about processes that have already been partitioned. So, for example, high-speed Japanese trains use different heuristics depending on whether the train is "speeding up", "slowing down", "starting" or "coming to a stop". You assign the rules to use in each situation, depending on the goal you want to achieve.
Because you're dealing with named states, rather than numbers, the rules can be read and written by regular people, in addition to programmers. That fact makes it easier for experts to contribute to the heuristics and review the results--a benefit that has also been observed by developers who have implemented a Domain Specific Language (DSL) that matches the problem they are trying to solve.
The fuzzy logic used in the 200mph Japanese trains is so successful that passengers don't need to hold on to the rails at any time, from the time they step on board to the time they get off. So the technique is powerful. But what makes the logic work is the underlying step where a speed of 86mph and an acceleration of such and such is translated into a state that makes a rule fire.
There are many way to solve the problem of translating from a given speed to a state. It can be done by voting. So 86mph might be considered "sorta fast" 20% of the time, "fast" 80% of the time, and "very fast" 20% of the time. Rules associated with all three states might then fire, with their contributions being weighted by .2, .8, and .2, respectively, to determine a numeric result. We don't necessarily need to do anything that complex--but it's interesting that we would have that capability, should we choose to use it.
In some cases, it may be necessary to make changes without changing authorship percentages. That requirement implies a need for roles.
For example, an editor may move half of a page to a new page, to make things easier to read. But that doesn't necessarily make the editor responsible for half the content on the first page, and all of the content on the second! So one implication is the need for an editor role, which can reorganize things without affecting contributorship.
The other implication is the need to maintain contributorship data in the face of such changes.
That brings us to the subject of granularity. In the best of all possible worlds, individual changes would be attributed--internally, at least, if not in any visible way. So if contributor A was responsible for the first half of the page, and contributor B was responsible for the second half, then the attributions would ideally remain appropriate if an editor moved B's contribution to a new page.
On the other hand, if both contributors worked equally on all parts of the page, then accurate attribution would leave them with 50% contributorship for each of the resulting pages. But the ability to make that determination depends on the granularity and method with which authorship is recorded.
Current wikis and file systems may record the fact that a file was created by contributor A, and modified by contributor B, but that is generally as far as they go. More accurate calculations would be possible if the people responsible for individual words, characters, and paragraphs could be identified.
I've heard of one system that would be capable of doing that, devised by Sandy Klausner (CoreTalk). It cleverly maintains markup outside of the text, with pointers into it. That system allows the text to be marked up in multiple ways, for multiple purposes. The idea is to be able to easily modify one set of markups and rearranging things at will without disturbing the other markups. (That's a seriously complex undertaking. He has developed a completely new hardware/software architecture to solve the problem.)
There have been other attempts at constructing a truly granular storage system, as well:
- I attempted the challenge in 2000, with Kernel--a node-and-lists storage system based on the Java platform. I became overwhelmed by the complexity and gave it up soon after, but only after making copious notes on the idea and the hurdles I faced. (At the end of my Collaboration System index, under "Node Library".)
- Lee Iverson then took up the glove, figured out where I went wrong, and took things to the next level. He eventually started a graduate-school program to work on the idea (Nodal), which produced several papers and a SourceForge project. There are even Java APIs (The Nodal Interface Specification for Java).
Lacking such a system, we'll probably have to be content with a system that uses the total contribution percentages for a page, when things are rearranged. So if an editor moves half of a 1,000-character page elsewhere, then the contributor percentages would remain the same for each of the new pages.
But note that while the percentages remain, the total contribution sizes would need to be adjusted. A 1,000-character page may have had 1500 characters of changes over its lifetime. If Contributors A and B are each responsible for half, then they would be responsible for 750 characters each. If one third of the page were moved elsewhere, then the breakdown would be:
- Before the change:
- total contributions for the page: 1500 characters
- A's share: 750 characters (50%)
- B's share: 750 characters (50%)
- After the change:
- total contribution for original page: 1,000 characters (2/3)
- total contribution for new page: 500 characters (1/3)
- A's contribution for the original page: 500 characters (50%)
- A's contribution for the new page: 250 characters (50%)
...and similarly for B
- Recent blog entries on the subject of Collaboration
- Wikis, Docs, and the Reuse Proposition
- Online Document Collaboration
- DITA, XDocs, and Online Collaboration with the Open Source Community
- Extant Granular Repositories
- Nodal: A granular storage system that stores document components in a directed graph.
- Sandy Klausner's CoreTalk system
- History and motivation recorded in these Collaboration System papers. (The exploration of collaboration system requirements surfaced the need for a granular storage mechanism.) In particular:
- Requirements for a Collaborative Design/Discussion/Decision System
- Knowledge Repositories and Design Discussions