# Measuring long-term fairness for locks

Senior Research Scientist
Lets say we have N concurrent and homogenous threads that each loop as follows : acquire a central lock; execute a critical section; release the lock; execute a non-critical section. We start the threads at the same time. At the end of the measurement interval we have N data points : the number of iterations completed by each thread. This is the so-called fixed-time-report-work benchmark methodology. Aggregate throughput is measured by the sum of the data points, but we're also interested in quantifying the fairness, as reflected by the distribution of those points. For simplicity, we'll assume that the scheduler is ideally fair, there are no NUMA or physical "geographic" issues in play, and that each thread has about the same cache footprint. The only remaining source of unfairness is the lock implementation or policies. (In practice there are myriad confounding factors, including energy and thermal caps, but we'll ignore those and assume an idealized model). A histogram of the completion counts is a good way to visualize the fairness or unfairness of the lock. I recommend HdrHistogram.
Sometimes, though, it's convenient to describe unfairness in terms of a simple real-valued numerical statistic. Such a descriptive statistic can be used to quantity the fairness of various lock policies, and in particular to help establish the trade-off between fairness and throughput. Ideally that statistic would be scale-invariant -- that property is desirable but optional. Some of the usual statistics are standard deviation or variance. Extrema-based statistics such as (max-min) or (max-min)/max can also be useful. These can give us a sense of the size of the range of the data points. The average divided by the max can also provide some insight. IQR is another commonly used statistic, as is Jain's Fairness Index. Cederman et al. suggested another useful fairness metric. In recent papers I've reported the relative standard deviation. (In communications literature it's not uncommon to see fairness expressed in terms of average/stddev, which is the reciprocal of the relative standard deviation). Median Absolute Deviation (MAD) is also informative. I don't have one favorite -- my lock benchmark harnesses report all of the above.
Recently I was looking at other measures of disparity or uniformity and came up with the following idea. First, we sort our N completion counts in ascending order. We then plot as follows. On the X-axis we have the number of threads T, and on the Y-axis we have the cumulative sum of iteration counts up to thread #T. (Think: CDF or Riemann Integral) If the completion counts were 10,15,22,33 then the Y values would be C(x) = 0,10,25,37,70, for instance, for 0,1,2,3,4 threads, respectively. Beware that the C(x) function is obviously discrete, but for the purposes of discussion and terminology I'll treat it as a continuous real-valued function. Next, we normalize both axes to [0,1]. For the Y-axis, we can simply divide C(x) by the maximum -- final -- C(x) value. If all the data points are equal -- ideally fair -- then C(x) is just the identity diagonal function : C(x)=x. Note that C(x) is convex. The more unfair the distribution, the larger the area under the diagonal and above the normalized C(x) function. And in fact that area measure makes a useful index of inequality. We could derive additional statistics from this approach, such as the maximum x-C(x) value, but the area between the diagonal and C(x) seems viable.
Subsequently, I realized a similar approach had been long used in economics to describe income disparity : Gini Coefficient.