I've been looking into "malloc" dynamic storage allocators targeted specifically at Intel Haswell i7-4770 processors.
For background, the i7-4770 has relatively simple cache geometry. The L1 (level-1 cache) is 32KB with 64-byte lines, is physically tagged, and is 8-way set-associative. There are 64 possibly indices (sets). As such the cache page size is 4KB -- addresses that differ by an integer multiple of 4K will map to the same index (set) in the L1. The low-order 6 bits of the address presented to the L1 form the offset into the line, and the next higher 6 bits serve as the L1 index. The MMU base page size is 4KB, so there is no overlap between the virtual page number and the index field in a virtual address. The L1 index field passes through address translation verbatim. As such, OS-level page coloring is not in play with respect to the L1. (An advantage of this design is that indexing can commence before the virtual address is translated to a physical address, although we still need the physical address for tag comparison). Some CPUs hash addresses -- usually XORing high-order physical address bits into the index bits -- in order to reduce the odds of index hotspots and index imbalance, but experiments suggest that does not appear to be the case with the i7-4770.
Such simple caches -- without the index hashing mentioned above -- can be vulnerable to excessive index conflicts, but malloc allocators can be made index-aware (local copy) to mitigate and reduce the frequency of index conflicts. Index imbalance results in underutilization of the cache. Some indices will be "cold"(less frequently accessed) while others are "hot" and thus incur relatively higher miss rates. It's worth pointing out that most application/allocator combinations don't exhibit excessive index conflicts, but for those that do, the performance impact can be significant. An index-aware allocator can act to "immunize" an application against some common cases of index-imbalance while typically incurring no additional cost over index-oblivious allocators. Think of the index-aware allocator as cheap insurance against a rare but painful performance disorder. The paper above describes an index-aware allocator designed for the L1 in a SPARC T2+ processor, but it's trivial to change a few cache geometry constants and retarget the allocator to the i7-4770.
The "CIA-Malloc" (Cache-Index Aware) allocator described in the paper has a number of other useful design properties. It also happens to be NUMA-friendly and large-page-friendly. Underlying pages are allocated on the node where the malloc() was invoked. Put another way, the pages underlying a block returned by malloc() will typically reside on the node where the malloc() was invoked. The allocator is also scalable with very little internal lock contention or coherence traffic. Each per-CPU sub-heap has a private lock -- the only time we'll encounter contention is via migration or preemption, which is relatively rare. The critical sections are also constant-time and very short. We also make heavy use of trylock(), so if a thread is obstructed it can usually make progress by reverting to another data structure. Remote free() operations are lock-free. Critically, the allocator acts to reduce the cost of malloc() and free() operations as well as the cost to the application when accessing blocks allocated via malloc(). The allocator is also designed specifically to reduce common cases of false sharing : allocator metadata-vs-metadata; metadata-vs-block; and inter-block block-vs-block. Metadata-vs-metadata sharing and false sharing is reduced by using per-CPU sub-heaps. False sharing arising between adjacent data blocks -- blocks returned by malloc() -- is addressed by placement and alignment. These attributes will prove even more useful when we use CIA-Malloc in conjunction with hardware transactions.
The i7-4770 provides hardware transactional memory (HTM). For the purposes of discussion we'll assume we're using TSX-RTM for the purposes of transactional lock elision (TLE). The critical section body contains unmodified HTM-oblivious legacy code that expects to run under the lock in the usual fashion, but via TLE we can modify the lock implementation to attempt optimistic execution, reverting to the lock only as necessary. The i7-4770's HTM implementation tracks the transactional write-set in the L1 and the read-set over the cache hierarchy. It uses a requester-wins conflict resolution strategy implemented via the MESIF coherence protocol. At most a single cache can have a given line in M/E state at any one time -- a classic multiple-reader single-writer model. Eviction or invalidation of a tracked cache line results in a transactional abort. For example if a transaction on CPU C loads address A, and some other CPU writes A before C commits, the write will invalidate the line from C's cache and cause an abort. Similarly, if C stores into A and some other CPU loads or stores into A before C commits, the invalidation of A will cause C's transaction to abort. Read-write or write-write sharing on locations accessed within a transaction results in coherence invalidation and consequent abort. (The HTM implementation in the i7-4770 shares quite a few aspects with Sun's experimental ROCK processor).
In addition to coherence traffic, self-displacement via conflict misses can also result in aborts. This is where a CIA-Malloc allocator may provide benefit relative to other allocators. Normally an index-aware allocator is expected to reduce conflict misses arising from index-imbalance, but it can also reduce transactional aborts caused by eviction of read-set or write-set entries from index conflicts. Aborts are usually far more expensive than simple cache misses. (Absent any potential benefit from warming up of caches, aborts are pure wasted and futile effort).
Lets take an actual example. The following data was collected on an i7-4770 running Ubuntu 14.04. We use a simple C single-threaded benchmark that uses malloc() to individually allocate a set of 250 nodes, and then arranges those nodes into an circular intrusive singly linked list. The benchmark was compiled with gcc 4.8.2 using the x32 ABI. The node structure has a "next" field at offset 0 followed by a volatile integer "w" field. A command-line switch gives us ability to specify the effective size of the node as passed to malloc(). Since there may be a correlation between allocation order and virtual address, we randomize the order of the nodes with a Fisher-Yates shuffle in order to minimize the impact of automatic hardware stride-based prefetchers. (Such a randomized order can put stress on the TLBs with lots of page crossings as we traverse the list, but that's not the dominant performance issue for the cases we'll discuss). We then report the time needed to complete 10000000 steps of the following loop body :
If we use an effective node size of 950 bytes, then the default glibc malloc() allocator places our nodes at 960 byte intervals (1024-64) and each step of the loop requires 2.1 nsecs. When we increase the node size to 1010 the interval is 1024 bytes and each step takes 8.1 nsecs. If we further increase the node size to 1080 bytes then the interval is 1088 bytes (1024+64)and the time drops back to 2.1 nsecs. The performance drop at 1010 bytes was caused by the 1024-byte placement interval. The base addresses of our 250 nodes resided on just 4 of the 64 possible indices, so we grossly underutilized the L1. This nicely illustrates index conflicts arising from index-oblivious allocator placement policies. An index-aware allocator will avoid this scenario.
a->w = 0 ; a = a->next
Now we'll extend our benchmark to use RTM hardware transactions. Each traversal of the ring is wrapped in a RTM XBEGIN-XEND transaction and we'll measure and report success rates. This is intended to model TLE or "naked" use of RTM transactions. We keep the ring circumference at 250 nodes, so each transaction will iterate over that
many elements. With nodes of 960 bytes the failure rate is 0.145%. Increasing the node size to 1010 bytes, the failure rate becomes 100%. (We have no progress). But if we bump the size to 1080 bytes then the failure rate drops back to 0.2%. The complete and persistent failure at 1010 bytes was caused by elements of the write-set being evicted and consequent abort. But if we use CIA-Malloc modified for the i7-4770 we can avoid such abrupt performance inflections.
To recap, an index-aware allocator can help avoid performance pathologies for normal non-transactional code as well as improving the success rate of hardware transactions. Specifically, conflict misses become aborts in transactions, and aborts are more expensive than normal misses.
Ideally, an HTM-friendly allocator will satisfy the desiderata enumerated in Cache-Index Aware Memory Allocation and also act to reduce the abort rate. The following properties are desirable:
As a side note, under a requester-wins conflict resolution strategy, to the extent possible and reasonable it's a good idea to shift stores of frequently accessed shared variables toward the end of a transaction. You can do this by hand, or a transaction-aware compiler or JIT can perform some of the transformations. The modCount field Java's hashtable is a canonical example of an update that should be shifted. Shifting reduces the window of vulnerability where the store resides in the transaction's write-set. But the asymmetry in the i7-4770 where the write-set is tracked in the L1 and the read-set in the L1-L2-L3 gives us yet another reason to shift stores toward the end of a transaction. Consider a transaction that executes a store followed by large number of loads. Those loads may displace the store and cause an abort. But if we shift the store to the end of the transaction, the same set of accesses (just reordered) can succeed without abort. The store may displace a loaded line from the L1, but the L2 can still track the line.
Finally, when a given size-class is index-unfriendly, we can use the punctuated array approach as described in the original CIA-Malloc paper. A more radical approach is to intentionally and explicitly pick size-classes that are prime multiples of the cache line size. This helps to reduce inter-size-class index conflicts.