Sunday Oct 03, 2010

Memory ordering

Just had a couple of white papers published on memory ordering. This is a topic which is quite hard to find documentation on, and also quite complex. Fortunately, it's also rarely encountered.

In Oracle Solaris Studio 12.2 we introduced the file mbarrier.h. This defines some intrinsics which allow the developer to enforce memory ordering.

The first paper covers avoiding the reordering of memory operations that the compiler may perform when compiling an application. The second paper covers the more complex issue of avoiding the reordering of memory operations that the processor may do at runtime.

Tuesday Mar 02, 2010

Compiler memory barriers

I've written in the past about memory barriers. Basically a membar instruction ensures that other processors see memory operations in the order that they appear in the source code. The obvious example being a mutex lock where you want the memory operations that occurred while the lock was held to be visible to other processors before the memory operation that releases the lock.

There's actually another kind of memory ordering and that is the ordering used by the compiler. If you write:

  \*a=1;
  \*b=2;

If the compiler can determine that a and b do not alias, then there's no reason for it not to swap the stores to a and b if it thinks that will be a more optimal code pattern.

The most cross-platform way of enforcing this ordering is to put a function call between the two stores:

  \*a=1;
  reorder_barrier();
  \*b=2;

Memory needs to be the program defined state at the call, so the compiler cannot defer the store to a, and cannot hoist the store to b.

This is a great solution, but causes the overhead of a function call, and function calls can have significant costs. There are some compiler intrinsics that cause the compiler to enforce the desired memory ordering. Sun Studio 12 Update 1 supports the GCC flavour:

  \*a=1;
  asm volatile ("":::"memory");
  \*b=2;

You can test the performance overhead using the following code:

void barrier(){}

void main()
{
  for (int i=0; i<1000000000;i++)
  {
    barrier();
  }
} 

On the test system this code took about 5 seconds to run. The alternative code is:

void main()
{
  for (int i=0; i<1000000000;i++)
  {
    asm volatile ("":::"memory");
  }
}

This code took under a second.

Thursday Mar 26, 2009

When to add a membar (an example)

I was recently having a discussion on one of the OpenSolaris lists on the topic of when to use the volatile keyword, and when it was necessary to use membars.

So volatile is a clue to the compiler to load the variable from memory and immediately store it back to memory. What it does not do is to tell the hardware anything. So the application can perform the store, but that store may not be immediately visible to the rest of the system. Most of the time this is not a problem - so long as the store is visible to the processor on which the thread is executing it's fine. Variability of when the store is visible to other processors may also be fine. There is one clear situation where the ordering of store operations could be a problem - and that's unlocking mutexes.

The problem here is best illustrated by the following scenario. I lock some data structure, then store new values into it, then unlock the structure. Immediately another thread comes along and uses the values in that structure. Not an uncommon situation. Unlocking a mutex is often just a case of storing a value (of zero) into the mutex structure. And here's the potential problem. In some weaker ordering architectures there is no guarantee that other processors see the stores in the same order that they are performed. So if you have Store A followed by Store B it may be possible for other processors to observe the change in the value of B before they see the change in the value of A. In the case of mutex unlock, the store of B would be the action that unlocked the mutex, enabling other threads to access the variable A... and there could be problems if they see the old value of A.

The solution to this is to put a membar in before the store that unlocks the mutex. You can see this happening in the OpenSolaris code:

     41 /\*
     42  \* lock_clear(lp)
     43  \*	- clear lock.
     44  \*/
     45 	ENTRY(_lock_clear)
     46 	membar	#LoadStore|#StoreStore
     47 	retl
     48 	  clrb	[%o0]
     49 	SET_SIZE(_lock_clear)

The membar ensures that all the pending stores are visible to other processors before the store that releases the lock becomes visible to them.

Wednesday May 07, 2008

When to use membars

membar instructions are SPARC assembly language instructions that enforce memory ordering. They tell the processor to ensure that memory operations are completed before it continues execution. However, the basic rule is that the instructions are usually only necessary in "unusual" circumstances - which fortunately will mean that most people don't encounter them.

The UltraSPARC Architecture manual documents the situation very well in section 9.5. It gives these rules which cover the default behaviour:

  • Each load instruction behaves as if it were followed by a MEMBAR #LoadLoad and #LoadStore.
  • Each store instruction behaves as if it were followed by a MEMBAR #StoreStore.
  • Each atomic load-store behaves as if it were followed by a MEMBAR #LoadLoad, #LoadStore, and #StoreStore.

There's a table in section 9.5.3 which covers when membars are necessary. Basically, membars are necessary for ordering of block loads and stores, and for ordering non-cacheable loads and stores. There is an interesting note where it indicates that a membar is necessary to order a store followed by a load to a different addresses; if the address is the same the load will get the correct data. This at first glance seems odd - why worry about whether the store is complete if the load is of independent data. However, I can imagine this being useful in situations where the same physical memory is mapped using different virtual address ranges - not something that happens often, but could happen in the kernel.

As a footnote, the equivalent x86 instruction is the mfence. There's a good discussion of memory ordering in section 7.2 of the Intel Systems Programming Guide.

There's some more discussion of this topic on Dave Dice's weblog.

About

Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge
Free Download

Search

Categories
Archives
« July 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  
       
Today
Bookmarks
The Developer's Edge
Solaris Application Programming
Publications
Webcasts
Presentations
OpenSPARC Book
Multicore Application Programming
Docs