Tuesday Dec 16, 2008

OpenSPARC Internals available on Amazon

OpenSPARC Internals is now available from Amazon. As well as print-on-demand from lulu, and as a free (after registration) download.

Thursday Nov 13, 2008

How to learn SPARC assembly language

Got a question this morning about how to learn SPARC assembly language. It's a topic that I cover briefly in my book, however, the coverage in the book was never meant to be complete. The text in my book is meant as a quick guide to reading SPARC (and x86) assembly, so that the later examples make some kind of sense. The basics are the instruction format:

[instruction]  [source register 1], [source register 2], [destination register]

For example:

faddd    %f0, %f2, %f4


%f4 = %f0 + %f2

The other thing to learn that's different about SPARC is the branch delay slot. Where the instruction placed after the branch is actually executed as part of the branch. This is different from x86 where a branch instruction is the delimiter of the block of code.

With those basics out the way, the next thing to do would be to take a look at the SPARC Architecture manual. Which is a very detailed reference to all the software visible implementation details.

Finally, I'd suggest just writing some simple codes, and profiling them using the Sun Studio Performance Analyzer. Use the disassembly view tab and the architecture manual to see how the instructions are used in practice.

Monday Nov 10, 2008

Poster for London workshop

Just got the poster for the OpenSPARC Workshop in London in December.

Wednesday Nov 05, 2008

OpenSPARC workshop in London

I'm thrilled to have been asked to present at the OpenSPARC workshop to be run in London, on December 4 and 5th. I'll be covering the 'software topics'. There's no charge for attending the workshop.

OpenSPARC presentations

As part of the OpenSPARC book, we were asked to provide slideware and to present that slideware. The details of what's available are listed on the OpenSPARC site, and are available for free from the site on wikis.sun.com.

I contributed two sections. I produced the slides and did the voice over for the material on developing for CMT, the accompanying slides are also available. I also did a voice over for someone else's slides on Operating Systems for CMT (again slides available).

The recording sessions were ok, but a bit strange since it was just myself and the sound engineer working in a meeting room in Santa Clara. I get a lot of energy from live presentations, particularly the interactions with people, and I found the setup rather too quiet for my liking.

The Sun Studio presentation was relatively easy. It runs for nearly an hour, and there's a couple of places where I felt that additional slides would have helped the flow. The Operating Systems presentation was much harder as it was trying to weave a story around someone else's slide deck.

OpenSPARC Internals book

The OpenSPARC Internals book has been released. This is available as a free (after registration) pdf or as a print-on-demand book. The book contains a lot of very detailed information about the OpenSPARC processors, and my contribution was a chapter about Sun Studio, tools, and developing for CMT.

Friday Oct 31, 2008

The limits of parallelism

We all know Amdahl's law. The way I tend to think of it is if you reduce the time spent in the hot region of code, the most benefit you can get is the total time that you initially spent there. However, the original setting for the 'law' was parallelisation - the runtime improvement depends on the proportion of the code that can be made to run in parallel.

Aside: When I'm looking at profiles of applications, and I see a hot region of code, I typically consider what the improvement in runtime would be if I entirely eliminated the time spent there, or if I halved it. Then use this as a guide as to whether it's worth the effort of changing the code.

The issue with Amdahl is that it's completely unrealistic to consider parallelisation without considering issues of the synchronisation overhead introduced when you use multiple threads. So let's do that and see what happens. Assume that:

P = parallel runtime
S = Serial runtime
N = Number of threads
Z = Synchronisation cost

Amdahl would give you:

P = S / N

The flaw is that I can keep adding processors, and the parallel runtime keeps getting smaller and smaller - why would I ever stop? A more accurate equation would be something like:

P = S / N + Z\*log(N)

This is probably a fair approximation to the cost of synchronisation, some kind of binary tree object that synchronises all the threads. So we can differentiate that:

dP/DN = -S / N\^2 + Z / N 

And then solve:

0 = -S / N\^2 + Z / N
S / N = Z
N = S / Z

Ok, it's a bit disappointing, you start off with some nasty looking equation, and end up with a ratio. But let's take a look at what the ratio actually means. Let's suppose I reduce the synchronisation cost (Z). If I keep the work constant, then I can scale to a greater number of threads on the system with the lower synchronisation cost. Or, if I keep the number of threads constant, I can make a smaller chunk of work run in parallel.

Let's take a practical example. If I do synchronisation between threads on traditional SMP system, then communication between cores occurs at memory latency. Let's say that's ~200ns. Now compare that with a CMT system, where the synchronisation between threads can occur through the second level cache, with a latency of ~20ns. That's a 10x reduction in latency, which means that I can either use 10x the threads on the same chunk of work, or I can run in parallel a chunk of work that is 10x smaller.

The logical conclusion is that CMT is a perfect enabler of Microparallelism. You have both a system with huge numbers of threads, and the synchronisation costs between threads are potentially very low.

Now, that's exciting!

Thursday Oct 30, 2008

The multi-core is complex meme (but it's not)

Hidden amoungst the interesting stories (Gaming museum opening, Tennant stepping down) on the BBC was this little gem from Andrew Herbert, the head of Microsoft research in the UK.

The article describes how multi-core computing is hard. Here's a snippet of it:

"For exciting, also read 'complicated'; this presents huge programming challenges as we have to address the immensely complex interplay between multiple processors (think of a juggler riding a unicycle on a high wire, and you're starting to get the idea)."

Now, I just happened to see this particular article, but there's plenty of other places where the same meme appears. And yes, writing a multi-threaded application can be very complex, but probably only if you do it badly :) I mean just how complex is:

#pragma omp parallel for

Ok, so it's not fair comparing using OpenMP to parallelise a loop with writing some convoluted juggler riding unicycle application, but let's take a look at the example he uses:

"Handwriting recognition systems, for example, work by either identifying pen movement, or by recognising the written image itself. Currently, each approach works, but each has certain drawbacks that hinder adoption as a serious interface.

Now, with a different processor focusing on each recognition approach, learning our handwriting style and combining results, multi-core PCs will dramatically increase the accuracy of handwriting recognition."

This sounds like a good example (better than the old examples of using a virus scanner on one core whilst you worked on the other), but to me it implies two independent tasks. Or two independent threads. Yes, having a multicore chip means that the two tasks can execute in parallel, but assuming a sufficiently fast single core processor we could use the same approach.

So yes, to get the best from multi-core, you need to use multi-threaded programming (or multi-process, or virtualisation, or consolidation, but that's not the current discussion). But multi-threaded programming, whilst it can be tricky, is pretty well understood, and more importantly, for quite a large range of codes easy to do using OpenMP.

So I'm going to put it the other way around. It's easy to find parallelism in today's environment, from the desktop, through gaming, or numeric codes. There's abundant examples of where many threads are simultaneously active. Where it gets really exciting (for exciting read "fun") is if you start looking at using CMT processors to parallelise things that previously were not practical to run with multiple threads.

Tuesday Oct 28, 2008

Second life - Utilising CMT slides and transcript

Just finished presenting in second life. This time the experience was not so good, my audio cut out unexpected during the presentation, so I ended up having to use chat to present the material. I was very glad that I'd gone to the effort of writing a script before the presentation, however, reading off the screen is not as effective as presenting the material.

Anyway, I found 'statistics' panel in the environment and looking at this indicated that I was down to <1 FPS, with massive lag. Interestingly, after the presentation and once everyone had left the presentation area, the FPS went up to 10-12. The SL program was still maxing out the CPU (as you might expect, I guess there's no reason to back off until the program hits the frame rate of the screen), but much more responsive - things actually happened when I clicked the on-screen controls.

So, I'm sorry for anyone who found the experience frustrating. I did too. And thank you to those people who turned up, and persevered, and particularly to the bunch of people who participated in the discussion at the end.

Anyway, for those who are interesting the slides and transcript for the presentation are available.

Sunday Oct 26, 2008

Second life presentation

I'll be presenting in Second Life on Tuesday 28th at 9am PST. The title of the talk is "Utilising CMT systems".

Thursday Jul 24, 2008

Haskell porting opportunity

There's an opportunity to work on improving the performance of Haskell on SPARC. The project is also going to explore using multiple threads to improve performance. Much more information is contained in the announcement.

Thursday May 22, 2008

Tonight's OpenSolaris User Group presentations

Slides for tonight's presentations are now available:

Wednesday May 21, 2008

Presenting at OpenSolaris Users group tomorrow

Tomorrow I'll be presenting at the Silicon Valley OpenSolaris Users group. Alan DuBoff has asked me to try and avoid the monolithic presentation, so I'll be aiming to have a couple of short presentations. The idea is to push the balance towards communications rather than presentations.

Tuesday May 13, 2008

OpenMP 3.0 specification released

The specification for OpenMP 3.0 has been put up on the OpenMP.org website. Using the previous OpenMP 2.5 standard, there's basically two supported modes of parallelisation:

  • Splitting a loop over multiple threads - each thread is responsible for a range of the iterations.
  • Splitting a serial code into sections - each thread executes a section of code.

The large change with OpenMP 3.0 is the introduction of tasks, where a thread can spawn a task to be completed by another thread at an unspecified point in the future. This should make OpenMP amenable to many more situations. An example of using tasks looks like:

  node \* p = head;
  while (p)
    #pragma omp task
    p = p->next;

The master thread iterates the linked list generating tasks for processing each element in the list. The brackets around the call to process(p) are unnecessary, but hopefully clarify what's happening.

Monday May 12, 2008

Slides for CommunityOne

All the slides for last week's CommunityOne conference are available for download. I was presenting in the CMT stream, you can find my slides here. Note that to download the slides, you'll need to use the username and password shown on the page.

My talk was on parallelisation. What's supported by the compiler, the steps to do it, and the tools that support that. I ended with an overview of microparallelisation.

Wednesday May 07, 2008

When to use membars

membar instructions are SPARC assembly language instructions that enforce memory ordering. They tell the processor to ensure that memory operations are completed before it continues execution. However, the basic rule is that the instructions are usually only necessary in "unusual" circumstances - which fortunately will mean that most people don't encounter them.

The UltraSPARC Architecture manual documents the situation very well in section 9.5. It gives these rules which cover the default behaviour:

  • Each load instruction behaves as if it were followed by a MEMBAR #LoadLoad and #LoadStore.
  • Each store instruction behaves as if it were followed by a MEMBAR #StoreStore.
  • Each atomic load-store behaves as if it were followed by a MEMBAR #LoadLoad, #LoadStore, and #StoreStore.

There's a table in section 9.5.3 which covers when membars are necessary. Basically, membars are necessary for ordering of block loads and stores, and for ordering non-cacheable loads and stores. There is an interesting note where it indicates that a membar is necessary to order a store followed by a load to a different addresses; if the address is the same the load will get the correct data. This at first glance seems odd - why worry about whether the store is complete if the load is of independent data. However, I can imagine this being useful in situations where the same physical memory is mapped using different virtual address ranges - not something that happens often, but could happen in the kernel.

As a footnote, the equivalent x86 instruction is the mfence. There's a good discussion of memory ordering in section 7.2 of the Intel Systems Programming Guide.

There's some more discussion of this topic on Dave Dice's weblog.

Friday May 02, 2008

Official reschedule notice for CommunityOne

Session ID: S297077
Session Title: Techniques for Utilizing CMT
Track: Chip Multithreading (CMT): OpenSPARCâ„¢
Room: Esplanade 302
Date: 2008-05-05
Start Time: 13:30 

The official timetable has also been updated

Embedded Systems Conference Presentation

I got the opportunity to present at the embedded systems conference in San Jose a couple of weeks back. My presentation covered parallelising a serial application, a quick tour of what to do, together with an overview of the tools that Sun Studio provides to help out. The presentation is now available on the OpenSPARC website.

Wednesday Apr 30, 2008

CommunityOne Panel and reschedule

I've heard that my session at CommunityOne is now scheduled from 1:30. The panel session that was scheduled for that time has been shifted to the 11:00am slot. However, the timetable currently up on the site does not reflect that change. I've been invited to appear on the panel - which I'm looking forward to. See you there!

Tuesday Apr 29, 2008

Multicore expo available - Microparallelisation

My presentation "Strategies for improving the performance of single threaded codes on a CMT system" has been made available on the OpenSPARC site.

The presentation discusses "microparallelisation", in the the context of parallelising an example loop. Microparallelisation is the aim of obtaining parallelism through assigning small chunks of work to discrete processors. Taking a step back...

With traditional parallelisation the idea is to identify large chunks of work that can be split between multiple processors. The chunks of work need to be large to amortise the synchronisation costs. This usually means that the loops have a huge trip count.

The synchronisation costs are derived from the time it takes to signal that a core has completed its work. The lower the synchronisation costs, the smaller amount of work is needed to make parallelisation profitable.

Now, a CMT processor has two big advantages here. First of all it has many threads. Secondly these threads have low latency access to a shared level of cache. The result of this is that the cost of synchronisation between threads is greatly reduced, and therefore each thread is free to do a smaller chunk of work in a parallel region.

All that's great in theory, the presentation uses some example code to try this out, and discovers, rather fortunately, that the idea also works in practice!

The presentation also covers using atomic operations rather that microparallelisation.

In summary the presentation is more research than solid science, but I hoped that presenting it would get some people thinking about non-traditional ways to extract parallelism from applications. I'm not alone in this area of work, Lawrence Spracklen is also working on it. We're both at presenting CommunityOne next week.

Tuesday Mar 25, 2008

Conference schedule

The next two months are likely to be a bit hectic for me. I'm presenting at three different conferences, as well as a chat session in Second Life. So I figured I'd put the information up in case anyone reading this is also going to one or other of the events. So in date order:

I'll be talking about parallelisation at the various conferences, the talks will be different. The multi-core expo talks focuses on microparallelisation. The ESC talk will probably be higher level, and the CommunityOne talk will probably be wider ranging, and I hope more interactive.

In the Second Life event I'll be talking about the book, although the whole idea of appearing is to do Q&A, so I hope that will be more of a discussion.

Tuesday Feb 26, 2008

Presentation at communityone

Had my presentation on parallelisation accepted for CommunityOne on 5th May in San Francisco.

Tuesday Feb 12, 2008

CMT related books

OpenSPARC.net has a side bar featuring books that are relevant to CMT. Mine is featured as one of the links. The other two books are Computer Architecture - a quantitative approach which features the UltraSPARC T1, and Chip-Multiprocessor Architecture which is by members of the team responsible for the UltraSPARC T1 processor.

Thursday Jan 31, 2008

Win $20,000!

Sun has announced a Community Innovation Awards Programme - basically a $1M of prize money available for various Sun-sponsored open source projects. There is an OpenSPARC programme, and the one that catches my eye is $20k for:

vi. Best Adaptation of a single-thread application to a multi-thread CMT (Chip Multi Threaded) environment

My guess is that they will expect more than the use of -xautopar -xreduction or a few OpenMP directives :) If I were allowed to enter (unfortunately Sun Employees are not) I'd be looking to exploit the features of the T1 or T2:

  • The threads can synchronise at the L2 cache level - so synchronisation costs are low
  • Memory latency is low

The upshot of this should be that it is possible to parallelise applications which traditionally have not been parallelisable because of synchronisation costs.

Funnily enough this is an area that I'm currently working in, and I do hope to have a paper accepted for the MultiExpo.

Monday Nov 26, 2007

Multi-threading webcast

A long while back I was asked to contribute a video that talked about parallelising applications. The final format is a webcast (audio and slides) rather than the expected video. This choice ended up being made to provide the clearest visuals of the slides, plus the smallest download.

I did get the opportunity to do the entire presentation on video - which was an interesting experience. I found it surprisingly hard to present to just a camera - I think the contrast with presenting to an audience is that you can look around the room and get feedback as to the appropriate level of energy to project. A video camera gives you no such feedback, and worse, there's no other place to look. Still I was quite pleased with the final video. The change to a webcast was made after this, so the audio from the video was carried over, and you still get to see about 3 seconds of the original film, but the rest has gone. I also ended up reworking quite a few of the slides - adding animation to clarify some of the topics.

The topics covered at a break-neck pace are, parallelising using Pthreads and OpenMP. Autoparallelisation by the compiler. Profiling parallel applications. Finally, detecting data races using the thread analyzer.

Tuesday Oct 09, 2007

CMT Developer Tools on the UltraSPARC T2 systems

The CMT Developer Tools are included on the new UltraSPARC T2 based systems, together with Sun Studio 12, and GCC for SPARC Systems.

The CMT Developer Tools are installed in:

and are (unfortunately) not on the default path.

Threads and cores

The UltraSPARC T2 has eight cores, each of these cores is capable of executing 8 threads; making a total of 64 virtual processors. The threads are grouped into two groups of four threads. Each core can execute at most 2 instructions per cycle, the load/store unit and the floating point unit are shared between the two groups of threads, but each thread has its own integer pipeline.

Usually, threads 0-7 are assigned to core 0, threads 8-15 to core 1 etc. The exact mapping is reported by the service processor (and prtdiag), but the mapping is only likely to be different if the system is configured with LDoMs. Taking core 0 as an example; threads 0-3 are assigned to one group and threads 4-7 are assigned to the other group.

To disable a core using psradm it is necessary to disable all the threads on that core. On the other hand if the objective is just to reduce the total number of active threads, keeping the core enabled, then best performance will be attained if the threads are disabled across all groups rather than just disabling threads within a single group. The reason for this is that each group gets to execute a single instruction per cycle, so disabling all the threads within a group will reduce the maximum number of instructions that can be executed in a cycle.

Compiling for the UltraSPARC T2

Today, Sun launched systems based on the UltraSPARC T2. A question that is bound to come up is what compiler flags should be used for the processor?

Sun Studio 12 has the flag -xtarget=ultraT2 to specifically target the UltraSPARC T2. But before jumping off and using this flag, let's take the flag apart and see what it actually means. There are three components that are set by the -xtarget flag :

  • -xcache flag. This flag tells the compiler to target a particular cache configuration. The flag will have an impact on floating point code where the loops can be tiled to fit into cache. Obviously not all codes are amenable to this optimisation, so the -xcache setting is usually unimportant.
  • -xchip flag. This sets the instruction latencies and instruction selection preferences. The UltraSPARC T2 (in common with the UltraSPARC T1) has a simple pipeline so there is nothing much to gain from accurately modelling the instruction latencies. There are also no real situations where it will do better with one instruction sequence in preference to another (unless one is longer than the other). So for the UltraSPARC T2 this flag has little impact on the generated code.
  • -xarch flag. The -xarch flag controls the target architecture. This is traditionally used principally to control whether 32-bit or 64-bit binaries are generated. However, Sun Studio 12 introduced the flags -m32 and -m64 to separate the address-size of the binary from the instruction set selection. There are no UltraSPARC T2 specific instructions which the compiler currently generates, so the default of the SPARC V9 ISA is fine.
  • To summarise, there is an UltraSPARC T2 specific compiler flag, but for most situations the best target to use would be -xtarget=generic which should give good performance over a wide range of processors.

Friday Sep 28, 2007

Multi-threading resources on the developer portal

Found a page of links to useful resources about multi-threading on the developer portal.

Solaris Application Programming Table of Contents

A couple of folks requested that I post the table of contents for my book. This is the draft TOC, not the finished product. I assume that there will be a good correspondence, but the final version should definitely look neater.


Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge
Free Download


« July 2016
The Developer's Edge
Solaris Application Programming
OpenSPARC Book
Multicore Application Programming