Monday Sep 29, 2008

Doing my selfish bit for the environment

Over a year ago I blogged about building a Low power Solaris home server. I was very pleased with the 77W achieved, but have since added a couple of 1TB SATA drives. Sadly, my cheap Trust PW-5252 420W PSU didn't see out its first year, so I swapped in a generic Antec unit from an unused PC that was kicking around.

I've just found a new home for the unloved PC, so I decided to swap the PSU back and to take the opportunity to upgrade my Solaris server's PSU to something a little more environmentally friendly. I selected the Antec TruePower Trio 430 for its green credentials, especially its claimed "up to 85%" efficiency. I was amazed at the difference!

With the addition of the two 1TB SATA drives, and with the swapped-in generic Antec PSU, my Solaris server was pulling about 85W (i.e. up 10% from the 77W in original posting). With the Antec TruePower Trio 430 in place, it is now drawing a miserly 37 Watts! Some of this is doubltess due to smarter cooling (e.g. the internal 12cm fan hardly ever kicks in, and the PSU also takes over control of the case fans), but the majority is probably attributable to smarter circuit design.

Smarter, greener power supplies don't come cheap. I paid £47 for mine. However, "doing the math" I find that this upgrade will have paid for itself in less than a year (mine is a 24x7 server). It also comes with a five year warranty! In all, buying the cheaper (£19) power supply has proven to be a false economy. Think on.

Thursday Dec 06, 2007

Solaris is to UNIX what Mindstorms is to Lego

I've now been at Sun for the best part of two decades. It was Solaris and SPARC which first attracted me to Sun, and it's exciting to see both very much alive and kicking all these years on. But this didn't happen by chance. Sun currently spends $2B per year on R&D, making Sun one of the top 50 technology investors worldwide.

Having myself spent the last five years doing R&D, why have I decided to move back into Sun UK's field organisation? Simply, it's because I think Sun has a very compelling story to tell, and a very exciting portfolio of products and services to offer. In short, I think I'm going to have a lot of fun!

In my new role I'm finding that the thinking behind my Killer Combination, Wicked Bible and Brief History postings is resonating well with a lot of people. Quite frankly, I'm astonished by the number of downloads of my extremely minimalist presentation given to the Sun HPC Consortium in Dresden.

Such has been the interest of late that I thought it would be worth sharing my latest minimalist slide deck: Solaris: greater than the sum of its parts. A lot of the material may be familiar, although I have been experimenting with tag clouds as an alternative to boring bullet points. My basic thesis is: Sun is where the innovation is, so why flirt with a imitations?

I've been a big Lego fan for as long as I can remember. Whilst some kids are content to follow the supplied instruction guides, the real thrill for me has always been that Lego allows me to design and build my very own creations, limited only by my imagination. I feel the same way about UNIX.

UNIX has always been about innovation. UNIX has always provided a rich set of simple, consistent, elegant and well defined interfaces which enable to developer to "go create". This "innovation elsewhere" often takes the seed innovation to places unforeseen by its inventors, and this in turn leads to new innovations.

Lego has undergone a similar evolution. At first I only had chunky 2x2 and 2x4 bricks in red and white to work with. Then came 1xN bricks and 1/3 height plates and more colours. Next came the specialised pieces (e.g. window frames, door frames, wheels, flat plates, sloping roof tiles bricks, fences and so on). But all the time these innovations extended the original, well-designed interfaces, with a big commitment to compatibility, thus preserving investment in much the same way as the Solaris ABI (application binary interface).

Obviously, there are places where these parallels break down, but I think we can push the analogue a little further yet [note to self: a USENIX paper?]. In my mind, Solaris is somewhat akin to Lego Technics, and Solaris 10 to Lego Mindstorms. And in this vein, I see Linux rather in the Betta Builder mould (i.e. innovative interfaces copied from elsewhere, actually cheaper and lighter, but not quite the same build quality as the original implementation). And this is where I'm going to get a little more provocative.

In my new presentation I experiment with tag clouds to enumerate some of Sun's more important UNIX innovations over time. The first cloud lists prehistoric stuff from SunOS 3 and 4 days. The second focussed mostly of Solaris 2. The third focusses on Solaris 10. And while Sun may not be able to take sole credit for ideas such as /proc and mmap, it can claim to have the first substantive implementations.

The fourth tag cloud is included to demonstrate that Sun does not suffer from terminal NIH (not invented here) syndrome. Indeed, I think it recognises that Sun is a pretty good judge of excellence elsewhere (most of the time).

Whatever you think of the detail (and I concede some of it could do with a little more research) I do think it is helpful to ask "where does the innovation happen?". At the very least, I think I've shown that there is heaps of innovation in Solaris which we simply take for granted.

To put it another way: as a Solaris enthusiast I can't help feeling at ease in a Linux environment because I find so many familiar objects from home (I guess a GNU/BSD devotee might say something similar). That's not to deny the achievements of the Linux community in implementing interfaces invented elsewhere, but when I look at the flow of innovation between Solaris and Linux it does feel rather like a one-way street.

We live in interesting times! My own particular area of interest is multithreading. With the dawning of the brave new world of large scale chip multithreading Solaris seems uniquely placed to ride the next wave. This is not by accident. Sun has made a huge investment in thread scalability over the past 15 years.

One of my slides asks "What is the essential difference between single and multithreaded processes?" For some this is not a trivial question. For some it depends on which thread library is being used. But with Solaris, where application threads have been first class citizens ever since Solaris 10 first shipped, the answer is simply "the number of threads".

Enough of this! The weekend is upon us. Where's my Lego?

Tuesday Jul 10, 2007

Prstat + DTrace + Zones + ZFS + E25K = A Killer Combination

The table on the right was directly lifted from a report exploring the scalability of a fairly traditional client-server application on the Sun Fire E6900 and E25K platforms.

The system boards in both machines are identical, only the interconnects differ. Each system board has four CPU sockets, with a dual-core CPU in each, yielding a total of eight virtual processors per board.

The application client is written in COBOL and talks to a multithreaded C-ISAM database on the same host via TCP/IP loopback sockets. The workload was a real world overnight batch of many "read-only" jobs run 32 at a time.

The primary metric for the benchmark was the total elapsed time. A processor set was used to contain the database engine, with no more than eight virtual processors remaining for the application processes.

The report concludes that the E25K's negative scaling is due to NUMA considerations. I felt this had more to do with perceived "previous convictions" than fact. It bothered me that the E6900 performance had not been called into question at all or explored further.

The situation is made a little clearer by plotting the table as a graph, where the Y axis is a throughput metric rather than the total elapsed time.

Although the E25K plot does indeed appear to show negative scalability (which must surely be somehow related to data locality), it is the E6900 plot which reveals the real problem.

The most important question is not "Why does the E25K throughput drop as virtual processors are added?" but rather "Why does the E6900 hardly go any faster as virtual processors are added?"

Of course there could be many reasons for this (e.g. "because there are not enough virtual processors available to run the COBOL client").

However, further investigation with the prstat utility revealed severe thread scalability bottlenecks in the multithreaded database engine.

Using prstat's -m and -L flags it was possible to see microstate accounting data for each thread in the database. This revealed a large number of threads in the LCK state.

Some very basic questions (simple enough to be typed on a command line) were then asked using DTrace and these showed that the "lock waits" were due to heavy contention on a few hot mutex locks within the database.

Many multithreaded applications are known to scale well on machines such as the E25K. Such applications will not have highly contended locks. Good design of data structures and clever strategies to avoid contention are essential for success.

This second graph may be purely hypothetical but it does indicate how a carefully written multithreaded application's throughput might be expected to scale on both the E6900 and the E25K (taking into account the slightly longer inter-board latencies associated with the latter).

The graph also shows that something less that perfect scalability may still be economically viable on very large machines -- i.e. it may be possible to solve larger problems, even if this is achieved with less efficiently.

As an aside, this is somewhat similar to the way in which drag takes over as the dominant factor limiting the speed of a car -- i.e. it may be necessary to double the engine size to increase the maximum speed by less than 50%.

Working with the database developer it was possible to use DTrace to begin "peeling the scalability onion" (an apt metaphor for an iterative process of diminishing returns -- and tears -- as layer after layer of contention is removed from code).

With DTrace it is a simple matter to generate call stacks for code implicated in heavily contended locks. Breaking such locks up and/or converting mutexes to rwlocks is a well understood technique for retrofitting scalability to serialised code, but it is beyond the scope of this post. Suffice it to say that some dramatic results were quickly achieved.

Using these techniques the database scalability limit was increased from 8 to 24 virtual processors in just a matter of days. Sensing that the next speed bump might take a lot more effort, some other really cool Solaris innovations were called on to go the next step.

The new improved scalable database engine was now working very nicely alongside the COBOL application on the E25K in the same processor set with up to 72 virtual processors (already somewhere the E6900 could not go).

For this benchmark the database consisted of a working set of about 120GB across some 100,000 files. With well in excess of 300GB of RAM in each system it seemed highly desirable to cache the data files entirely in RAM (something which the customer was very willing to consider).

The "read-only" benchmark workload actually resulted in something like 200MB of the 120GB dataset being updated each run. This was mostly due to writing intermediate temporary data (which is discarded at the end of the run).

Then came a flash of inspiration. Using clones of a ZFS snapshot of the data together with Zones it was possible to partition multiple instances of the application. But the really cool bit is that ZFS snapshots are almost instant and virtually free.

ZFS clones are implemented using copy-on-write relative to a snapshot. This means that most of the storage blocks on disk and filesystem cache in RAM can be shared across all instances. Although snapshots and partitioning are possible on other systems, they are not instant, and they are unable to share RAM.

The E25K's 144 vitual processors (on 18 system boards) were partitioned into a global zone and five local zones of 24 virtual processors (3 system boards) each. The database was quiesced and a ZFS snapshot taken. This snapshot was then cloned five times (once per local zone) and the same workload run against all six zones concurrently (in the real world the workload would also be partitioned).

The resulting throughput was nearly five times that of a single 24 virtual processor zone, and almost double the capacity of a fully configured E6900.

All of the Solaris technologies mentioned in this posting are pretty amazing in their own right. The main reason for writing is to underline how extremely powerful the combination of such innovative technologies can be when applied to real world problems. Just imagine what Solaris could do for you!

Solaris: the whole is greater than the sum of its parts.

Technorati Tags: , , , ,

Tuesday Jul 03, 2007

A Brief History Of Solaris

A week ago I was presenting A Brief History Of Solaris at the Sun HPC Consortium in Dresden. My slideware is pretty minimalist (audiences generally don't respond well to extended lists of bullet points), but it should give you a flavour of my presentation style and content. For more, see Josh Simon's writeup.

My main point is that although Solaris is a good place to be because it has a consistent track record of innovation (e.g. ONC, mmap, dynamic linking, audaciously scalable SMP, threads, doors, 64-bit, containers, large memory support, zones, ZFS, DTrace, ...), the clincher is that these innovations meet in a robust package with long term compatability and support.

Linus may kid himself that ZFS is all Solaris has to offer, but the Linux community has been sincerely flattering Sun for years with its imitation and use of so many Solaris technologies. Yes, there is potential for this to work both ways, but until now the traffic has been mostly a one way street.

As a colleague recently pointed out it is worth considering questions like "what would Solaris be without the Linux interfaces it has adopted?" and "what would Linux be without the interfaces it has adopted from Sun?" (e.g. NFS, NIS, PAM, nsswitch.conf, ld.so.1, LD_\*, /proc, doors, kernel slab allocator, ...). Wow, isn't sharing cool!

Solaris: often imitated, seldom bettered.

Technorati Tags: , , , ,

Sunday Jul 01, 2007

Silent Data Corruption, The Wicked Bible and ZFS

Are you concerned about silent data corruption? You should be. History shows that silent data corruption has potential to end your career!

In 1631 printers introduced a single bit error into an edition of the King James Bible. They omitted an important "not" from Exodus 20:14, making the seventh commandment read "Thou shalt commit adultery."

These unfortunate professionals were fined £300 (roughly a lifetime's wages). Most of the copies of this "Wicked Bible" were recalled immediately, with only 11 copies known to exist today (source Wikipedia, image Urban Dictionary).

One of the problems with silent data corruption is that we may not even notice that is it there when we read the data back. Indeed, we may actually prefer the adulterated version. Happily the 1631 error stands out against the rest of the text in which it is found.

ZFS protects all its data (including metadata) with non-local checksums. This means that it is impossible for silent data corruption introduced anywhere between the dirty spinning stuff and your system memory to go unnoticed (what happens from there onwards is entirely up to you).

ZFS is able to repair corrupted data automatically provided you have configured mirrored or RAIDZ storage pools. It's just a shame the British authorities didn't take the ditto blocks in Deuteronomy 5:18 into account way back in 1631.

Hmmm, does that make the Bible prior art for United States Patent 20070106862?

Technorati Tags: , ,

Friday Jun 29, 2007

ZFS: zee, zed or what?

"Sesame Street was brought to you today by the letter zee ..." was probably the first time I was aware of a problem. Whilst I do try to contextualise my zees and zeds, sometimes the audience is just too diverse (or I am just too old). I am grateful to my French-Canadian friend and colleague Roch Bourbonnais for proposing an alternative. So "zer-eff-ess" it is, then! After all, ZFS really is zer one true filesystem, n'est-ce pas?

Technorati Tags: , ,

Wednesday May 16, 2007

ZFS and RAID - "I" is for "Inexpensive" (sorry for any confusion)

When I were a lad "RAID" was always an acronym for "Redundant Array of Inexpensive Disks". According to this Wikipedia article 'twas always thus. So, why do so many people think that the "I" stands for "Independent"?

Well, I guess part of the reason is that when companies started to build RAID products they soon discovered that were far from inexpensive. Stuff like fancy racking, redundant power supplies, large nonvolatile write caches, multipath I/O, high bandwith interconnects, and data path error protection simply don't come cheap.

Then there's the "two's company, three's a crowd" factor: reliability, performance, and low cost ... pick any two. But just because the total RAID storage solution isn't cheap, doesn't necessarily mean that it cannot leverage inexpensive disk drives.

However, inexpensive disk drives (such as today's commodity IDE and SATA products) provide a lot less in terms of data path protection than more expensive devices (such as premium FC and SCSI drives). So RAID has become a someone elitist, premium technology, rather than goodness for the masses.

Enter ZFS.

Because ZFS provides separate checksum protection of all filesystem data and meta data, even IDE drives be deployed to build simple RAID solutions with high data integrity. Indeed, ZFS's checksumming protects the entire data path from the spinning brown stuff to the host computer's memory.

This is why I rebuilt my home server around OpenSolaris using cheap non-ECC memory and low cost IDE drives. But ZFS also promises dramatically to reduce the cost of storage solutions for the datacentre. I'm sure we will see many more products like Sun's X4500 "Thumper".

ZFS - Restoring the "I" back in RAID

Technorati Tags: , , , ,

Friday Apr 27, 2007

Lower power Solaris home server

78 Watts

When my trusty Solaris 10 Atlhon 32-bit home server blew up I enlisted a power hungry dual socket Opteron workstation as a stop-gap emergency measure. I also used the opportunity to upgrade from SVM/UFS to ZFS. But the heat and the noise were unacceptable, so I started thinking about a quieter and greener alternative ...

My initial intention was to build a simple ZFS-based NAS box, but after an abortive attempt to do things on the really cheap with a £20 mobo and £25 CPU (one of which didn't work, and I'm still waiting for Ebuyer to process my RMA), I decided I needed to make an effort to keep up with the Gerhards.

Although I'd seen Chris Gerhard's post about a system built around an ASUS M2NPV-VM, when I searched for this mobo at Insight UK (where Sun employees can get a useful discount, free next working day delivery, and expect to be treated as a valued customer), I was unable to find it. So instead, I opted for the cheaper ASUS M2V-MX (£38) ... and soon regretted it.

My config also included: an Antec SLK3000B case (£29), a quiet Trust PW-5252 420W PSU (£19), an AMD Athlon 64 3000+ 1.8GHz 512K L2 CPU (£45), two Kingston ValueRAM 1GB 667 DDRII DIMMS (£52 each), and two Seagate Barracuda 400GB 7200.10 ATA100 drive (£82 each). I also threw in a spare 80GB SATA drive and DVD rewriter I just happened to have lying around. Grand total: £399.

However, despite upgrading to the latest firmware, I couldn't get Casper's PowerNow stuff to work. My disappontment grew whilst talking to Chris about his ASUS M2NPV-VM. Not only had he got PowerNow working (rev 0705 firmware minimum required), but the mobo included gigabit networking, NVIDIA graphics, and greater memory capacity. By this time, feature creep had also set in. I could see that the machine might also be a useful workstation, and ZFS compression probably meant I could use as much CPU as I could get (sadly, the current version of Casper's code supports only a single core config).

Then I discovered that Insight UK did indeed stock the ASUS M2NPV-VM after all! It's just that their search engine is broken. So I decided to bite the bullet (£56) ... I have since reused the ASUS M2V-MX in a dual core Ubuntu config, but that's a story for another day ... perhaps.

To find out how much power I was saving, I invested in a Brennenstuhl PM230 inline power meter (shown above). Machine Mart only wanted £20 for it, and unlike some other cheap units, it does a proper job with the power factor. The only issue I've found is the crazy positioning of the power outlet relative to the display and control buttons (it's pretty obvious that UK power plugs were not considered in the original design). Anyway, here are some results:

Mode \\ Config2x Opteron 2.2GHZ
RioWorks HDAMB
2GB DDRII
6x 7200RPM
1x Athlon 64 1.8GHz
ASUS M2NPV-VM
2GB DDRII
3x 7200RPM
Intel Core Duo 1.83GHz
Apple MacBook Pro
2GB DDR2
1x 5400RPM
standby40W (£23 PA)4W (£2 PA)7W (£4 PA)
idle240W (£137 PA)78W (£47 PA)34W (£19 PA)
idle + charging- -60W (£34 PA)
1 loop260W (£149 PA)111W (£64 PA)50W (£29 PA)
2 loops280W (£160 PA)111W (£64 PA)55W (£32 PA)
2 loops + charging--81W (£46 PA)

The above calculated annual electricity costs are based on a charge of 9p per kWh. Since a home server spends most of its time idle, I calculate that my new machine will save me at least £90 per year relative to my stop-gap Opteron system. That hardly pays for the upgrade, but it does salve my green conscience a little ... just not much!

mbp$ ssh basket
Last login: Fri Apr 27 16:12:16 2007
Sun Microsystems Inc.   SunOS 5.11      snv_55  October 2007
basket$ prtdiag
System Configuration: System manufacturer System Product Name
BIOS Configuration: Phoenix Technologies, LTD ASUS M2NPV-VM ACPI BIOS Revision 0705 01/02/2007

==== Processor Sockets ====================================

Version                          Location Tag
-------------------------------- --------------------------
AMD Athlon(tm) 64 Processor 3000+ Socket AM2 

==== Memory Device Sockets ================================

Type    Status Set Device Locator      Bank Locator
------- ------ --- ------------------- --------------------
unknown in use 0   A0                  Bank0/1
unknown empty  0   A1                  Bank2/3
unknown in use 0   A2                  Bank4/5
unknown empty  0   A3                  Bank6/7

==== On-Board Devices =====================================

==== Upgradeable Slots ====================================

ID  Status    Type             Description
--- --------- ---------------- ----------------------------
1   in use    PCI              PCI1
2   available PCI              PCI2
4   available PCI Express      PCIEX16
5   available PCI Express      PCIEX1_1
basket$ 

Technorati Tags: , , ,

Tuesday Dec 06, 2005

Scorching Mutexes with CoolThreads - Part 1

How could I not be excited about CoolThreads?! Regular readers of Multiple Threads will be aware of my technical white paper on Multithreading in the Solaris Operating Environment, and of more recent work making getenv() scale for Solaris 10. And there's a lot more stuff I'm itching to blog -- not least libMicro, a scalable framework for microbenchmarking which is now available to the world via the OpenSolaris site.

For my small part in today's launch of the first CoolThreads servers, I thought it would be fun to use libMicro to explore some aspects of application level mutex performance on the UltraSPARC T1 chip. In traditional symmetric multiprocess configurations mutex performance is dogged by inter-chip cache to cache latencies.

Little Monster

To applications, the Sun Fire T2000 Server looks like a 32-way monster. Indeed, log in to one of these babies over the network and you soon get the impression of a towering, floor-breaking, hot-air-blasting, mega-power-consuming beast. In reality, it's an cool, quiet, unassuming 2U rackable box with a tiny appetite!

Eight processor cores -- each with its own L1 cache and four hardware strands -- share a common on-chip L2 cache. The thirty-two virtual processors see very low latencies from strand to strand, and core to core. But how does this translate to mutex performance? And is there any measurable difference between inter-core and intra-core synchronization?

For comparison, I managed to scrounge a Sun Fire V890 Server with eight UltraSPARC IV processors (i.e. 16 virtual processors in all). Both machines are clocked at 1.2GHz.

Spin city

First up, I took libMicro's cascade_mutex test case for a spin. Literally! This test takes a defined number of threads and/or processes and arranges them in a ring. Each thread has two mutexes on which it blocks alternately; and each thread manipulates the two mutexes of the next thread in the ring such that only one thread is unblocked at a time. Just now, I'm only interested in the minimum time taken to get right around the loop.

The default application mutex implementation in Solaris uses an adaptive algorithm in which a thread waiting for a mutex does a short spin for the lock in the hope of avoiding a costly sleep in the kernel. However, in the case of an intraprocess mutex the waiter will only attempt the spin as long as the mutex holder is running (there is no point spinning for a mutex held by thread which is making no forward progress).

Wow!

With 16 threads running cascade_mutex the T2000 achieved a blistering 11.9us/loop (that's less than 750ns per thread)! The V890, on the other hand, took a more leisurely 25.3us/loop. Clearly, mutex synchronization can be very fast with CoolThreads!

Naturally, spinning is not going to help the cascade_mutex case if you have more runable threads than available virtual processors. With 32 active threads the V890 loop time rockets to 850us/loop, whereas the T2000 (with just enough hardware strands available) manages a very respectable 32.4us/loop. Only when the T2000 runs out of virtual processors does the V890 catch up (due to better single thread performance). At 33 threads the T2000 jumps to 1140us/loop versus 900us/loop on the V890.

In conclusion

libMicro's cascade_mutex case clearly shows that UltraSPARC T1 delivers incredibly low latency synchronization across 32 virtual processors. Whilst this is a good thing in general it is particularly good news for the many applications which use thread worker pools to express their concurrency.

In Part 2 we will explore the small difference cascade_mutex sees between strands on the same core and strands on different cores. Stay tuned!

Technorati Tags: , ,

Wednesday Jun 22, 2005

How not to load-balance connections with accept()

It is a fairly well know that multiple threads in multiple processes can wait for TCP/IP connections on the same listening socket. What is not so well known are the nuances of implementation which affect the load balancing of such connections across processes and threads. Here I'm going to share what I learned whilst working on improving the scalability of a popular 3rd party software product (which shall remain nameless).

Let's not worry too much about how multiple processes come to be sharing the same listening socket (this could easily be achieved by inheritance across fork(2) or using ioctl(2) with I_SENDFD (see streamio(7I))). Instead, let's consider how connections are allocated to threads in such a situation.

The Relatively Easy Way

Threads blocking in accept(3SOCKET) join socket-specific queue in "prioritised FIFO" order. This is almost exactly what we want for round-robin allocation of connections!

However, it does pose a challenge when a thread "just happens" to have a low priority when it arrives at the queue. If having a low priority is a relatively rare occurrence, this could mean that threads thus afflicted suffer starvation (i.e. if there is always one or more higher priority threads queued whenever a connection arrived).

In the olden days this was a real issue for threads in the TS scheduling class. Of course, one solution would be to use a fixed priority scheduling class (such as FX). However, what was needed was for threads sleeping in "prioritised FIFO" queues to have their priority and queue position recalculated on a regular basis (in the same manner as runnable threads waiting for a CPU).

The bug 4468181 low priority TS threads on a sleep queue can be victimised was first fixed in Solaris 9. The fix has since been made available for Solaris 8, Solaris 7, and Solaris 2.6 via patches, but in these cases it has to be enabled by adding the following to /etc/system:

set TS:ts_sleep_promote=1

The fix is enabled by default in Solaris 9 onwards, and once it is in place connections to a specific listening socket will be allocated to threads on a near round-robin basis.

The Extremely Hard Way

For whatever reason ("it seems to work ok on AIX" is not a good one!) some folk like to put their listening sockets into non-blocking mode, and then to wait for connections in select(3C) or poll(2) -- it is important to note that in Solaris, select() is implemented using poll() in any case.

This is all fine and dandy, except that Solaris uses LIFO for the poll sleep queues. Whilst this suits thread worker pools very well (because it helps keep caches warm) it is precisely not what one needs for the round-robin load-balancing of connections between threads.

Of course, the "prioritised FIFO" queueing of accept() and the LIFO queueing of poll() is either not specified or it is classified as "implementation specific" by the standards. Solaris does what it does because it seems to please most of the people most of the time. However, if you assume FIFO you can be in for a nasty surprise!

Actually, having multiple threads waiting in poll() for connections on the same listening socket is a pretty dumb idea. The reason is that a thundering herd will occurr each time a connection comes in.

Some implementations actually result in every thread coming right out of poll() and diving into accept() where only one will succeed. Solaris is a little smarter in that once one thread has picked up a connection, other threads still in poll() will go back to sleep.

Desparate Situations, Desperate Measures

If you have an application which is doing things the extremely hard way, then the best solution is a complete rewrite! However, if this is not an option open to you "The Harman House Of Hacks" is proud to present the following:

#define _REENTRANT

#include <sys/types.h>
#include <sys/socket.h>
#include <dlfcn.h>
#include <stdlib.h>

static int (\*orig_accept)();

#pragma init(iaccept_init)
#pragma weak accept=_accept

static void
iaccept_init()
{
        orig_accept = (int(\*)())dlsym(RTLD_NEXT, "_accept");
}

static int
_accept(int s, struct sockaddr \*addr, socklen_t \*addrlen)
{
        int res;

        (void) poll(0, 0, drand48() \* 21.0);

        res = orig_accept(s, addr, addrlen);
        return res;
}

This is intended to be compiled and used something like this:

$ cc -K pic -G -o iaccept.so iaccept.c
$ LD_PRELOAD=./iaccept.so brokenapp

This works by interposing a new implementation of accept() which does a short random sleep before diving into the standard implementation of accept(). This means that the thread at the head of the list socket's sleep queue is nolonger likely to be the first thread into accept().

Conclusion

Of course, this is a rather blunt instrument which should not be applied universally -- or lightly. However, although more threads in the thundering herd are likely to come all the way out of poll() this hack does a very nice job of scambling the sleep queue order. For the application for which the above hack was written, the latency added to accept() was not an issue. But more importantly, connections were very evenly balanced across processes.

Technorati Tags: ,

Tuesday Jun 14, 2005

Caring For The Environment - Making getenv() Scale

Although a relatively minor contributor to OpenSolaris I still have the satisfaction of knowing that every Solaris 10 process is using my code. But who in their right mind needs getenv(3C) to scale? Of course if you don't care about thread safety (as is currently the case with glibc version 2.3.5 -- and hence with Linux) your implementation might scale very nicely thankyou!

Sun on the other hand does care about thread safety (and we've been doing so for a long time). However we had rather assumed that no one in their right mind would need getenv() to scale, so our implemetation was made thread safe by putting a dirty great mutex lock around every piece of code which manipulates environ. After all, as our very own Roger Faulkner is so fond of saying: "Correctness is a contraint; performance is a goal". And who cares about getenv() performance anyway?

But Who Really Cares?

Well it turns out that there are some significant applications which depend on getenv() scalability (and which scale wonderfully on Linux ... where thread safety is often ignored ... they are just very lucky that no one seems to be updating environ whilst anyone else is reading it). So Bart Smaalders filed bug 4991763 getenv doesn't scale and said he thought it was an excellent opportunity for my first putback. Thanks Bart!

For some time I've been saying: "If Linux is faster, it's a Solaris bug!" but somehow 4991763 didn't quite fit the bill. Firstly, I think an application which depends on getenv() scalability is broken. Secondly, Linux is just feeling lucky, punk. However, I do firmly believe that we should do all we can to ensure that Solaris runs applications well -- even those which really need some tuning of their own. I had also been itching for an chance to explore opportunities for lockless optimisations in libc, so all in all 4991763 was an opportunity not to be missed!

A Complete Rewrite

The existing implementation of getenv(), putenv(), setenv(), and unsetenv() was split across three files (getenv.c, putenv.c and nlspath_checks.c) with the global mutex _libc_environ_lock being defined elsewhere. Things had become fairly messy and inefficient so I decided on a complete rewrite.

NLSPATH security checks had introduced yet another global variable and a rather inefficient dance involving a mutex on every {get,put,set,unset}env() call just to ensure that clean_env() was called the first time in. In this instance it was an easy matter to remove the mutex from the fast path by doing a lazy check thus:

static mutex_t                  update_lock = DEFAULTMUTEX;
static int                      initenv_done = 0;

char \*getenv(const char \*name)
{
        char                    \*value;
        if (!initenv_done)

        if (findenv(environ, name, 1, &value) != NULL)
                return (value);

        return (NULL);
}

The test was then repeated under the protection of the mutex in initenv() thus:

extern void                     clean_env();

static void
initenv()
{
        lmutex_lock(&update_lock);
        if (!initenv_done || ... ) {
                /\* Call the NLSPATH janitor in. \*/
                clean_env();
                .
                .
                .
                initenv_done = 1;
        }
        lmutex_unlock(&update_lock);
}

By rolling putenv.c into getenv.c I was able to eliminate the use of globals altogether, which in turn allowed the compiler to produce better optimised code.

Look, No Locks!

But the biggest part of the rewrite was to make the fast path of getenv() entirely lockless. What is not apparent above is that findenv() is entirely lockless.

Various standards define the global environment list pointer:

extern char \*\*environ;

This has to be kept as a consistent NULL terminated array of pointers to NULL terminated strings. However, the standards say nothing about how updates are to be synchronised. More recent standards forbid direct updates to environ itself if getenv() and friends are being used.

Yet the requirement that environ is kept consistent is precisely what we need to implement a lockless findenv(). The big issue is that whenever the environ list is updated, anyone else in the process of scanning it must not see an old value which has been removed, or miss a value which has not been removed.

The traditional approach is to allocate an array of pointers, with environ pointing to the first element. When someone needs to a new value to the environment list we simply add it to the end of the list. But how do we go about deleting values? And what if we need to add a new value when the allocated array is already full? If you care about threads, it's not long before you need to introduce some locking!

The new implementation contains two "smarts" which meets these challenges without introducing locks into the getenv() path ...

Double It And Drop The Old One On The Floor

When the new implementation needs a bigger environ list, it simply allocates a new one which is twice as large and copies the old list into it. The old list is never reused -- it is left intact for the benefit of anyone else who might happen to be still traversing it.

This may sound wasteful, but the environment list rarely needs to be resized. The wastage is also bounded -- it is quite easy to prove mathematically that this strategy never consumes more than 3x the space required by an array of optimal size.

However, one teeny weeny issue with the "drop it on the floor" approach is that leak detectors can get a tad upset if they find allocated memory which is not referenced by anyone. With a view to keeping me on the straight and narrow -- but mostly to avert high customer support call volumes -- Chris Gerhard recommended that I keep a linked list of all internally dropped memory (just to keep those goody-goody leak detectors happy).

I first met Chris in 1989. He was on my interview panel when I joined Sun. I do hope he feels he did a good job that day!

Overwrite With The First Entry And Increment

I was bouncing some other getenv() ideas around with Chris when he also gave me just what I needed for deletes. The old code just grabbed the global lock, found the element to be deleted, shifted the rest of the list down one slot (overwriting the victim), and then released the lock.

Chris had the bright idea of copying the first element in the list over the victim, and then incrementing the environ pointer itself. The worst case would be that the same element might be seen twice by another thread, but this is not a correctness issue.

This led to two further changes:

  1. New values are now added at the bottom of the environment list (with the environ pointer being decremented once the new value is in place).
  2. When a new doube-sized environment list needs to allocated, the old one is copied into the top of the new one (instead of the bottom) so that the list can then be grown downwards.

OK, Not Entirely Lockless

Obviously mutex lock protection is still needed to serialise all updates to the environment list. The new implementation has a lock for all seasons: update_lock (e.g. for updating initenv_done and for protecting environ itself). However the new getenv() is entirely lockless (i.e. once clean_env() has been called once).

Another important issue is that it is considered bad practice for system libraries to hold a lock while calling malloc(). For this reason the first two thirds of addtoenv() are inside a for(;;) loop. If it is necessary to allocate a larger environment array addtoenv() needs to drop update_lock temporarily. However this opens a window for another thread to modify environ in such a way that means we must retry. This loop is controlled by a simple generation number environ_gen (also protected by update_lock).

Guaranteeing Consistency

That's almost all there is to it. However in multiprocessor systems we still have to make sure that memory writes on one CPU happen in such a way that they don't appear out of sequence on another CPU. Of course this is taken care of automatically when we use mutex locks.

Consider the following code fragment to insert a new item:

        environ[-1] = string;
        environ--;

It is vitally important the two memory writes implied by this are seen in the same order to every CPU in the system. On SPARC today this doesn't matter since all ABI compliant binaries run in Total Store Order mode (i.e. stores are guarantted to be visible in the order in which they are submitted). But is it possible that future systems will used a more relaxed memory model.

However, this is not just a hardware issue, it is also a compiler issue. Without extra care the compiler might reorder the two stores, since the C language cares nothing for threads. I had quite a long discussion with a number of folk concerning "sequence points" and the use of volatile in C.

The eventual solution was this:

        environ[-1] = string;
        membar_producer();
        environ--;

First, the function membar_producer() serves as a sequence point, guaranteeing that the C compiler will preserve the order of the preceding and following statements. Secondly, it provides any store barriers needed by the underlying guarantee the same effect as Total Store Order for the preceding and following instructions.

A Spirit Of Generousity

My new implementation was integrated into s10_67 yet despite my own extensive testing it caused a number of test suites to fail in later testing. This was tracked down to common test suite library code which updated environ directly. Yuck! Although this kind of thing is very much frowned on by the more recent standards it was felt that if our own test cases did it there was a good chance that some real applications did it too. So with some reluctance I filed 6183277 getenv should be more gracious to broken apps.

If someone else if going to mess with environ under your nose there's not a lot you can do about it. However it is fairly easy to detect the majority of cases (except for a few really nasty data races) by keeping a private copy my_environ which can be compared with the actual value of environ. If these two values are ever found to differ we just go back to square one and try again.

So the above fragment for adding an item now looks more like this:

        if (my_environ != environ || ...)
                initenv();
        .
        .
        .
        my_environ[-1] = string;
        membar_producer();
        my_environ--;
        environ = my_environ;

Conclusion

My second putback integrated into s10_71. Following this I had to fight off one other challenge from Rod Evans who filed 6178667 ldd list unexpected (file not found) in x86 environment. However this turned out to be not my fault, but a latent buffer overrun in ldd(1) exposed by different heap usage patterns. Of course, the engineer who introduced this heinous bug (back in January 2001) will remain anonymous (sort of). Still, he did buy me a beer!

Of course, the serious point is that when you change something which is used by everyone, it is possible that you expose problems elsewhere. It is to be expected that the thing last changed will get the blame. Such changes are not for the faint-hearted, but they can be a whole lot of fun!

My first experience of modifying Solaris actually resulted in two putbacks into the Solaris gate, but I learnt a great deal along the way. Dizzy with my success, I am now actively seeking other opportunities for lockless optimisations in libc!

Technorati Tags: ,

Tuesday May 24, 2005

If Linux is faster, it's a Solaris bug!

Well that's what I said back in 2002 -- and it certainly had an impact within Sun -- but I see Darrin has toned it down a little in his inspirational list of performance mantras. Somehow "if a competing OS is faster, it's a Solaris bug and it's important" doesn't have quite the same punch!

I still stand behind my original call to action. And I'm glad to report that the gap has been closing and continues to close. Of course there will always be cases where the RAS (reliability, availability, and serviceability) or scalability of one platform make it slower than a competitive but less capable platform. However, the days of attributing all performance vices to other mitigating virtues are long gone. Quite often code is slow simply because it is slow and needs tuning.

I've already promised to blog about my lockless scalability fix for getenv(3C) (and I will do so soon). This is just one instance of me putting my putback where my slogan is. Although in this case it would have been possible to take the moral high ground since the other platform's implementation was only faster because it was completely multithread unsafe -- it may still be so for all I know, but I do know that Solaris 10 now has a fast and safe implementation.

Is it too cheeky to suggest a corollary? ...

"If Solaris has a cool distinguishing feature, it's a Linux bug!"

Tag: ,

Friday May 20, 2005

Time, Timers, Timeouts and Ticks - Part 1

One annoying thing about DTrace is that there is now generally no good excuse for not finding out what really is going on whenever a puzzling system statistic rears its ugly head. In the past I would just shrug off such gaps in my understanding and move onto something "more imporant" (i.e. less soul destroying). With the old tools the "return on investment" was often just too low, but with DTrace the ROI really can be significant: not only is my understanding improving, but I'm finding new stuff to "go fix" all the time!

Jon Haslam and I were sipping coffee in a Washington DC hotel lobby, hacking together some examples for our demo session at the SUPerG 2005 conference. I was simulating a thundering herd scenario on my Acer Ferarri laptop by having 100 threads in a process doing this:

        for(;;)
                poll(0, 0, 10);

I'd done this kind of thing many times before, but this time I happened to notice that I wasn't getting my expected number of syscalls or context switches, as reported by good old vmstat(1M). By some strange quirk I was seeing only about 6,000 additional syscalls per second, but context switches were up by 10,000 per second (where I naively expected the syscalls to be). Of course, I now realise that I should have expected no more than 5,000 poll(2) syscalls per second.

I will presume the reader is pretty used to the idea that Solaris rounds most timers and timeouts (I'm not going to cover CLOCK_HIGHRES timers here) up to the next clock tick. The default clock frequency is 100Hz (and has been as long as I can remember, although from Solaris 7 onwards this has been configurable) which equates to a tick interval of 10ms.

This means that the following code segment running in a single thread will generate no more that 100 poll(2) syscalls per second:

        for (;;)
                poll(0, 0, 9);

Although we're asking for 9ms timeouts these will get rounded up to the next 10ms tick. On a quiet system we can expect the first call to synchronise us with the clock after one or two ticks. From then on all subsequent calls should only require one tick each (with nearly 1ms to spare for call and implementation overhead).

Likewise we can fairly safely predict that following code segment will generate no more than 50 syscalls per second:

        for (;;)
                poll(0, 0, 11);

When we come to a 10ms timeout it is fairly safe to predict (based on the above) that the following code should generate somewhere between 50 and 100 syscalls per second tops:

        for (;;)
                poll(0, 0, 10);

Of course we can only hit 100 syscalls per second if the C langauge overhead and the poll(2) syscall implementation overhead were both zero. On reflection, this obviously cannot be the case!

A second guess would be that all 10ms sleeps must take more than 10ms elapsed time, so we might expect just 50 syscalls per second maximum. However in the singlethreaded case I observe about 60 on my system (and this accords with the 6,000 I saw from 100 threads).

At this point it is important to add that I have verified that none of the poll(2) syscalls are returning early - i.e. all are taking at least 10ms.

Right now my best guess is that the gap between two consecutive clock ticks reaching the timer callout queue sometimes exceeds 10ms. Of course if this happens it is very likely that the very next tick will reach the callout queue in less than 10ms, but this too would be consistent with my observations.

To summarise. I currently believe that something like this is happening:

poll()
tick0
tick1
tick2
tick3
tick4
tick5
tick6
tick7
9ms

wakeup wakeup wakeup wakeup wakeup wakeup wakeup
10ms


wakeup
wakeup wakeup
wakeup
11ms


wakeup
wakeup
wakeup

In the 10ms example above I'm guessing that the gap between tick4 and tick5 arriving at the timer callout queue is slightly more than 10ms. This results in two consecutive 10ms wakeups. Since I'm seeing 60 syscalls where I would expect 50, I can further surmise that this kind of thing happens about 20% of the time.

Whilst I did use some trivial DTrace scripts to verify some of the above data (e.g. the number of actual poll(2) syscalls per second) most of this post is intended to set the scene and whet your appetite for Part 2.

Next time I hope to explain how I used DTrace to find the reason for the excessive context switches I observed. This led to me filing a performance bug and opened up a host of additional questions to do timers, timeouts and ticks and how they relate to the time of day.

Tag: ,

About

pgdh

Search

Categories
Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today