Tuesday Sep 08, 2009

Low latency computing with Solaris and DTrace

Over the past couple of years I've helped a number of financial institutions identify and eliminate or significantly reduce sources of latency and jitter in their systems. In the city there's currently something akin to an arms race, as banks seek to gain a competitive edge by shaving microseconds off transaction times. It's all about beating the competition to make the right trade at the right price. A saving of one millisecond can be worth millions in profit. And the regulatory bodies need to be able to keep up with this new low latency world too.

Code path length is only part of the picture (although an important one). However, processor architectures with challenging single thread performance (such as Sun's T-series systems) are still able to offer competitive advantage in scenarios where quick access to CPU resource is a bigger factor. Your mileage will vary.

When it comes to jitter I've seen a fair amount of naivety. Just because I have 32 cores in a system doesn't mean I won't see issues such as preemption, interrupt pinning, thundering herds and thread migration. Thankfully, DTrace provides the kind of observability I need to identify and quantify such issues, and Solaris often has the features needed to ameliorate them, often without the need to change application code.

I generally find that there is a lot of "low hanging fruit", and am often able to demonstrate a dramatic reduction in jitter and absolute latency in a short amount of time. You may have seen some pretty big claims for DTrace, but in my experience it is hard to over-hype what can be achieved. It's not just about shaving milliseconds of transaction times, but about reducing the amount of hardware that needs to be thrown at the problem.

DTrace for dummies - Complexity

DTrace is many things to many people. To me it is a tool for engaging with complexity. Sure there's an important place for the DTrace Toolkit, advanced OpenStorage analytics, Chime and other wonderful technologies built on DTrace (most of which don't even come close to exposing the user to the more low-level cranium challenging detail), but for me DTrace remains "The One True Tool" (as slashdot reviewer) and the means by which I can ask an arbitrary question and get an instant answer.

When presenting DTrace to a new audience, I see my primary goal as creating desire. Nothing worth having comes easily. Getting to grips with DTrace involves a steep learning curve. Before exposing candidates to potentially overwhelming detail, I need to show them why the gain is going to be worth the pain. It's also useful to underline some seeds of self doubt and insecurity, to establish my authority as the teacher they can trust. So I generally start by talking about complexity.

All I'm going to blog here is one of my favourite complexity stories. It is best done live, with lots of stuff scrolling up a green screen, and plenty of theatrical flare. However, for the purpose of this post I've done the UNIX thing and used a pipe into the wc(1) command. I'm sorry if it loses something in the telling, but the base data is still interesting.

I usually start by talking about how complexity has increased during my time at Sun. In the good old days when we all programmed in C it was possible for one person to have a handle on the whole system. But today's world is very different. In a bid to connect with the old timers, we start talking about "Hello World!". I then show how good the truss(1) utility is at exposing some of the implementation detail.

We then move on to a Java implementation. The code looks similar, and it is functionally equivalent. Although both the C and Java versions complete in far less then a second, even the casual observer can see that the Java variant is slower. I then start digging deeper with truss(1). First, we compare just the number of system calls, then the number of inter-library function calls, the lastly, the number of intra-library function calls.

This post is really just the raw data, simply to underline the point that todays software environments are a lot more complex than we often give them credit for; and secondly, that we need a new generation of tools to engage with this level of complexity. For added fun, I've added Perl and Python data to the mix. Enjoy!

The Code

opensolaris$ head -10 hello.c hello.pl hello.py hello.java
==> hello.c <==
#include 

int
main(int argc, char \*argv[])
{
	(void) printf("Hello World!\\n");
}

==> hello.pl <==
#!/usr/bin/perl

print "Hello World!\\n";

==> hello.py <==
#!/usr/bin/python

print "Hello World!"

==> hello.java <==
public class hello {
	public static void main(String args[]) {
		System.out.println("Hello World!");
	}
}

It works!

opensolaris$ ./hello
Hello World!
opensolaris$ ./hello.pl
Hello World!
opensolaris$ ./hello.py
Hello World!
opensolaris$ java hello 
Hello World!

Sycalls

opensolaris$ truss ./hello 2>&1 | wc -l   
33
opensolaris$ truss ./hello.pl 2>&1 | wc -l
118
opensolaris$ truss ./hello.py 2>&1 | wc -l
660
opensolaris$ truss java hello 2>&1 | wc -l     
2209

Inter-library calls

opensolaris$ truss -t!all -u : ./hello 2>&1 | wc -l
9
opensolaris$ truss -t!all -u : ./hello.pl 2>&1 | wc -l
232
opensolaris$ truss -t!all -u : ./hello.py 2>&1 | wc -l
31578
opensolaris$ truss -t!all -u : java hello 2>&1 | wc -l     
12055
Note: these numbers need to be divided by two (see the raw output for why).

Intra-library calls

opensolaris$ truss -t!all -u :: ./hello 2>&1 | wc -l    
329
opensolaris$ truss -t!all -u :: ./hello.pl 2>&1 | wc -l 
10337
opensolaris$ truss -t!all -u :: ./hello.py 2>&1 | wc -l
548908
opensolaris$ truss -t!all -u :: java hello 2>&1 | wc -l    
4142645
Note: these numbers also need to be divided by two (see above).

Context

opensolaris$ uname -a
SunOS opensolaris 5.11 snv_111b i86pc i386 i86pc Solaris

Conclusion

Of course the above gives no indication of how long each experiment took. Yes, I could have wrapped the experiment with ptime(1), but I'll leave that as an exercise for the reader. When I use this illustration with a live audience, it's generally sufficient to allow the longest case to continue to scroll up the screen for the rest of the presentation.

At this point, I generally move on. Usually, I say some kind words about high level languages, abstraction, code reuse etc. I am not out to knock Java. That's not the point. The point is complexity. I then move on to how DTrace can help us to engage with complexity. I'd do that here, but I hope that I'll continue to be asked to speak on the subject, and I don't want to give it all away here, just now.

Thursday Dec 06, 2007

Solaris is to UNIX what Mindstorms is to Lego

I've now been at Sun for the best part of two decades. It was Solaris and SPARC which first attracted me to Sun, and it's exciting to see both very much alive and kicking all these years on. But this didn't happen by chance. Sun currently spends $2B per year on R&D, making Sun one of the top 50 technology investors worldwide.

Having myself spent the last five years doing R&D, why have I decided to move back into Sun UK's field organisation? Simply, it's because I think Sun has a very compelling story to tell, and a very exciting portfolio of products and services to offer. In short, I think I'm going to have a lot of fun!

In my new role I'm finding that the thinking behind my Killer Combination, Wicked Bible and Brief History postings is resonating well with a lot of people. Quite frankly, I'm astonished by the number of downloads of my extremely minimalist presentation given to the Sun HPC Consortium in Dresden.

Such has been the interest of late that I thought it would be worth sharing my latest minimalist slide deck: Solaris: greater than the sum of its parts. A lot of the material may be familiar, although I have been experimenting with tag clouds as an alternative to boring bullet points. My basic thesis is: Sun is where the innovation is, so why flirt with a imitations?

I've been a big Lego fan for as long as I can remember. Whilst some kids are content to follow the supplied instruction guides, the real thrill for me has always been that Lego allows me to design and build my very own creations, limited only by my imagination. I feel the same way about UNIX.

UNIX has always been about innovation. UNIX has always provided a rich set of simple, consistent, elegant and well defined interfaces which enable to developer to "go create". This "innovation elsewhere" often takes the seed innovation to places unforeseen by its inventors, and this in turn leads to new innovations.

Lego has undergone a similar evolution. At first I only had chunky 2x2 and 2x4 bricks in red and white to work with. Then came 1xN bricks and 1/3 height plates and more colours. Next came the specialised pieces (e.g. window frames, door frames, wheels, flat plates, sloping roof tiles bricks, fences and so on). But all the time these innovations extended the original, well-designed interfaces, with a big commitment to compatibility, thus preserving investment in much the same way as the Solaris ABI (application binary interface).

Obviously, there are places where these parallels break down, but I think we can push the analogue a little further yet [note to self: a USENIX paper?]. In my mind, Solaris is somewhat akin to Lego Technics, and Solaris 10 to Lego Mindstorms. And in this vein, I see Linux rather in the Betta Builder mould (i.e. innovative interfaces copied from elsewhere, actually cheaper and lighter, but not quite the same build quality as the original implementation). And this is where I'm going to get a little more provocative.

In my new presentation I experiment with tag clouds to enumerate some of Sun's more important UNIX innovations over time. The first cloud lists prehistoric stuff from SunOS 3 and 4 days. The second focussed mostly of Solaris 2. The third focusses on Solaris 10. And while Sun may not be able to take sole credit for ideas such as /proc and mmap, it can claim to have the first substantive implementations.

The fourth tag cloud is included to demonstrate that Sun does not suffer from terminal NIH (not invented here) syndrome. Indeed, I think it recognises that Sun is a pretty good judge of excellence elsewhere (most of the time).

Whatever you think of the detail (and I concede some of it could do with a little more research) I do think it is helpful to ask "where does the innovation happen?". At the very least, I think I've shown that there is heaps of innovation in Solaris which we simply take for granted.

To put it another way: as a Solaris enthusiast I can't help feeling at ease in a Linux environment because I find so many familiar objects from home (I guess a GNU/BSD devotee might say something similar). That's not to deny the achievements of the Linux community in implementing interfaces invented elsewhere, but when I look at the flow of innovation between Solaris and Linux it does feel rather like a one-way street.

We live in interesting times! My own particular area of interest is multithreading. With the dawning of the brave new world of large scale chip multithreading Solaris seems uniquely placed to ride the next wave. This is not by accident. Sun has made a huge investment in thread scalability over the past 15 years.

One of my slides asks "What is the essential difference between single and multithreaded processes?" For some this is not a trivial question. For some it depends on which thread library is being used. But with Solaris, where application threads have been first class citizens ever since Solaris 10 first shipped, the answer is simply "the number of threads".

Enough of this! The weekend is upon us. Where's my Lego?

Tuesday Sep 11, 2007

Taking UFS new places safely with ZFS zvols

I've just read a couple of intriguing posts which discuss the possibility of hosting UFS filesystems on ZFS zvols. I mean, who in their right mind...? The story goes something like this ...

# zfs create tank/ufs
# newfs /dev/zvol/rdsk/tank/ufs
# mount /dev/zvol/dsk/tank/ufs /ufs
# touch /ufs/file
# zfs snapshot tank/ufs@snap
# zfs clone tank/ufs@snap tank/ufs_clone
# mount /dev/zvol/dsk/tank/ufs_clone /ufs_clone
# ls -l /ufs_clone/file

Whoopy doo. It just works. How cool is that? I can have the best of both worlds (e.g. UFS quotas with ZFS datapath protection and snapshots). I can have my cake and eat it!

Well, not quite. Consider this variation on the theme:

# zfs create tank/ufs
# newfs /dev/zvol/rdsk/tank/ufs
# mount /dev/zvol/dsk/tank/ufs /ufs
# date >/ufs/file
# zfs snapshot tank/ufs@snap
# zfs clone tank/ufs@snap tank/ufs_clone
# mount /dev/zvol/dsk/tank/ufs_clone /ufs_clone
# cat /ufs_clone/file

What will the output of the cat(1) command be?

Well, every time I've tried it so far, the file exists, but it contains nothing.

The reason for this is that whilst the UFS metadata gets updated immediately (ensuring that the file is created), the file's data has to wait a while in the Solaris page cache until the fsflush daemon initiates a write back to the storage device (a zvol in this case).

By default, fsflush will attempt to cover the entire page cache within 30 seconds. However, if the system is busy, or has lots of RAM -- or both -- it can take much longer for the file's data to hit the storage device.

Applications that care about data integrity across power outages and crashes don't rely on fsflush to do their dirty (page) work for them. Instead, they tend to use raw I/O interfaces, or fcntl(2) flags such as O_SYNC and O_DSYNC, or APIs such as fsync(3C), fdatasync(3RT) and msync(3C).

On systems with large amounts of RAM, the fsflush daemon can consume inordinate amounts of CPU. It is not uncommon to see a whole CPU pegged just scanning the page cache for dirty pages. In configurations where applications take care of their own write flushing, it is considered good practice to throttle fsflush with the /etc/system parameters autoup and tune_t_fsflushr. Many systems are configured for fsflush to take at least 5 minutes to scan the whole of the page cache.

From this is it clear that we need to take a little more care before taking a snapshot of a UFS filesystem hosted on a ZFS zvol. Fortunately, Solaris has just want we need:

# zfs create tank/ufs
# newfs /dev/zvol/rdsk/tank/ufs
# mount /dev/zvol/dsk/tank/ufs /ufs
# date >/ufs/file
# lockfs -wf
# zfs snapshot tank/ufs@snap
# lockfs -u
# zfs clone tank/ufs@snap tank/ufs_clone
# mount /dev/zvol/dsk/tank/ufs_clone /ufs_clone
# cat /ufs_clone/file

Notice the addition of just two lockfs(1M) commands. The first blocks any writers to the filesystem and causes all dirty pages associated with the filesystem to be flushed to the storage device. The second releases any blocked writers once the snapshot has been cleanly taken.

Of course, this will be nothing like as quick as the initial example, but at least it will guarantee that you get all the data you are expecting. It's not just no data we should be concerned about, but also stale data (which is much harder to detect).

I suppose this may be a useful workaround for folk waiting for some darling features to appear in ZFS. However, don't forget that "there's no such thing as a free lunch"! For instance, hosting UFS on ZFS zvols will result in the double caching of filesystem pages in RAM. Of course, as a SUNW\^H\^H\^H\^HJAVA stock holder, I'd like to encourage you to do just that!

Solaris is a wonderfully well-stocked tool box full of the great technology that is ideal for solving many real world problems. One of the joys of UNIX is that there is usually more than one way to tackle a problem. But hey, be careful out there! Make sure you do a good job, and please don't blame the tools when you screw up. A good rope is very useful. Just don't hang yourself!

Technorati Tags: , ,

Tuesday Jul 10, 2007

Prstat + DTrace + Zones + ZFS + E25K = A Killer Combination

The table on the right was directly lifted from a report exploring the scalability of a fairly traditional client-server application on the Sun Fire E6900 and E25K platforms.

The system boards in both machines are identical, only the interconnects differ. Each system board has four CPU sockets, with a dual-core CPU in each, yielding a total of eight virtual processors per board.

The application client is written in COBOL and talks to a multithreaded C-ISAM database on the same host via TCP/IP loopback sockets. The workload was a real world overnight batch of many "read-only" jobs run 32 at a time.

The primary metric for the benchmark was the total elapsed time. A processor set was used to contain the database engine, with no more than eight virtual processors remaining for the application processes.

The report concludes that the E25K's negative scaling is due to NUMA considerations. I felt this had more to do with perceived "previous convictions" than fact. It bothered me that the E6900 performance had not been called into question at all or explored further.

The situation is made a little clearer by plotting the table as a graph, where the Y axis is a throughput metric rather than the total elapsed time.

Although the E25K plot does indeed appear to show negative scalability (which must surely be somehow related to data locality), it is the E6900 plot which reveals the real problem.

The most important question is not "Why does the E25K throughput drop as virtual processors are added?" but rather "Why does the E6900 hardly go any faster as virtual processors are added?"

Of course there could be many reasons for this (e.g. "because there are not enough virtual processors available to run the COBOL client").

However, further investigation with the prstat utility revealed severe thread scalability bottlenecks in the multithreaded database engine.

Using prstat's -m and -L flags it was possible to see microstate accounting data for each thread in the database. This revealed a large number of threads in the LCK state.

Some very basic questions (simple enough to be typed on a command line) were then asked using DTrace and these showed that the "lock waits" were due to heavy contention on a few hot mutex locks within the database.

Many multithreaded applications are known to scale well on machines such as the E25K. Such applications will not have highly contended locks. Good design of data structures and clever strategies to avoid contention are essential for success.

This second graph may be purely hypothetical but it does indicate how a carefully written multithreaded application's throughput might be expected to scale on both the E6900 and the E25K (taking into account the slightly longer inter-board latencies associated with the latter).

The graph also shows that something less that perfect scalability may still be economically viable on very large machines -- i.e. it may be possible to solve larger problems, even if this is achieved with less efficiently.

As an aside, this is somewhat similar to the way in which drag takes over as the dominant factor limiting the speed of a car -- i.e. it may be necessary to double the engine size to increase the maximum speed by less than 50%.

Working with the database developer it was possible to use DTrace to begin "peeling the scalability onion" (an apt metaphor for an iterative process of diminishing returns -- and tears -- as layer after layer of contention is removed from code).

With DTrace it is a simple matter to generate call stacks for code implicated in heavily contended locks. Breaking such locks up and/or converting mutexes to rwlocks is a well understood technique for retrofitting scalability to serialised code, but it is beyond the scope of this post. Suffice it to say that some dramatic results were quickly achieved.

Using these techniques the database scalability limit was increased from 8 to 24 virtual processors in just a matter of days. Sensing that the next speed bump might take a lot more effort, some other really cool Solaris innovations were called on to go the next step.

The new improved scalable database engine was now working very nicely alongside the COBOL application on the E25K in the same processor set with up to 72 virtual processors (already somewhere the E6900 could not go).

For this benchmark the database consisted of a working set of about 120GB across some 100,000 files. With well in excess of 300GB of RAM in each system it seemed highly desirable to cache the data files entirely in RAM (something which the customer was very willing to consider).

The "read-only" benchmark workload actually resulted in something like 200MB of the 120GB dataset being updated each run. This was mostly due to writing intermediate temporary data (which is discarded at the end of the run).

Then came a flash of inspiration. Using clones of a ZFS snapshot of the data together with Zones it was possible to partition multiple instances of the application. But the really cool bit is that ZFS snapshots are almost instant and virtually free.

ZFS clones are implemented using copy-on-write relative to a snapshot. This means that most of the storage blocks on disk and filesystem cache in RAM can be shared across all instances. Although snapshots and partitioning are possible on other systems, they are not instant, and they are unable to share RAM.

The E25K's 144 vitual processors (on 18 system boards) were partitioned into a global zone and five local zones of 24 virtual processors (3 system boards) each. The database was quiesced and a ZFS snapshot taken. This snapshot was then cloned five times (once per local zone) and the same workload run against all six zones concurrently (in the real world the workload would also be partitioned).

The resulting throughput was nearly five times that of a single 24 virtual processor zone, and almost double the capacity of a fully configured E6900.

All of the Solaris technologies mentioned in this posting are pretty amazing in their own right. The main reason for writing is to underline how extremely powerful the combination of such innovative technologies can be when applied to real world problems. Just imagine what Solaris could do for you!

Solaris: the whole is greater than the sum of its parts.

Technorati Tags: , , , ,

Wednesday May 16, 2007

ZFS and RAID - "I" is for "Inexpensive" (sorry for any confusion)

When I were a lad "RAID" was always an acronym for "Redundant Array of Inexpensive Disks". According to this Wikipedia article 'twas always thus. So, why do so many people think that the "I" stands for "Independent"?

Well, I guess part of the reason is that when companies started to build RAID products they soon discovered that were far from inexpensive. Stuff like fancy racking, redundant power supplies, large nonvolatile write caches, multipath I/O, high bandwith interconnects, and data path error protection simply don't come cheap.

Then there's the "two's company, three's a crowd" factor: reliability, performance, and low cost ... pick any two. But just because the total RAID storage solution isn't cheap, doesn't necessarily mean that it cannot leverage inexpensive disk drives.

However, inexpensive disk drives (such as today's commodity IDE and SATA products) provide a lot less in terms of data path protection than more expensive devices (such as premium FC and SCSI drives). So RAID has become a someone elitist, premium technology, rather than goodness for the masses.

Enter ZFS.

Because ZFS provides separate checksum protection of all filesystem data and meta data, even IDE drives be deployed to build simple RAID solutions with high data integrity. Indeed, ZFS's checksumming protects the entire data path from the spinning brown stuff to the host computer's memory.

This is why I rebuilt my home server around OpenSolaris using cheap non-ECC memory and low cost IDE drives. But ZFS also promises dramatically to reduce the cost of storage solutions for the datacentre. I'm sure we will see many more products like Sun's X4500 "Thumper".

ZFS - Restoring the "I" back in RAID

Technorati Tags: , , , ,

About

pgdh

Search

Categories
Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today