Thursday Apr 27, 2006

A Tale of Two Books

The Second Edition of Solaris Internals

Sit. Relax. Breath. Read a book. Watch a movie. It's done (pretty much). Resync. Have fun. Get normal.

Recently, I found myself thinking about a book I read some time ago, Into Thin Air, Jon Krakauer's riveting account of the 1996 Mount Everest expedition that ended in disaster. In particular, I was recalling Krakauer's description of how he felt when he finally summited Everest. Having reached a milestone of this magnitude, standing on the tallest spot on the planet earth, Krakauer found the moment more surreal than anything else; feelings of intense joy and satisfaction were to come later. There was no jumping for joy (literally or figuratively)...he was mostly thinking about getting off the mountain.

I had a similar experience when Richard McDougall and I, along with Brendan Gregg, wrapped up the new edition of Solaris Internals. Now, please understand, I am not equating writing a technical book to climbing the tallest mountain in the world - it's not even in the same universe effort-wise. I'm simply drawing a comparison to a similar feeling of having accomplished something that I never thought I'd complete, and the hazy feeling I had (have) now that it's all done. To me, it's a fascinating example of what complex creatures we are. To achieve something of such significance (summiting Everest, not writing a technical book), and the natural gratification latency that follows. I think it will kick-in when we're shipping, and I can actually hold, in my hands, the new books. Which brings me to....

As per a blog I did last June, the updated edition of Solaris Internals did not have a smooth and predictable take-off. Even after we put a "stake in the ground", and narrowed our focus to Solaris 10 and OpenSolaris, we still suffered from self-induced scope-creep, and generally did all the wrong things in terms of constraining a complex project. The good news is that all the wrong things produced something that we are extremely proud of.

Our original goal was to produce an update to the first edition of Solaris Internals - no new content in terms of subject matter, but update, revise and rewrite the material such that it reflects Solaris 10 and OpenSolaris. Naturally, we could not ignore the new technology we had at our disposal, like DTrace, MDB, and all the other way-cool observability tools in Solaris 10 and OpenSolaris. Also, the availability of Solaris source code allowed us to reference specific areas of the code, and include code segments in the book. Additionally, the internal support and enthusiasm for the new edition was just overwhelming. The Sun engineering community, Solaris kernel engineering and adjunct groups, offered assistance in the writing, reviewing, content advise, etc. It was extremely gratifying to have so many talented and time-constrained engineers come forward and offer their time and expertise.

Given the tools and expertise we had at our disposal, it seemed inevitable that the end result would be something significantly different than our original intention of "a simple update". It actually brought us right back to one of our key goals when we wrote the first book. That is, in addition to describing the internals of the Solaris kernel, include methods of putting the information to practical use. To that end, we made extensive use of DTrace, and MDB, etc, to illustrate the areas of the kernel discussed throughout the text. The tools examples naturally evolved into performance and behavior related text. This is a good thing, in that a great many readers will be using the text specifically to understand the performance and behavior of thier Solaris systems. The not-so-good news is, once you cross the "performance" line, scoop broadens pretty significantly, and more content gets created. A lot more content. Pages and pages. And man, if I may be so bold, it's all good.

So while I was busily trying to complete my internals chapters, Richard took off like a rocket with new material for performance and tools, and recruited Brendan Gregg (whose DTrace ToolKit is a must-have download), to add his considerable experience and expertise. Faster than you could say "DTrace rocks", we did a book build and found we had over 1400 pages of material. And we were not done. We had some calls with the publisher, and discovered that the publishing industry is not particularly fond of publishing very large books, and we had some concerns about our readers needing orthopedic surgery due to carrying Solaris Internals around. So it was decided that we split the work into two books; an internals book, and a POD book. POD is an acronym that Richard and I have been using for some time, and expands to Performance, Observability and Debugging. We love being able to encapsulate so concisely what the vast majority of Solaris users wish to do - observe and understand for the purpose of improving performance, and/or root-causing pathological behavior. The tools are there, and now there's some documentation to leverage for guidance and examples.

As you can imagine, once we started on a performance book, scope-creep took a whole new path. Ideas flowed faster than content, and given sufficient time, we could easily have created 1000 pages on POD. As it was, finishing up turned out to be something of a herculean task. We were bound and determined not to miss another deadline, and deliver the book files to the publisher. Richard, Brendan and I were communicating using AIM (Brendan is in Australia, Richard in California, and I'm located in New Jersey), and pounding away to get the material cleaned up and ready to ship. What started out as a late night turned into an all-nighter. Literally. I had to stop at 7:45AM ET to take my Son to school, and I was happy to do it (stop that is). Incredibly, Brendan and Richard seemed like they could go on for hours more (note to self, start consuming Vegemite). In the end, we met our goal, and handed the publisher two books, over 1600 pages of material, on that hazy Monday morning.

Solaris Internals: Solaris 10 and OpenSolaris kernel architecture

Solaris Performance and Tools: DTrace and MDB Techniques for Solaris 10 and Open Solaris

So there it is. We're currently working with the publisher on getting the cover art together, dealing with typsetting issues, etc. We're doing our best to accelerate getting the books to the printer, and getting them out the door. Hopefully, correct cover images, ISBN numbers and pricing will make their way to the retailers very soon. I want to take this blog opportunity to thank the Solaris community for the positive feedback on the first edition, and the support and interest in getting the second edition out the door. We really have a winning combination, with the best operating system on the planet (Solaris - my objective opinion), world-class observability tools, open solaris source, and documentation to pull it all together, thus maximizing the using-Solaris experience. We look forward to hearing from and working with the Solaris community, and doing our part to broaden awareness of the power of Solaris, and contribute to the knowledge base. Keep an eye on the Solaris Internals and OpenSolaris WEB sites for feedback, forums and reader contributions.

Tuesday Dec 06, 2005

Niagara IO - Architecture & Performance

Today Sun is launching a revolutionary new set of server products. The Sun FireTM CoolThreads servers, internally named Ontario and Erie, are both based on the Niagara multicore SPARC processor. The Niagara, or UltraSPARCTM T1 processor, represents a quantum leap in implementing multiple execution pipelines (cores) on a single chip, with support for multiple hardware threads per pipeline. We refer to this throughput-oriented design as Chip Multithreading (CMT) technology. The UltraSPARC T1 processor incorporate eight execution cores, with four hardware threads per core, providing the capability of what previously required 32 processors (where each processor was a traditional design with a single instruction pipeline) on a single chip. The Sun FireTM T2000 (Ontario) and the Sun FireTMT1000 (Erie) represent ground breaking technology. First and foremost, the amount of processing power (CPU, memory, I/O) available in a relatively small system. Both the Sun FireTMT2000 and T1000 are rack mount chassis systems; the T2000 is two RU (rack unit) high, and the T1000 is one rack unit in height. Within a relatively small package, we find an amazing amount of computing power - not only in terms of parallel processor-oriented tasks, but also in memory and I/O bandwidth capabilities. The icing on the cake is the low power design of the systems. The UltraSPARC T1 processor generates a remarkably low amount of heat, and the system as a whole has an amazing performance/power metric.

But my blog here today is not about the power and heat metrics of the T2000 and T1000. I'm sure that the launch blog-burst will include specific data on that particular feature. Nor will I be detailing the UltraSPARC T1 microprocessor architecture - the beauty of 8 execution cores, 4 hardware threads per core, (32 threads total), and the whiz-bang performance and throughput these systems deliver with parallel workloads. My fellow bloggers will expound on these virtues, as well as other features. This discussion is intended to provide an overview of the I/O architecture of these systems, and a small sample of some performance numbers we have measured in our benchmarking work. Not industry standard benchmark results - they can be found on the product pages.

The I/O architecture of the T2000 includes five PCI slots; three PCI-E and two PCI-X, as well as four on-board gigabit ethernet ports. PCI-X is a 64-bit wide, 133Mhz bus, capable of 1.06GB/sec bandwidth. PCI-E (PCI-Express) is a point-to-point bus that provides a non-shared link to a PCI-E device. A link can be implemented with one or more lanes to carry data, where each lane carries a full-duplex serial data bit stream and a rate of 2.5Gbits/second. PCI-E implementaions can scale-up bandwidth based on the number of lanes implemented in the link, referred to as X1, X2, X4, X8, X12, X16 and X32, where the value after the X corresponds to the number of data lanes. PCI-E on the T2000 and T1000 is X8, supporting devices with up to 8 lanes of data bandwidth capability. The transport bus between the Fire I/O bridge chip and the UltraSPARC T1 processor is the Jbus, which has a theoretical maximum bandwidth of 2.5GB/sec. Please note that this is not memory bandwidth - processor to memory data transfers take place on a different physical bus in the system (and of course through a cache memory hierarchy). The Jbus is dedicated to I/O, providing true high-end I/O bandwidth capability.

The T1000 uses the same Fire I/O bridge chip and Jbus to interface the I/O subsystem to the UltraSPARC T1 processor. The T1000, at one RU in size, has fewer I/O slots, with one PCI-E slot.

Some quick tests on an Sun Fire T2000 system with well over 200 connected disks (multiple Sun 3510 storage arrays connected via multiple PCI-X and PCI-E dual port Gbit Fiber Channel adapters) indicate these systems are extremely I/O capable. The T2000 is able to sustain 1.6GB/sec of sequential disk read bandwidth doing sequential reads from raw disk devices. Running a database transactional workload, which has a random I/O profile (and small 4k I/O's), the T2000 sustains 58000 IOPS (I/O operations per second). Using a smaller I/O size for the sequential tests (8k instead of 1MB), we can sustain 120,000 IOPS on reads (just under 1GB/sec bandwidth with 8k IOs). On a combined read/write test, 60,000 read/sec and 60,000 writes/sec are sustained.

The numbers quoted above provide a solid indication that the Sun FireTMT2000 system is not just a new system with another pretty processor (the UltraSPARC T1). These systems are designed handle workloads that generate high rates of sustained I/O, making the T2000 system suitable for a broad range of applications and workloads.

[ T: ]

Wednesday Jun 15, 2005

Solaris Internals (How I Spent My Summer Vacation)

Solaris Internals, Second Edition, and Watching the Kernel Flow

The launch of OpenSolaris, and all the very cool technology and features in Solaris 10 have generated a great deal of excitement in the industry. I for one am particularly jazzed about these events. Let me take a minute and talk about why, and then we'll do a bit of technical stuff...

Solaris Internals

Back in 1997 (through 2000), I coauthored Solaris Internals with my colleague and good mate Richard McDougall. After a few fits, sputters and false starts, we finally found ourselves sitting on opposite ends of a phone connection (Richard lives in the Bay Area, I'm in New Jersey), each holding a shiny new copy of Solaris Internals, exchanging grunts of disbelief. All of a sudden, as my euphoria was reaching a peak, I hear a groaned "Uh oh" from the other end of the phone. Richard (who never misses anything) spotted a bug - a rather severe bug. Page 107 was not followed by page 108, but rather by page 180! You're cruising along, reading all about kernel bootstrap and initialization, flip that page, and whammo, you're staring at a diagram of the page scanners two handed clock algorithm. Not good.

We immediately placed a call to the publisher, with a cry of "stop the presses"! They responded immediately, pulling the books that were printed, fixing the transposed page, and starting over. As far as I know, Richard and I have the only copies with the page bug (I'm waiting to sell mine on Ebay - private bids accepted).

Following the publication of the first edition, Richard and I decided we needed a little time to decompress, get to know our families again, get caught up on some stuff around the house, and generally try to feel normal for a short while. After a bit of that, we'd jump in and get going on an update for Solaris 8 and Solaris 9. As it happens, writing, and exercising (working out) have a few things in common. It's really hard to get started, and especially hard after you've stopped for a bit. Time marched on, and our efforts failed to keep pace. We finally reached a point where it just made more sense to focus our efforts on Solaris 10, and that is precisely what we are doing. We're working feverishly to complete the manuscript this summer, and get it on the shelves ASAP. Our thanks to all the support and kind words we have received over the years on the first edition.

What's Different The Second Time Around?

Working on the first edition was largely an exercise in reading through the kernel source, and describing what it does in english text. For those areas where I found myself scratching my head, there was a lot of email exchanges with various members of Solaris Kernel Engineering, seeking clarification, and/or the rationale behind a design choice. I'm compelled to add here that the engineers at the receiving side of my emails are the real authors (at least for the stuff I wrote - not speaking for Richard here). An amazing group of individuals that remained supportive and responsive throughout the effort. Their contributions to the accuracy and thoroughness of the text are immeasurable.

Working on the second edition, we have a new tool to make understanding what is happening in the kernel a whole heck of a lot easier. I am talking of course about DTrace. DTrace is a truly amazing technology, and having spent a bit of time looking at various areas of DTrace in the source code, I am even more amazed. I'm still trying to figure out how they did! With DTrace at our disposal, we now have a tool that substantially reduces the amount of time required to understand the code path through the kernel for a specific event (that is of course just one of the zillions of things you can do with DTrace). Here are a couple quick examples...

"Trussing" the Kernel...

A potentially interesting approach to walking through the Solaris kernel source is to pick an event or subsystem of interest, use DTrace to generate the code path, and use that as a reference point to start your source examination. An area that I've looked at and written about a bit is the process/threads architecture and subsystem, so we'll take a quick look at process and thread creation. But first, a quick note.

Threads, and thread models, exist in various forms in modern operating systems. In Solaris, they've been a core part of the kernel from the beginning (Solaris 2.0, circa 1991), as the kernel itself is mulithreaded. Additionally, application programming interfaces (APIs) exist so software developers can write multithreaded applications. Solaris threads have evolved over time, going through a series of refinements, and some architectural changes. Phil Harman, also a very good friend and co-worker, is an expert in this area, and authored a great paper for Solaris 9 called Multithreading in the Solaris Operating System, which is an outstanding technical reference describing the architectural change, and characteristics of a new thread model that was introduced in Solaris 9 (well, Solaris 8 technically the paper for the particulars).

Threads are executable objects, where every process has one or more threads, and each thread can be scheduled and executed independently of other threads on the same process. In Solaris, think of a process as a state container for one or more threads. Various components of the process are shared by the threads in the process - the address space (memory space), open files, credentials, etc. Each thread however has its own stack, and each thread can issue system calls and enter the kernel.

Processes, threads, and pretty much every other memory object that resides in the Solaris kernel, is defined in a header file, most of which can be found in the common/sys directory in the source tree, so this is the place to look to see the structure definitions - reference proc.h and thread.h.

Now let's say we're interested is understanding the code path through the kernel for process creation. This becomes a no-brainer with dtrace. All processes are created with the fork(2) system call, so all we need is a D script to trace the kernel when the fork(2) system call is entered. Additionally, we could write a simple C program to call fork(2) if we don't have a load on our system generating fork(2) calls. We need a D script that uses the syscall provider to enable a probe for fork(2) entry and return. I can verify the probe name simply on the command line:

solaris10> dtrace -n syscall::fork:entry
dtrace: invalid probe specifier syscall::fork:entry: probe description syscall::fork:entry does not match any probes

Egads! What madness is this? No fork(2) probe in the syscall provider? Let's see...:

solaris10> dtrace -l -P syscall | grep fork
    9    syscall                                             forkall entry
   10    syscall                                             forkall return
  201    syscall                                               vfork entry
  202    syscall                                               vfork return
  243    syscall                                               fork1 entry
  244    syscall                                               fork1 return

Ah, OK. We have probes for forkall(2), vfork(2) and fork1(2), but not fork(2). A quick look at the man page tells use most of what we need to know. It's an implementation detail...there are actually three flavors of fork(2) in Solaris 10; fork(2), forkall(2) and fork1(2) (well, there's also vfork(2), but that's a deprecated interface, since we now have the posix_spawn(3C) interface). Prior to Solaris 10, fork(2) would replicate all the threads when called from a multithreaded process, and fork1(2) was created for applications that wish to replicate only the calling thread in the child process (unless the code was linked with libpthread, in which case fork(2) would provide fork1(2) behaviour). In Solaris 10, fork(2) provides fork1(2) behaviour, replicating only the calling thread in the child process. A forkall(2) system call exists when the replicate-all-threads behaviour is desired. So what happens when we have a fork(2) call in out code? Since fork(2) and fork1(2) do the same thing in Solaris 10, there's no need to have multiple source files. In the libc source tree, you'll find fork(2) set to the same entry point as fork1(2) with a #pragma compiler directive in scalls.c;

116 /\*
117  \* fork() is fork1() for both Posix threads and Solaris threads.
118  \* The forkall() interface exists for applications that require
119  \* the semantics of replicating all threads.
120  \*/
121 #pragma weak fork = _fork1
122 #pragma weak _fork = _fork1
123 #pragma weak fork1 = _fork1
124 pid_t
125 _fork1(void)
126 {

This was done so the bazillions of applications out there that call fork(2) will just work on Solaris 10. Now where were we...oh yea, why the fork1(2) probe in the dtrace script, when it's fork(2) we're interested in. Because fork(2) is associated with fork1(2) in the libc source, such that calls to fork(2) and fork1(2) enter the same function (_fork1(void), line 125 above). Thus, the dtrace syscall provider can not locate a probe for the fork(2) entry point as one does not exist, which is just fine, because we now know why.

So let's get back to having a look at the code path for a process creation in Solaris 10. Here's a D script that'll do the trick:

#!/usr/sbin/dtrace -s

#pragma D option flowindent

/ self->trace /

This D script above sets the flowindent option, which makes a large function-call-return result much easier to read. When a fork(2) or fork1(2) system call is executed, the entry probe will fire. The action in the probe clause sets a thread local variable (self->trace) to a 1. We use a probe name of fbt::: in the script, which enables every Function Boundary Tracing probe that exists (basically, most every kernel function). Note we have a predicate that tests 'if self->trace == 1', ensuring that we only take action (print the function name - the default action since nothing is specified in the probe clause) when we're in our fork(2) code path. Finally, when the fork1(2) returns, we clear the trace flag and exit, so we only get one set of function call flow, which is all we're interested in. Running the D script, along with a simple C program that does a fork(2) system call, we get:

solaris10> ./pcreate.d -c ./fk
dtrace: script './pcreate.d' matched 32360 probes
Child PID: 5284
Parent PID: 5283
dtrace: pid 5283 has exited
CPU FUNCTION                                 
 12  -> fork1                                 
 12  <- fork1                                 
 12  -> cfork                                 
 12    -> secpolicy_basic_fork                
 12    <- secpolicy_basic_fork                
 12    -> priv_policy                         
 12    <- priv_policy                         
 12    -> holdlwps                            
 12      -> schedctl_finish_sigblock          
 12      <- schedctl_finish_sigblock          
 12      -> pokelwps                          
 12      <- pokelwps                          
 12    <- holdlwps                            
 12    -> flush_user_windows_to_stack         
 12    <- flush_user_windows_to_stack         
 12    -> pool_barrier_enter                  
 12    <- pool_barrier_enter                  
 12    -> getproc                             
 12        <- setbackdq                       
 12        -> generic_enq_thread              
 12        <- generic_enq_thread              
 12        -> disp_lock_exit                  
 12        <- disp_lock_exit                  
 12      <- continuelwps                      
 12      -> continuelwps                      
 12        -> thread_lock                     
 12        <- thread_lock                     
 12        -> disp_lock_exit                  
 12        <- disp_lock_exit                  
 12      <- continuelwps                      
 12      -> thread_lock                       
 12      <- thread_lock                       
 12      -> thread_transition                 
 12      <- thread_transition                 
 12      -> disp_lock_exit_high               
 12      <- disp_lock_exit_high               
 12      -> ts_setrun                         
 12      <- ts_setrun                         
 12      -> setbackdq                         
 12        -> cpu_update_pct                  
 12          -> cpu_grow                      
 12            -> cpu_decay                   
 12              -> exp_x                     
 12              <- exp_x                     
 12            <- cpu_decay                   
 12          <- cpu_grow                      
 12        <- cpu_update_pct                  
 13  <= fork1                                 

I cut most of the lines from the output for this post. The idea here is to illustrate how easy it is to plot the code path through the kernel for a fork call (process creation). You can see what a great option flowindent is. It does a beautiful job of presenting a long function call flow, and show entry points (->) and returns (<-). Note also in the command line output that I used the '-c' flag in the dtrace command line. This instructs dtrace to run the specified command, and exit when it's complete. In the case, the command was the fk executable, which is a simple piece of code that issues a fork(2) call.

Another quick example - this time, a thread create. Here's the D:

#!/usr/sbin/dtrace -s

#pragma D option flowindent


/ self->trace /


The login of this D script is the same as the previous example. What is different here is we're using the PID provider to enable probes in a user process. This is because the entry point we're interested in is not a system call, it is a library routine (pthread_create(3C)). As such, it will be mapped into the user processes address space, and it's the PID provider that opens that door for us. We're also using the $target DTrace macro variable. This works in conjunction with the '-c' command line option, and will set the PID of the command started on the command line to $target, which we can use are part of the PID provider component of the probe name, to probe the process we're interested in. Here's the run:

solaris10> ./tcreate.d -c ./tds    
dtrace: script './tcreate.d' matched 32360 probes
Created 8 writers, 16 readers
CPU FUNCTION                                 
  0  -> pre_syscall                           
  0    -> syscall_mstate                      
  0    <- syscall_mstate                      
  0  <- pre_syscall                           
  0  -> sysconfig                             
  0  <- sysconfig                             
  0  -> post_syscall                          
  0    -> clear_stale_fd                      
  0    <- clear_stale_fd                      
  0    -> syscall_mstate                      
  0    <- syscall_mstate                      
  0  <- post_syscall                          
  0  -> pre_syscall                           
  0    -> syscall_mstate                      
  0    <- syscall_mstate                      
  0  <- pre_syscall                           
  0  -> smmap32                               
  0    -> smmap_common                        
  0      -> as_rangelock                      
  0      <- as_rangelock                      
  0      -> zmap                              
  0        -> map_addr                        
  0          -> map_addr_proc                 
  0            -> rctl_enforced_value         
  0              -> rctl_set_find             
  0    -> lwp_continue                        
  0      -> thread_lock                       
  0      <- thread_lock                       
  0      -> setrun_locked                     
  0        -> thread_transition               
  0        <- thread_transition               
  0        -> disp_lock_exit_high             
  0        <- disp_lock_exit_high             
  0      <- setrun_locked                     
  0      -> ts_setrun                         
  0      <- ts_setrun                         
  0      -> setbackdq                         
  0        -> cpu_update_pct                  
  0          -> cpu_decay                     
  0            -> exp_x                       
  0            <- exp_x                       
  0          <- cpu_decay                     
  0        <- cpu_update_pct                  
  0        -> cpu_choose                      
  0        <- cpu_choose                      
  0        -> disp_lock_enter_high            
  0        <- disp_lock_enter_high            
  0        -> cpu_resched                     
  0        <- cpu_resched                     
  0      <- setbackdq                         
  0      -> generic_enq_thread                
  0      <- generic_enq_thread                
  0      -> disp_lock_exit                    
  0      <- disp_lock_exit                    
  0    <- lwp_continue                        
  0    -> cv_broadcast                        
  0    <- cv_broadcast                        
  0  <- syslwp_continue                       
  0  -> post_syscall                          
  0    -> clear_stale_fd                      
  0    <- clear_stale_fd                      
  0    -> syscall_mstate                      
  0    <- syscall_mstate                      
  0  <- post_syscall                          
  0  <- pthread_create

Once again, most of the output was cut for space. This time, we executed a binary called 'tds', which of course is a program that creates threads.

That's it for now. With some simple D scripting, and sample programs that generate an event of interest, you can trace the code path through the kernel, than use that data to zero in on points of interest in the source code.




« June 2016