Friday Aug 01, 2014

Faster Boot with Many Devices

In addition to its highly touted features such as Kernel Zones and Unified Archives, the just-released Solaris 11.2 has some nice unsung optimizations that are noticed less because everything works the same, but faster. One set of optimizations improves the scalability and efficiency of the devfsadm daemon, which is responsible for managing the namespace of devices under the /devices and /dev mount points in the filesystem. This daemon is very busy at boot time and during certain device configuration operations. For example, it accumulates 11 minutes of CPU time during boot on a system with 1000's of devices:

       242 root        8  59    0   73M   72M sleep   11:03  0.00% devfsadm

Moreover, devfsadm is on the critical path during boot, and many SMF services depend directly or indirectly upon it completing the configuration of /dev, including for example the console login service.

We made many improvements to the userland and kernel components of devfs, including caching, hashing, and tuning timeouts. In one test, we measure the time from typing the OBP boot command to the appearance of the console login prompt. The system is a T5440 with 4000 disk devices, multi-pathed using MPxIO, which multiplies the number of device instances. The previous version of Solaris takes over 44 minutes, and Solaris 11.2 takes just over 3 minutes, for a 13X speedup. This is an old T-series processor, and both the old and new Solaris will be faster on a more recent processor such as the T5, but Solaris 11.2 will still reduce boot times by many minutes on current platforms with 1000's of devices. Your results will vary with device count. The algorithms we fixed have polynomial time complexity, so the times do not scale linearly. If your system has only 10's to 100's of devices, you might not notice the difference.

In addition to normal boot and reboot, commands for which we observe speedups include the following (but no doubt there are others I have missed):

  • devfsadm -C : Clean up dangling /dev links
  • reboot -r : reconfiguration reboot
  • cfgadm -al : show status of dynamically reconfigurable hardware r esources

Do you configure systems with a huge number of devices? Does Solaris 11.2 make a difference for you? Please share your experiences.

Wednesday Apr 10, 2013

Massive Solaris Scalability for the T5-8 and M5-32, Part 3

Today I conclude this series on M5-32 scalability [ Part1 , Part2 ] with enhancements we made in the Scheduler, Devices, Tools, and Reboot areas of Solaris.


The Solaris thread scheduler is little changed, as the architecture of balancing runnable threads across levels in the processor resource hierarchy , which I described when the T2 processor was introduced, has scaled well. However, we have continued to optimize the clock function of the scheduler. Clock is responsible for quanta expiration, timeout processing, resource accounting for every CPU, and for misc housekeeping functions. Previously, we parallelized quanta expiration and timeout expiration(aka callouts). In Solaris 11, we eliminated the need to acquire the process and thread locks in most cases during quanta expiration and accounting, and we eliminated or reduced the impact of several smallish O(N) calculations that had become significant at 1536 CPUs. The net result is that all functionality associated with clock scales nicely, and CPU 0 does not accumulate noticeable %sys CPU time due to clock processing.


SPARC systems use an IOMMU to map PCI-E virtual addresses to physical memory. The PCI VA space is a limited resource with high demand. The VA span is only 2GB to maintain binary compatibility with traditional DDI functions, and many drivers pre-map large DMA buffer pools so that mapping is not on the critical path for transport operations. Every CPU can post concurrent DMA requests, thus demand increases with scale. Managing these conflicting demands is a challenge. We reimplemented DVMA allocation using the Solaris kmem_cache and vmem facilities, with object size and quanta chosen to match common DMA transfer sizes. This provides a good balance between contention-free per-CPU caching, and redistribution of free space in the back end magazine and slab layers. We also modified drivers to use DMA pools more efficiently, and we modified the IOMMU code so that 2GB of VA is available per PCI function, rather than per PCI root port.

The net result for the end user is higher device throughput and/or lower CPU utilization per unit of throughput on larger systems.


The very tools we use to analyze scalability may exhibit problems themselves, because they must collect data for all the entities on a system. We noticed that mpstat was consuming so much CPU time on large systems that it could not sample at 1 second intervals and was falling behind. mpstat collects data for all CPUs in every interval, but 1536 CPUs is not a large number to handle in 1 second, so something was amiss. Profiling showed the time was spent searching for per-cpu kstats (see kstat(3KSTAT)), and every lookup searched the entire kc_chain linked list of all kstats. Since the number of kstats grows with NCPU, the overall algorithm takes time O(NCPU^2), which explodes on the larger systems. We modified the kstat library to build a hash table when kstats are opened, and re-implemented kstat_lookup() on that. This reduced cpu consumption by 8X on our "small" 512-CPU test system, and improves the performance of all tools that are based on libkstat, including mpstat, vmstat, iostat, and sar.

Even dtrace is not immune. When a script starts, dtrace allocates multi-megabyte trace buffers for every CPU in the domain, using a single thread, and frees the buffers on script termination using a single thread. On a T3-4 with 512 CPUs, it took 30 seconds to run a null D script. Even worse, the allocation is done while holding the global cpu_lock, which serializes the startup of other D scripts, and causes long pauses in the output of some stat commands that briefly take cpu_lock while sampling. We fixed this in Solaris 11.1 by allocating and freeing the trace buffers in parallel using vmtasks, and by hoisting allocation out of the cpu_lock critical path.

Large scale can impact the usability of a tool. Some stat tools produce a row of output per CPU in every sampling interval, making it hard to spot important clues in the torrent of data. In Solaris 11.1, we provide new aggregation and sorting options for the mpstat, cpustat, and trapstat commands that allow the user to make sense of the data. For example, the command

  mpstat -k intr -A 4 -m 10 5
sorts CPUs by the interrupts metric, partitions them into quartiles, and aggregates each quartile into a single row by computing the mean column values within each. See the man pages for details.


Large servers take longer to reboot than small servers. Why? They must initialize more CPUs, memory, and devices, but much of the shutdown and startup code in firmware and the kernel is single threaded. We are addressing that. On shutdown, Solaris now scans memory in parallel to look for dirty pages that must be flushed to disk. The sun4v hypervisor zero's a domain's memory in parallel, using CPUs that are physically closest to memory for maximum bandwidth. On startup, Solaris VM initializes per-page metadata using SPARC cache initializing block stores, which speeds metadata initialization by more than 2X. We also fixed an O(NCPU^2) algorithm in bringing CPUs online, and an O(NCPU) algorithm in reclaiming memory from firmware. In total, we have reduced the reboot time for M5-32 systems by many minutes, and we continue to work on optimizations in this area.

In these few short posts, I have summarized the work of many people over a period of years that has pushed Solaris to new heights of scalability, and I look forward to seeing what our customers will do with their massive T5-8 and M5-32 systems. However, if you have seen the SPARC processor roadmap, you know that our work is not done. Onward and upward!

Friday Apr 05, 2013

Massive Solaris Scalability for the T5-8 and M5-32, Part 2

Last time, I outlined the general issues that must be addressed to achieve operating system scalability. Next I will provide more detail on what we modified in Solaris to reach the M5-32 scalability level. We worked in most of the major areas of Solaris, including Virtual Memory, Resources, Scheduler, Devices, Tools, and Reboot. Today I cover VM and resources.

Virtual Memory

When a page of virtual memory is freed, the virtual to physical address translation must be deleted from the MMU of all CPUs which may have accessed the page. On Solaris, this is implemented by posting a software interrupt known as an xcall to each target CPU. This "TLB shootdown" operation poses one of the thorniest scalability challenges in the VM area, as a single-threaded process may have migrated and run on all the CPUs in a domain, and a multi-threaded process may run threads on all CPUs concurrently. This is a frequent cause of sub-optimal scaling when porting an application from a small to a large server, for a wide variety of systems and vendors.

The T5 and M5 processors provide hardware acceleration for this operation. A single PIO write (an ASI write in SPARC parlance) can demap a VA in all cores of a single socket. Solaris need only send an xcall to one CPU per socket, rather than sending an xcall to every CPU. This achieves a 48X reduction in xcalls on M5-32, and a 128X reduction in xcalls on T5-8, for mappings such as kernel pages that are used on every CPU. For user page mappings, one xcall is sent to each socket on which the process runs. The net result is that the cost of demap operations in dynamic memory workloads is not measurably higher on large T5 and M5 systems than on small.

The VM2 project re-implemented the physical page management layer in Solaris 11.1, and offers several scalability benefits. It manages a large page as a single unit, rather than as a collection of contained small pages, which reduces the cost of allocating and freeing large pages. It predicts the demand for large pages and proactively defragments physical memory to build more, reducing delays when an application page faults and needs a large page. These enhancements make it practical for Solaris to use a range of large page sizes, in every segment type, which maximizes run-time efficiency of large memory applications. VM2 also allows kernel memory to be allocated near any socket. Previously, kernel memory was confined to a single "kernel cage" that was confined to a single physically contiguous region, which often fit on the memory connected to a single socket, which could become a memory hot spot for kernel intensive workloads. Spreading reduces hot spots, and also allows kernel data such as DMA buffers to be allocated near threads or devices for lower latency and higher bandwidth.

The VM system manages certain resources on a per-domain basis, in units of pages. These include swap space, locked memory, and reserved memory, among others. These quantities are adjusted when a page is allocated, freed, locked, and unlocked. Each is represented by a global counter protected by a global lock. The lock hold times are small, but at some CPU count they become bottlenecks. How does one scale a global counter? Using a new data structure I call the Credit Tree, which provides O(K * log(NCPU)) allocation performance with a very small constant K. I will describe it in a future posting. We replaced the VM system's global counters with credit trees in S11.1, and achieved a 45X speedup on an mmap() microbenchmark on T4-4 with 256 CPUs. This is good for the Oracle database, because it uses mmap() and munmap() to dynamically allocate space for its per-process PGA memory.

The virtual address space is a finite resource that must be partitioned carefully to support large memory systems. 64 bits of VA is sufficient, but we had to adjust the kernel's VA's to support a larger heap and more physical memory pages, and adjust process VA's to support larger shared memory segments (eg, for the Oracle SGA).

Lastly, we reduced contention on various locks by increasing lock array sizes and improving the object-to-lock hash functions.

Resource Limits

Solaris limits the number of processes that can be created to prevent metadata such as the process table and the proc_t structures from consuming too much kernel memory. This is enforced by the tunables maxusers, max_nprocs, and pidmax. The default for the latter was 30000, which is too small for M5-32 with 1536 CPUs, allowing only 20 processes per CPU. As of Solaris 11.1, the default for these tunables automatically scales up with CPU count and memory size, to a maximum of 999999 processes. You should rarely if ever need to change these tunables in /etc/system, though that is still allowed.

Similarly, Solaris limits the number of threads that can be created, by limiting the space reserved for kernel thread stacks with the segkpsize tunable, whose default allowed approximately 64K threads. In Solaris 11.1, the default scales with CPU and memory to a maximum of 1.6M threads.

Next time: Scheduler, Devices, Tools, and Reboot.

Tuesday Apr 02, 2013

Massive Solaris Scalability for the T5-8 and M5-32, Part 1

How do you scale a general purpose operating system to handle a single system image with 1000's of CPUs and 10's of terabytes of memory? You start with the scalable Solaris foundation. You use superior tools such as Dtrace to expose issues, quantify them, and extrapolate to the future. You pay careful attention to computer science, data structures, and algorithms, when designing fixes. You implement fixes that automatically scale with system size, so that once exposed, an issue never recurs in future systems, and the set of issues you must fix in each larger generation steadily shrinks.

The T5-8 has 8 sockets, each containing 16 cores of 8 hardware strands each, which Solaris sees as 1024 CPUs to manage. The M5-32 has 1536 CPUs and 32 TB of memory. Both are many times larger than the previous generation of Oracle T-class and M-class servers. Solaris scales well on that generation, but every leap in size exposes previously benign O(N) and O(N^2) algorithms that explode into prominence on the larger system, consuming excessive CPU time, memory, and other resources, and limiting scalability. To find these, knowing what to look for helps. Most OS scaling issues can be categorized as CPU issues, memory issues, device issues, or resource shortage issues.

CPU scaling issues include:

  • increased lock contention at higher thread counts
  • O(NCPU) and worse algorithms
Lock contention is addressed using fine grained locking based on domain decomposition or hashed lock arrays, and the number of locks is automatically scaled with NCPU for a future-proof solution. O(NCPU^2) algorithms are often the result of naive data structures, or interactions between sub-systems each of which does O(N) work, and once recognized can be recoded easily enough with an adequate supply of caffeine. O(NCPU) algorithms are often the result of a single thread managing resources that grow with machine size, and the solution is to apply parallelism. A good example is the use of vmtasks for shared memory allocation.

Memory scaling issues include:

  • working sets that exceed VA translation caches
  • unmapping translations in all CPUs that access a memory page
  • O(memory) algorithms
  • memory hotspots
Virtual to physical address translations are cached at multiple levels in hardware and software, from TLB through TSB and HME on SPARC. A miss in the smaller lower level caches requires a more costly lookup at the higher level(s). Solaris maximizes the span of each cache and minimizes misses by supporting shared MMU contexts, a range of hardware page sizes up to 2 GB, and the ability to use large pages in every type of memory segment: user, kernel, text, data, private, shared. Solaris uses a novel hardware feature of the T5 and M5 processors to unmap memory on a large number of CPUs efficiently. O(memory) algorithms are fixed using parallelism. Memory hotspots are fixed by avoiding false sharing and spreading data structures across caches and memory controllers.

Device scaling issues include:

  • O(Ndevice) and worse algorithms
  • system bandwidth limitations
  • lock contention in interrupt threads and service threads
The O(N) algorithms tend to be hit during administrative actions such as system boot and hot plug, are are fixed with parallelism and improved data structures. System bandwidth is maximized by spreading devices across PCI roots and system boards, by spreading DMA buffers across memory controllers, and by co-locating DMA buffers with either the producer or consumer of the data. Lock contention is a CPU scaling issue.

Resource shortages occur when too many CPUs compete for a finite set of resources. Sometimes the resource limit is artificial and defined by software, such as for the maximum process and thread count, in which case the fix is to scale the limit automatically with NCPU. Sometimes the limit is imposed by hardware, such as for the number of MMU contexts, and the fix requires more clever resource management in software.

Next time I will provide more details on new Solaris improvements in all of these areas that enable superior performance and scaling on T5 and M5 systems. Stay tuned.

Thursday Nov 08, 2012

Faster Memory Allocation Using vmtasks

You may have noticed a new system process called "vmtasks" on Solaris 11 systems:

    % pgrep vmtasks
    % prstat -p 8
         8 root        0K    0K sleep   99  -20   9:10:59 0.0% vmtasks/32

What is vmtasks, and why should you care? In a nutshell, vmtasks accelerates creation, locking, and destruction of pages in shared memory segments. This is particularly helpful for locked memory, as creating a page of physical memory is much more expensive than creating a page of virtual memory. For example, an ISM segment (shmflag & SHM_SHARE_MMU) is locked in memory on the first shmat() call, and a DISM segment (shmflg & SHM_PAGEABLE) is locked using mlock() or memcntl(). Segment operations such as creation and locking are typically single threaded, performed by the thread making the system call. In many applications, the size of a shared memory segment is a large fraction of total physical memory, and the single-threaded initialization is a scalability bottleneck which increases application startup time.

To break the bottleneck, we apply parallel processing, harnessing the power of the additional CPUs that are always present on modern platforms. For sufficiently large segments, as many of 16 threads of vmtasks are employed to assist an application thread during creation, locking, and destruction operations. The segment is implicitly divided at page boundaries, and each thread is given a chunk of pages to process. The per-page processing time can vary, so for dynamic load balancing, the number of chunks is greater than the number of threads, and threads grab chunks dynamically as they finish their work. Because the threads modify a single application address space in compressed time interval, contention on locks protecting VM data structures locks was a problem, and we had to re-scale a number of VM locks to get good parallel efficiency. The vmtasks process has 1 thread per CPU and may accelerate multiple segment operations simultaneously, but each operation gets at most 16 helper threads to avoid monopolizing CPU resources. We may reconsider this limit in the future.

Acceleration using vmtasks is enabled out of the box, with no tuning required, and works for all Solaris platform architectures (SPARC sun4u, SPARC sun4v, x86).

The following tables show the time to create + lock + destroy a large segment, normalized as milliseconds per gigabyte, before and after the introduction of vmtasks:

        system     ncpu    before      after   speedup 
        ------     ----    ------      -----   -------
        x4600      32      1386        245     6X
        X7560      64      1016        153     7X
        M9000      512     1196        206     6X
        T5240      128     2506        234     11X
        T4-2       128     1197        107     11x
        system     ncpu    before      after   speedup 
        ------     ----    ------      -----   -------
        x4600      32      1582        265     6X
        X7560      64      1116        158     7X
        M9000      512     1165        152     8X
        T5240      128     2796        198     14X

(I am missing the data for T4 DISM, for no good reason; it works fine).

The following table separates the creation and destruction times:

    ISM, T4-2
                  before    after  
                  ------    -----
        create    702       64
        destroy   495       43

To put this in perspective, consider creating a 512 GB ISM segment on T4-2. Creating the segment would take 6 minutes with the old code, and only 33 seconds with the new. If this is your Oracle SGA, you save over 5 minutes when starting the database, and you also save when shutting it down prior to a restart. Those minutes go directly to your bottom line for service availability.

Wednesday Oct 31, 2012

High Resolution Timeouts

The default resolution of application timers and timeouts is now 1 msec in Solaris 11.1, down from 10 msec in previous releases. This improves out-of-the-box performance of polling and event based applications, such as ticker applications, and even the Oracle rdbms log writer. More on that in a moment.

As a simple example, the poll() system call takes a timeout argument in units of msec:

System Calls                                              poll(2)
     poll - input/output multiplexing
     int poll(struct pollfd fds[], nfds_t nfds, int timeout);

In Solaris 11, a call to poll(NULL,0,1) returns in 10 msec, because even though a 1 msec interval is requested, the implementation rounds to the system clock resolution of 10 msec. In Solaris 11.1, this call returns in 1 msec.

In specification lawyer terms, the resolution of CLOCK_REALTIME, introduced by POSIX.1b real time extensions, is now 1 msec. The function clock_getres(CLOCK_REALTIME,&res) returns 1 msec, and any library calls whose man page explicitly mention CLOCK_REALTIME, such as nanosleep(), are subject to the new resolution. Additionally, many legacy functions that pre-date POSIX.1b and do not explicitly mention a clock domain, such as poll(), are subject to the new resolution. Here is a fairly comprehensive list:

      pthread_mutex_timedlock pthread_mutex_reltimedlock_np
      pthread_rwlock_timedrdlock pthread_rwlock_reltimedrdlock_np
      pthread_rwlock_timedwrlock pthread_rwlock_reltimedwrlock_np
      mq_timedreceive mq_reltimedreceive_np
      mq_timedsend mq_reltimedsend_np
      sem_timedwait sem_reltimedwait_np
      poll select pselect
      _lwp_cond_timedwait _lwp_cond_reltimedwait
      semtimedop sigtimedwait
      aiowait aio_waitn aio_suspend
      port_get port_getn
      cond_timedwait cond_reltimedwait
      setitimer (ITIMER_REAL)
      misc rpc calls, misc ldap calls

This change in resolution was made feasible because we made the implementation of timeouts more efficient a few years back when we re-architected the callout subsystem of Solaris. Previously, timeouts were tested and expired by the kernel's clock thread which ran 100 times per second, yielding a resolution of 10 msec. This did not scale, as timeouts could be posted by every CPU, but were expired by only a single thread. The resolution could be changed by setting hires_tick=1 in /etc/system, but this caused the clock thread to run at 1000 Hz, which made the potential scalability problem worse. Given enough CPUs posting enough timeouts, the clock thread could be a performance bottleneck. We fixed that by re-implementing the timeout as a per-CPU timer interrupt (using the cyclic subsystem, for those familiar with Solaris internals). This decoupled the clock thread frequency from timeout resolution, and allowed us to improve default timeout resolution without adding CPU overhead in the clock thread.

Here are some exceptions for which the default resolution is still 10 msec.

  • The thread scheduler's time quantum is 10 msec by default, because preemption is driven by the clock thread (plus helper threads for scalability). See for example dispadmin, priocntl, fx_dptbl, rt_dptbl, and ts_dptbl. This may be changed using hires_tick.
  • The resolution of the clock_t data type, primarily used in DDI functions, is 10 msec. It may be changed using hires_tick. These functions are only used by developers writing kernel modules.
  • A few functions that pre-date POSIX CLOCK_REALTIME mention _SC_CLK_TCK, CLK_TCK, "system clock", or no clock domain. These functions are still driven by the clock thread, and their resolution is 10 msec. They include alarm, pcsample, times, clock, and setitimer for ITIMER_VIRTUAL and ITIMER_PROF. Their resolution may be changed using hires_tick.

Now back to the database. How does this help the Oracle log writer? Foreground processes post a redo record to the log writer, which releases them after the redo has committed. When a large number of foregrounds are waiting, the release step can slow down the log writer, so under heavy load, the foregrounds switch to a mode where they poll for completion. This scales better because every foreground can poll independently, but at the cost of waiting the minimum polling interval. That was 10 msec, but is now 1 msec in Solaris 11.1, so the foregrounds process transactions faster under load. Pretty cool.

Friday Oct 26, 2012

Solaris 11.1 Performance

Solaris 11.1 has many improvements in performance and scalability, some of which I worked on and can finally talk about.  Check this space over the next few weeks for new postings.

Tuesday Jul 19, 2011

Fast Crash Dump

You may have noticed the new system crash dump file vmdump.N that was introduced in Solaris 10 9/10. However, you perhaps did not notice that the crash dump is generated much more quickly than before, reducing your down time by many minutes on large memory systems, by harnessing parallelism and high compression at dump time. In this entry, I describe the Fast Crash optimizations that Dave Plauger and I added to Solaris.

In the previous implementation, if a system panics, the panic thread freezes all other CPUs and proceeds to copy system memory to the dump device. By default, only kernel pages are saved, which is usually sufficient to diagnose a kernel bug, but that can be changed with the dumpadm(1M) command. I/O is the bottleneck, so the panic thread compresses pages on the fly to reduce the data to be written. It uses lzjb compression, which provides decent compression at a reasonable CPU utilization. When the system reboots, the single-threaded savecore(1M) process reads the dump device, uncompresses the data, and creates the crash dump files vmcore.N and unix.N, for a small integer N.

Even with lzjb compression, writing a crash dump on systems with gigabytes to terabytes of memory takes a long time. What if we use stronger compression to further reduce the amount of data to write? The following chart compares the compression ratio of lzjb vs bzip2 for 42 crash dumps picked at random from our internal support site.

bzip2 compresses 2x more than lzjb for most cases, and in the extreme case, bzip2 achieves a 39X compression vs 9X for lzjb. (We also tested gzip levels 1 through 9, and they fall in between the two.) Thus we could reduce the disk I/O time for crash dump by using bzip2. The catch is that bzip2 requires significantly more CPU time than lzjb per byte compressed, some 20X to 40X more on the SPARC and x86 CPUs we tested, so introducing bzip2 in a single threaded dump would be a net loss. However, we hijack the frozen CPUs to compress different ranges of physical memory in parallel. The panic thread traverses physical memory in 4 MB chunks, mapping each chunk and passing its address to a helper CPU. The helper compresses the chunk to an output buffer, and passes the result back to the panic thread, which writes it to disk. This is implemented in a pipelined, dataflow fashion such that the helper CPUs are kept busy compressing the next batch of data while the panic thread writes the previous batch of data to disk.

We dealt with several practical problems to make this work. Each helper CPU needs several MB of buffer space to run the bzip2 algorithm, which really adds up for 100's of CPUs, and we did not want to statically reserve that much memory per domain. Instead, we scavenge memory that is not included in the dump, such as userland pages in a kernel-only dump. Also, during a crash dump, only the panic thread is allowed to use kernel services, because the state of kernel data structures is suspect and concurrency is not safe. Thus the panic thread and helper CPUs must communicate using shared memory and spin locks only.

The speedup of parallel crash dump versus the serial dump depends on compression factor, CPU speed, and disk speed, but here are a few examples. These are "kernel only" dumps, and the dumpsize column below is the uncompressed kernel size. The disk is either a raw disk or a simple ZFS zvol, with no striping. Before is the time for a serial dump, and after is the time for a parallel dump, measured from the "halt -d" command to the last write to the dump device.

    system  NCPU  disk  dumpsize  compression  before  after  speedup
                          (GB)                 mm:ss   mm:ss
    ------  ----  ----  --------  -----------  -----   -----  -------
    M9000   512   zvol    90.6      42.0       28:30    2:03    13.8X
    T5440   256   zvol    19.4       7.2       27:21    4:29     6.1X
    x4450    16   raw      4.9       6.7        0:22    0:07     3.2X
    x4600    32   raw     14.6       3.1        3:47    1:46     2.1X

The higher compression, performed concurrently with I/O, gives a significant speedup, but we are still I/O limited, and future speedup will depend on improvements in I/O. For example, a striped zvol is not supported as a dump device, but that would help. You can use hardware raid to configure a faster dump device.

We also optimized live dumps, which are crash dumps that are generated with the "savecore -L" command, without stopping the system. This is useful for diagnosing systems that are misbehaving in some way but are still performing adequately, without interrupting service. We fixed a bug (CR 6878030) in which live dump writes were broken into 8K physical transfers, giving terrible I/O throughput. The live dumps could take hours on large systems, making them practically unusable. We obtained these speedups for various dump sizes:

    system disk  before   after  speedup
                 h:mm:ss  mm:ss
    ------ ----  -------  -----  -------
    M9000  zvol  3:28:10  15:06  13.8X
    T5440  zvol    23:29   1:19  18.6X
    T5440  raw      9:17   0:55  10.5X

Lastly, we optimized savecore, which runs when the system boots after a panic, and copies data from the dump device to a persistent file, such as /var/crash/hostname/vmdump.N. savecore is 3X to 5X faster faster because the dump device contents are more compressed, so there are fewer bytes to copy, and because it no longer uncompresses the dump by default. That makes more sense if you need to send the vmdump across the internet. To uncompress by default, use "dumpadm -z off". To uncompress a vmdump.N file and produce vmcore.N and unix.N, which are required for using tools such as mdb, run the "savecore -f" command. We multi-threaded savecore to uncompress in parallel.

Having said all that, I sincerely hope you never see a crash dump on Solaris! The best way to reduce downtime is for the system to stay up. The Sun Ray server I am working on now has been up for 160 days since its previous planned downtime.


Steve Sistare


« July 2016