Friday Aug 26, 2005

Least Privilege Chapter done

This week I completed a draft of the Least Privilege chapter, based on some technical content from Casper Dik. It's added about 40 pages to the book. Darren Moffat is working on a complementary chapter on the Solaris 10 crypto framework too, so this edition will have a Security section...

[ T: ]

Chasing the Sanfrancisco Fog

Sometimes I think we're a little crazy, reviewing the amount of effort invested in a simple, nonsensical goal like this one: trying to capture the essence of the fog around Sanfrancisco...

Recently, we ascended the higher vantages on the Marin headlands, looking for the golden gate bridge nestled in a blanket of fog.

Here you can see Joost Pronk waiting patiently for the fog to lower and reveal the perfect image...

After getting sunburned, we left with no such pictures. We were entertained for quite some time by interesting implementations of chaos theory at work:

So this week's pictures are two of the city, but not actually what we set out to get:

Friday Aug 19, 2005

This week's picture

Phil Harman and I took a quick trip to Yosemite earlier this year, and stumbled across this scene with the sun setting over the windmills at Altamont Pass, just east of the bay area.

Friday Aug 12, 2005

File System Performance Benchmarks (FileBench) added to Performance Community

For those who have been following FileBench, we've finally Open Sourced it (FileBench is our file system performance measurement and benchmark framework). We've started a community web page and discussion within the OpenSolaris performance community for this topic.

We're working on some Linux ext3/rieser comparisons to server partly as a workload validation experiment; which should be available soon.

Documentation is being worked on as we speak. I'll be providing some worked examples for reference in the near future.

[ T: ]

Thursday Aug 11, 2005

Photo of the week

It's about time I added something to that "Photography" category...

I was in SanFrancisco early one morning last week, and snapped this scene which I found interesting.

Friday Jul 15, 2005

New look for Solaris Internals Website

Well, one thing that having a blog with any significant technical content forces you to do is understand cascading style sheets. As a result, I've also given Solaris Internals the css makeover with our OpenSolaris theme.

I found a picture that my good friend Bill Walker took of our two cars parked together in the Grand Canyon, and included that too.

Technorati Tag: OpenSolaris

Technorati Tag: Solaris

Tuesday Jul 12, 2005

Performance Improvements for the File System Cache in Solaris 10

One of the major improvements in Solaris 10 was the improvement of the file system cache. Here is a quick summary of the perf changes that I grabbed recently using Filebench.

The file system caching mechanism was overhauled to eliminate the majority of the MMU overhead associated with the page cache. The new segkpm facility providesfast mappings for physical pages within the kernel's address space. It is used by the file systems to provide a virtual addres that can be used to copy data to and from the user's address space for file system I/O.

Since the available virtual address range in a 64 bit kernel is always larger than physical memory size, the entire physical memory can be mapped into the kernel. This eliminates the need to have to map/unmap pages everytime they are accessed via segmap, significantly reducing code path and the need for TLB shoot downs. In addition, segkpm can use large TLB mappings to minimize TLB miss overhead.

Some quick measurements on Solaris 8 and 10 show the code path reduction. There are three important cases:

  • Small In-Memory: When the file fits in memory, and is smaller than the size of the segmap cache
  • Large In-Memory: When the file fits in memory, and is larger than the size of the segmap cache
  • Large Non-Cached: When the file does not fit in memory

Measurements show that the most significant gain is a reduction of CPU per system call when a file is in memory (fits in the page cache), and is larger than segmap (which is 12% of physmem on SPARC, and 64MB on x64). Importantly, this is the most common case I've seen with real applications, too.

Random Read of 8kSolaris 8Solaris 10
Small In-Memory (Less than segmap)89us83us
Large In-Memory (Greater than segmap but less then physmem)181us98us
Large Non-Cached236us153us

The throughput gains for a CPU constrained system are shown here:

With Solaris 10 you should expect significant improvements for applications which are I/O intensive into the file system cache. The actual improvement will vary, and will be greater on systems with higher CPU counts. You should also expect to see the cross call count drop (see xcal in mpstat), and a significantly reduced amount of system time.

Technorati Tag: OpenSolaris

Technorati Tag: Solaris

Monday Jul 11, 2005

A terabyte of NAS for $800!

I just noticed these little devices at Fry's: The Terrastation

While the target market is no doubt CIFS/Windows, it does claim to support NFS (Appletalk and FTP, too).

It has 4 x 250GB drives in a RAID configuration, and gigabit ethernet as the primary transport.

Looks pretty cool on the surface, I wonder if it has a decent NFS implementation?

Friday Jul 08, 2005

Solaris Internals: 2nd Edition!

It's no secret that we hope to get an updated Solaris Internals book out. Jim and I have had this AI on our desk for a while. The good news is that it's been making quite a bit of progress of late!

The idea is to update the existing book from Solaris 7 to Solaris 10, highly leveraging OpenSolaris, DTrace and mdb. There's a lot to add, given the onslaught of development: substantially revised virtual memory, a new file system interface, a new threads model, zones, ZFS, Least privilege, SMF and the list goes on. We scoped adding all of this, and we'd have a 2000+ page book when we're done.

What we've decided to do is break up the work into smaller deliverable chunks, and deliver it in parts. Yes, we're taking the Knuth approach: Solaris 10 Internals will have more than one volume. We're splitting some of the new material and most of the performance discussion out into the subsequent volume. We're enlisting a few helpers for the subsequent volume, to make it more of a community effort.

So that you can keep the pressure on us, I thought I'd share where we are with the current volume. Our target is to be done with this volume in the next couple of months.

Part Chapter Primary Old pages Target pages Left

Preface JM 6 7 2
I – Intro Introduction Phil Harman/JM 36 40 10

Running Page Total


II – Tools Introduction JM 2 2 2

Dtrace Jon 0 30 5

MDB RMC 5 5 5

Kstat Boothby/RMC 0 15 0

III – Memory VM Intro RMC 6 6 0

VM Monitoring RMC 44 44 0

Large pages RMC 14 14 0

Memory Arch RMC 36 36 0

Physical Mem Mngmnt RMC 20 20 0

HAT Tariq 12 20 7

Kernel Memory RMC 48 48 0

IV – Platform Sync Intro JM 16 16 5

Sync Impl JM 16 16 4

NUMA/CMT RMC/Saxe/Chew 16 18 4

Kernel Services JM 37 38 20

Kernel Modules & Linker JM 0 20 20

V – Process Model Process Model JM 48 48 20

Sched Classes & Disp JM 65 50 40

ProcFS JM 22 22 6

Signals JM 18 20 10

Resource Management JM 8 20 12

IPC JM 48 48 10

VI – Files & File Systems Files RMC 40 40 4

Intro RMC 18 22 6

FS Architecture RMC 46 70 0

UFS Shawn 24 30 6

NFS Spencer/Sameer 0 30 0

ZFS RMC 20 0 20

Appendix A ELF File Format JM 12 12 12
Appendix B Kernel Maps RMC 12 12 0

819 260

Technorati Tag: OpenSolaris

Technorati Tag: Solaris

Technorati Tag: DTrace

Tracing the Solaris 10 File System Interface

Here's a quick script to trace activity though the central file system interface. Until there is a general file system provider, this script should serve as a basic framework help construct other file system tracing scripts.

# ./voptrace.d /tmp
Event           Device                                                Path  RW     Size   Offset
fop_putpage     -          /tmp//filebench/bin/i386/fastsu                   -     4096     4096
fop_inactive    -          /tmp//filebench/bin/i386/fastsu                   -        0        0
fop_putpage     -          /tmp//filebench/xanadu/WEB-INF/lib/classes12.jar  -     4096   204800
fop_inactive    -          /tmp//filebench/xanadu/WEB-INF/lib/classes12.jar  -        0        0
fop_putpage     -          /tmp/filebench1.63_s10_x86_sparc_pkg.tar.Z        -     4096  7655424
fop_inactive    -          /tmp/filebench1.63_s10_x86_sparc_pkg.tar.Z        -        0        0
fop_putpage     -          /tmp//filebench/xanadu/WEB-INF/lib/classes12.jar  -     4096   782336
fop_inactive    -          /tmp//filebench/xanadu/WEB-INF/lib/classes12.jar  -        0        0
fop_putpage     -          /tmp//filebench/bin/amd64/filebench               -     4096    36864

The source is below:

#!/usr/sbin/dtrace -s

 \* Trace the vnode interface
 \* USAGE: voptrace.d [/all | /mountname ]
 \* Author: Richard McDougall
 \* 7/8/2005

#pragma D option quiet

        printf("%-15s %-10s %51s %2s %8s %8s\\n",
                "Event", "Device", "Path", "RW", "Size", "Offset");
        self->trace = 0;
        self->path = "";

/self->trace == 0/
        /\* Get vp: fop_open has a pointer to vp \*/
        self->vpp = (vnode_t \*\*)arg0;
        self->vp = (vnode_t \*)arg0;
        self->vp = probefunc == "fop_open" ? (vnode_t \*)\*self->vpp : self->vp;

        /\* And the containing vfs \*/
        self->vfsp = self->vp ? self->vp->v_vfsp : 0;

        /\* And the paths for the vp and containing vfs \*/
        self->vfsvp = self->vfsp ? (struct vnode \*)((vfs_t \*)self->vfsp)->vfs_vnodecovered : 0;
        self->vfspath = self->vfsvp ? stringof(self->vfsvp->v_path) : "unknown";

        /\* Check if we should trace the root fs \*/
        ($1 == "/all" ||
         ($1 == "/" && self->vfsp && \\
         (self->vfsp == `rootvfs))) ? self->trace = 1 : self->trace;

        /\* Check if we should trace the fs \*/
        ($1 == "/all" || (self->vfspath == $1)) ? self->trace = 1 : self->trace;

 \* Trace the entry point to each fop
        self->path = (self->vp != NULL && self->vp->v_path) ? stringof(self->vp->v_path) : "unknown";
        self->len = 0;
        self->off = 0;

        /\* Some fops has the len in arg2 \*/
        (probefunc == "fop_getpage" || \\
         probefunc == "fop_putpage" || \\
         probefunc == "fop_none") ? self->len = arg2 : 1;

        /\* Some fops has the len in arg3 \*/
        (probefunc == "fop_pageio" || \\
         probefunc == "fop_none") ? self->len = arg3 : 1;

        /\* Some fops has the len in arg4 \*/
        (probefunc == "fop_addmap" || \\
         probefunc == "fop_map" || \\
         probefunc == "fop_delmap") ? self->len = arg4 : 1;

        /\* Some fops has the offset in arg1 \*/
        (probefunc == "fop_addmap" || \\
         probefunc == "fop_map" || \\
         probefunc == "fop_getpage" || \\
         probefunc == "fop_putpage" || \\
         probefunc == "fop_seek" || \\
         probefunc == "fop_delmap") ? self->off = arg1 : 1;

        /\* Some fops has the offset in arg3 \*/
        (probefunc == "fop_close" || \\
         probefunc == "fop_pageio") ? self->off = arg3 : 1;

        /\* Some fops has the offset in arg4 \*/
        probefunc == "fop_frlock" ? self->off = arg4 : 1;

        /\* Some fops has the pathname in arg1 \*/
        self->path = (probefunc == "fop_create" || \\
         probefunc == "fop_mkdir" || \\
         probefunc == "fop_rmdir" || \\
         probefunc == "fop_remove" || \\
         probefunc == "fop_lookup") ?
                strjoin(self->path, strjoin("/", stringof(arg1))) : self->path;
        printf("%-15s %-10s %51s %2s %8d %8d\\n",
                "-", self->path, "-", self->len, self->off);
        self->type = probefunc;

/self->trace == 1/
        self->trace = 0;

/\* Capture any I/O within this fop \*/
        printf("%-15s %-10s %51s %2s %8d %8u\\n",
                self->type, args[1]->dev_statname,
                self->path, args[0]->b_flags & B_READ ? "R" : "W",
                args[0]->b_bcount, args[0]->b_blkno);


Technorati Tag: OpenSolaris

Technorati Tag: Solaris

Technorati Tag: DTrace

Thursday Jun 30, 2005

Using DTrace for Memory Analysis

Following on from yesterday's post on using prstat to look at memory slow-downs, here is a more advanced way of investigating. The DTrace probes help us identify all sources of paging in the system, and give us the ability to drill down quickly to identify cause and affect.

With DTrace, you can probe more deeply into the sources of activity observed with higher-level memory analysis tools. For example, if you determine that a significant amount of paging activity is due to a memory shortage, you can determine which process is initiating the paging activity. In another example, if you see a significant amount of paging due to file activity, you can drill down to see which process and which file is responsible.

DTrace allows for memory analysis through a vminfo provider, and, optionally, through deeper tracing of virtual memory paging with the fbt provider.

The vminfo provider probes correspond to the fields in the "vm" named kstat: a probe provided by vminfo fires immediately before the corresponding vm value is incremented. The table below from the DTrace guide lists the probes available from the VM provider. A probe takes the following arguments:

  • arg0 - The value by which the statistic is to be incremented. For most probes, this argument is always 1, but for some it may take other values.
  • arg1 - A pointer to the current value of the statistic to be incremented. This value is a 64-bit quantity that is incremented by the value in arg0. Dereferencing this pointer allows consumers to determine the current count of the statistic corresponding to the probe.

For example, if you should see the following paging activity with vmstat, indicating page-in from the swap device, you could drill down to investigate.

sol8# vmstat -p 3

     memory           page          executable      anonymous      filesystem 
   swap  free  re  mf  fr  de  sr  epi  epo  epf  api  apo  apf  fpi  fpo  fpf
 1512488 837792 160 20 12   0   0    0    0    0 8102    0    0   12   12   12
 1715812 985116 7  82   0   0   0    0    0    0 7501    0    0   45    0    0
 1715784 983984 0   2   0   0   0    0    0    0 1231    0    0   53    0    0
 1715780 987644 0   0   0   0   0    0    0    0 2451    0    0   33    0    0

sol10$ dtrace -n anonpgin'{@[execname] = count()}'
dtrace: description 'anonpgin' matched 1 probe
  svc.startd                                                       1
  sshd                                                             2
  ssh                                                              3
  dtrace                                                           6
  vmstat                                                          28
  filebench                                                      913

DTrace VM Provider Probes and Descriptions
Probe Name Description
anonfree Fires whenever an unmodified anonymous page is freed as part of paging activity. Anonymous pages are those that are not associated with a file; memory containing such pages include heap memory, stack memory, or memory obtained by explicitly mapping zero(7D).
anonpgin Fires whenever an anonymous page is paged in from a swap device.
anonpgout Fires whenever a modified anonymous page is paged out to a swap device.
as_fault Fires whenever a fault is taken on a page and the fault is neither a protection fault nor a copy-on-write fault.
cow_fault Fires whenever a copy-on-write fault is taken on a page. arg0 contains the number of pages that are created as a result of the copy-on-write.
dfree Fires whenever a page is freed as a result of paging activity. Whenever dfree fires, exactly one of anonfree, execfree, or fsfree will also subsequently fire.
execfree Fires whenever an unmodified executable page is freed as a result of paging activity.
execpgin Fires whenever an executable page is paged in from the backing store.
execpgout Fires whenever a modified executable page is paged out to the backing store. If it occurs at all, most paging of executable pages will occur in terms of execfree; execpgout can only fire if an executable page is modified in memory -- an uncommon occurrence in most systems.
fsfree Fires whenever an unmodified file system data page is freed as part of paging activity.
fspgin Fires whenever a file system page is paged in from the backing store.
fspgout Fires whenever a modified file system page is paged out to the backing store.
kernel_asflt Fires whenever a page fault is taken by the kernel on a page in its own address space. Whenever kernel_asflt fires, it will be immediately preceded by a firing of the as_fault probe.
maj_fault Fires whenever a page fault is taken that results in I/O from a backing store or swap device. Whenever maj_fault fires, it will be immediately preceded by a firing of the pgin probe.
pgfrec Fires whenever a page is reclaimed off of the free page list.

Using DTrace to Estimate Memory Slowdowns

You can use Using DTrace to, we can directly measure time elapsed time around the page-in probes when a process is waiting for page-in from the swap device, as in this example.

        self->on = vtimestamp;

        @oncpu[execname] = sum(vtimestamp - self->on);
        self->on = 0;

        self->anonpgin = 1;

        self->wait = timestamp;

/self->anonpgin == 1/
        self->anonpgin = 0;
        @pageintime[execname] = sum(timestamp - self->wait);
        self->wait = 0;

        normalize(@oncpu, 1000000);
        printf("Who's on cpu (milliseconds):\\n");
        printa("  %-50s %15@d\\n", @oncpu);

        normalize(@pageintime, 1000000);
        printf("Who's waiting for pagein (milliseconds):\\n");
        printa("  %-50s %15@d\\n", @pageintime);


With an aggregation by execname, you can we can look to see who is being held up by paging the most.

Who's on cpu (milliseconds):
  svc.startd                                                       1                                                          2
  sshd                                                             2
  ssh                                                              3
  dtrace                                                           6
  vmstat                                                          28
  pageout                                                         60
  fsflush                                                        120
  filebench                                                      913
  sched                                                        84562
Who's waiting for pagein (milliseconds):
  filebench                                                   230704

The DTrace script displays the amount of time the program spends doing useful work compared to the amount of time it spends waiting for page-in.

The next script measures the elapsed time from when a program stalls on a page in from the swap device (anonymous page ins) and when it resumes for a specific pid target, specified on the command line.

/pid == $1/
        self->on = vtimestamp;

        @time[""] = sum(vtimestamp - self->on);
        self->on = 0;

/pid == $1/
        self->anon = 1;

/pid == $1/
        self->wait = timestamp;

/self->anon == 1/
        self->anon = 0;
        @time[""] = sum(timestamp - self->wait);
        self->wait = 0;

        printf("Time breakdown (milliseconds):\\n");
        normalize(@time, 1000000);
        printa("  %-50s %15@d\\n", @time);

In the following example, the program spends 0.9 seconds doing useful work, and 230 seconds waiting for page-ins.

sol10$ /pagingtime.d 22599                           
dtrace: script './pagingtime.d' matched 10 probes
  1      2                             :END 
Time breakdown (milliseconds):
  <on cpu>                                                        913
  <paging wait>                                                   230704

Technorati Tag: OpenSolaris

Technorati Tag: Solaris

Technorati Tag: DTrace

Wednesday Jun 29, 2005

Using prstat to estimate memory slow-downs

There are many indicators which show when Solaris has a memory shortage, but when you get in this situation wouldn't it be nice if you could tell who is being affected, and to what degree?

Here's a simple method: use the microstate measurement facility in prstat(1M). Using the memory-stall counts, you can observe the percentage of wall clock time spent waiting in data faults.

The microstates show a timeline for 100% of the wall clock time of a thread in between samples. The time is broken down into eight categories; on cpu in USR/SYS, in a software trap (TRP), in memory stalls (DFL and TFL), in user locks (LCK), sleeping on I/O or something else (SLP), and the time spent waiting on the run queue (LAT). Of particular interest here is the DFL column. It shows the percentage of time spent waiting for data faults to be serviced. When an application is stalled waiting to be paged in from the swap device, time accumulates in this column.

The In the following example we shows a severe memory shortage situation. The system was running short of memory, and each thread in filebench is waiting for memory approximately 90% of the time.

sol8$ prstat -mL
 15625 rmc      0.1 0.7 0.0 0.0  95 0.0 0.9 3.2  1K 726  88   0 filebench/2
 15652 rmc      0.1 0.7 0.0 0.0  94 0.0 1.8 3.6  1K  1K  10   0 filebench/2
 15635 rmc      0.1 0.7 0.0 0.0  96 0.0 0.5 3.2  1K  1K   8   0 filebench/2
 15626 rmc      0.1 0.6 0.0 0.0  95 0.0 1.4 2.6  1K 813  10   0 filebench/2
 15712 rmc      0.1 0.5 0.0 0.0  47 0.0  49 3.8  1K 831 104   0 filebench/2
 15628 rmc      0.1 0.5 0.0 0.0  96 0.0 0.0 3.1  1K 735   4   0 filebench/2
 15725 rmc      0.0 0.4 0.0 0.0  92 0.0 1.7 5.7 996 736   8   0 filebench/2
 15719 rmc      0.0 0.4 0.0 0.0  40  40  17 2.9  1K 708 107   0 filebench/2
 15614 rmc      0.0 0.3 0.0 0.0  92 0.0 4.7 2.4 874 576  40   0 filebench/2
 15748 rmc      0.0 0.3 0.0 0.0  94 0.0 0.0 5.5 868 646   8   0 filebench/2
 15674 rmc      0.0 0.3 0.0 0.0  86 0.0 9.7 3.2 888 571  62   0 filebench/2
 15666 rmc      0.0 0.3 0.0 0.0  29  46  23 2.1 689 502 107   0 filebench/2
 15682 rmc      0.0 0.2 0.0 0.0  24  43  31 1.9 660 450 107   0 filebench/2

Tuesday Jun 14, 2005

Adding your own performance statistics to Solaris

Have you ever wondered how all of those magical vmstat statistics come from, or how you might go about adding your own? Now that we have DTrace, you will be able to get all the add-hoc statistics you need, so we seldom need to add hard coded statistics. However, the kstat mechanism is used to provide light weight statistics that are a stable part of your kernel code. The kstat interface would be used to provide standard information that would be reported from a user level tool. For example, if you wanted to add your own device driver I/O statistics into the statistics pool reported by the iostat command, you would add a kstat provider.

Peter Boothby provides an excellent introduction to the kstat facility and the user API in his developer article. Since Pete's covered the detail, I'll just recap the major parts and jump into an overview of a kstat provider.

We'll start with the basics about adding your own kstat provider, and now that we've launched OpenSolaris, I can include source references ;-)

The statistics reported by vmstat, iostat and most of the other Solaris tools are gathered via a central kernel statistics subsystems, known as "kstat". The kstat facility is an all purpose interface for collecting and reporting named and typed data.

A typical scenario will have a kstat producer and a kstat reader. The kstat reader is a utility in user mode which reads, potentially aggregates and then reports the results. For example, the vmstat utility is a kstat reader, which aggregates statistics which a provided by the vm system in the kernel.

Statistics are named and accessed via a four-tuple: class, module, name, instance. Solaris 8 introduced a new method to access kstat information from the command line or in custom-written scripts. You can use the command-line tool /usr/bin/kstat interactively to print all or selected kstat information from a system. This program is written in the Perl language, and you can use the Perl XS extension module to write your own custom Perl programs. Both facilities are documented in the pages of the online manual.

The kstat Command

You can invoke the kstat command on the command line or within shell scripts to selectively extract kernel statistics. Like many other Solaris commands, kstat it takes optional interval and count arguments for repetitive, periodic output. Its command options are quite flexible.

The first form follows standard UNIX command-line syntax, and the second form provides a way to pass some of the arguments as colon-separated fields. Both forms offer the same functionality. Each of the module, instance, name, or statistic specifiers may be a shell glob pattern or a Perl regular expression enclosed by "/" characters. You can It is possible to use both specifier types within a single operand. Leaving a specifier empty is equivalent to using the "\*" glob pattern for that specifier. Running kstat with no arguments will print out nearly all kstat entries from the running kernel (most, but not all kstats of KSTAT_TYPE_RAW are decoded).

The tests specified by the options are logically ANDed, and all matching kstats are selected. The argument for the -c, -i, -m, -n, and -s options can may be specified as a shell glob pattern, or a Perl regular expression enclosed in "/" characters. If you want to pass a regular expression containing shell metacharacters to the command, you must protect it from the shell by enclosing it with the appropriate quotation marks. For example, to show all kstats that have a statistics name beginning with intr in the module name cpu_stat, you could use the following script:

$ kstat -p -m cpu_stat -s 'intr\*'
cpu_stat:0:cpu_stat0:intr       878951000
cpu_stat:0:cpu_stat0:intrblk    21604
cpu_stat:0:cpu_stat0:intrthread 668353070
cpu_stat:1:cpu_stat1:intr       211041358
cpu_stat:1:cpu_stat1:intrblk    280
cpu_stat:1:cpu_stat1:intrthread 209879640

The -p option used in the preceding previous example displays output in a parsable format. If you do not specify this option, kstat produces output in a human-readable, tabular format. In the following example, we leave out the -p flag and use the module:instance:name:statistic argument form and a Perl regular expression.

$ kstat cpu_stat:::/\^intr/
module: cpu_stat                        instance: 0
name:   cpu_stat0                       class:    misc
        intr                            879131909
        intrblk                         21608
        intrthread                      668490486
module: cpu_stat                        instance: 1
name:   cpu_stat1                       class:    misc
        intr                            211084960
        intrblk                         280
        intrthread                      209923001

A kstat provider - a walk though

To add your own statistics to your Solaris kernel, you will need to create a kstat provider, which consists of an initialization function to create the statistics group, and then a call-back function that updates the statistics prior to being read. The callback function is often used to aggregate or summarize information prior to being reported back to the reader. The kstat provider interface is defined in kstat(3KSTAT) And kstat(9S). There is some more verbose information in usr/src/uts/common/sys/kstat.h. The first step is to decide on the type of information you want to export. The two primary types are RAW, NAMED or IO. The RAW interface exports raw C data structures to userland, and it's use is strongly discouraged, since a change in the C structure will cause incompatibilities in the reader. The named mechanism are preferred since the data is typed and extensible. Both the NAMED and IO use typed data.

The NAMED type is for providing single or multiple records of data, and is the most common choice. The IO record is specifically for providing I/O statistics. It is collected and reported by the iostat command, and therefore should be used only for items which can be viewed and reported as I/O devices (we do this currently for I/O devices and NFS file systems).

A simple example of named statistics is the virtual memory summaries provided via "system_pages":

$ kstat -n system_pages
module: unix                            instance: 0     
name:   system_pages                    class:    pages                         
        availrmem                       343567
        crtime                          0
        desfree                         4001
        desscan                         25
        econtig                         4278190080
        fastscan                        256068
        freemem                         248309
        kernelbase                      3556769792
        lotsfree                        8002
        minfree                         2000
        nalloc                          11957763
        nalloc_calls                    9981
        nfree                           11856636
        nfree_calls                     6689
        nscan                           0
        pagesfree                       248309
        pageslocked                     168569
        pagestotal                      512136
        physmem                         522272
        pp_kernel                       64102
        slowscan                        100
        snaptime                        6573953.83957897

These are first declared and initialized by the following C structs in usr/src/uts/common/os/kstat_fr.c:

struct {
        kstat_named_t physmem;
        kstat_named_t nalloc;
        kstat_named_t nfree;
        kstat_named_t nalloc_calls;
        kstat_named_t nfree_calls;
        kstat_named_t kernelbase;
        kstat_named_t econtig;
        kstat_named_t freemem;
        kstat_named_t availrmem;
        kstat_named_t lotsfree;
        kstat_named_t desfree;
        kstat_named_t minfree;
        kstat_named_t fastscan;
        kstat_named_t slowscan;
        kstat_named_t nscan;
        kstat_named_t desscan;
        kstat_named_t pp_kernel;
        kstat_named_t pagesfree;
        kstat_named_t pageslocked;
        kstat_named_t pagestotal;
} system_pages_kstat = {

        { "physmem",            KSTAT_DATA_ULONG },
        { "nalloc",             KSTAT_DATA_ULONG },
        { "nfree",              KSTAT_DATA_ULONG },
        { "nalloc_calls",       KSTAT_DATA_ULONG },
        { "nfree_calls",        KSTAT_DATA_ULONG },
        { "kernelbase",         KSTAT_DATA_ULONG },
        { "econtig",            KSTAT_DATA_ULONG },
        { "freemem",            KSTAT_DATA_ULONG },
        { "availrmem",          KSTAT_DATA_ULONG },
        { "lotsfree",           KSTAT_DATA_ULONG },
        { "desfree",            KSTAT_DATA_ULONG },
        { "minfree",            KSTAT_DATA_ULONG },
        { "fastscan",           KSTAT_DATA_ULONG },
        { "slowscan",           KSTAT_DATA_ULONG },
        { "nscan",              KSTAT_DATA_ULONG },
        { "desscan",            KSTAT_DATA_ULONG },
        { "pp_kernel",          KSTAT_DATA_ULONG },
        { "pagesfree",          KSTAT_DATA_ULONG },
        { "pageslocked",        KSTAT_DATA_ULONG },
        { "pagestotal",         KSTAT_DATA_ULONG },

These statistics are the most simple type; merely a basic list of 64-bit variables. Once declared, the kstats are registered with the subsystem:

static int system_pages_kstat_update(kstat_t \*, int);


        kstat_t \*ksp;

        ksp = kstat_create("unix", 0, "system_pages", "pages", KSTAT_TYPE_NAMED,
                sizeof (system_pages_kstat) / sizeof (kstat_named_t),
        if (ksp) {
                ksp->ks_data = (void \*) &system_pages_kstat;
                ksp->ks_update = system_pages_kstat_update;


The kstat create function takes the 4-tuple description, the size of the kstat and provides a handle to the created kstats. The handle is then updated to include a pointer to the data, and a call-back function which will be invoked when the user reads the statistics.

The callback function has the task of updating the data structure pointed to by ks_data when invoked. If you choose not to have one, simply set the callback function to default_kstat_update(). The system pages kstat preable looks like this:

static int
system_pages_kstat_update(kstat_t \*ksp, int rw)

        if (rw == KSTAT_WRITE) {
                return (EACCES);

This basic preamble checks to see if the user code is trying to read or write the structure. (Yes, it's possible to write to some statistics, if the provider allows it). Once basic checks are done, the update call-back simply stores the statistics into the predefined data structure, and then returns.

        system_pages_kstat.freemem.value.ul     = (ulong_t)freemem;
        system_pages_kstat.availrmem.value.ul   = (ulong_t)availrmem;
        system_pages_kstat.lotsfree.value.ul    = (ulong_t)lotsfree;
        system_pages_kstat.desfree.value.ul     = (ulong_t)desfree;
        system_pages_kstat.minfree.value.ul     = (ulong_t)minfree;
        system_pages_kstat.fastscan.value.ul    = (ulong_t)fastscan;
        system_pages_kstat.slowscan.value.ul    = (ulong_t)slowscan;
        system_pages_kstat.nscan.value.ul       = (ulong_t)nscan;
        system_pages_kstat.desscan.value.ul     = (ulong_t)desscan;
        system_pages_kstat.pagesfree.value.ul   = (ulong_t)freemem;

        return (0);

That's it for a basic named kstat.

I/O statistics

Moving on to I/O, we can see how I/O stats are measured and recorded. There is special type of kstat type for I/O statistics. This provides a common methodology for recording statistics about devices which have queuing, utilization and response time metrics.

These devices are measured as a queue using "reimann sum" - which is a count of the visits to the queue and a sum of the "active" time. These two metrics can be used to determine the average service time and I/O counts for the device. There are typically two queues for each device, the wait queue and the active queue. This represents the time spent after the request has been accepted and enqueued, and then the time spent active on the device. The statistics are covered in kstat(3KSTAT):

     typedef struct kstat_io {
      \* Basic counters.
     u_longlong_t     nread;      /\* number of bytes read \*/
     u_longlong_t     nwritten;   /\* number of bytes written \*/
     uint_t           reads;      /\* number of read operations \*/
     uint_t           writes;     /\* number of write operations \*/
     \* Accumulated time and queue length statistics.
     \* Time statistics are kept as a running sum of "active" time.
     \* Queue length statistics are kept as a running sum of the
     \* product of queue length and elapsed time at that length --
     \* that is, a Riemann sum for queue length integrated against time.
     \*       \^
         \*       |           _________
         \*       8           | i4    |
         \*       |           |   |
         \*   Queue   6           |   |
         \*   Length  |   _________   |   |
         \*       4   | i2    |_______|   |
         \*       |   |   i3      |
         \*       2_______|           |
         \*       |    i1             |
         \*       |_______________________________|
         \*       Time->  t1  t2  t3  t4
     \* At each change of state (entry or exit from the queue),
     \* we add the elapsed time (since the previous state change)
     \* to the active time if the queue length was non-zero during
     \* that interval; and we add the product of the elapsed time
     \* times the queue length to the running length\*time sum.
     \* This method is generalizable to measuring residency
     \* in any defined system: instead of queue lengths, think
     \* of "outstanding RPC calls to server X".
     \* A large number of I/O subsystems have at least two basic
     \* "lists" of transactions they manage: one for transactions
     \* that have been accepted for processing but for which processing
     \* has yet to begin, and one for transactions which are actively
     \* being processed (but not done). For this reason, two cumulative
     \* time statistics are defined here: pre-service (wait) time,
     \* and service (run) time.
     \* The units of cumulative busy time are accumulated nanoseconds.
     \* The units of cumulative length\*time products are elapsed time
     \* times queue length.
     hrtime_t   wtime;            /\* cumulative wait (pre-service) time \*/
     hrtime_t   wlentime;         /\* cumulative wait length\*time product\*/
     hrtime_t   wlastupdate;      /\* last time wait queue changed \*/
     hrtime_t   rtime;            /\* cumulative run (service) time \*/
     hrtime_t   rlentime;         /\* cumulative run length\*time product \*/
     hrtime_t   rlastupdate;      /\* last time run queue changed \*/
     uint_t     wcnt;             /\* count of elements in wait state \*/
     uint_t     rcnt;             /\* count of elements in run state \*/
     } kstat_io_t;

An I/O device driver has a similar declare and create secion, as we saw with the named statistics. A quick look at the floppy disk device driver ( usr/src/uts/sun/io/fd.c) shows the kstat_create() in the device driver attach function:

static int
fd_attach(dev_info_t \*dip, ddi_attach_cmd_t cmd)
        fdc->c_un->un_iostat = kstat_create("fd", 0, "fd0", "disk",
        if (fdc->c_un->un_iostat) {
                fdc->c_un->un_iostat->ks_lock = &fdc->c_lolock;

The per-I/O statistics are updated in two places: the device driver strategy function, where the I/O is first recieved and queued. At this point, the I/O is marked as waiting on the wait queue:

#define KIOSP   KSTAT_IO_PTR(un->un_iostat)

static int
fd_strategy(register struct buf \*bp)
        struct fdctlr \*fdc;
        struct fdunit \*un;

        fdc = fd_getctlr(bp->b_edev);
        un = fdc->c_un;
	/\* Mark I/O as waiting on wait q \*/
        if (un->un_iostat) {


The I/O spends some time on the wait queue until the device is able to process the request. For each I/O the fdstart() routine moves the I/O from the wait queue to the run queue via the kstat_waitq_to_runq() function:

static void
fdstart(struct fdctlr \*fdc)

		/\* Mark I/O as active, move from wait to active q \*/
                if (un->un_iostat) {

		/\* Do I/O... \*/

When the I/O is complete (still in the fdstart() function), it is marked as leaving the active queue via kstat_runq_exit(). This updates the last part of the statistic, leaving us with the number of I/Os, and the total time spent on each queue.

		/\* Mark I/O as complete \*/
                if (un->un_iostat) {
                        if (bp->b_flags & B_READ) {
                                KIOSP->nread +=
                                        (bp->b_bcount - bp->b_resid);
                        } else {
                                KIOSP->nwritten += (bp->b_bcount - bp->b_resid);



These statistics provide us with our familiar metrics where actv is the average length of the queue of active I/Os and asvc_t is the average service time in the device. The wait queue is represented accordingly with wait and wsvc_t.

$ iostat -xn 10
                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    1.2    0.1    9.2    1.1  0.1  0.5    0.1   10.4   1   1 fd0

That's likely enough for a primer to kstat providers, as the detailed information is available in other places... Enjoy!


Technorati Tag: OpenSolaris

Technorati Tag: Solaris

Monday Jun 13, 2005

Understanding Memory Allocation and File System Caching in OpenSolaris

Yes, I'm guilty. While the workings of the Solaris virtual memory system prior to Solaris 8 are documented quite well, I've not written much about the new cyclic file system cache. There will be quite a bit on this subject in the new Solaris Internals edition, but to answer a giant FAQ in a somewhat faster mode, I'm posting this overview.

A quick Introduction

File system caching has been implemented as an integrated part of the Solaris virtual memory system - since as far back as SunOS 4.0. This has the great advantage of dynamically using available memory as a file system cache. While this has many positive advantages (like being able to speed up some I/O intensive apps by as much as 500x), there were some historic side affects: applications with a lot of file system I/O could swamp the memory system with demand for memory allocations, putting so much pressure that memory pages would be agressively stolen from important applications. Typical symptoms of this condition were that everything seemed to "slow down" when there was file I/O occuring, and the system reported it was constantly out of memory. In Solaris 6 and 7, I updated the paging algorithms to only steal file system pages unless there was a real memory shortage, as part of the feature named "Priority Paging". This meant that although there was still significant pressure from file I/O and high "scan rates", applications didn't get paged out, nor suffer from the pressure. A healthy Solaris 7 system still reported it was out of memory, but performed well.

Solaris 8 - "Cyclic Page Cache"

Starting with Solaris 8, we provided a significant architectual enhancement which provides a more effective solution. The file system cache was changed so that it steals memory from itself, rather than other parts of the system. Hence, a system with a large amount of file I/O will remain in a healthy virtual memory state -- with large amounts of visible free memory, and since the page scanner doesn't need to run, there are no agressive scan rates. Since the page scanner isn't required constantly to free up large amounts of memory, it no longer limits file system related I/O throughput. Other benefits of the enhancement are that applications which want to allocate a large amount of memory can do so by efficiently consuming it directly from the file system cache. For example, starting Oracle with a 50Gbyte SGA now takes less than a minute, compared to the 20-30 minutes with the prior implementation.

The old allocation algorithm

To keep this explantion relatively simple, lets take a brief look at what used to happen with Solaris 7, even with priority paging. The file system consumes memory from the freelists every time a new page is read from disk (or whereever) into the file system. The more pages we read, the more pages depleted from the systems' free list (the central place where memory is kept for reuse). Eventually (sometimes rather quickly), the free memory pool is depleted. At this point, if there is enough pressure, futher requests for new memory pages are blocked until the free memory pool is replenished by the page scanner. The page scanner scans inneficiently though all of memory, looking for pages which it can free up, and slowly refills the free list, but only by enough to satisfy the immediate request, processes resume for a short time, and then stop again as they again run short on memory. The page scanner is a bottleneck in the whole memory life cycle.

In the diagram above, we can see the file system's cache mechanism (segmap) consuming memory from the free list until it's depleted. After those pages are used, they are kept around, but they are only immediately accessable by the file system cache in the direct re-use case; that is, if a file system cache hit occurs, then they can be "reclaimed" back into segmap to avoid a subsequent physical I/O. However, if the file system cache needs a new page, there is no easy way of finding these pages -- rather the page scanner is used to stumble across them. The page scanner effectively "bigles" out the system, blindly looking for new pages to refill the free list. The page scanner has to fill the free list at the same rate as the file system is reading new pages - and is a single point of constraint in the whole design.

The new allocation algorithm

The new algorithm uses a central list to place inactive file cache (that which isn't immediately mapped anywhere), so that they can easily be used to satisfy new memory requests. This is a very subtle change, but with significant demonstrable effects. Firstly, the file system cache now appears as a single age-ordered FIFO: recently read pages are placed on the tail of the list, and new pages are consumed from the head. While on the list, the pages remain as valid cached portions of the file, so if a read cache hit occurs, they are simply removed from where ever they are on the list. This means that pages which are accessed often (cache hit often) are frequently moved to the tail of the list, and only the oldest and least used pages migrate to the head as candidates for freeing.

The cachelist is linked to the freelist, such that if the free list is exhausted then pages will be taken from the head of the cachelist and their contents discarded. New page requests are requested from the freelist, but since this list is often empty, allocations occur mostly from the head of the cache list, consuming the oldest file system cache pages. The page scanner doesn't need to get involved, eliminating the paging bottleneck and the need to run the scanner at high rates (and hence, not wasting CPU either).

If an application process requests a large amount of memory, it too can take from the cachelist via the freelist. Thus, an application can take a large amount of memory from the file system cache without needing to start the page scanner, resulting in substantially faster allocation.

Putting it all together: The Allocation Cycle of Physical Memory

The most significant central pool physical memory is the freelist. Physical memory is placed on the freelist in page-size chunks when the system is first booted and then consumed as required. Three major types of allocations occur from the freelist, as shown above.

Anonymous/process allocations

Anonymous memory, the most common form of allocation from the freelist, is used for most of a process’s memory allocation, including heap and stack. Anonymous memory also fulfills shared memory mappings allocations. A small amount of anonymous memory is also used in the kernel for items such as thread stacks. Anonymous memory is pageable and is returned to the freelist when it is unmapped or if it is stolen by the page scanner daemon.

File system “page cache”

The page cache is used for caching of file data for file systems. The file system page cache grows on demand to consume available physical memory as a file cache and caches file data in page-size chunks. Pages are consumed from the freelist as files are read into memory. The pages then reside in one of three places: - the segmap cache, a process’s address space to which they are mapped, or on the cachelist.

The cachelist is the heart of the page cache. All unmapped file pages reside on the cachelist. Working in conjunction with the cache list are mapped files and the segmap cache.

Think of the segmap file cache as the fast first level file system read/write cache. segmap is a cache that holds file data read and written through the read and write system calls. Memory is allocated from the freelist to satisfy a read of a new file page, which then resides in the segmap file cache. File pages are eventually moved from the segmap cache to the cachelist to make room for more pages in the segmap cache.

The cachelist is typically 12% of the physical memory size on SPARC systems. The segmap cache works in conjunction with the system cachelist to cache file data. When files are accessed - through the read and write system calls, up to 12% of the physical memory file data resides in the segmap cache and the remainder is on the cache list.

Memory mapped files also allocate memory from the freelist and remain allocated in memory for the duration of the mapping or unless a global memory shortage occurs. When a file is unmapped (explicitly or with madvise), file pages are returned to the cache list.

The cachelist operates as part of the freelist. When the freelist is depleted, allocations are made from the oldest pages in the cachelist. This allows the file system page cache to grow to consume all available memory and to dynamically shrink as memory is required for other purposes.

Kernel allocations

The kernel uses memory to manage information about internal system state; for example, memory used to hold the list of processes in the system. The kernel allocates memory from the freelist for these purposes with its own allocators: - the vmem and slab. However, unlike process and file allocations, the kernel seldom returns memory to the freelist; memory is allocated and freed between kernel subsystems and the kernel allocators. Memory is consumed from the freelist only when the total kernel allocation grows.

Memory allocated to the kernel is mostly nonpageable and so, cannot be managed by the system page scanner daemon. Memory is returned to the system freelist proactively by the kernel’s allocators when there is a global memory shortage occurs.

How to observe and monitor the new VM algorithms

The page scanner and its metrics are an important indicator or memory health. If the page scanner is running, there is likely a memory shortage. This is an interesting departure from the behavior you might have been accustomed to on Solaris 7 and earlier, where the page scanner was always running. Since Solaris 8, the file system cache resides on the cachelist, which is part of the global free memory count. Thus, if a significant amount of memory is available, even if it’s being used as a file system cache, the page scanner won’t be running.

The most important metric is the scan rate, which indicates whether the page scanner is running. The scanner starts scanning at an initial rate (slowscan) when freememory falls down to the configured watermark—lotsfree—and then runs faster as free memory gets lower, up to a maximum (fastscan).

We can perform a quick and simple health check by determining whether there is a significant memory shortage. To do this, use vmstat to look at scanning activity and check to see if there is sufficient free memory on the system.

Let’s first look at a healthy system. This system is showing 970 Mbytes of free memory in the free column, and a scan rate (sr) of zero.

sol8# vmstat -p 3
     memory           page          executable      anonymous      filesystem 
   swap  free  re  mf  fr  de  sr  epi  epo  epf  api  apo  apf  fpi  fpo  fpf
 1512488 837792 160 20 12   0   0    0    0    0    0    0    0   12   12   12
 1715812 985116 7  82   0   0   0    0    0    0    0    0    0   45    0    0
 1715784 983984 0   2   0   0   0    0    0    0    0    0    0   53    0    0
 1715780 987644 0   0   0   0   0    0    0    0    0    0    0   33    0    0

Looking at a second case, we can see two of the key indicators showing a memory shortage—both high scan rates (sr > 50000 in this case) and very low free memory (free < 10 Mbytes).

sol8# vmstat -p 3
     memory           page          executable      anonymous      filesystem 
   swap  free  re  mf  fr  de  sr  epi  epo  epf  api  apo  apf  fpi  fpo  fpf
 2276000 1589424 2128 19969 1 0 0    0    0    0    0    0    0    0    1    1
 1087652 388768 12 129675 13879 0 85590 0 0   12    0 3238 3238   10 9391 10630
 608036 51464  20 8853 37303 0 65871 38   0  781   12 19934 19930 95 16548 16591
  94448  8000  17 23674 30169 0 238522 16 0  810   23 28739 28804 56  547  556

Given that the page scanner runs only when the freelist and cachelist are effectively depleted, then any scanning activity is our first sign of memory shortage. Drilling downfuther with ::memstat and shows us where the major allocations are.

sol9# mdb -k
Loading modules: [ unix krtld genunix ip ufs_log logindmux ptm cpc sppp ipc random nfs ]
> ::memstat

Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                      53444               208   10%
Anon                       119088               465   23%
Exec and libs                2299                 8    0%
Page cache                  29185               114    6%
Free (cachelist)              347                 1    0%
Free (freelist)            317909              1241   61%

Total                      522272              2040
Physical                   512136              2000

The categories are described as follows:


The total memory used for nonpageable kernel allocations. This is how much memory the kernel is using, excluding anonymous memory used for ancillaries (see Anon).


The amount of anonymous memory. This includes user process heap, stack and copy-on-write pages, shared memory mappings, and small kernel ancillaries, such as lwp thread stacks, present on behalf of user processes.

Exec and libs

The amount of memory used for mapped files interpreted as binaries or libraries. This is typically the sum of memory used for user binaries and shared libraries. Technically, this memory is part of the page cache, but it is page cache tagged as “executable” when a file is mapped with PROT_EXEC and file permissions include execute permission.

Page cache

The amount of unmapped page cache, that is, page cache not on the cachelist. This category includes the segmap portion of the page cache, and any memory mapped files. If the applications on the system are solely using a read/write path, then we would expect the size of this bucket not to exceed segmap_percent (defaults to 12% of physical memory size). Files in /tmp are also included in this category.

Free (cachelist)

The amount of page cache on the freelist. The freelist contains unmapped file pages and is typically where the majority of the file system cache resides. Expect to see a large cachelist on a system that has large file sets and sufficient memory for file caching.

Free (freelist)

The amount of memory that is actually free. This is memory that has no association with any file or process.

If you want this functionality for Solaris 8, copy the downloadable into /usr/lib/mdb/kvm/sparcv9, and then use ::load memory before running ::memstat. (Note that this is not a Sun-supported code, but it is considered low risk since it affects only the mdb user-level program).

# wget
# cp /usr/lib/mdb/kvm/sparcv9
# mdb -k
Loading modules: [ unix krtld genunix ip ufs_log logindmux ptm cpc sppp ipc random nfs ]
> ::load memory
> ::memstat

That's it for now.

Technorati Tag: OpenSolaris

Technorati Tag: Solaris

Technorati Tag: mdb




« July 2016