Friday Jul 08, 2005

Tracing the Solaris 10 File System Interface

Here's a quick script to trace activity though the central file system interface. Until there is a general file system provider, this script should serve as a basic framework help construct other file system tracing scripts.

# ./voptrace.d /tmp
Event           Device                                                Path  RW     Size   Offset
fop_putpage     -          /tmp//filebench/bin/i386/fastsu                   -     4096     4096
fop_inactive    -          /tmp//filebench/bin/i386/fastsu                   -        0        0
fop_putpage     -          /tmp//filebench/xanadu/WEB-INF/lib/classes12.jar  -     4096   204800
fop_inactive    -          /tmp//filebench/xanadu/WEB-INF/lib/classes12.jar  -        0        0
fop_putpage     -          /tmp/filebench1.63_s10_x86_sparc_pkg.tar.Z        -     4096  7655424
fop_inactive    -          /tmp/filebench1.63_s10_x86_sparc_pkg.tar.Z        -        0        0
fop_putpage     -          /tmp//filebench/xanadu/WEB-INF/lib/classes12.jar  -     4096   782336
fop_inactive    -          /tmp//filebench/xanadu/WEB-INF/lib/classes12.jar  -        0        0
fop_putpage     -          /tmp//filebench/bin/amd64/filebench               -     4096    36864

The source is below:

#!/usr/sbin/dtrace -s

/\*
 \* Trace the vnode interface
 \*
 \* USAGE: voptrace.d [/all | /mountname ]
 \*
 \* Author: Richard McDougall
 \*
 \* 7/8/2005
 \*/

#pragma D option quiet

:::BEGIN
{
        printf("%-15s %-10s %51s %2s %8s %8s\\n",
                "Event", "Device", "Path", "RW", "Size", "Offset");
        self->trace = 0;
        self->path = "";
}


::fop_\*:entry
/self->trace == 0/
{
        /\* Get vp: fop_open has a pointer to vp \*/
        self->vpp = (vnode_t \*\*)arg0;
        self->vp = (vnode_t \*)arg0;
        self->vp = probefunc == "fop_open" ? (vnode_t \*)\*self->vpp : self->vp;

        /\* And the containing vfs \*/
        self->vfsp = self->vp ? self->vp->v_vfsp : 0;

        /\* And the paths for the vp and containing vfs \*/
        self->vfsvp = self->vfsp ? (struct vnode \*)((vfs_t \*)self->vfsp)->vfs_vnodecovered : 0;
        self->vfspath = self->vfsvp ? stringof(self->vfsvp->v_path) : "unknown";

        /\* Check if we should trace the root fs \*/
        ($1 == "/all" ||
         ($1 == "/" && self->vfsp && \\
         (self->vfsp == `rootvfs))) ? self->trace = 1 : self->trace;

        /\* Check if we should trace the fs \*/
        ($1 == "/all" || (self->vfspath == $1)) ? self->trace = 1 : self->trace;
}

/\*
 \* Trace the entry point to each fop
 \*
 \*/
::fop_\*:entry
/self->trace/
{
        self->path = (self->vp != NULL && self->vp->v_path) ? stringof(self->vp->v_path) : "unknown";
        self->len = 0;
        self->off = 0;

        /\* Some fops has the len in arg2 \*/
        (probefunc == "fop_getpage" || \\
         probefunc == "fop_putpage" || \\
         probefunc == "fop_none") ? self->len = arg2 : 1;

        /\* Some fops has the len in arg3 \*/
        (probefunc == "fop_pageio" || \\
         probefunc == "fop_none") ? self->len = arg3 : 1;

        /\* Some fops has the len in arg4 \*/
        (probefunc == "fop_addmap" || \\
         probefunc == "fop_map" || \\
         probefunc == "fop_delmap") ? self->len = arg4 : 1;

        /\* Some fops has the offset in arg1 \*/
        (probefunc == "fop_addmap" || \\
         probefunc == "fop_map" || \\
         probefunc == "fop_getpage" || \\
         probefunc == "fop_putpage" || \\
         probefunc == "fop_seek" || \\
         probefunc == "fop_delmap") ? self->off = arg1 : 1;

        /\* Some fops has the offset in arg3 \*/
        (probefunc == "fop_close" || \\
         probefunc == "fop_pageio") ? self->off = arg3 : 1;

        /\* Some fops has the offset in arg4 \*/
        probefunc == "fop_frlock" ? self->off = arg4 : 1;

        /\* Some fops has the pathname in arg1 \*/
        self->path = (probefunc == "fop_create" || \\
         probefunc == "fop_mkdir" || \\
         probefunc == "fop_rmdir" || \\
         probefunc == "fop_remove" || \\
         probefunc == "fop_lookup") ?
                strjoin(self->path, strjoin("/", stringof(arg1))) : self->path;
        printf("%-15s %-10s %51s %2s %8d %8d\\n",
                probefunc,
                "-", self->path, "-", self->len, self->off);
        self->type = probefunc;
}

::fop_\*:return
/self->trace == 1/
{
        self->trace = 0;
}


/\* Capture any I/O within this fop \*/
io:::start
/self->trace/
{
        printf("%-15s %-10s %51s %2s %8d %8u\\n",
                self->type, args[1]->dev_statname,
                self->path, args[0]->b_flags & B_READ ? "R" : "W",
                args[0]->b_bcount, args[0]->b_blkno);

}

Technorati Tag: OpenSolaris

Technorati Tag: Solaris

Technorati Tag: DTrace

Thursday Jun 30, 2005

Using DTrace for Memory Analysis

Following on from yesterday's post on using prstat to look at memory slow-downs, here is a more advanced way of investigating. The DTrace probes help us identify all sources of paging in the system, and give us the ability to drill down quickly to identify cause and affect.

With DTrace, you can probe more deeply into the sources of activity observed with higher-level memory analysis tools. For example, if you determine that a significant amount of paging activity is due to a memory shortage, you can determine which process is initiating the paging activity. In another example, if you see a significant amount of paging due to file activity, you can drill down to see which process and which file is responsible.

DTrace allows for memory analysis through a vminfo provider, and, optionally, through deeper tracing of virtual memory paging with the fbt provider.

The vminfo provider probes correspond to the fields in the "vm" named kstat: a probe provided by vminfo fires immediately before the corresponding vm value is incremented. The table below from the DTrace guide lists the probes available from the VM provider. A probe takes the following arguments:

  • arg0 - The value by which the statistic is to be incremented. For most probes, this argument is always 1, but for some it may take other values.
  • arg1 - A pointer to the current value of the statistic to be incremented. This value is a 64-bit quantity that is incremented by the value in arg0. Dereferencing this pointer allows consumers to determine the current count of the statistic corresponding to the probe.

For example, if you should see the following paging activity with vmstat, indicating page-in from the swap device, you could drill down to investigate.

sol8# vmstat -p 3

     memory           page          executable      anonymous      filesystem 
   swap  free  re  mf  fr  de  sr  epi  epo  epf  api  apo  apf  fpi  fpo  fpf
 1512488 837792 160 20 12   0   0    0    0    0 8102    0    0   12   12   12
 1715812 985116 7  82   0   0   0    0    0    0 7501    0    0   45    0    0
 1715784 983984 0   2   0   0   0    0    0    0 1231    0    0   53    0    0
 1715780 987644 0   0   0   0   0    0    0    0 2451    0    0   33    0    0

sol10$ dtrace -n anonpgin'{@[execname] = count()}'
dtrace: description 'anonpgin' matched 1 probe
  svc.startd                                                       1
  sshd                                                             2
  ssh                                                              3
  dtrace                                                           6
  vmstat                                                          28
  filebench                                                      913


DTrace VM Provider Probes and Descriptions
Probe Name Description
anonfree Fires whenever an unmodified anonymous page is freed as part of paging activity. Anonymous pages are those that are not associated with a file; memory containing such pages include heap memory, stack memory, or memory obtained by explicitly mapping zero(7D).
anonpgin Fires whenever an anonymous page is paged in from a swap device.
anonpgout Fires whenever a modified anonymous page is paged out to a swap device.
as_fault Fires whenever a fault is taken on a page and the fault is neither a protection fault nor a copy-on-write fault.
cow_fault Fires whenever a copy-on-write fault is taken on a page. arg0 contains the number of pages that are created as a result of the copy-on-write.
dfree Fires whenever a page is freed as a result of paging activity. Whenever dfree fires, exactly one of anonfree, execfree, or fsfree will also subsequently fire.
execfree Fires whenever an unmodified executable page is freed as a result of paging activity.
execpgin Fires whenever an executable page is paged in from the backing store.
execpgout Fires whenever a modified executable page is paged out to the backing store. If it occurs at all, most paging of executable pages will occur in terms of execfree; execpgout can only fire if an executable page is modified in memory -- an uncommon occurrence in most systems.
fsfree Fires whenever an unmodified file system data page is freed as part of paging activity.
fspgin Fires whenever a file system page is paged in from the backing store.
fspgout Fires whenever a modified file system page is paged out to the backing store.
kernel_asflt Fires whenever a page fault is taken by the kernel on a page in its own address space. Whenever kernel_asflt fires, it will be immediately preceded by a firing of the as_fault probe.
maj_fault Fires whenever a page fault is taken that results in I/O from a backing store or swap device. Whenever maj_fault fires, it will be immediately preceded by a firing of the pgin probe.
pgfrec Fires whenever a page is reclaimed off of the free page list.

Using DTrace to Estimate Memory Slowdowns

You can use Using DTrace to, we can directly measure time elapsed time around the page-in probes when a process is waiting for page-in from the swap device, as in this example.

sched:::on-cpu
{
        self->on = vtimestamp;
}

sched:::off-cpu
/self->on/
{
        @oncpu[execname] = sum(vtimestamp - self->on);
        self->on = 0;
}

vminfo:::anonpgin
{
        self->anonpgin = 1;
}

:::pageio_setup:return
{
        self->wait = timestamp;
}

:::pageio_done:entry
/self->anonpgin == 1/
{
        self->anonpgin = 0;
        @pageintime[execname] = sum(timestamp - self->wait);
        self->wait = 0;
}

END
{
        normalize(@oncpu, 1000000);
        printf("Who's on cpu (milliseconds):\\n");
        printa("  %-50s %15@d\\n", @oncpu);

        normalize(@pageintime, 1000000);
        printf("Who's waiting for pagein (milliseconds):\\n");
        printa("  %-50s %15@d\\n", @pageintime);

}

With an aggregation by execname, you can we can look to see who is being held up by paging the most.

sol10$./whospaging.d
\^C
Who's on cpu (milliseconds):
  svc.startd                                                       1
  loop.sh                                                          2
  sshd                                                             2
  ssh                                                              3
  dtrace                                                           6
  vmstat                                                          28
  pageout                                                         60
  fsflush                                                        120
  filebench                                                      913
  sched                                                        84562
Who's waiting for pagein (milliseconds):
  filebench                                                   230704

The DTrace script displays the amount of time the program spends doing useful work compared to the amount of time it spends waiting for page-in.

The next script measures the elapsed time from when a program stalls on a page in from the swap device (anonymous page ins) and when it resumes for a specific pid target, specified on the command line.

sched:::on-cpu
/pid == $1/
{
        self->on = vtimestamp;
}

sched:::off-cpu
/self->on/
{
        @time[""] = sum(vtimestamp - self->on);
        self->on = 0;
}

vminfo:::anonpgin
/pid == $1/
{
        self->anon = 1;
}

:::pageio_setup:return
/pid == $1/
{
        self->wait = timestamp;
}

:::pageio_done:entry
/self->anon == 1/
{
        self->anon = 0;
        @time[""] = sum(timestamp - self->wait);
        self->wait = 0;
}

tick-5s
{
        printf("\\033[H\\033[2J");
        printf("Time breakdown (milliseconds):\\n");
        normalize(@time, 1000000);
        printa("  %-50s %15@d\\n", @time);
        trunc(@time);
}

In the following example, the program spends 0.9 seconds doing useful work, and 230 seconds waiting for page-ins.

sol10$ /pagingtime.d 22599                           
dtrace: script './pagingtime.d' matched 10 probes
\^C
  1      2                             :END 
Time breakdown (milliseconds):
  <on cpu>                                                        913
  <paging wait>                                                   230704

Technorati Tag: OpenSolaris

Technorati Tag: Solaris

Technorati Tag: DTrace

Tuesday Apr 19, 2005

DTrace Repository Created

I have created a repository for DTrace scripts that we have been referencing. Today it's just a download, we will be adding some more verbose descriptions of what the tools do shortly.

Technorati Tag: OpenSolaris

Technorati Tag: Solaris

Technorati Tag: DTrace

Wednesday Apr 13, 2005

Using DTrace to Estimate File System Cache Hit Ratio

We now have a method of collecting one of the most significant performance statistics for a file system in Solaris -- the cache hit rate in the file system page cache. By using DTrace with probes at the entry and exit to the file system, we can collect the logical io's into the file system and physical io's from the file system into the device I/O subsystem. These two statistics give us insight into how effective the file system cache is being, and if there might be grounds for adding physical memory in order to increase the amount of file system level caching.

#!/usr/sbin/dtrace -s

#pragma D option quiet

::fop_read:entry
/self->trace == 0 && (((vnode_t \*)arg0)->v_vfsp)->vfs_vnodecovered/
{
        vp = (vnode_t\*)arg0;
        vfs = (vfs_t \*)vp->v_vfsp;
        mountvp = vfs->vfs_vnodecovered;
        uio = (uio_t\*)arg1;
        self->path=stringof(mountvp->v_path);
        @rio[stringof(mountvp->v_path), "logical"] = count();
        @rbytes[stringof(mountvp->v_path), "logical"] = sum(uio->uio_resid);
        self->trace = 1;
}

::fop_read:entry
/self->trace == 0 && (((vnode_t \*)arg0)->v_vfsp == `rootvfs)/
{
        vp = (vnode_t\*)arg0;
        vfs = (vfs_t \*)vp->v_vfsp;
        mountvp = vfs->vfs_vnodecovered;
        uio = (uio_t\*)arg1;
        self->path="/";
        @rio[stringof("/"), "logical"] = count();
        @rbytes[stringof("/"), "logical"] = sum(uio->uio_resid);
        self->trace = 1;
}

::fop_read:return
/self->trace == 1/
{
        self->trace = 0;
}

io::bdev_strategy:start
/self->trace/
{
        @rio[self->path, "physical"] = count();
        @rbytes[self->path, "physical"] = sum(args[0]->b_bcount);
}


tick-5s
{
        trunc (@rio, 20);
        trunc (@rbytes, 20);
        printf("\\033[H\\033[2J");
        printf ("\\nRead IOPS\\n");
        printa ("%-60s %10s %10@d\\n", @rio);
        printf ("\\nRead Bandwidth\\n");
        printa ("%-60s %10s %10@d\\n", @rbytes);
        trunc (@rbytes);
        trunc (@rio);
}

Using this script, we probe the amount of logical bytes into the file system though the new Solaris 10 file system fop layer. We count the physical bytes using the io provider. Running the script, we can see the logical and physical for a file system which can be used to calculate the hit rate:

Read IOPS
/data1                                                          physical        287
/data1                                                           logical       2401

Read Bandwidth
/data1                                                          physical    2351104
/data1                                                           logical    5101240

The /data1 file system on this machine is doing 2401 logical iops, and 287 physical - i.e. a hit rate of 2401 / (2401 + 287) = 89%. It is also doing 5.1MB/s logical and 2.3MB/s physical. We can also do this at the file level:


#!/usr/sbin/dtrace -s

#pragma D option quiet

::fop_read:entry
/self->trace == 0 && (((vnode_t \*)arg0)->v_path)/
{
        vp = (vnode_t\*)arg0;
        uio = (uio_t\*)arg1;
        self->path=stringof(vp->v_path);
        self->trace = 1;
        @rio[stringof(vp->v_path), "logical"] = count();
        @rbytes[stringof(vp->v_path), "logical"] = sum(uio->uio_resid);
}

::fop_read:return
/self->trace == 1/
{
        self->trace = 0;
}

io::bdev_strategy:start
/self->trace/
{
        @rio[self->path, "physical"] = count();
        @rbytes[self->path, "physical"] = sum(args[0]->b_bcount);
}

tick-5s
{
        trunc (@rio, 20);
        trunc (@rbytes, 20);
        printf("\\033[H\\033[2J");
        printf ("\\nRead IOPS\\n");
        printa ("%-60s %10s %10@d\\n", @rio);
        printf ("\\nRead Bandwidth\\n");
        printa ("%-60s %10s %10@d\\n", @rbytes);
        trunc (@rbytes);
        trunc (@rio);
}


Technorati Tag: OpenSolaris

Technorati Tag: Solaris

Technorati Tag: DTrace

About

rmc

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today