X

Analyzing Interrupt Activity with DTrace

This article is about interrupt analysis using DTrace. It is also
available on the Solaris Internals and Performance FAQ Wiki,
as part of the DTrace Topics collection.

Interrupt Analysis

Interrupts are events delivered to CPUs, usually by external devices
(e.g. FC, SCSI, Ethernet and Infiniband adapters). Interrupts can
cause performance and observability problems for applications.

Performance problems are caused when an interrupt "steals" a CPU from
an application thread, halting its process while the interrupt is
serviced. This is called pinning - the interrupt will pin an
application thread if the interrupt was delivered to a CPU on which
an application was executing at the time.

This can affect other threads or processes in the application if for
example the pinned thread was holding one or more synchronization
objects (locks, semaphores, etc.)

Observability problems can arise if we are trying to account for work
the application is completing versus the CPU it is consuming. During
the time an interrupt has an application thread pinned, the CPU it
consumes is charged to the application.

Strategy

The SDT provider offers the following probes that indicate
when an interrupt is being serviced:

 interrupt-start
interrupt-complete

The first argument (arg0) to both probes is the address of a
struct dev_info (AKA dev_info_t *), which can be
used to identify the driver and instance for the interrupt.

Pinning

If the interrupt has indeed pinned a user thread, the following will
be true:

 curthread->t_intr != 0;
curthread->t_intr->t_procp->p_pidp->pid_id != 0

The pid_id field will correspond to the PID of the process that has
been pinned. The thread will be pinned until either
sdt:::interrupt-complete or fbt::thread_unpin:return
fire.

DTrace Scripts

Attached are some
scripts that can be used to assess the effect of
pinning
. These have been tested with Solaris 10 and Solaris 11.

Probe effect will vary. De-referencing four pointers then hashing
against a character string device name each time an interrupt fires;
as some of the scripts do; can be expensive. The last two scripts are
designed to have a lower probe effect if your application or system is
sensitive to this.


The scripts and their outputs are:
pin_by_drivers.d
How much drivers are pinning processes. Does not identify the PID(s) affected.
pids_by_drivers.d
How much each driver is pinning each process.
pid_cpu_pin.d
CPU consumption for a process, including pinning per driver, and time waiting on run queues.
intr_flow.d
Identifies the interrupt routine name for a specified driver


The following scripts are designed to have a lower probe effect
pid_pin_devi.d
Pinning on a specific process - shows drivers as raw "struct dev_info *" values.
pid_pin_any.d
Lowest probe effect - shows pinning on a specific process without identifying the driver(s) responsible.

Resolving Pinning Issues

The primary technique used to improve the performance of an
application experiencing pinning is to "fence" the interrupts from the
application. This involves the use of either processor binding or
processor sets (sets are usually preferable) to either dedicate
CPUs to the application that are known to not have the high-impact
interrupts targeted at them, or to dedicate CPUs to the driver(s)
delivering the high-impact interrupts.

This is not the optimal solution for all situations. Testing is
recommended.

Another technique is to investigate whether the interrupt handling for
the driver(s) in question can be modified. Some drivers allow for
more or less work to be performed by worker threads, reducing the time
during which an interrupt will pin a user thread. Other drivers can
direct interrupts at more than a single CPU, usually depending on the
interface on which the I/O event has ocurred. Some network drivers
can wait for more or fewer incoming packets before sending an
interrupt.

Most importantly, only attempt to resolve these issues yourself if you
have a good understanding of the implications, preferably one
backed-up by testing. An alternative is to open a service call with
Oracle asking for assistance to resolve a suspected pinning issue.
You can reference this article and include data obtained by using the
DTrace scripts.

Exercise For The Reader

If you have identified that your multi-threaded or multi-process
application is being pinned, but the stolen CPU time does not seem to
account for the drop in performance, the next step in DTrace would be
to identify whether any critical kernel or user locks are being held
during any of the pinning events. This would require marrying
information gained about how long application threads are pinned with
information gained from the lockstat and plockstat
providers.

References

Join the discussion

Comments ( 1 )
  • Brendan Gregg Wednesday, January 4, 2012

    Excellent - I've needed this type of observability in the past, and those are neat ways to identify pinned threads from DTrace. Thanks for sharing the scripts!


Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha