From an article that I wrote last month, published in the September 2007 issue of Sun's Technocrat, this examination of System Latency starts where we left off with the last discussion What is Performance ? .. in the Real World
. That discussion identified the following list of key attributes and metrics that most in the IT world associate with optimal system performance :
- Response Times (Client GUI's, Client/Server Transactions, Service Transactions, ..) Measured as "acceptable" Latency.
- Throughput (how much Volume of data can be pushed through a specific subsystem.. IO, Network, etc...)
- Transaction Rates (DataBase, Application Services, Infrastructure / OS / Network.. Services, etc.). These
can be either rates per Second, Hour, or even Day... measuring various service-related transactions.
- Failure Rates (# or Frequency of exceeding High or Low Water Marks .. aka Threshold Exceptions)
- Resource Utilization (CPU Kernel vs. User vs. Idle, Memory Consumption, etc..)
- Startup Time (System HW, OS boot, Volume Mgmt Mirroring, Filesystem validation, Cluster Data Services, etc..)
- FailOver / Recovery Time (HA clustered DataServices, Disaster Recovery of a Geographic Service, ..) Time to recover a failed Service (includes recovery and/or startup time of restoring the failed Service)
- etc ...
Each of the attributes and perceived gauges of performance listed above have their own intrinsic
relationships and dependencies to specific subsystems and components... in turn reflecting a type of "latency" (delay in response). It is these latencies that are investigated and examined for root cause and correlation as the basis for most Performance Analysis activities.
How do you define Latency ?
In the past, the most commonly used terminology relating to latency within the field of Computer Science had been "Rotational
Latency". This was due to the huge discrepancy between the
responsiveness of an operation requiring mechanical movement, vs. the flow of electrons between components, where previously the
discrepancy was astronomical (nano seconds vs. milliseconds).
Although the most common bottlenecks do typically relate
to physical disk-based I/O latency, the paradigm of latency is shifting. With today's built in HW caching controllers and memory resident DB's, (along with other optimizations at the HW, media, drivers, and protocols...), the gap has narrowed. Realize that in 1 nanosecond (1 billionth of a second), electricity can travel approximately one foot down a wire (approaching the speed of light).
However, given the industry's latest cpu's running multiple cores at clock speeds upwards of multiple GigaHertz (with >= 1 thread per core, each theoretically executing > 1+ billion instructions per second...), many bottlenecks can
now easily be realized within memory, where the densities have increased dramatically, the distances across huge supercomputer buses (and grids) have expanded dramatically, and most significantly.. the latency of memory has not decreased at the same rate as cpu speed increases.
In order to best investigate system latency, we first need to define it and fully understand what we're dealing
The delay, or time
that it takes prior to a function, operation, and/or transaction
occurring. (my own definition)
(Latent) Present or
potential but not evident or active.
A place or stage in a process at which progress is impeded.
to input; the amount of data passing through a system from input to
The amount of
data that can be passed along a communications channel in a given
period of time.
(definitions cited from www.dictionary.com)
Environment" and it's basic subsystems :
Once again, the all-inclusive entity
that we need to realize and examine in it's entirety is the "Application
Environment", and it's standard
- OS / Kernel (System processing)
- Processors / CPU's
- Storage related I/O
- Network related I/O
- Application (User) SW
The "Critical Path" of (End-to-End) System Performance :
Although system performance might frequently be associated with one (or a few) system metrics, we must take 10 steps back and realize that overall system performance is one long inter-related sequence of events (both parallel and sequential). Depending on the type of workload and services running within an Application Environment, the Critical Path
might vary, as each system has it's own performance profile and
Using the typical OLTP RDBMS environment as an example, the Critical Path would include
everything (and ALL Latencies incurred) between :
Client Node / User -> Client GUI -> Client Application / Services ->
Client OS / Kernel -> Client HW -> NICs -> Client LAN ->
(network / naming services, etc.. ) -> WAN (switches, routers, ...) -> ...
Network Load Balancing Devices
-> Middleware / Tier(s) -> Web Server(s) -> Application
Server(s) -> Directory, Naming, NFS... Servers/Services->
-> RDBMS Server(s) [Infrastructure Svcs, Application SW,
OS / kernel, VM, FS / Cache, Device Drivers, System HW, HBA's, ...] ->
External SAN /NAS I/O [ Switches, Zones/Paths, Array(s), Controllers,
Cache, LUN(s), Disk Drives, .. ] -> RDBMS Svr ... LAN ...... ->
... and back to the Client Node through the WAN, etc... <<-
(NOTE: MANY sub-system components / interactions are left out in this example of a transaction and
response between a client and DB Server)
Categories of Latency :
Latency, in and of itself, simply refers to a delay of sorts. In the realm of
Performance Analysis and Workload Characterization, an association
can generally be made between certain types of latency and a specific
sub-system "bottleneck". However, in many cases the
underlying "root causes of bottlenecks are the result of
several overlapping conditions, none of which individually cause
performance degradation, but together can result in a bottleneck. It is
for this reason that performance analysis is typically an
iterative exercise, where the removal of one bottleneck can easily
result in the creation of another "hot spot elsewhere,
requiring further investigation and /or correlation once a bottleneck
has been removed.
Internal vs. External Latency ...
Internal Forms of Latency :
- CPU Saturation (100% Utilization, High Run Queues, Blocked
Kthreads, Cpu Contention ... Migrations / Context Switching / ... SMTX,
- Memory Contention (100% Utilization, Allocation Latency due to
either location, Translation, and/or paging/swapping, ...)
- OS Kernel Contention Overhead ( aka .. "Thrashing" due to
- IO Latency ( Hot Spots, High Svc Times, ...)
- Network Latency
- OS Infrastructure Service Latency (Telnet, FTP, Naming Svcs,
- Application SW / Services (Application Libraries, JVM, DB, ...)
External Forms of Latency :
- SAN or External Storage Devices (Arrays, LUNS, Controllers, Disk
Drives, Switches, NAS, ...)
- LAN/WAN Device Latency (Switches, Routers, Collisions, Duplicate
IP's, Media Errors, ....)
- External Services .. DNS, NIS, NFS, LDAP, SNMP, SMTP, DB, ....)
- Protocol Latency (NACK's, .. Collisions, Errors, etc...)
- Client Side Latency
Perceived vs. Actual Latency ...
For anyone that has worked in the field with end-users,
they have likely experienced scenarios where users will attribute a
change in application behavior to a performance issue, in many cases
incorrectly. The following is a short list of the top reasons for
a lapse in user perception of system performance :
- Mis-Alignment of user expectations, vantage points, anticipation,
etc.. (Responsiveness / Response Times, ...)
- Deceptive expectations based upon marketing "PEAK" Throughput
and/or CPU clock-speed #'s and promised increases in
performance. (high clock speeds do NOT always equate to higher throughput or better overall performance, especially if ANY bottlenecks are present)
- PEAK Throughput #'s can only be achieved if there is NO
bottleneck or related latency along the critical path as described
above. The saturation of ANY sub-system will degrade the performance
until that bottleneck is removed.
The PEAK Performance of a system will be dictated by the performance of it's most latent and/or contentious components (or sub-systems) along the critical path of system performance. (eg. The PEAK bandwidth of a system is no greater than that of it's slowest components along the path of a transaction and all it's interactions.)
As the holy grail of system
performance (along with Capacity Planning.. and ROI) dictates, ... a system that allows for as close to 100% of CPU processing time as possible (vs. WAIT events that pause processing) is what every IT Architect and System Administrator strives for. This is
where systems using CMT (multiple cores per cpu, each with multiple threads per core) shine, allowing for more processing to continue even when many threads are waiting on I/O.
The Application Environment and it's Sub-Systems ... where the bottlenecks can be
Within Computing, or more broadly,
Information Technology, "latency" and it's underlying
causes can be tied to one or more specific "sub-systems".
The following list reflects the first level of "sub-systems"
that you will find for any Application Environment :
Metrics, Measurements, and/or Interactions
centerplane, I/O Bus, etc.. (many types of connectivity and media are
possible, all with individual response times and bandwidth properties).
output, aggregated total throughput #'s (from kstat, etc..)
# Cores, # HW
Threads per core, Clock speed / Frequency in Ghz (cycles per second),
Operations (instructions) per Sec, Cache, DMA, etc..
trapstat, cpustat, cputrack, mpstat, ... (Run Queue, Blocked Kthreads,
ITLB_Misses, % S/U/Idle Utilization, # lwp's, ...)
of Bus, Bandwidth of Bus, Bus Latency, DMA Config, L1/L2/L3 Cache
Locations/ Sizes, FS page cache, Physical Proximity of Cache and/or RAM, FS page
caching, tmpfs, pagesizes, ..
mdb, kstat, prstat, trapstat, ipcs, pagesize, swap, ... (Cache Misses,
DTLB_misses, Page Scan Rate, heap/stack/kernel sizes,..)
(NIC's, HBA's, ..)
Interrupt Saturation, NIC Overflows, NIC / HBA Caching, HBA SW vs. HW
RAID, Bus/Controller Bridges/Switches, DMP, MPxIO, ...
(RX Pkts / Sec, Network Errors, ...) , iostat, vxstat.. (Response
Times, Storage device Svc_times..), lockstat, intrstat, ...
RAID LUN's, File Systems (types, block sizes, ...), Volumes, RAID
configuration (stripes, mirrors, RAID Level, paths,...), physical
fragmentation, Mpxio, etc..
vxstat, kstat, dtrace, statspack, .. (%wait, Service Times, blocked
kernel threads, ... FS/LUN Hot Spots)
OS / Kernel
Scheduling, Virtual Memory Mgmt, HW Mgmt/Control, Interrupt handling,
polling, system calls, ...
(utilization, interrupts, syscalls, %Sys / % Usr, ...), prstat, top,
mpstat, ps, lockstat (for smtx, lock, spin.. contention), ...
BIND/DNS, Naming Svcs, LDAP, Authentication/Authoriz., ..
svcadm, .. various ..
DB Svr, Web
Svr, Application Svr, ...
Note, if you want a single Solaris utility to do the heavy lifting, performance / workload
correlation, and reporting for you, take a look at sys_diag
if you haven't already done so (or the README).
Bandwidth and related Latencies :
The following table demonstrates the wide range of typical operating frequencies and latencies PER Sub-System,
Component, and/or Media Type :
Component / Transport Media
Response Time / Frequency / Speed
Throughput / Bandwidth
> 1+ Giga Hertz (1+ billion cycles per
\* (# cores \* HW Threads / core)
>1 billion operations per second
(huge theoretical #ops/s per system)
(PC-3200@200MHz/200MHz bus) ~5ns
DDR2 (PC2-5300@166MHz/333MHz bus) ~ 6ns
DDR2 (PC2-8500@266MHz/533MHz bus) ~ 3.75ns
(billionths of a second)
Peak Transfer 3.2 GB/s
Pk Transfer 8.5GB/s <TBD>
Service Times : ~5+ ms =
~ X ms Latency + Y ms Seek Times
(1 millisecond = 1000th of a second)
[platter size, # cylinders/ platters, RPM,...]
varies greatly, see below
Ultra 320 SCSI (16 bit) parallel
(high performance, cable & dev
Up to 320 MBps
SAS [Serial Attached SCSI]
> 300 MBps (>3 Gbps)
Up to 1200 MBps <TBD>
SATA [Serial ATA]
low cost, higher capacity (poor performance)
Up to 300 MBps
Up to 600 MBps <TBD>
(1 microsecond [us] = 1 millionth of a second)
|up to 480
Mbps (60 MBps) ~40 MBps
|Up to 50 MBps
Fiber Channel (Dual Ch)
4 Gb (4 / 2 / 1 Gb) \*2
8 Gb (8 / 4 / 2 Gb)
Up to 1.6 GBps (1 GB Usable)
Up to 3.2 GBps (1.8 GB Usable)
1 Gigabit Ethernet
\*\* Latency ~ 50 us
125 MBps (~1 Gbps) theoretical
|Up to 20
Gbps (<= 9 Gbps Usable)
Infiniband (Dual Ported HCA)
x4 (SDR / DDR) Dual Ported= \*2
\*\* Latency < 2 microseconds \*\*
x8 (DDR) \*2 <TBD>
2\*10Gb= 20 Gbps (16Gbps Usable)
Up to 40 Gbps (32 Gbps Usable)
|32 bit @ 33
64 bit @ 33 MHz
64 bit @ 66MHz
|64 bit bus
width @ 100 MHz (parallel bus)
64 bit bus width @ 133 MHz (parallel bus)
|Up to 800
1066 MBps (1 GBps)
bus / bi-directional @ 2.5 GHz
v.2 @ 5 GHz
(10's -100's of nanoseconds for latencies)
|4 GBps (x16 lanes) one direction
8 GBps (x32 lanes) one direction
Up to 16 GBps bi-directional (x32)
32 GBps bi-directional (x32
Regarding System Latency :
Other considerations regarding system
latency that are often overlooked include the following, which offers
us a more holistic vantage point of system performance and items that
might work against "Peak system capabilities :
- For Application SW that supports advanced capabilities such as
Infiniband RDMA (Remote Direct Memory Access), interconnect latencies
can be virtually eliminated via Application RDMA "kernel bypass".
This would be applicable in an HPC grid and/or possibly Oracle
RAC Deployments, etc. (confirming certifications of SW/HW..).
- Level of Multi-Threading vs. Monolithic serial or "batch" jobs
(If Applications are not Multi-Threaded, then SMP and/or CMT systems
with multiple processors / cores will likely always remain
- Architectural configurations supporting load distribution across
multiple devices / paths (cpu's, cores, NIC's, HBA's, Switches, LUNs,
- System Over Utilization (too much running on one system.. due to
under-sizing or over-growth, resulting in system "Thrashing" overhead)
- External Latency Due to Network and/or SAN I/O Contention
- Saturated Sub-Systems / Devices (NIC's, HBA's, Ports, Switches,
...) create system overhead handling the contention.
- Excessive Interrupt Handling (vs. Polling, Msg passing, etc..),
resulting in overhead where Interrupt Handling can cause CPU migrations
/ context switching (interrupts have the HIGHEST priority within the
Solaris Kernel, and are handled even before RT processing, preempting
running threads if necessary). Note, this can easily occur with
NIC cards/ports that become saturated (> ~25K RX pkts/sec),
especially for older drivers and/or over-utilized systems.
- Java Garbage collection Overhead (sub-par programming practices,
or more frequently OLD JVM's, and/or missing compilation optimizations).
- Use of Binaries that are compiled generically using GCC, vs. HW
optimized compilations using Sun's Studio Compilers (Sun Studio 12 can give you 200% + better performance than gcc binaries).
- Virtualization Overhead (significant overhead relating to traps
and library calls... when using VmWare, etc..)
- System Monitoring Overhead (the cumulative impact of monitoring
utilities, tools, system accounting, ... as well as the IO incurred to
store that historical performance trending data).
- OS and/or SW ... Patches, Bugs, Upgrades (newly applied, or
- Systems that are MIS-tuned, are accidents waiting to
happen. Only Tune kernel/drivers if you KNOW what you are doing,
or have been instructed by support to do so (and have FIRST tested on a
NON-production system). I can't tell you how many performance
issues I have encountered that were to do administrator "tweaks" to
kernel tunables (to the point of taking down entire LAN segments
!). The defaults are generally the BEST starting point unless a
world-class benchmarking effort is under-way.
nature of Performance Analysis and System Tuning
No matter what the root causes are
found to be, in the realm of Performance Analysis and system Tuning,
... once you remove one bottleneck, the system processing
characteristics will change, resulting in a new performance profile,
and new "hot spots"
that require further data collection and analysis. The process is
iterative, and requires a methodical approach to remediation.
Make certain that ONLY ONE (1) change
is made at a time, otherwise, the effects ( + or - ) can not be
Hopefully at some point in the future
we'll be operating at latencies measured in attoseconds (10 \^-18th, or
1 quintillionth of a second), but until then .... Happy tuning :)
For more information regarding
Performance Analysis, Capacity Planning, and related Tools, review some of my other postings at : http://blogs.sun.com/toddjobson/category/Performance+and+Capacity+Planning
Copyright 2007 Todd A. Jobson