The Many Flavors of System Latency.. along the Critical Path of Peak Performance
By tjobson on Sep 30, 2007
From an article that I wrote last month, published in the September 2007 issue of Sun's Technocrat, this examination of System Latency starts where we left off with the last discussion What is Performance ? .. in the Real World . That discussion identified the following list of key attributes and metrics that most in the IT world associate with optimal system performance :
- Response Times (Client GUI's, Client/Server Transactions, Service Transactions, ..) Measured as "acceptable" Latency.
- Throughput (how much Volume of data can be pushed through a specific subsystem.. IO, Network, etc...)
- Transaction Rates (DataBase, Application Services, Infrastructure / OS / Network.. Services, etc.). These can be either rates per Second, Hour, or even Day... measuring various service-related transactions.
- Failure Rates (# or Frequency of exceeding High or Low Water Marks .. aka Threshold Exceptions)
- Resource Utilization (CPU Kernel vs. User vs. Idle, Memory Consumption, etc..)
- Startup Time (System HW, OS boot, Volume Mgmt Mirroring, Filesystem validation, Cluster Data Services, etc..)
- FailOver / Recovery Time (HA clustered DataServices, Disaster Recovery of a Geographic Service, ..) Time to recover a failed Service (includes recovery and/or startup time of restoring the failed Service)
- etc ...
Each of the attributes and perceived gauges of performance listed above have their own intrinsic
relationships and dependencies to specific subsystems and components... in turn reflecting a type of "latency" (delay in response). It is these latencies that are investigated and examined for root cause and correlation as the basis for most Performance Analysis activities.
How do you define Latency ?
In the past, the most commonly used terminology relating to latency within the field of Computer Science had been "Rotational Latency". This was due to the huge discrepancy between the responsiveness of an operation requiring mechanical movement, vs. the flow of electrons between components, where previously the discrepancy was astronomical (nano seconds vs. milliseconds). Although the most common bottlenecks do typically relate to physical disk-based I/O latency, the paradigm of latency is shifting. With today's built in HW caching controllers and memory resident DB's, (along with other optimizations at the HW, media, drivers, and protocols...), the gap has narrowed. Realize that in 1 nanosecond (1 billionth of a second), electricity can travel approximately one foot down a wire (approaching the speed of light).
However, given the industry's latest cpu's running multiple cores at clock speeds upwards of multiple GigaHertz (with >= 1 thread per core, each theoretically executing > 1+ billion instructions per second...), many bottlenecks can now easily be realized within memory, where the densities have increased dramatically, the distances across huge supercomputer buses (and grids) have expanded dramatically, and most significantly.. the latency of memory has not decreased at the same rate as cpu speed increases. In order to best investigate system latency, we first need to define it and fully understand what we're dealing with.
The delay, or time
that it takes prior to a function, operation, and/or transaction
occurring. (my own definition)
- adj (Latent) Present or potential but not evident or active.
- noun A place or stage in a process at which progress is impeded.
- noun Output relative to input; the amount of data passing through a system from input to output.
- noun The amount of data that can be passed along a communications channel in a given period of time.
(definitions cited from www.dictionary.com)
The "Application Environment" and it's basic subsystems :
Once again, the all-inclusive entity that we need to realize and examine in it's entirety is the "Application Environment", and it's standard subsystems :
- OS / Kernel (System processing)
- Processors / CPU's
- Storage related I/O
- Network related I/O
- Application (User) SW
The "Critical Path" of (End-to-End) System Performance :
Although system performance might frequently be associated with one (or a few) system metrics, we must take 10 steps back and realize that overall system performance is one long inter-related sequence of events (both parallel and sequential). Depending on the type of workload and services running within an Application Environment, the Critical Path might vary, as each system has it's own performance profile and related "personality. Using the typical OLTP RDBMS environment as an example, the Critical Path would include everything (and ALL Latencies incurred) between :
Client Node / User -> Client GUI -> Client Application / Services -> Client OS / Kernel -> Client HW -> NICs -> Client LAN -> (network / naming services, etc.. ) -> WAN (switches, routers, ...) -> ... Network Load Balancing Devices
-> Middleware / Tier(s) -> Web Server(s) -> Application Server(s) -> Directory, Naming, NFS... Servers/Services->
-> RDBMS Server(s) [Infrastructure Svcs, Application SW, OS / kernel, VM, FS / Cache, Device Drivers, System HW, HBA's, ...] -> External SAN /NAS I/O [ Switches, Zones/Paths, Array(s), Controllers, HW Cache, LUN(s), Disk Drives, .. ] -> RDBMS Svr ... LAN ...... -> ... and back to the Client Node through the WAN, etc... <<-
(NOTE: MANY sub-system components / interactions are left out in this example of a transaction and response between a client and DB Server)
Categories of Latency :
Latency, in and of itself, simply refers to a delay of sorts. In the realm of Performance Analysis and Workload Characterization, an association can generally be made between certain types of latency and a specific sub-system "bottleneck". However, in many cases the underlying "root causes of bottlenecks are the result of several overlapping conditions, none of which individually cause performance degradation, but together can result in a bottleneck. It is for this reason that performance analysis is typically an iterative exercise, where the removal of one bottleneck can easily result in the creation of another "hot spot elsewhere, requiring further investigation and /or correlation once a bottleneck has been removed.
Internal vs. External Latency ...
Internal Forms of Latency :
- CPU Saturation (100% Utilization, High Run Queues, Blocked Kthreads, Cpu Contention ... Migrations / Context Switching / ... SMTX, ..)
- Memory Contention (100% Utilization, Allocation Latency due to either location, Translation, and/or paging/swapping, ...)
- OS Kernel Contention Overhead ( aka .. "Thrashing" due to saturation.. )
- IO Latency ( Hot Spots, High Svc Times, ...)
- Network Latency
- OS Infrastructure Service Latency (Telnet, FTP, Naming Svcs, ...)
- Application SW / Services (Application Libraries, JVM, DB, ...)
External Forms of Latency :
- SAN or External Storage Devices (Arrays, LUNS, Controllers, Disk Drives, Switches, NAS, ...)
- LAN/WAN Device Latency (Switches, Routers, Collisions, Duplicate IP's, Media Errors, ....)
- External Services .. DNS, NIS, NFS, LDAP, SNMP, SMTP, DB, ....)
- Protocol Latency (NACK's, .. Collisions, Errors, etc...)
- Client Side Latency
Perceived vs. Actual Latency ...
For anyone that has worked in the field with end-users, they have likely experienced scenarios where users will attribute a change in application behavior to a performance issue, in many cases incorrectly. The following is a short list of the top reasons for a lapse in user perception of system performance :
- Mis-Alignment of user expectations, vantage points, anticipation, etc.. (Responsiveness / Response Times, ...)
- Deceptive expectations based upon marketing "PEAK" Throughput
and/or CPU clock-speed #'s and promised increases in
performance. (high clock speeds do NOT always equate to higher throughput or better overall performance, especially if ANY bottlenecks are present)
- PEAK Throughput #'s can only be achieved if there is NO
bottleneck or related latency along the critical path as described
above. The saturation of ANY sub-system will degrade the performance
until that bottleneck is removed.
The PEAK Performance of a system will be dictated by the performance of it's most latent and/or contentious components (or sub-systems) along the critical path of system performance. (eg. The PEAK bandwidth of a system is no greater than that of it's slowest components along the path of a transaction and all it's interactions.)
As the holy grail of system performance (along with Capacity Planning.. and ROI) dictates, ... a system that allows for as close to 100% of CPU processing time as possible (vs. WAIT events that pause processing) is what every IT Architect and System Administrator strives for. This is where systems using CMT (multiple cores per cpu, each with multiple threads per core) shine, allowing for more processing to continue even when many threads are waiting on I/O.
The Application Environment and it's Sub-Systems ... where the bottlenecks can be found
Within Computing, or more broadly, Information Technology, "latency" and it's underlying causes can be tied to one or more specific "sub-systems". The following list reflects the first level of "sub-systems" that you will find for any Application Environment :
Subsystem / Components
Attributes and key Characteristics
Related Metrics, Measurements, and/or Interactions
System "Bus" / Backplane
Backplane / centerplane, I/O Bus, etc.. (many types of connectivity and media are possible, all with individual response times and bandwidth properties).
Busstat output, aggregated total throughput #'s (from kstat, etc..)
# Cores, # HW Threads per core, Clock speed / Frequency in Ghz (cycles per second), Operations (instructions) per Sec, Cache, DMA, etc..
vmstat, trapstat, cpustat, cputrack, mpstat, ... (Run Queue, Blocked Kthreads, ITLB_Misses, % S/U/Idle Utilization, # lwp's, ...)
Memory / Cache
Speed/Frequency of Bus, Bandwidth of Bus, Bus Latency, DMA Config, L1/L2/L3 Cache Locations/ Sizes, FS page cache, Physical Proximity of Cache and/or RAM, FS page caching, tmpfs, pagesizes, ..
vmstat, pmap, mdb, kstat, prstat, trapstat, ipcs, pagesize, swap, ... (Cache Misses, DTLB_misses, Page Scan Rate, heap/stack/kernel sizes,..)
Controllers (NIC's, HBA's, ..)
NIC RX Interrupt Saturation, NIC Overflows, NIC / HBA Caching, HBA SW vs. HW RAID, Bus/Controller Bridges/Switches, DMP, MPxIO, ...
netstat, kstat (RX Pkts / Sec, Network Errors, ...) , iostat, vxstat.. (Response Times, Storage device Svc_times..), lockstat, intrstat, ...
Disk Based Devices
Boot Devices, RAID LUN's, File Systems (types, block sizes, ...), Volumes, RAID configuration (stripes, mirrors, RAID Level, paths,...), physical fragmentation, Mpxio, etc..
iostat, vxstat, kstat, dtrace, statspack, .. (%wait, Service Times, blocked kernel threads, ... FS/LUN Hot Spots)
OS / Kernel
Process Scheduling, Virtual Memory Mgmt, HW Mgmt/Control, Interrupt handling, polling, system calls, ...
vmstat (utilization, interrupts, syscalls, %Sys / % Usr, ...), prstat, top, mpstat, ps, lockstat (for smtx, lock, spin.. contention), ...
OS Infrastructure Services
FTP, Telnet, BIND/DNS, Naming Svcs, LDAP, Authentication/Authoriz., ..
prstat, ps, svcadm, .. various ..
DB Svr, Web Svr, Application Svr, ...
Bandwidth and related Latencies :
The following table demonstrates the wide range of typical operating frequencies and latencies PER Sub-System, Component, and/or Media Type :
Component / Transport Media
Response Time / Frequency / Speed
Throughput / Bandwidth
> 1+ Giga Hertz (1+ billion cycles per
>1 billion operations per second
(PC-3200@200MHz/200MHz bus) ~5ns
DDR2 (PC2-5300@166MHz/333MHz bus) ~ 6ns
DDR2 (PC2-8500@266MHz/533MHz bus) ~ 3.75ns
(billionths of a second)
Peak Transfer 3.2 GB/s
Pk Transfer 8.5GB/s <TBD>
Service Times : ~5+ ms =
varies greatly, see below
Ultra 320 SCSI (16 bit) parallel
(high performance, cable & dev
Up to 320 MBps
SAS [Serial Attached SCSI]
> 300 MBps (>3 Gbps)
SATA [Serial ATA]
low cost, higher capacity (poor performance)
Up to 300 MBps
(1 microsecond [us] = 1 millionth of a second)
|up to 480
Mbps (60 MBps) ~40 MBps
||Up to 50 MBps
Fiber Channel (Dual Ch)
4 Gb (4 / 2 / 1 Gb) \*2
Up to 1.6 GBps (1 GB Usable)Up to 3.2 GBps (1.8 GB Usable)
1 Gigabit Ethernet
\*\* Latency ~ 50 us
125 MBps (~1 Gbps) theoretical
||Up to 20
Gbps (<= 9 Gbps Usable)
Infiniband (Dual Ported HCA)
x4 (SDR / DDR) Dual Ported= \*2
\*\* Latency < 2 microseconds \*\*
2\*10Gb= 20 Gbps (16Gbps Usable)
Up to 40 Gbps (32 Gbps Usable)
||32 bit @ 33
64 bit @ 33 MHz
64 bit @ 66MHz
||64 bit bus
width @ 100 MHz (parallel bus)
64 bit bus width @ 133 MHz (parallel bus)
|Up to 800
1066 MBps (1 GBps)
bus / bi-directional @ 2.5 GHz
v.2 @ 5 GHz <TBD>
(10's -100's of nanoseconds for latencies)
|4 GBps (x16 lanes) one direction
8 GBps (x32 lanes) one direction
Up to 16 GBps bi-directional (x32)
32 GBps bi-directional (x32 lanes)
Other Considerations Regarding System Latency :
Other considerations regarding system latency that are often overlooked include the following, which offers us a more holistic vantage point of system performance and items that might work against "Peak system capabilities :
- For Application SW that supports advanced capabilities such as
Infiniband RDMA (Remote Direct Memory Access), interconnect latencies
can be virtually eliminated via Application RDMA "kernel bypass".
This would be applicable in an HPC grid and/or possibly Oracle
RAC Deployments, etc. (confirming certifications of SW/HW..).
- Level of Multi-Threading vs. Monolithic serial or "batch" jobs (If Applications are not Multi-Threaded, then SMP and/or CMT systems with multiple processors / cores will likely always remain under-utilized).
- Architectural configurations supporting load distribution across multiple devices / paths (cpu's, cores, NIC's, HBA's, Switches, LUNs, Drives, ...)
- System Over Utilization (too much running on one system.. due to under-sizing or over-growth, resulting in system "Thrashing" overhead)
- External Latency Due to Network and/or SAN I/O Contention
- Saturated Sub-Systems / Devices (NIC's, HBA's, Ports, Switches,
...) create system overhead handling the contention.
- Excessive Interrupt Handling (vs. Polling, Msg passing, etc..), resulting in overhead where Interrupt Handling can cause CPU migrations / context switching (interrupts have the HIGHEST priority within the Solaris Kernel, and are handled even before RT processing, preempting running threads if necessary). Note, this can easily occur with NIC cards/ports that become saturated (> ~25K RX pkts/sec), especially for older drivers and/or over-utilized systems.
- Java Garbage collection Overhead (sub-par programming practices, or more frequently OLD JVM's, and/or missing compilation optimizations).
- Use of Binaries that are compiled generically using GCC, vs. HW
optimized compilations using Sun's Studio Compilers (Sun Studio 12 can give you 200% + better performance than gcc binaries).
- Virtualization Overhead (significant overhead relating to traps and library calls... when using VmWare, etc..)
- System Monitoring Overhead (the cumulative impact of monitoring utilities, tools, system accounting, ... as well as the IO incurred to store that historical performance trending data).
- OS and/or SW ... Patches, Bugs, Upgrades (newly applied, or possibly missing)
- Systems that are MIS-tuned, are accidents waiting to
happen. Only Tune kernel/drivers if you KNOW what you are doing,
or have been instructed by support to do so (and have FIRST tested on a
NON-production system). I can't tell you how many performance
issues I have encountered that were to do administrator "tweaks" to
kernel tunables (to the point of taking down entire LAN segments
!). The defaults are generally the BEST starting point unless a
world-class benchmarking effort is under-way.
The "Iterative" nature of Performance Analysis and System Tuning
No matter what the root causes are found to be, in the realm of Performance Analysis and system Tuning, ... once you remove one bottleneck, the system processing characteristics will change, resulting in a new performance profile, and new "hot spots" that require further data collection and analysis. The process is iterative, and requires a methodical approach to remediation.
Make certain that ONLY ONE (1) change is made at a time, otherwise, the effects ( + or - ) can not be quantified.
Hopefully at some point in the future
we'll be operating at latencies measured in attoseconds (10 \^-18th, or
1 quintillionth of a second), but until then .... Happy tuning :)
For more information regarding Performance Analysis, Capacity Planning, and related Tools, review some of my other postings at : http://blogs.sun.com/toddjobson/category/Performance+and+Capacity+Planning
Copyright 2007 Todd A. Jobson